AI – steps used by LLMs
How a Large Language Model Processes Text
Before a large language model (LLM) can interpret language, raw text passes through several processing stages. The model never directly “reads” words like humans do. Instead, it transforms text into numerical representations that can be processed mathematically.
1. Raw Text Input
The process begins with ordinary text:
The cat sat on the mat.
At this stage, the input is just a sequence of characters:
- Letters
- Spaces
- Punctuation
- Unicode symbols
2. Text Normalization
Some systems standardize the text before further processing.
Examples include:
- Converting line endings
- Normalizing Unicode characters
- Standardizing quotation marks
- Removing unusual whitespace
“Hello world”
May become:
"Hello world"
3. Tokenization
The text is split into smaller units called tokens. Tokens are not always complete words.
"The cat sat on the mat."
↓
["The", " cat", " sat", " on", " the", " mat", "."]
Rare or complex words may be split into subword pieces:
"unbelievable"
↓
["un", "believ", "able"]
4. Token IDs
Each token is converted into an integer ID.
"The" → 523
" cat" → 9812
The sentence becomes a sequence of numbers:
[523, 9812, 4410, 299, 262, 7819, 13]
5. Embeddings
Token IDs are converted into vectors called embeddings.
523 →
[0.12, -0.88, 1.44, ...]
These vectors encode learned relationships between words, concepts, grammar, and usage patterns.
6. Positional Encoding
Transformers do not naturally understand sequence order, so positional information is added.
"dog bites man"
"man bites dog"
Positional encoding helps the model distinguish between these.
7. Transformer Processing
The embeddings pass through multiple transformer layers.
Inside each layer:
- Tokens exchange information
- Context is accumulated
- Relationships are learned
8. Attention Mechanism
Attention allows tokens to determine which other tokens matter most for interpretation.
"The animal didn't cross the street because it was tired."
The model learns that “it” likely refers to “animal”.
9. Internal Representations
As information moves deeper through the network:
- Early layers detect syntax
- Middle layers detect relationships
- Later layers encode abstract meaning
10. Next-Token Prediction
LLMs fundamentally predict the most likely next token.
Input:
"The cat sat on the"
Output Probabilities:
" mat" → 72%
" floor" → 9%
" chair" → 4%
11. Sampling and Decoding
The model selects the next token using decoding strategies such as:
- Greedy decoding
- Temperature sampling
- Top-k sampling
- Nucleus sampling
These methods influence creativity and randomness.
12. Detokenization
Finally, tokens are converted back into readable text.
["The", " cat", " sat", " on", " the", " mat", "."]
↓
"The cat sat on the mat."
Key Insight
The model never literally understands language the way humans do. Internally it processes:
- Token IDs
- Vectors
- Matrix operations
- Attention weights
- Probability distributions
Meaning emerges from statistical patterns learned during training.
Complete Pipeline
Raw Text
↓
Normalization
↓
Tokenization
↓
Token IDs
↓
Embeddings
↓
Positional Encoding
↓
Transformer Layers + Attention
↓
Next-Token Prediction
↓
Sampling
↓
Output Tokens
↓
Readable Text