Day 2 — How Large Language Models Actually Work?
“To master any tool, you must first understand what it is doing beneath the surface. The engineers who build the best AI systems are not the ones who treat models as black boxes — they are the ones who understand why the black box behaves the way it does.”

Why This Day Matters
Most AI engineers use LLMs the way most people use smartphones — effectively, but without understanding what’s happening inside. That’s fine for consumers. It’s limiting for engineers.
When you understand how LLMs work internally, you gain several concrete engineering advantages:
- You know why prompts structured certain ways work better
- You can predict where a model will fail and design around it
- You can make better decisions about when to use reasoning models vs. standard models
- You can debug hallucinations more effectively by understanding their source
- You can explain AI system behavior to stakeholders with precision
This is not a math lecture. We will build intuition — deep, durable intuition — without requiring you to implement backpropagation from scratch. When the math is relevant, we’ll show it. When it isn’t, we’ll use analogies that hold up under scrutiny.
Part 1: What Is a Neural Network?
The Biological Inspiration (and Its Limits)
Neural networks are loosely inspired by the brain’s neurons — cells that receive signals, process them, and fire outputs to other neurons. The analogy is useful but limited. Artificial neural networks are mathematical functions, not biological systems.
Here’s the core idea: a neural network is a function that transforms inputs into outputs by passing data through layers of interconnected mathematical operations.
Input → [Layer 1] → [Layer 2] → [Layer 3] → ... → Output
(weights) (weights) (weights)
Each “layer” applies a mathematical transformation to the data. The weights — the numbers that govern those transformations — are what the network “learns” during training.
The Building Block: A Single Neuron
A single artificial neuron does something remarkably simple:
output = activation_function(w₁x₁ + w₂x₂ + w₃x₃ + ... + bias)
It takes multiple inputs (x₁, x₂, x₃...), multiplies each by a learned weight (w₁, w₂, w₃...), adds them up, adds a bias term, and passes the result through a non-linear activation function (like ReLU or GELU).
That’s it. One neuron. The magic of neural networks comes from stacking millions or billions of these neurons together in layers.
How Networks Learn: The Training Loop
Training a neural network follows a three-step loop:
Step 1 — Forward Pass: Feed input data through the network layer by layer. Produce a prediction.
Step 2 — Calculate Loss: Compare the prediction to the correct answer using a loss function. Measure how wrong the model was.
Step 3 — Backward Pass (Backpropagation): Using calculus (specifically the chain rule), calculate how much each weight contributed to the error. Update the weights to reduce the error slightly.
Repeat this millions of times across billions of examples. The weights gradually converge to values that allow the network to make accurate predictions. This is what “training” means.
# Conceptual training loop (simplified)
for batch in training_data:
# Forward pass
predictions = model(batch.inputs)
# Calculate how wrong we were
loss = loss_function(predictions, batch.targets)
# Backward pass: compute gradients
loss.backward()
# Update weights to reduce loss
optimizer.step()
optimizer.zero_grad()
For LLMs, the training task is deceptively simple: predict the next token. Given the sequence “The cat sat on the”, predict that the next word is likely “mat” or “floor” or “roof.” Do this billions of times across trillions of tokens of text from the internet, books, and code — and the model learns to encode a surprisingly rich model of language, knowledge, and reasoning.
Part 2: The Transformer Architecture
Before Transformers: The Problem with Sequential Processing
Before 2017, the dominant approach for processing text was Recurrent Neural Networks (RNNs). RNNs processed text sequentially — token by token, left to right. This created two serious problems:
- Long-range dependencies were hard to capture. If the answer to a question depends on a word 500 tokens earlier in the document, the signal about that word had to survive 500 steps of processing. It usually got diluted.
- Sequential processing couldn’t be parallelized. Training was slow because you had to process token 1 before token 2 before token 3.
The Transformer solved both problems simultaneously. It processes all tokens in parallel, and it uses a mechanism called attention to directly connect any token to any other token regardless of distance.
The Transformer: A High-Level View
Input Text
↓
[Tokenizer] → Convert words to token IDs
↓
[Embedding Layer] → Convert token IDs to vectors
↓
[Positional Encoding] → Add position information
↓
[Attention Block] ←──┐
[Feed-Forward Block] │ × N layers
[Layer Norm] ──────┘
↓
[Output Head] → Predict next token probabilities
↓
Sample from distribution → Generated token
Let’s examine each component.
Tokenization: How Text Becomes Numbers
LLMs don’t process characters or words — they process tokens. A token is typically a subword unit, somewhere between a character and a word.
from tiktoken import encoding_for_model
enc = encoding_for_model("gpt-4")
tokens = enc.encode("Hello, how are you today?")
print(tokens)
# [9906, 11, 1268, 527, 499, 3432, 30]
print(f"Characters: {len('Hello, how are you today?')}") # 25
print(f"Tokens: {len(tokens)}") # 7
Different text encodes to different numbers of tokens. As a rule of thumb, 1 token ≈ 0.75 English words. But code, non-English text, and special characters can be much less efficient — sometimes 1 character = 1 token.
Why this matters for engineers: Token count drives API cost, context window usage, and latency. Understanding tokenization helps you estimate costs and optimize prompts.
Embeddings: Meaning as Geometry
Once text is tokenized, each token ID is converted to an embedding — a vector of floating-point numbers (typically 768 to 12,288 dimensions depending on model size).
The key insight: these vectors encode semantic meaning as geometry.
"king" - "man" + "woman" ≈ "queen"
"Paris" - "France" + "Italy" ≈ "Rome"
Words with similar meanings cluster together in this high-dimensional space. The model learns these relationships during training — not from explicit rules, but from statistical patterns in text.
This embedding space is also the foundation of embedding models — specialized models trained to produce these semantic vectors for search and retrieval. We’ll use them extensively in RAG systems.
Positional Encoding: Teaching Order Without Sequential Processing
Since Transformers process all tokens in parallel, they need a way to know which token came first, second, third. Positional encoding adds position information to each token’s embedding.
Modern models (like Llama and GPT-4) use Rotary Position Embeddings (RoPE), which elegantly encode relative position and scale to very long sequences — enabling the 1M+ context windows we see in 2026.
Part 3: Self-Attention — The Core Breakthrough
Self-attention is the mechanism that makes Transformers powerful. It allows every token to directly attend to every other token in the sequence, regardless of distance.
The Intuition
Imagine you’re reading: “The animal didn’t cross the street because it was too tired.”
What does “it” refer to? The animal or the street? To resolve this, your brain looks back at context — at “animal” and “street” and their relationship to the rest of the sentence.
Self-attention does something analogous. For each token, it computes a weighted sum of all other tokens in the sequence — weighted by how “relevant” each token is to understanding the current one.
The Mechanics: Q, K, V
Self-attention uses three matrices — Query (Q), Key (K), and Value (V) — all learned during training.
Think of it like a library search:
- Query = “What am I looking for?”
- Key = “What does each book say it’s about?”
- Value = “What’s actually in each book?”
For each token, we compute:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
In plain English:
- For each token, compute how well it matches every other token (QK^T)
- Scale by the square root of the dimension to stabilize gradients (/ √d_k)
- Apply softmax to get probabilities that sum to 1
- Use these probabilities to take a weighted sum of the values (V)
The result: each token’s representation is updated to incorporate information from all other tokens, weighted by relevance.
import torch
import torch.nn.functional as F
def self_attention(Q, K, V, d_k):
"""
Minimal self-attention implementation.
Args:
Q, K, V: Query, Key, Value matrices [batch, seq_len, d_k]
d_k: Key dimension (for scaling)
Returns:
Attention output [batch, seq_len, d_k]
"""
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Convert to probabilities
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.matmul(attention_weights, V)
return output, attention_weights
Multi-Head Attention: Learning Multiple Relationships
Real Transformer models use multi-head attention — they run the attention mechanism in parallel multiple times (e.g., 32 or 96 “heads”), each learning to attend to different types of relationships.
One head might learn to attend to syntactic relationships (subjects and verbs). Another might learn semantic similarity. Another might learn to resolve coreferences (like “it” → “animal”). Together, they capture a rich representation of language.
Multi-Head Attention = Concat(head_1, head_2, ..., head_h) × W_O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
Part 4: The Full Forward Pass
Let’s trace a complete forward pass through a Transformer LLM.
Input: "The weather today is"
↓
Tokenize: [791, 9282, 3432, 374]
↓
Embed: 4 vectors of dim 4096 (for a large model)
↓
Add positional encoding
↓
Layer 1:
Multi-Head Self-Attention (96 heads)
→ Each token attends to all others
Add & Normalize
Feed-Forward Network (MLP)
Add & Normalize
↓
Layer 2:
(same structure)
↓
... × 32 layers (GPT-3) or 80 layers (GPT-4)
↓
Output projection → vocabulary size (50,257 for GPT models)
↓
Softmax → probability distribution over all tokens
↓
Sample (temperature=0.7) → "sunny" (or "cold", or "beautiful")
This entire forward pass happens in parallel for all tokens simultaneously — that’s why Transformers can be trained so much faster than RNNs, and why they can process long contexts efficiently.
Part 5: What “Training” Really Means at Scale
The Scale of Modern Training
GPT-4 was trained on approximately:
- ~1 trillion tokens of text
- Months of compute time
- Thousands of NVIDIA A100 GPUs
- Estimated cost: $50–100 million
Llama 3.1 405B was trained on:
- 15 trillion tokens
- 16,000 H100 GPUs
- ~6 months
This is why you don’t train frontier models from scratch. You use them as base models and adapt them (through prompting, fine-tuning, RAG) for your use case.
Pre-training, SFT, and RLHF
Modern LLMs are trained in stages:
Stage 1 — Pre-training: Self-supervised learning on massive text corpora. The model learns to predict the next token. This is where the model acquires its knowledge and capabilities.
Stage 2 — Supervised Fine-Tuning (SFT): Training on human-written instruction-response pairs. This teaches the model to follow instructions, not just complete text.
Stage 3 — RLHF / DPO: Training using human preference data. Human raters compare model outputs and choose the better one. The model learns to produce outputs humans prefer — safer, more helpful, better formatted.
Pre-trained Base Model → SFT → RLHF/DPO → Chat Model (what you use via API)
Understanding these stages matters for engineering decisions: when you fine-tune, you’re typically starting from a chat model (post-RLHF) and adding domain-specific knowledge or behavior on top.
Part 6: Scaling Laws — Why Bigger Is (Usually) Better
In 2020, OpenAI researchers discovered a remarkable empirical regularity: LLM performance improves predictably as you scale up model size, dataset size, and compute — following smooth power laws.
This is captured in Chinchilla scaling laws (2022, DeepMind):
Loss ∝ N^(-α) × D^(-β)
Where:
N = number of model parameters
D = number of training tokens
α, β ≈ 0.5 (empirically)
Key implication: For a given compute budget, you should train a smaller model on more data, rather than a larger model on less data. Chinchilla (70B parameters, 1.4T tokens) outperformed GPT-3 (175B parameters, 300B tokens) — with the same compute budget.
This has driven the industry toward:
- Training 7B-70B models on 2–15 trillion tokens (Llama 3, Qwen, Mistral)
- Investing heavily in data quality, not just scale
- “Inference-time compute” — spending more compute at generation time rather than training time (reasoning models)
The Emergence Phenomenon
One of the most striking aspects of scaling is emergence — the sudden appearance of new capabilities at scale thresholds that weren’t present in smaller models.
Examples of emergent capabilities:
- Multi-step arithmetic (emerged around 100B parameters)
- Chain-of-thought reasoning (emerged in the 100B+ range)
- Instruction following without fine-tuning (appeared in very large models)
- In-context learning from a few examples
This has a direct engineering implication: a task that a small model handles poorly might be solved by a larger model — sometimes without any changes to your prompt or system.
Part 7: How This Knowledge Makes You a Better Engineer
Insight 1: Why Prompt Structure Matters
The attention mechanism means that every token in your prompt can influence every other token’s representation. Long, unfocused prompts create noise. Precise, structured prompts help the model allocate attention to what matters.
# ❌ Noisy prompt — attention is diffuse
prompt = """
Hello, I was wondering if you could maybe help me with something.
I have a lot of text data and I want to understand what the main
topics are. Could you help? The text is customer feedback for
our software product. If that's possible, that would be great!
"""
# ✅ Structured prompt - attention is focused
prompt = """
Task: Identify the top 3 topics in customer feedback.
Format: Return a JSON array of {topic, frequency_estimate, example_quote}
Customer feedback:
{feedback_text}
"""
Insight 2: Why LLMs Hallucinate
Hallucination emerges from the training objective: predict the next most probable token. If the model has seen many texts that confidently state something, it will confidently state it — regardless of whether it’s true. The model has no explicit “I don’t know” signal; it has learned to produce fluent, confident text.
Mitigation: RAG (retrieve ground truth before generating), temperature reduction (more conservative sampling), and structured output with citations.
Insight 3: Why Context Position Matters
Research shows that LLMs pay more attention to information at the beginning and end of the context window — the “lost in the middle” problem. If you have critical information, put it at the top of your system prompt or just before the user query — not buried in the middle.
Insight 4: Why Reasoning Models Work
Reasoning models (o3, Gemini 2.5 Thinking) are trained to generate extended internal reasoning before answering. This leverages a key property of Transformers: each generated token can attend to all previous tokens, including the model’s own reasoning steps. By “thinking out loud,” the model can use intermediate conclusions to arrive at better final answers — essentially using the context window as working memory.
🔍 Common Mistakes to Avoid
Mistake 1: Treating All Models as Equivalent
GPT-4o and GPT-4o-mini have fundamentally different parameter counts and capabilities. Using the cheaper model for tasks requiring deep reasoning is a common source of quality failures. Always match model capability to task complexity.
Mistake 2: Ignoring the Lost-in-the-Middle Effect
Placing your most important instructions or context in the middle of a long prompt consistently underperforms placing them at the beginning or end. Structure your prompts accordingly.
Mistake 3: Underestimating Temperature’s Effect
Temperature 0.0 is not always better. For creative tasks, summarization, and open-ended questions, some temperature (0.3–0.7) produces better outputs. For factual, structured, or code generation tasks, lower temperature (0.0–0.2) is usually preferable.
Mistake 4: Assuming More Tokens = Better Understanding
LLMs don’t “understand” your prompt better if you explain more. They pattern-match against training data. A concise, well-structured prompt often outperforms a verbose explanation of the same thing.
💼 Quick Questions
Q1: What is the attention mechanism, and why was it a breakthrough over previous architectures?
Answer: Attention allows every token in a sequence to directly attend to every other token, computing a weighted relevance score. This solves two problems of RNNs: (1) long-range dependencies — attention directly connects distant tokens regardless of position, and (2) parallelism — all attention computations happen simultaneously, enabling efficient training on large datasets.
Q2: What is the difference between pre-training, SFT, and RLHF?
Answer: Pre-training is self-supervised training on massive text corpora (next-token prediction) — this is where the model acquires general knowledge and language understanding. SFT (Supervised Fine-Tuning) trains the model on instruction-response pairs to follow instructions. RLHF (Reinforcement Learning from Human Feedback) trains the model to produce outputs that human raters prefer, improving safety, helpfulness, and formatting.
Q3: What are scaling laws, and what do they imply for AI engineering decisions?
Answer: Scaling laws describe the predictable power-law relationship between model performance and scale (parameters, data, compute). Chinchilla scaling laws showed that for a given compute budget, training a smaller model on more data often outperforms a larger model on less data. For engineers, this means: don’t assume the largest model is always best — a well-trained smaller model might outperform a larger one on your specific task, often at much lower cost.
Q4: Why do LLMs hallucinate, and what are the architectural reasons?
Answer: LLMs are trained to predict the next most probable token based on patterns in training data. They have no explicit truth-checking mechanism or “I don’t know” signal — they always produce the statistically most likely continuation. Hallucination occurs when the model generates a plausible-sounding but factually incorrect sequence. Mitigation strategies include RAG (grounding in retrieved documents), lower temperature, and structured output with mandatory source citation.
Q5: What is the “lost in the middle” problem?
Answer: Research shows that LLMs pay disproportionately more attention to information at the beginning and end of the context window. Critical information placed in the middle of a long prompt is processed less effectively. The practical implication: put your most important instructions and context either at the start of the system prompt or immediately before the user query.
🏭 Production Considerations
Context Window as a Resource: Every token in the context window costs money and adds latency. In production systems, actively manage context — summarize conversation history, compress retrieved documents, and measure the relationship between context length and output quality.
Temperature in Production: For production AI systems handling factual queries, structured data extraction, or code generation, set temperature between 0.0 and 0.2. Higher temperatures increase variance, which is generally undesirable in consistent production pipelines.
Model Versioning: Models are updated by providers without notice. GPT-4o today may behave differently from GPT-4o in three months. Pin model versions explicitly (gpt-4o-2024-11-20, not just gpt-4o) for production systems where consistency matters.
# ✅ Version-pinned production call
response = client.chat.completions.create(
model="gpt-4o-2024-11-20", # Pinned version, not just "gpt-4o"
messages=messages,
temperature=0.1,
seed=42 # For reproducibility testing
)
⚡ Performance & Scalability Insights
KV Cache: Modern inference servers (vLLM, TGI) use a Key-Value cache to store attention computations for previously seen tokens. This dramatically speeds up generation for subsequent tokens and for batched requests with shared prefixes (e.g., the same system prompt).
Prefill vs. Decode: LLM inference has two phases — prefill (processing the input tokens in parallel) and decode (generating output tokens one at a time). Prefill is fast and highly parallelizable. Decode is the bottleneck. Strategies like speculative decoding address this by using a small draft model to generate candidate tokens and a large model to verify them.
Batch Size and Throughput: For server-side AI applications, increasing batch size (serving multiple requests together) significantly improves GPU utilization and throughput — at the cost of per-request latency. This tradeoff is fundamental to AI infrastructure design.
🔑 Key Takeaways
- LLMs are next-token predictors trained at scale. The sophistication of their outputs emerges from the scale of training, the richness of training data, and the power of the attention mechanism — not from explicit rules or knowledge encoding.
- Self-attention is the core innovation. By allowing every token to attend to every other token, Transformers capture long-range relationships efficiently and in parallel — solving the fundamental limitation of prior architectures.
- Training has stages that shape behavior. Pre-training builds knowledge. SFT builds instruction-following. RLHF/DPO aligns the model with human preferences. Understanding these stages helps you reason about why models behave the way they do.
- The “lost in the middle” effect is real and consequential. Put critical context at the beginning or end of your prompt. This is one of the most actionable insights from LLM internals research.
- Scaling laws are predictive. The relationship between model size, data, and performance is empirical and consistent. This drives hardware investment, training strategy, and the cost structure of the entire AI industry.
📚 Further Reading & Resources
- The Illustrated Transformer (Jay Alammar) — The best visual explanation of Transformers ever written
- “Attention Is All You Need” (Vaswani et al., 2017) — The original paper
- “Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022) — The Chinchilla scaling laws paper
- “Lost in the Middle” (Liu et al., 2023) — The research paper on context position effects
- Andrej Karpathy’s “Let’s Build GPT” — The best hands-on video for understanding Transformers from scratch
