Day 2 — How Large Language Models Actually Work?

Neeraj Kushwaha

“To master any tool, you must first understand what it is doing beneath the surface. The engineers who build the best AI systems are not the ones who treat models as black boxes — they are the ones who understand why the black box behaves the way it does.”

Why This Day Matters

Most AI engineers use LLMs the way most people use smartphones — effectively, but without understanding what’s happening inside. That’s fine for consumers. It’s limiting for engineers.

When you understand how LLMs work internally, you gain several concrete engineering advantages:

You know why prompts structured certain ways work better
You can predict where a model will fail and design around it
You can make better decisions about when to use reasoning models vs. standard models
You can debug hallucinations more effectively by understanding their source
You can explain AI system behavior to stakeholders with precision

This is not a math lecture. We will build intuition — deep, durable intuition — without requiring you to implement backpropagation from scratch. When the math is relevant, we’ll show it. When it isn’t, we’ll use analogies that hold up under scrutiny.

Part 1: What Is a Neural Network?

The Biological Inspiration (and Its Limits)

Neural networks are loosely inspired by the brain’s neurons — cells that receive signals, process them, and fire outputs to other neurons. The analogy is useful but limited. Artificial neural networks are mathematical functions, not biological systems.

Here’s the core idea: a neural network is a function that transforms inputs into outputs by passing data through layers of interconnected mathematical operations.

Input → [Layer 1] → [Layer 2] → [Layer 3] → ... → Output
         (weights)   (weights)   (weights)

Each “layer” applies a mathematical transformation to the data. The weights — the numbers that govern those transformations — are what the network “learns” during training.

The Building Block: A Single Neuron

A single artificial neuron does something remarkably simple:

output = activation_function(w₁x₁ + w₂x₂ + w₃x₃ + ... + bias)

It takes multiple inputs (x₁, x₂, x₃...), multiplies each by a learned weight (w₁, w₂, w₃...), adds them up, adds a bias term, and passes the result through a non-linear activation function (like ReLU or GELU).

That’s it. One neuron. The magic of neural networks comes from stacking millions or billions of these neurons together in layers.

How Networks Learn: The Training Loop

Training a neural network follows a three-step loop:

Step 1 — Forward Pass: Feed input data through the network layer by layer. Produce a prediction.

Step 2 — Calculate Loss: Compare the prediction to the correct answer using a loss function. Measure how wrong the model was.

Step 3 — Backward Pass (Backpropagation): Using calculus (specifically the chain rule), calculate how much each weight contributed to the error. Update the weights to reduce the error slightly.

Repeat this millions of times across billions of examples. The weights gradually converge to values that allow the network to make accurate predictions. This is what “training” means.

# Conceptual training loop (simplified)
for batch in training_data:
    # Forward pass
    predictions = model(batch.inputs)
    
    # Calculate how wrong we were
    loss = loss_function(predictions, batch.targets)
    
    # Backward pass: compute gradients
    loss.backward()
    
    # Update weights to reduce loss
    optimizer.step()
    optimizer.zero_grad()

For LLMs, the training task is deceptively simple: predict the next token. Given the sequence “The cat sat on the”, predict that the next word is likely “mat” or “floor” or “roof.” Do this billions of times across trillions of tokens of text from the internet, books, and code — and the model learns to encode a surprisingly rich model of language, knowledge, and reasoning.

Part 2: The Transformer Architecture

Before Transformers: The Problem with Sequential Processing

Before 2017, the dominant approach for processing text was Recurrent Neural Networks (RNNs). RNNs processed text sequentially — token by token, left to right. This created two serious problems:

Long-range dependencies were hard to capture. If the answer to a question depends on a word 500 tokens earlier in the document, the signal about that word had to survive 500 steps of processing. It usually got diluted.
Sequential processing couldn’t be parallelized. Training was slow because you had to process token 1 before token 2 before token 3.

The Transformer solved both problems simultaneously. It processes all tokens in parallel, and it uses a mechanism called attention to directly connect any token to any other token regardless of distance.

The Transformer: A High-Level View

Input Text
    ↓
[Tokenizer] → Convert words to token IDs
    ↓
[Embedding Layer] → Convert token IDs to vectors
    ↓
[Positional Encoding] → Add position information
    ↓
[Attention Block] ←──┐
[Feed-Forward Block]  │  × N layers
[Layer Norm]    ──────┘
    ↓
[Output Head] → Predict next token probabilities
    ↓
Sample from distribution → Generated token

Let’s examine each component.

Tokenization: How Text Becomes Numbers

LLMs don’t process characters or words — they process tokens. A token is typically a subword unit, somewhere between a character and a word.

from tiktoken import encoding_for_model
enc = encoding_for_model("gpt-4")
tokens = enc.encode("Hello, how are you today?")
print(tokens)
# [9906, 11, 1268, 527, 499, 3432, 30]
print(f"Characters: {len('Hello, how are you today?')}")  # 25
print(f"Tokens: {len(tokens)}")                           # 7

Different text encodes to different numbers of tokens. As a rule of thumb, 1 token ≈ 0.75 English words. But code, non-English text, and special characters can be much less efficient — sometimes 1 character = 1 token.

Why this matters for engineers: Token count drives API cost, context window usage, and latency. Understanding tokenization helps you estimate costs and optimize prompts.

Embeddings: Meaning as Geometry

Once text is tokenized, each token ID is converted to an embedding — a vector of floating-point numbers (typically 768 to 12,288 dimensions depending on model size).

The key insight: these vectors encode semantic meaning as geometry.

"king" - "man" + "woman" ≈ "queen"
"Paris" - "France" + "Italy" ≈ "Rome"

Words with similar meanings cluster together in this high-dimensional space. The model learns these relationships during training — not from explicit rules, but from statistical patterns in text.

This embedding space is also the foundation of embedding models — specialized models trained to produce these semantic vectors for search and retrieval. We’ll use them extensively in RAG systems.

Positional Encoding: Teaching Order Without Sequential Processing

Since Transformers process all tokens in parallel, they need a way to know which token came first, second, third. Positional encoding adds position information to each token’s embedding.

Modern models (like Llama and GPT-4) use Rotary Position Embeddings (RoPE), which elegantly encode relative position and scale to very long sequences — enabling the 1M+ context windows we see in 2026.

Part 3: Self-Attention — The Core Breakthrough

Self-attention is the mechanism that makes Transformers powerful. It allows every token to directly attend to every other token in the sequence, regardless of distance.

The Intuition

Imagine you’re reading: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal or the street? To resolve this, your brain looks back at context — at “animal” and “street” and their relationship to the rest of the sentence.

Self-attention does something analogous. For each token, it computes a weighted sum of all other tokens in the sequence — weighted by how “relevant” each token is to understanding the current one.

The Mechanics: Q, K, V

Self-attention uses three matrices — Query (Q), Key (K), and Value (V) — all learned during training.

Think of it like a library search:

Query = “What am I looking for?”
Key = “What does each book say it’s about?”
Value = “What’s actually in each book?”

For each token, we compute:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

In plain English:

For each token, compute how well it matches every other token (QK^T)
Scale by the square root of the dimension to stabilize gradients (/ √d_k)
Apply softmax to get probabilities that sum to 1
Use these probabilities to take a weighted sum of the values (V)

The result: each token’s representation is updated to incorporate information from all other tokens, weighted by relevance.

import torch
import torch.nn.functional as F
def self_attention(Q, K, V, d_k):
    """
    Minimal self-attention implementation.
    
    Args:
        Q, K, V: Query, Key, Value matrices [batch, seq_len, d_k]
        d_k: Key dimension (for scaling)
    
    Returns:
        Attention output [batch, seq_len, d_k]
    """
    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    
    # Convert to probabilities
    attention_weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of values
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

Multi-Head Attention: Learning Multiple Relationships

Real Transformer models use multi-head attention — they run the attention mechanism in parallel multiple times (e.g., 32 or 96 “heads”), each learning to attend to different types of relationships.

One head might learn to attend to syntactic relationships (subjects and verbs). Another might learn semantic similarity. Another might learn to resolve coreferences (like “it” → “animal”). Together, they capture a rich representation of language.

Multi-Head Attention = Concat(head_1, head_2, ..., head_h) × W_O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Part 4: The Full Forward Pass

Let’s trace a complete forward pass through a Transformer LLM.

Input: "The weather today is"
    ↓
Tokenize: [791, 9282, 3432, 374]
    ↓
Embed: 4 vectors of dim 4096 (for a large model)
    ↓
Add positional encoding
    ↓
Layer 1:
  Multi-Head Self-Attention (96 heads)
  → Each token attends to all others
  Add & Normalize
  Feed-Forward Network (MLP)
  Add & Normalize
    ↓
Layer 2:
  (same structure)
    ↓
... × 32 layers (GPT-3) or 80 layers (GPT-4)
    ↓
Output projection → vocabulary size (50,257 for GPT models)
    ↓
Softmax → probability distribution over all tokens
    ↓
Sample (temperature=0.7) → "sunny" (or "cold", or "beautiful")

This entire forward pass happens in parallel for all tokens simultaneously — that’s why Transformers can be trained so much faster than RNNs, and why they can process long contexts efficiently.

Part 5: What “Training” Really Means at Scale

The Scale of Modern Training

GPT-4 was trained on approximately:

~1 trillion tokens of text
Months of compute time
Thousands of NVIDIA A100 GPUs
Estimated cost: $50–100 million

Llama 3.1 405B was trained on:

15 trillion tokens
16,000 H100 GPUs
~6 months

This is why you don’t train frontier models from scratch. You use them as base models and adapt them (through prompting, fine-tuning, RAG) for your use case.

Pre-training, SFT, and RLHF

Modern LLMs are trained in stages:

Stage 1 — Pre-training: Self-supervised learning on massive text corpora. The model learns to predict the next token. This is where the model acquires its knowledge and capabilities.

Stage 2 — Supervised Fine-Tuning (SFT): Training on human-written instruction-response pairs. This teaches the model to follow instructions, not just complete text.

Stage 3 — RLHF / DPO: Training using human preference data. Human raters compare model outputs and choose the better one. The model learns to produce outputs humans prefer — safer, more helpful, better formatted.

Pre-trained Base Model → SFT → RLHF/DPO → Chat Model (what you use via API)

Understanding these stages matters for engineering decisions: when you fine-tune, you’re typically starting from a chat model (post-RLHF) and adding domain-specific knowledge or behavior on top.

Part 6: Scaling Laws — Why Bigger Is (Usually) Better

In 2020, OpenAI researchers discovered a remarkable empirical regularity: LLM performance improves predictably as you scale up model size, dataset size, and compute — following smooth power laws.

This is captured in Chinchilla scaling laws (2022, DeepMind):

Loss ∝ N^(-α) × D^(-β)

Where:
  N = number of model parameters
  D = number of training tokens
  α, β ≈ 0.5 (empirically)

Key implication: For a given compute budget, you should train a smaller model on more data, rather than a larger model on less data. Chinchilla (70B parameters, 1.4T tokens) outperformed GPT-3 (175B parameters, 300B tokens) — with the same compute budget.

This has driven the industry toward:

Training 7B-70B models on 2–15 trillion tokens (Llama 3, Qwen, Mistral)
Investing heavily in data quality, not just scale
“Inference-time compute” — spending more compute at generation time rather than training time (reasoning models)

The Emergence Phenomenon

One of the most striking aspects of scaling is emergence — the sudden appearance of new capabilities at scale thresholds that weren’t present in smaller models.

Examples of emergent capabilities:

Multi-step arithmetic (emerged around 100B parameters)
Chain-of-thought reasoning (emerged in the 100B+ range)
Instruction following without fine-tuning (appeared in very large models)
In-context learning from a few examples

This has a direct engineering implication: a task that a small model handles poorly might be solved by a larger model — sometimes without any changes to your prompt or system.

Part 7: How This Knowledge Makes You a Better Engineer

Insight 1: Why Prompt Structure Matters

The attention mechanism means that every token in your prompt can influence every other token’s representation. Long, unfocused prompts create noise. Precise, structured prompts help the model allocate attention to what matters.

# ❌ Noisy prompt — attention is diffuse
prompt = """
Hello, I was wondering if you could maybe help me with something. 
I have a lot of text data and I want to understand what the main 
topics are. Could you help? The text is customer feedback for 
our software product. If that's possible, that would be great!
"""
# ✅ Structured prompt - attention is focused
prompt = """
Task: Identify the top 3 topics in customer feedback.
Format: Return a JSON array of {topic, frequency_estimate, example_quote}
Customer feedback:
{feedback_text}
"""

Insight 2: Why LLMs Hallucinate

Hallucination emerges from the training objective: predict the next most probable token. If the model has seen many texts that confidently state something, it will confidently state it — regardless of whether it’s true. The model has no explicit “I don’t know” signal; it has learned to produce fluent, confident text.

Mitigation: RAG (retrieve ground truth before generating), temperature reduction (more conservative sampling), and structured output with citations.

Insight 3: Why Context Position Matters

Research shows that LLMs pay more attention to information at the beginning and end of the context window — the “lost in the middle” problem. If you have critical information, put it at the top of your system prompt or just before the user query — not buried in the middle.

Insight 4: Why Reasoning Models Work

Reasoning models (o3, Gemini 2.5 Thinking) are trained to generate extended internal reasoning before answering. This leverages a key property of Transformers: each generated token can attend to all previous tokens, including the model’s own reasoning steps. By “thinking out loud,” the model can use intermediate conclusions to arrive at better final answers — essentially using the context window as working memory.

🔍 Common Mistakes to Avoid

Mistake 1: Treating All Models as Equivalent

GPT-4o and GPT-4o-mini have fundamentally different parameter counts and capabilities. Using the cheaper model for tasks requiring deep reasoning is a common source of quality failures. Always match model capability to task complexity.

Mistake 2: Ignoring the Lost-in-the-Middle Effect

Placing your most important instructions or context in the middle of a long prompt consistently underperforms placing them at the beginning or end. Structure your prompts accordingly.

Mistake 3: Underestimating Temperature’s Effect

Temperature 0.0 is not always better. For creative tasks, summarization, and open-ended questions, some temperature (0.3–0.7) produces better outputs. For factual, structured, or code generation tasks, lower temperature (0.0–0.2) is usually preferable.

Mistake 4: Assuming More Tokens = Better Understanding

LLMs don’t “understand” your prompt better if you explain more. They pattern-match against training data. A concise, well-structured prompt often outperforms a verbose explanation of the same thing.

💼 Quick Questions

Q1: What is the attention mechanism, and why was it a breakthrough over previous architectures?

Answer: Attention allows every token in a sequence to directly attend to every other token, computing a weighted relevance score. This solves two problems of RNNs: (1) long-range dependencies — attention directly connects distant tokens regardless of position, and (2) parallelism — all attention computations happen simultaneously, enabling efficient training on large datasets.

Q2: What is the difference between pre-training, SFT, and RLHF?

Answer: Pre-training is self-supervised training on massive text corpora (next-token prediction) — this is where the model acquires general knowledge and language understanding. SFT (Supervised Fine-Tuning) trains the model on instruction-response pairs to follow instructions. RLHF (Reinforcement Learning from Human Feedback) trains the model to produce outputs that human raters prefer, improving safety, helpfulness, and formatting.

Q3: What are scaling laws, and what do they imply for AI engineering decisions?

Answer: Scaling laws describe the predictable power-law relationship between model performance and scale (parameters, data, compute). Chinchilla scaling laws showed that for a given compute budget, training a smaller model on more data often outperforms a larger model on less data. For engineers, this means: don’t assume the largest model is always best — a well-trained smaller model might outperform a larger one on your specific task, often at much lower cost.

Q4: Why do LLMs hallucinate, and what are the architectural reasons?

Answer: LLMs are trained to predict the next most probable token based on patterns in training data. They have no explicit truth-checking mechanism or “I don’t know” signal — they always produce the statistically most likely continuation. Hallucination occurs when the model generates a plausible-sounding but factually incorrect sequence. Mitigation strategies include RAG (grounding in retrieved documents), lower temperature, and structured output with mandatory source citation.

Q5: What is the “lost in the middle” problem?

Answer: Research shows that LLMs pay disproportionately more attention to information at the beginning and end of the context window. Critical information placed in the middle of a long prompt is processed less effectively. The practical implication: put your most important instructions and context either at the start of the system prompt or immediately before the user query.

🏭 Production Considerations

Context Window as a Resource: Every token in the context window costs money and adds latency. In production systems, actively manage context — summarize conversation history, compress retrieved documents, and measure the relationship between context length and output quality.

Temperature in Production: For production AI systems handling factual queries, structured data extraction, or code generation, set temperature between 0.0 and 0.2. Higher temperatures increase variance, which is generally undesirable in consistent production pipelines.

Model Versioning: Models are updated by providers without notice. GPT-4o today may behave differently from GPT-4o in three months. Pin model versions explicitly (gpt-4o-2024-11-20, not just gpt-4o) for production systems where consistency matters.

# ✅ Version-pinned production call
response = client.chat.completions.create(
    model="gpt-4o-2024-11-20",  # Pinned version, not just "gpt-4o"
    messages=messages,
    temperature=0.1,
    seed=42  # For reproducibility testing
)

⚡ Performance & Scalability Insights

KV Cache: Modern inference servers (vLLM, TGI) use a Key-Value cache to store attention computations for previously seen tokens. This dramatically speeds up generation for subsequent tokens and for batched requests with shared prefixes (e.g., the same system prompt).

Prefill vs. Decode: LLM inference has two phases — prefill (processing the input tokens in parallel) and decode (generating output tokens one at a time). Prefill is fast and highly parallelizable. Decode is the bottleneck. Strategies like speculative decoding address this by using a small draft model to generate candidate tokens and a large model to verify them.

Batch Size and Throughput: For server-side AI applications, increasing batch size (serving multiple requests together) significantly improves GPU utilization and throughput — at the cost of per-request latency. This tradeoff is fundamental to AI infrastructure design.

🔑 Key Takeaways

LLMs are next-token predictors trained at scale. The sophistication of their outputs emerges from the scale of training, the richness of training data, and the power of the attention mechanism — not from explicit rules or knowledge encoding.
Self-attention is the core innovation. By allowing every token to attend to every other token, Transformers capture long-range relationships efficiently and in parallel — solving the fundamental limitation of prior architectures.
Training has stages that shape behavior. Pre-training builds knowledge. SFT builds instruction-following. RLHF/DPO aligns the model with human preferences. Understanding these stages helps you reason about why models behave the way they do.
The “lost in the middle” effect is real and consequential. Put critical context at the beginning or end of your prompt. This is one of the most actionable insights from LLM internals research.
Scaling laws are predictive. The relationship between model size, data, and performance is empirical and consistent. This drives hardware investment, training strategy, and the cost structure of the entire AI industry.

📚 Further Reading & Resources

The Illustrated Transformer (Jay Alammar) — The best visual explanation of Transformers ever written
“Attention Is All You Need” (Vaswani et al., 2017) — The original paper
“Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022) — The Chinchilla scaling laws paper
“Lost in the Middle” (Liu et al., 2023) — The research paper on context position effects
Andrej Karpathy’s “Let’s Build GPT” — The best hands-on video for understanding Transformers from scratch

Day 2 — How Large Language Models Actually Work?

Why This Day Matters

Part 1: What Is a Neural Network?

The Biological Inspiration (and Its Limits)

The Building Block: A Single Neuron

How Networks Learn: The Training Loop

Part 2: The Transformer Architecture

Before Transformers: The Problem with Sequential Processing

The Transformer: A High-Level View

Tokenization: How Text Becomes Numbers

Embeddings: Meaning as Geometry

Positional Encoding: Teaching Order Without Sequential Processing

Part 3: Self-Attention — The Core Breakthrough

The Intuition

The Mechanics: Q, K, V

Multi-Head Attention: Learning Multiple Relationships

Part 4: The Full Forward Pass

Part 5: What “Training” Really Means at Scale

The Scale of Modern Training

Pre-training, SFT, and RLHF

Part 6: Scaling Laws — Why Bigger Is (Usually) Better

The Emergence Phenomenon

Part 7: How This Knowledge Makes You a Better Engineer

Insight 1: Why Prompt Structure Matters

Insight 2: Why LLMs Hallucinate

Insight 3: Why Context Position Matters

Insight 4: Why Reasoning Models Work

🔍 Common Mistakes to Avoid

Mistake 1: Treating All Models as Equivalent

Mistake 2: Ignoring the Lost-in-the-Middle Effect

Mistake 3: Underestimating Temperature’s Effect

Mistake 4: Assuming More Tokens = Better Understanding

💼 Quick Questions

🏭 Production Considerations

⚡ Performance & Scalability Insights

🔑 Key Takeaways

📚 Further Reading & Resources

Related

Leave a Reply Cancel reply

Why This Day Matters

Part 1: What Is a Neural Network?

The Biological Inspiration (and Its Limits)

The Building Block: A Single Neuron

How Networks Learn: The Training Loop

Part 2: The Transformer Architecture

Before Transformers: The Problem with Sequential Processing

The Transformer: A High-Level View

Tokenization: How Text Becomes Numbers

Embeddings: Meaning as Geometry

Positional Encoding: Teaching Order Without Sequential Processing

Part 3: Self-Attention — The Core Breakthrough

The Intuition

The Mechanics: Q, K, V

Multi-Head Attention: Learning Multiple Relationships

Part 4: The Full Forward Pass

Part 5: What “Training” Really Means at Scale

The Scale of Modern Training

Pre-training, SFT, and RLHF

Part 6: Scaling Laws — Why Bigger Is (Usually) Better

The Emergence Phenomenon

Part 7: How This Knowledge Makes You a Better Engineer

Insight 1: Why Prompt Structure Matters

Insight 2: Why LLMs Hallucinate

Insight 3: Why Context Position Matters

Insight 4: Why Reasoning Models Work

🔍 Common Mistakes to Avoid

Mistake 1: Treating All Models as Equivalent

Mistake 2: Ignoring the Lost-in-the-Middle Effect

Mistake 3: Underestimating Temperature’s Effect

Mistake 4: Assuming More Tokens = Better Understanding

💼 Quick Questions

🏭 Production Considerations

⚡ Performance & Scalability Insights

🔑 Key Takeaways

📚 Further Reading & Resources

Related

You May Also Like

Day 3 — The Transformer Architecture Deep Dive

Day 1 — Welcome to the AI Era: The 2026 Landscape

Leave a Reply Cancel reply