Day 3 — The Transformer Architecture Deep Dive

Neeraj Kushwaha

“Architecture is destiny. The design decisions made in 2017 — self-attention, residual connections, layer normalization — are still the dominant paradigm in 2026. Understanding them is understanding the foundation of modern AI.”

Why This Day Matters

On Day 2, we covered the concepts behind Transformers — attention, training, scaling. Today we go into the engineering specifics of the architecture itself.

This matters for several concrete reasons:

When you choose between models, you’re choosing between architectural decisions. When you read a model card saying “32 layers, 4096 hidden dim, 32 attention heads,” you need to know what that means for capability and cost. When you deploy models at scale, understanding the architecture helps you predict memory requirements and optimize inference. And when you fine-tune, the architecture determines what you can modify and how.

This is the day the Transformer stops being a black box and becomes a system you can reason about.

Part 1: The Three Transformer Families

The original 2017 Transformer was an encoder-decoder model designed for translation: encode a French sentence, decode an English sentence. Since then, the architecture split into three distinct families with different strengths.

┌──────────────────────────────────────────────────────────────────┐
│                    TRANSFORMER FAMILIES                          │
├──────────────────┬──────────────────┬────────────────────────────┤
│   ENCODER-ONLY   │  DECODER-ONLY    │    ENCODER-DECODER          │
│   (Reader)       │  (Writer)        │    (Translator)            │
├──────────────────┼──────────────────┼────────────────────────────┤
│ BERT, RoBERTa,   │ GPT-4, Claude,   │ T5, BART, mT5,             │
│ DeBERTa          │ Llama, Gemini,   │ Original Transformer        │
│                  │ Mistral          │                             │
├──────────────────┼──────────────────┼────────────────────────────┤
│ Best for:        │ Best for:        │ Best for:                   │
│ Classification   │ Generation       │ Translation                 │
│ NER              │ Chat             │ Summarization               │
│ Embeddings       │ Code             │ QA (seq2seq)                │
│ Search           │ Reasoning        │                             │
└──────────────────┴──────────────────┴────────────────────────────┘

Encoder-Only: Built for Understanding

Encoder-only models (BERT and its descendants) process the full input text simultaneously in both directions — each token attends to all other tokens, including those that come after it

Encoder-only models process the entire sentence at once. Each word can see every other word in both directions.

This makes them excellent at understanding tasks: sentiment classification, named entity recognition, semantic search. The model builds a rich representation of the entire input.

Why they’re not used for chat: Bidirectional attention means you can’t generate text autoregressively — to generate the next token, you’d need to know what tokens come after it, which you don’t have yet.

Decoder-Only: Built for Generation

Decoder-only models (GPT-4, Claude, Llama, Gemini) use causal attention — each token can only attend to tokens that came before it. This is implemented via a causal mask that zeros out future token attention.

Decoder-only models generate text one token at a time. Each token can only see previous tokens.

"The cat sat on the mat"
Token "sat" can attend to: ["The", "cat", "sat"]
Token "sat" CANNOT attend to: ["on", "the", "mat"]

This constraint is what enables autoregressive generation: generate one token, append it to the context, generate the next. Repeat.

This is the dominant architecture for all frontier chat and generation models in 2026.

Encoder-Decoder: Built for Transformation

Encoder-decoder models (T5, BART) use an encoder to build a rich representation of the input and a decoder that attends to both the encoder output and previously generated output tokens.

Encoder-decoder models combine both approaches. The encoder understands the input. The decoder generates the output.

The decoder uses cross-attention — attention over the encoder’s representations — in addition to causal self-attention. This is powerful for translation and summarization but less flexible than pure decoders for general-purpose generation.

T5 is worth studying because it reframes everything as text-to-text: classification, QA, summarization — all become “generate the answer as text.” This philosophical shift influenced how we think about language models today.

Part 2: The Anatomy of a Decoder-Only Transformer

Since decoder-only is the dominant architecture for LLMs, let’s dissect it completely.

The Residual Stream

Every Transformer layer doesn’t replace the token representations — it adds to them. This is the residual connection, and it’s one of the most important innovations in deep learning.

One of the biggest breakthroughs in deep learning was the residual connection. Instead of replacing information, each layer adds improvements to existing representations.

x_output = x_input + Layer(x_input)
Instead of: x_output = Layer(x_input)

Why does this matter?

Gradient flow: During backpropagation, gradients can flow directly through the residual connection without being transformed, solving the vanishing gradient problem in deep networks.
Additive structure: You can think of each layer as adding information to a shared “residual stream.” Each layer reads from this stream, computes a delta, and writes back. This conceptual model (developed by researchers at Anthropic) is extremely useful for thinking about mechanistic interpretability.
Depth without degradation: Residual connections are why you can train 80, 100, or 200 layer networks without performance degrading.

class TransformerLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Residual connection around attention
        x = x + self.attention(self.norm1(x), mask=mask)
        
        # Residual connection around feed-forward
        x = x + self.feed_forward(self.norm2(x))
        
        return x

Layer Normalization

Layer normalization normalizes the activations across the feature dimension (not the batch dimension, which is what BatchNorm does). This stabilizes training by keeping activation values in a reasonable range.

LayerNorm(x) = γ × (x - μ) / (σ + ε) + β
Where:
  μ = mean of x across features
  σ = std of x across features
  γ, β = learned scale and shift parameters
  ε = small constant for numerical stability

Pre-norm vs. Post-norm: The original Transformer applied normalization after each sublayer (post-norm). Modern LLMs (Llama, Mistral, GPT-4-era) apply it before (pre-norm). Pre-norm has been shown to train more stably at large scale — which is why every frontier model uses it.

Feed-Forward Networks: More Than You’d Expect

After each attention block, every token’s representation passes through a position-wise feed-forward network. This is a two-layer MLP applied independently to each token:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Modern models use SwiGLU (Llama, Mistral) or GELU activations instead of ReLU:

SwiGLU(x) = (xW₁) ⊙ σ(xW₃) × W₂

The FFN dimension is typically 4× the model dimension. For a model with d_model=4096, the FFN hidden size is 16,384. This makes the FFN the largest component by parameter count in most Transformers — containing roughly 2/3 of all parameters.

What do FFNs actually do? Research suggests that FFNs function as “key-value memories” — they store factual associations learned during training. When the model knows that “Paris is the capital of France,” that knowledge is likely encoded in the FFN weights. This has direct implications for why RAG and fine-tuning work the way they do.

Part 3: Architectural Dimensions and What They Mean

When you read a model card, you’ll see parameters like these (from Llama 3.1 70B):

Model Architecture: Llama (Decoder-only Transformer)
Hidden Dimension (d_model): 8192
Number of Layers: 80
Number of Attention Heads: 64
Number of KV Heads: 8          # GQA
FFN Hidden Dimension: 28672
Vocabulary Size: 128,256
Context Length: 131,072 tokens
Activation: SwiGLU
Position Embedding: RoPE

Let’s decode each dimension:

Hidden Dimension (d_model)

The size of each token’s representation vector. Larger = richer representation = more expressive = more parameters.

Parameter count ≈ 12 × L × d_model²
Where L = number of layers

For Llama 3.1 70B: 12 × 80 × 8192² ≈ 64 billion parameters ✓

Number of Layers (Depth)

Depth adds the ability to compose representations — each layer can compute more abstract functions of the previous layer’s output. Very deep models can solve more complex problems.

However, depth has diminishing returns beyond a point — width (d_model) eventually matters more.

Attention Heads

More heads = more types of relationships the model can learn to attend to simultaneously. GPT-3 has 96 heads. Llama 3.1 70B has 64. Each head operates on a d_head = d_model / n_heads dimensional subspace.

Grouped Query Attention (GQA)

A key optimization in modern models. Standard multi-head attention (MHA) has one K and V matrix per head. Multi-query attention (MQA) shares one K, V pair across all heads. Grouped Query Attention (GQA) is a middle ground — groups of heads share K, V matrices.

MHA:  Q=64 heads, K=64 heads, V=64 heads
MQA:  Q=64 heads, K=1 head,   V=1 head
GQA:  Q=64 heads, K=8 heads,  V=8 heads  (Llama 3.1)

Why this matters in production: GQA dramatically reduces the KV cache size during inference — the memory required to store past token representations. For a 70B model serving thousands of concurrent users, this is the difference between feasible and infeasible deployment.

Part 4: Key Architectural Innovations in 2024–2026

RoPE: Rotary Position Embeddings

The original Transformer used fixed sinusoidal position encodings. Modern models use RoPE — a position encoding applied inside the attention computation that encodes relative positions rather than absolute ones.

def apply_rope(q, k, positions, head_dim):
    """Apply Rotary Position Embeddings to query and key tensors."""
    # Compute sin and cos for the rotation
    theta = positions[:, None] / (10000 ** (2 * torch.arange(head_dim//2) / head_dim))
    cos = theta.cos()
    sin = theta.sin()
    
    # Rotate query and key
    q_rotated = rotate_half(q) * sin + q * cos
    k_rotated = rotate_half(k) * sin + k * cos
    
    return q_rotated, k_rotated

Why RoPE enables long contexts: Unlike absolute position encodings that degrade at positions the model wasn’t trained on, RoPE can be extended to longer sequences (with some fine-tuning). This is why Llama 3.1 can handle 128K tokens and why extended context fine-tuning is feasible.

Mixture-of-Experts (MoE)

MoE replaces the FFN layer with multiple “expert” FFN networks and a router that selects which experts to activate for each token. Only 2–8 of N experts are active per token.

Standard FFN: Active parameters = total parameters
MoE FFN:      Active parameters = (k/N) × total parameters
Where k = experts activated per token, N = total experts

DeepSeek-R2 uses 671B total parameters but only ~37B are active per token. This gives it the quality of a 671B model at the inference cost of a 37B model. GPT-4 is believed to use a similar architecture.

Engineering implication: MoE models have much lower inference cost relative to their total parameter count — but they require all expert weights to be loaded into GPU memory, so serving them requires more GPUs even though fewer are used per token.

Sliding Window Attention

For very long contexts, full attention is quadratic in sequence length — O(n²). Mistral introduced sliding window attention where each token only attends to the W most recent tokens, making attention O(n×W).

Full Attention:          Token attends to all N previous tokens → O(N²)
Sliding Window (W=4096): Token attends to 4096 previous tokens → O(N×W)

This enables efficient processing of very long documents at the cost of not attending to the very distant past — acceptable for many tasks, but less suitable for tasks requiring global context.

Part 5: Comparing Major Model Architectures

GPT-4 Family (OpenAI)

Exact architecture is proprietary, but from reverse engineering and leaks:

Likely MoE with ~8 experts, 2 active per token
~1.8 trillion total parameters
Multi-head attention with GQA
Proprietary tokenizer (cl100k_base, 100K vocabulary)
Context: 128K tokens

Claude Family (Anthropic)

Also proprietary, but:

Decoder-only Transformer
Strong Constitutional AI alignment training
Longer context handling than GPT (200K+ tokens in Claude 3)
Anthropic has published research on mechanistic interpretability of their models
Context: 200K tokens (Claude 3.5/4)

Llama 3.1 (Meta, Open Source)

Fully documented architecture:

Decoder-only, dense (not MoE) Transformer
Pre-norm with RMSNorm (faster than LayerNorm)
SwiGLU activation
GQA for efficient serving
RoPE for long context
Vocabulary: 128K tokens
Available: 8B, 70B, 405B

Mistral/Mixtral (Mistral AI)

Mistral 7B: Dense, GQA, sliding window attention
Mixtral 8×7B: MoE with 8 experts, 2 active, ~12.9B active params
Efficient for deployment due to small active parameter count

DeepSeek-R2

671B total parameters, ~37B active (MoE)
Strong reasoning capabilities (trained on math and code)
MIT license — fully open weights
Competitive with GPT-4o on many benchmarks at much lower inference cost

Part 6: Practical Architecture Decisions for Engineers

How to Estimate Memory Requirements

def estimate_gpu_memory_gb(
    params_billion: float,
    precision: str = "fp16",
    kv_cache_tokens: int = 4096,
    batch_size: int = 1
) -> dict:
    """
    Estimate GPU memory requirements for serving an LLM.
    
    Args:
        params_billion: Model size in billions of parameters
        precision: "fp32", "fp16", "int8", or "int4"
        kv_cache_tokens: Context length for KV cache
        batch_size: Number of concurrent requests
    
    Returns:
        Dictionary with memory estimates
    """
    bytes_per_param = {
        "fp32": 4,
        "fp16": 2,
        "bf16": 2,
        "int8": 1,
        "int4": 0.5
    }[precision]
    
    # Model weights
    model_memory_gb = (params_billion * 1e9 * bytes_per_param) / (1024**3)
    
    # KV cache (rough estimate)
    # Each token requires 2 × num_layers × d_head × n_kv_heads × bytes
    # Approximate: 0.5 MB per token for a 70B model at fp16
    kv_cache_mb_per_token = 0.5 * (params_billion / 70)  # Scale with model size
    kv_cache_gb = (kv_cache_mb_per_token * kv_cache_tokens * batch_size) / 1024
    
    # Activations and overhead (~20% of model memory)
    overhead_gb = model_memory_gb * 0.2
    
    total_gb = model_memory_gb + kv_cache_gb + overhead_gb
    
    return {
        "model_weights_gb": round(model_memory_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "overhead_gb": round(overhead_gb, 1),
        "total_gb": round(total_gb, 1),
        "recommended_gpu": f"{int(total_gb / 80) + 1}× H100 80GB"
    }

# Examples
print(estimate_gpu_memory_gb(7, "fp16"))
# {'model_weights_gb': 13.1, 'kv_cache_gb': 0.0, 'overhead_gb': 2.6, 
#  'total_gb': 15.7, 'recommended_gpu': '1× H100 80GB'}
print(estimate_gpu_memory_gb(70, "fp16", kv_cache_tokens=8192, batch_size=32))
# {'model_weights_gb': 130.6, 'kv_cache_gb': 1.0, 'overhead_gb': 26.1, 
#  'total_gb': 157.7, 'recommended_gpu': '2× H100 80GB'}

How to Choose Between Architectures

Decision Framework: Which model architecture for my use case?
1. Need to generate text? → Decoder-only (GPT, Claude, Llama, Mistral)
2. Need embeddings / classification? → Encoder-only (BERT, E5, BGE)
   OR use decoder models with embedding fine-tuning
3. Need translation / structured transformation? → Encoder-decoder (T5, BART)
   OR use a strong decoder-only model with prompting (often simpler)
4. Constrained on cost per query? → MoE models (DeepSeek, Mixtral)
   OR quantized smaller dense models (Llama 3.1 8B Q8)
5. Need very long context? → Models with RoPE + long-context fine-tuning
   (Llama 3.1 128K, Claude 200K, Gemini 1.5 1M)

🔍 Common Mistakes to Avoid

Mistake 1: Confusing Total Parameters with Active Parameters

A 671B MoE model (DeepSeek) uses ~37B parameters per token. A 70B dense model (Llama 3.1 70B) uses all 70B. The MoE model may be faster at inference despite having 10x more total parameters.

Mistake 2: Ignoring Memory Bandwidth

GPU memory bandwidth (not just capacity) is often the bottleneck for LLM inference. An H100’s 3.35 TB/s bandwidth determines how fast weights can be read. This is why quantization (reducing precision) speeds up inference — fewer bytes to read.

Mistake 3: Not Accounting for KV Cache in Memory Planning

The KV cache grows linearly with sequence length and batch size. A 70B model serving 100 concurrent users at 8K context can require 40+ GB for the KV cache alone. Always budget for this in production deployments.

Mistake 4: Assuming Bigger Always Means Smarter

Architectural choices matter as much as scale. A well-trained Mistral 7B with GQA and SwiGLU outperforms poorly-trained models 5× its size. Quality of training data and alignment procedure often matter more than raw parameter count.

💼 Quick Questions

Q1: What is the difference between encoder-only and decoder-only Transformers? Which architecture do GPT-4 and BERT use?

Answer: Encoder-only (BERT): bidirectional attention, each token attends to all tokens. Best for understanding tasks (classification, embeddings). Decoder-only (GPT-4): causal attention, each token attends only to previous tokens. Best for generation tasks (chat, code, reasoning). GPT-4 uses decoder-only; BERT uses encoder-only.

Q2: What is Grouped Query Attention (GQA) and why does it matter for production inference?

Answer: GQA shares Key and Value matrices across groups of Query heads, reducing the KV cache size. In standard MHA with 64 heads, you store 64 K and V vectors per token per layer. With GQA (8 KV heads), you store only 8 — an 8× memory reduction. This directly reduces GPU memory requirements for serving, enabling larger batch sizes or longer contexts with the same hardware.

Q3: What is the KV cache, and how does it affect scaling LLM serving?

Answer: The KV cache stores the Key and Value tensors computed for all previous tokens during autoregressive generation. Without it, every generation step would recompute attention over all previous tokens — O(n²) total compute. With it, each step only computes attention for the new token — O(n) total. The cost is memory: the KV cache scales with batch_size × context_length × num_layers × d_head, and can exceed model weight memory at scale.

Q4: What is Mixture-of-Experts and why is it used in frontier models?

Answer: MoE replaces the dense FFN layer with N expert FFN networks plus a router. Only k experts (typically 2 out of 8–64) are activated per token. This allows a model to have much higher total parameter count (and thus capacity) while maintaining manageable per-token compute cost. DeepSeek-R2 has 671B total parameters but only ~37B active per forward pass, giving frontier-quality outputs at a fraction of the inference cost of a comparable dense model.

Q5: What is the “residual stream” interpretation of Transformers?

Answer: Rather than viewing each Transformer layer as a complete transformation, the residual stream view sees each layer as reading from and writing to a shared vector stream. Each attention and FFN block reads the current stream, computes a small update (delta), and adds it back. This additive structure means layers are composable and interpretable — useful for mechanistic interpretability research and for understanding why residual connections are so important for deep network training.

🏭 Production Considerations

Tensor Parallelism: For models larger than a single GPU, tensor parallelism splits each matrix multiplication across multiple GPUs. A 70B model might use 2–4× H100 80GB GPUs with tensor parallelism, each GPU holding a shard of every weight matrix.

Pipeline Parallelism: Alternatively, pipeline parallelism distributes layers across GPUs. GPU 0 runs layers 1–20, GPU 1 runs layers 21–40, etc. This is simpler to implement but introduces pipeline bubbles (idle time between microbatches).

Quantization Impact on Architecture: Not all parts of a model quantize equally. Attention weights are more sensitive to quantization than FFN weights. The first and last layers are generally kept at higher precision. Modern quantization schemes (AWQ, GPTQ, EXL2) account for these sensitivities.

⚡ Performance & Scalability Insights

The Hardware Lottery: The dominance of Transformer architecture is partly a “hardware lottery” — these architectures happen to be extremely efficient on GPUs due to their reliance on matrix multiplications (GEMM operations), which GPUs were already optimized to perform at scale. Alternative architectures (SSMs, RNNs) may be equally capable but have historically underperformed due to less efficient GPU utilization.

FlashAttention: A key optimization that computes attention in a memory-efficient way by tiling the computation to avoid materializing the full attention matrix. FlashAttention-3 (H100-optimized) provides 2–3× speedup over standard attention with identical mathematical output. All serious production inference stacks use it.

Compile-Time Optimization: PyTorch’s torch.compile() can provide 20-40% inference speedup by optimizing the computation graph at compile time. Combined with FlashAttention and quantization, you can often achieve 3-5× throughput improvements over naive PyTorch inference.

🔑 Key Takeaways

Three Transformer families, one dominant: Encoder-only for understanding, encoder-decoder for seq2seq, decoder-only for generation. Nearly all frontier chat models (GPT, Claude, Llama, Gemini) are decoder-only.
Residual connections are foundational. Every Transformer layer adds to (not replaces) the residual stream. This enables training of very deep networks and gives each layer a focused role.
GQA and MoE are the two architectural innovations most impacting 2026 production AI. GQA cuts KV cache memory. MoE decouples model capacity from per-token compute. Together, they’ve made frontier-quality models economically deployable.
Model size ≠ memory used ≠ compute per token. For a 671B MoE model, all three of these numbers are different. Understand the distinction to make correct infrastructure decisions.
Architecture shapes what a model can do. Long-context capability requires RoPE. Efficient serving requires GQA. High capability at low cost requires MoE. These are engineering choices, not magic.

📚 Further Reading & Resources

The Annotated Transformer (Harvard NLP) — Full Transformer implementation with explanations inline
Llama 3 Architecture Overview (Meta) — Real architecture details from a frontier open model
“FlashAttention: Fast and Memory-Efficient Exact Attention” (Dao et al., 2022) — The paper that changed inference optimization
GQA Paper (Ainslie et al., 2023) — The grouped query attention paper
“Mixture of Experts Explained” (Hugging Face Blog) — Excellent practical overview of MoE

Day 3 — The Transformer Architecture Deep Dive

Why This Day Matters

Part 1: The Three Transformer Families

Encoder-Only: Built for Understanding

Decoder-Only: Built for Generation

Encoder-Decoder: Built for Transformation

Part 2: The Anatomy of a Decoder-Only Transformer

The Residual Stream

Layer Normalization

Feed-Forward Networks: More Than You’d Expect

Part 3: Architectural Dimensions and What They Mean

Hidden Dimension (d_model)

Number of Layers (Depth)

Attention Heads

Grouped Query Attention (GQA)

Part 4: Key Architectural Innovations in 2024–2026

RoPE: Rotary Position Embeddings

Mixture-of-Experts (MoE)

Sliding Window Attention

Part 5: Comparing Major Model Architectures

GPT-4 Family (OpenAI)

Claude Family (Anthropic)

Llama 3.1 (Meta, Open Source)

Mistral/Mixtral (Mistral AI)

DeepSeek-R2

Part 6: Practical Architecture Decisions for Engineers

How to Estimate Memory Requirements

How to Choose Between Architectures

🔍 Common Mistakes to Avoid

Mistake 1: Confusing Total Parameters with Active Parameters

Mistake 2: Ignoring Memory Bandwidth

Mistake 3: Not Accounting for KV Cache in Memory Planning

Mistake 4: Assuming Bigger Always Means Smarter

💼 Quick Questions

🏭 Production Considerations

⚡ Performance & Scalability Insights

🔑 Key Takeaways

📚 Further Reading & Resources

Related

Leave a Reply Cancel reply

Why This Day Matters

Part 1: The Three Transformer Families

Encoder-Only: Built for Understanding

Decoder-Only: Built for Generation

Encoder-Decoder: Built for Transformation

Part 2: The Anatomy of a Decoder-Only Transformer

The Residual Stream

Layer Normalization

Feed-Forward Networks: More Than You’d Expect

Part 3: Architectural Dimensions and What They Mean

Hidden Dimension (d_model)

Number of Layers (Depth)

Attention Heads

Grouped Query Attention (GQA)

Part 4: Key Architectural Innovations in 2024–2026

RoPE: Rotary Position Embeddings

Mixture-of-Experts (MoE)

Sliding Window Attention

Part 5: Comparing Major Model Architectures

GPT-4 Family (OpenAI)

Claude Family (Anthropic)

Llama 3.1 (Meta, Open Source)

Mistral/Mixtral (Mistral AI)

DeepSeek-R2

Part 6: Practical Architecture Decisions for Engineers

How to Estimate Memory Requirements

How to Choose Between Architectures

🔍 Common Mistakes to Avoid

Mistake 1: Confusing Total Parameters with Active Parameters

Mistake 2: Ignoring Memory Bandwidth

Mistake 3: Not Accounting for KV Cache in Memory Planning

Mistake 4: Assuming Bigger Always Means Smarter

💼 Quick Questions

🏭 Production Considerations

⚡ Performance & Scalability Insights

🔑 Key Takeaways

📚 Further Reading & Resources

Related

You May Also Like

Day 1 — Welcome to the AI Era: The 2026 Landscape

Day 2 — How Large Language Models Actually Work?

Leave a Reply Cancel reply