Day 9: LLM Inference Fundamentals — Temperature, Top-P, and Sampling

“The difference between a poet and a fact-checker isn’t intelligence — it’s entropy. Temperature is how you dial between the two.”

Why This Matters

You’ve learned to call the API. You’ve sent prompts. You’ve gotten responses. But most developers use the default parameters and wonder why the output is sometimes too random, sometimes too repetitive, sometimes too robotic.

Sampling parameters are the dials on your mixing board. Get them wrong and your AI application produces outputs that feel wrong to users — even if they can’t articulate why. Get them right and the output feels exactly tailored to the task.

These parameters also directly affect:

Output quality: wrong temperature → incoherent or boring responses
Reliability: high randomness in production → unpredictable behavior
Cost: verbose outputs cost more tokens
User trust: repetitive or hallucinated outputs erode confidence

Understanding what happens inside the model during generation is the difference between guessing at parameters and making informed engineering decisions.

Part 1: How LLMs Actually Generate Text

The Sampling Loop

Before you can understand temperature, you need to understand what the model is actually doing when it generates a response. Here’s the process at each step:

Input tokens → Transformer layers → Logits vector → Softmax → Probability distribution → Sample → Next token → Repeat

Let’s break each step down:

Step 1: The model produces logits

After processing your input through all transformer layers, the model outputs a vector of raw scores called logits — one score per token in the vocabulary. GPT-4 has a vocabulary of ~100,000 tokens. So for each generation step, the model produces 100,000 numbers.

# Conceptually (you don't call this directly in production):
# logits shape: [vocab_size]  e.g., [100,277]

logits = [
    2.3,   # token " the"
    1.8,   # token " a"
    4.7,   # token " Paris"
    0.1,   # token " London"
    -2.1,  # token " banana"
    # ... 100,272 more scores
]

Step 2: Softmax converts logits to probabilities

The softmax function converts the raw scores into a valid probability distribution that sums to 1.0:

import numpy as np

def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))  # numerical stability
    return exp_logits / exp_logits.sum()

# Input logits
logits = np.array([2.3, 1.8, 4.7, 0.1, -2.1])

# Output probabilities
probs = softmax(logits)
# [0.038, 0.023, 0.922, 0.004, 0.000]
# " Paris" has 92.2% probability — very confident

Step 3: Sample from the distribution

The model samples one token from this probability distribution. This is not argmax (always picking the highest probability token) — it’s random sampling weighted by probability. This is why running the same prompt twice can give different outputs.

import random

tokens = [" the", " a", " Paris", " London", " banana"]
probs = [0.038, 0.023, 0.922, 0.004, 0.000]

# Weighted random sample
next_token = random.choices(tokens, weights=probs, k=1)[0]
# " Paris" most of the time, but occasionally " the" or " a"

Step 4: Append and repeat

The sampled token is appended to the sequence, and the entire process repeats until the model generates an end-of-sequence token or hits max_tokens.

This sampling loop is where every parameter you’re about to learn exerts its influence.

Part 2: Temperature — The Creativity Dial

What Temperature Does

Temperature modifies the logits before the softmax step. It divides each logit by the temperature value:

def softmax_with_temperature(logits, temperature):
    # Divide logits by temperature BEFORE softmax
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    return exp_logits / exp_logits.sum()

The effect is profound:

import numpy as np
import matplotlib.pyplot as plt

logits = np.array([4.7, 2.3, 1.8, 0.1, -2.1])  # Simulated logits
tokens = ["Paris", "Rome", "London", "Tokyo", "Banana"]

temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]

for temp in temperatures:
    probs = softmax_with_temperature(logits, temp)
    formatted = ", ".join(f"{t}:{p:.3f}" for t, p in zip(tokens, probs))
    print(f"T={temp}: {formatted}")

Output:

T=0.1:  Paris:1.000, Rome:0.000, London:0.000, Tokyo:0.000, Banana:0.000
T=0.5:  Paris:0.985, Rome:0.013, London:0.002, Tokyo:0.000, Banana:0.000
T=1.0:  Paris:0.834, Rome:0.124, London:0.077, Tokyo:0.011, Banana:0.001
T=1.5:  Paris:0.640, Rome:0.182, London:0.148, Tokyo:0.029, Banana:0.002
T=2.0:  Paris:0.463, Rome:0.211, London:0.192, Tokyo:0.094, Banana:0.019

The intuition:

Low temperature (→ 0): Distribution becomes peaky — the highest-probability token dominates. Behavior approaches greedy (deterministic) decoding. Confident, focused, repetitive.
High temperature (→ ∞): Distribution flattens — all tokens become roughly equally likely. Behavior becomes random, creative, incoherent at extremes.
Temperature = 1.0: The model’s learned distribution, unmodified.

Temperature as Entropy

If you want the physics framing: temperature in LLMs is directly borrowed from thermodynamics. In statistical mechanics, temperature measures the entropy (disorder) of a system. Higher temperature = more disorder = more randomness in particle behavior. The analogy is exact.

Low entropy = predictable = low temperature. High entropy = unpredictable = high temperature.

Temperature Reference Table

Temperature    Behavior                          Best Use Cases
───────────────────────────────────────────────────────────────────
0.0 – 0.1    Near-deterministic, greedy         Fact retrieval, structured output,
                                                 routing decisions, classification
                                                 
0.2 – 0.4    Focused, minimal variation         Code generation, SQL, function calls,
                                                 math, data extraction, Q&A with facts
                                                 
0.5 – 0.7    Balanced (recommended default)     General chat, summarization,
                                                 analysis, question answering
                                                 
0.7 – 1.0    Creative, some surprise            Blog writing, marketing copy,
                                                 product descriptions, emails
                                                 
1.0 – 1.3    High creativity, occasional drift  Brainstorming, poetry, story starters,
                                                 creative ideation
                                                 
1.5+         Unpredictable, often incoherent    Experimental use only —
                                                 rarely appropriate in production

Code: Temperature in Practice

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI()

async def compare_temperatures(prompt: str, temperatures: list[float]):
    """Send the same prompt at different temperatures and compare outputs."""
    
    print(f"Prompt: {prompt}\n{'='*60}")
    
    for temp in temperatures:
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=temp,
            max_tokens=100,
        )
        print(f"\nTemperature {temp}:")
        print(response.choices[0].message.content)
        print("-" * 40)

async def main():
    # Creative prompt — notice how outputs diverge at higher temps
    await compare_temperatures(
        prompt="Complete this sentence: The old lighthouse keeper noticed something strange in the fog —",
        temperatures=[0.1, 0.7, 1.2],
    )
    
    # Factual prompt — higher temp introduces errors
    await compare_temperatures(
        prompt="What is the capital of Australia?",
        temperatures=[0.0, 0.7, 1.5],
    )

asyncio.run(main())

Key observation: On the factual question, temperature 0.0 reliably answers “Canberra.” At 1.5, you may occasionally see “Sydney” (common misconception) or other errors as lower-probability tokens get sampled.

Part 3: Top-P (Nucleus Sampling)

What Top-P Does

Top-P, also called nucleus sampling, provides an alternative to temperature for controlling randomness. Instead of scaling the entire distribution, it dynamically selects the smallest set of tokens whose cumulative probability exceeds P, then samples only from that set.

def top_p_sampling(probs: np.ndarray, p: float) -> np.ndarray:
    """
    Zero out probabilities for tokens outside the nucleus.
    
    The nucleus = smallest set of tokens whose cumulative probability >= p
    """
    # Sort tokens by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    
    # Find the nucleus — tokens whose cumulative prob reaches p
    cumulative_probs = np.cumsum(sorted_probs)
    
    # Find the cutoff point
    cutoff_idx = np.searchsorted(cumulative_probs, p) + 1
    
    # Zero out tokens outside the nucleus
    filtered_probs = np.zeros_like(probs)
    nucleus_indices = sorted_indices[:cutoff_idx]
    filtered_probs[nucleus_indices] = probs[nucleus_indices]
    
    # Renormalize
    filtered_probs /= filtered_probs.sum()
    
    return filtered_probs

Walkthrough with numbers:

tokens = ["Paris", "Rome", "London", "Berlin", "Osaka", "Banana", ...]
probs  = [0.55,   0.20,   0.12,    0.06,    0.04,   0.002,  ...]
cumsum = [0.55,   0.75,   0.87,    0.93,    0.97,   0.972,  ...]

# top_p = 0.90
# Nucleus stops at "London" (cumsum = 0.87 → 0.93 passes 0.90 at "Berlin")
# Actually at Berlin: cumsum = 0.93 ≥ 0.90, so nucleus = {Paris, Rome, London, Berlin}
# "Osaka", "Banana" and everything after: excluded, probability set to 0
# Sample only from the nucleus, renormalized

The Key Difference: Top-P Adapts

The critical insight about top-P: the nucleus size adapts dynamically to the model’s confidence.

When the model is very confident (one token has 95% probability), the nucleus is tiny — just 1–2 tokens. The model stays focused.

When the model is uncertain (many tokens each with 5–10% probability), the nucleus is large — perhaps 20–30 tokens. The model is given more freedom.

This is the core advantage over temperature alone: top-P adjusts to the model’s own uncertainty.

Model state          top_p=0.9 behavior
───────────────────────────────────────
Very confident:      Nucleus = 2-3 tokens  → stays focused
Somewhat confident:  Nucleus = 5-10 tokens → some variation
Uncertain:           Nucleus = 15-20 tokens → explores options
Very uncertain:      Nucleus = 30-50 tokens → broad exploration

Top-P Values Guide

Top-P    Effect                            Use Cases
──────────────────────────────────────────────────────────────────
1.0      No filtering (sample all tokens)  Rarely useful in production
0.95     Very light filtering              Creative writing, brainstorming
0.9      Standard — recommended default   General purpose, chat
0.85     Moderate focus                   Summarization, analysis  
0.7      Focused                          Factual Q&A, technical docs
0.5      Highly focused                   Structured outputs, classification
<0.3     Very narrow nucleus              Near-deterministic (use temp=0 instead)

Part 4: Top-K Sampling

Top-K is simpler: keep only the top K highest-probability tokens, zero out the rest, and sample from those K tokens.

def top_k_sampling(probs: np.ndarray, k: int) -> np.ndarray:
    """Keep only the top-k probability tokens."""
    top_k_indices = np.argsort(probs)[-k:]  # indices of top k tokens
    
    filtered_probs = np.zeros_like(probs)
    filtered_probs[top_k_indices] = probs[top_k_indices]
    filtered_probs /= filtered_probs.sum()  # renormalize
    
    return filtered_probs

Top-K vs Top-P comparison:

Feature            Top-K                      Top-P
────────────────────────────────────────────────────────────
Nucleus size       Fixed (always K tokens)    Adaptive (varies by confidence)
Vocabulary         Cuts at a fixed count      Cuts at a probability threshold
Behavior when      K might still include      P automatically focuses to
model is           low-prob noise             just the confident tokens
confident
Behavior when      K might miss many          P includes all reasonable
model is           reasonable tokens          tokens dynamically
uncertain
Used in            Older generation, Ollama   GPT-4, Claude, Gemini (default)
                   LM Studio local models     modern production APIs

Top-K with OpenAI-compatible APIs (Ollama):

import httpx
import json
import asyncio

async def ollama_with_top_k(prompt: str, k: int = 40):
    """Ollama supports top_k — useful for local model inference."""
    
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:11434/api/generate",
            json={
                "model": "llama3.2",
                "prompt": prompt,
                "options": {
                    "top_k": k,          # Keep top-K tokens
                    "top_p": 0.9,        # Then apply nucleus filtering
                    "temperature": 0.7,  # Scale before both filters
                    "num_predict": 200,
                },
                "stream": False,
            },
            timeout=60.0,
        )
        return response.json()["response"]


async def main():
    response = await ollama_with_top_k(
        "Write a short story about AI",
        k=50
    )
    print(response)

asyncio.run(main())

Part 5: Frequency and Presence Penalties

These OpenAI-specific parameters (also supported by many open-source APIs) modify the logits based on what tokens have already appeared in the generated text.

Frequency Penalty

Reduces probability of tokens in proportion to how often they’ve already appeared.

New logit = Original logit − (frequency_penalty × count_of_token_in_output)

Effect: tokens that appear frequently get their probability reduced more severely.

# Example: if "the" has appeared 5 times so far and frequency_penalty = 0.5:
# "the" logit is reduced by 0.5 × 5 = 2.5
# This significantly reduces the probability of "the" appearing again

frequency_penalty    Effect
──────────────────────────────────────────────────────────
0.0                  No penalty (default)
0.3                  Light reduction of repetition
0.6                  Moderate — good for longer texts
1.0                  Strong — model avoids most repeated tokens
2.0                  Very strong — output becomes highly varied
                     but may sacrifice coherence

Presence Penalty

Reduces probability of tokens that have appeared at all in the output, regardless of how many times.

New logit = Original logit − presence_penalty × (1 if token has appeared else 0)

Effect: any token that’s appeared at all gets a flat penalty. Encourages the model to introduce new topics and vocabulary.

response = await client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a blog post about machine learning"}],
    temperature=0.7,
    frequency_penalty=0.3,  # Reduce repeated words
    presence_penalty=0.2,   # Encourage topic diversity
    max_tokens=800,
)

Penalty Decision Guide

Situation                              Recommendation
──────────────────────────────────────────────────────────────────────
Short responses (< 100 tokens)         penalties = 0 (negligible benefit)
Long-form content (blog posts, docs)   frequency_penalty = 0.3–0.5
Repetitive output complaints           frequency_penalty = 0.5–0.7
Output stays on one topic too long     presence_penalty = 0.2–0.4
Structured output / JSON               penalties = 0 (can corrupt structure)
Code generation                        penalties = 0 (syntax must repeat)

Part 6: Repetition Penalty (Open-Source / Ollama)

Open-source models (Llama, Mistral, Qwen) use a different parameter called repetition_penalty, which works multiplicatively rather than additively.

# Ollama / llama.cpp repetition_penalty
# Values > 1.0 penalize repetition; 1.0 = no penalty
async with httpx.AsyncClient() as http_client:
    response = await http_client.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama3.2",
            "prompt": "Write a story about a robot",
            "options": {
                "temperature": 0.8,
                "top_p": 0.9,
                "top_k": 40,
                "repeat_penalty": 1.1,      # 1.0 = off, 1.1 = light, 1.3 = strong
                "repeat_last_n": 64,        # How many past tokens to check
                "num_predict": 300,
            },
            "stream": False,
        }
    )

repeat_penalty    Effect
──────────────────────────────────────────
1.0               No penalty (neutral)
1.05 – 1.1        Light — subtle reduction
1.1 – 1.2         Moderate — good default
1.2 – 1.3         Strong — visible variety
1.4+              Very strong — can cause incoherence

Part 7: How Parameters Interact

The parameters don’t operate independently — they execute in a specific order on the logit/probability distribution:

Raw logits
    │
    ▼
Apply temperature (divide logits by T)
    │
    ▼
Apply frequency/presence penalties (subtract from logits)
    │
    ▼
Convert to probabilities via softmax
    │
    ▼
Apply Top-K filtering (zero out all but top K)
    │
    ▼
Apply Top-P filtering (zero out below nucleus threshold)
    │
    ▼
Renormalize probabilities
    │
    ▼
Sample one token

Critical interaction: Temperature + Top-P

OpenAI recommends: change either temperature OR top_p, not both simultaneously. Modifying both creates compounding effects that are hard to reason about.

# ✅ Correct — change one
{"temperature": 0.3, "top_p": 1.0}    # Cool temperature, no top-p filtering
{"temperature": 1.0, "top_p": 0.8}    # Default temperature, nucleus filtering

# ⚠️ Technically works, but harder to reason about
{"temperature": 0.7, "top_p": 0.9}    # Both active — interactions compound

# In practice, most production teams use temperature alone
# and only add top_p when they need adaptive nucleus behavior# ⚠️ Technically works, but harder to reason about
{"temperature": 0.7, "top_p": 0.9}    # Both active — interactions compound

Part 8: Production Parameter Presets

Here’s a tested set of parameter presets covering the most common production use cases. Save this as a reference:

# parameter_presets.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class SamplingParams:
    temperature: float
    top_p: float
    frequency_penalty: float
    presence_penalty: float
    max_tokens: Optional[int] = None
    description: str = ""


PRESETS = {
    
    # ── Deterministic / Structured ──────────────────────────────────────────────
    "deterministic": SamplingParams(
        temperature=0.0,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        description="Greedy decoding — same input always gives same output. "
                    "Use for: classification, routing, structured JSON, math.",
    ),
    
    "extraction": SamplingParams(
        temperature=0.1,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        description="Near-deterministic. Use for: data extraction, NER, "
                    "schema filling, fact retrieval.",
    ),
    
    # ── Code ────────────────────────────────────────────────────────────────────
    "code": SamplingParams(
        temperature=0.2,
        top_p=0.95,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        description="Low temperature preserves syntax. Slight top-p prevents "
                    "hallucinating unusual APIs. Use for: code generation, SQL, "
                    "function calls.",
    ),
    
    # ── Balanced / General Purpose ───────────────────────────────────────────────
    "default": SamplingParams(
        temperature=0.7,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        description="General purpose. Good starting point for most tasks.",
    ),
    
    "chat": SamplingParams(
        temperature=0.7,
        top_p=0.95,
        frequency_penalty=0.1,
        presence_penalty=0.0,
        description="Conversational assistant. Light frequency penalty prevents "
                    "repetitive phrasing in longer conversations.",
    ),
    
    # ── Analysis / Writing ───────────────────────────────────────────────────────
    "analysis": SamplingParams(
        temperature=0.5,
        top_p=0.9,
        frequency_penalty=0.2,
        presence_penalty=0.0,
        description="Analytical tasks: summarization, document analysis, "
                    "report generation. Focused but not robotic.",
    ),
    
    "writing": SamplingParams(
        temperature=0.8,
        top_p=0.95,
        frequency_penalty=0.4,
        presence_penalty=0.1,
        description="Long-form writing: blog posts, articles, marketing copy. "
                    "Frequency penalty prevents word repetition over long outputs.",
    ),
    
    # ── Creative ────────────────────────────────────────────────────────────────
    "creative": SamplingParams(
        temperature=1.0,
        top_p=0.95,
        frequency_penalty=0.5,
        presence_penalty=0.2,
        description="Creative writing: stories, poetry, brainstorming. "
                    "High temperature + penalties encourage diverse vocabulary.",
    ),
    
    "brainstorm": SamplingParams(
        temperature=1.1,
        top_p=0.98,
        frequency_penalty=0.5,
        presence_penalty=0.4,
        description="Ideation: generating diverse options, exploring possibilities. "
                    "High presence penalty pushes toward covering new angles.",
    ),
}


def apply_preset(preset_name: str, **overrides) -> dict:
    """
    Get parameter dict for a preset, with optional overrides.
    
    Usage:
        params = apply_preset("code", max_tokens=500)
        response = await client.chat.completions.create(**params, ...)
    """
    if preset_name not in PRESETS:
        raise ValueError(f"Unknown preset: {preset_name}. "
                         f"Available: {list(PRESETS.keys())}")
    
    preset = PRESETS[preset_name]
    result = {
        "temperature": preset.temperature,
        "top_p": preset.top_p,
        "frequency_penalty": preset.frequency_penalty,
        "presence_penalty": preset.presence_penalty,
    }
    
    if preset.max_tokens:
        result["max_tokens"] = preset.max_tokens
    
    result.update(overrides)
    return result


# ── Usage examples ─────────────────────────────────────────────────────────────
async def example_usage():
    from openai import AsyncOpenAI
    client = AsyncOpenAI()
    
    # Code generation — low temperature, no penalties
    code_params = apply_preset("code", max_tokens=500)
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": "Write a Python function to parse a JWT token"
        }],
        **code_params,
    )
    
    # Creative story — high temperature, strong penalties
    creative_params = apply_preset("creative", max_tokens=600)
    story = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": "Write the opening paragraph of a noir detective story"
        }],
        **creative_params,
    )

Part 9: Parameters Across Providers

Each provider exposes these controls differently. Here’s the unified reference:

# ── OpenAI ──────────────────────────────────────────────────────────────────────
await openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    temperature=0.7,           # 0.0 – 2.0
    top_p=0.95,                # 0.0 – 1.0
    frequency_penalty=0.3,     # -2.0 – 2.0
    presence_penalty=0.1,      # -2.0 – 2.0
    max_tokens=500,
    n=1,                       # Number of completions to generate
    seed=42,                   # For reproducibility (not strict guarantee)
)

# ── Anthropic ───────────────────────────────────────────────────────────────────
await anthropic_client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[...],
    temperature=0.7,           # 0.0 – 1.0 (note: max is 1.0, not 2.0)
    top_p=0.95,                # 0.0 – 1.0
    top_k=40,                  # Top-K (Anthropic supports this too)
    max_tokens=500,
    # Note: No frequency/presence penalties in Anthropic API
)

# ── Gemini ───────────────────────────────────────────────────────────────────────
model = genai.GenerativeModel(
    model_name="gemini-2.5-flash",
    generation_config=genai.GenerationConfig(
        temperature=0.7,           # 0.0 – 2.0
        top_p=0.95,                # 0.0 – 1.0
        top_k=40,                  # Integer > 0
        max_output_tokens=500,
        candidate_count=1,         # Number of candidate completions
        stop_sequences=["END"],    # Custom stop strings
    )
)

# ── Ollama (local models) ─────────────────────────────────────────────────────────
{
    "model": "llama3.2",
    "options": {
        "temperature": 0.7,        # 0.0 – 2.0
        "top_p": 0.9,              # 0.0 – 1.0
        "top_k": 40,               # Positive integer
        "repeat_penalty": 1.1,     # > 1.0 penalizes repetition
        "repeat_last_n": 64,       # Context window for repeat penalty
        "num_predict": 500,        # max_tokens equivalent
        "seed": -1,                # -1 = random, positive = reproducible
        "num_ctx": 4096,           # Context window size
        "mirostat": 0,             # Advanced: Mirostat sampling (0=off, 1, 2)
        "mirostat_tau": 5.0,       # Target entropy (Mirostat only)
        "mirostat_eta": 0.1,       # Learning rate (Mirostat only)
    }
}

Provider Quirks Worth Knowing

Provider      Temperature max    Penalties           Top-K     Notes
─────────────────────────────────────────────────────────────────────────
OpenAI        2.0               freq + presence ✓   No        seed for reproducibility
Anthropic     1.0               None                Yes       Stricter than OpenAI range
Gemini        2.0               None                Yes       candidate_count for beam
Ollama        2.0               repeat_penalty ✓    Yes       Mirostat available

Part 10: Determinism and Reproducibility

The Seed Parameter

OpenAI provides a seed parameter for reproducibility, but it’s not a hard guarantee:

# Same seed → usually same output (but not guaranteed across model updates)
response1 = await client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Pick a random number from 1-10"}],
    temperature=0.7,
    seed=42,
)

response2 = await client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Pick a random number from 1-10"}],
    temperature=0.7,
    seed=42,
)

# Check system_fingerprint — if different, model weights changed between calls
print(response1.system_fingerprint)  # fp_a2c3b1...
print(response2.system_fingerprint)  # fp_a2c3b1... (same = same model)

True Determinism: Temperature = 0

If you need genuine reproducibility (evaluation pipelines, unit tests, regression testing), use temperature=0:

# For evaluation and testing — ALWAYS use temperature=0
async def evaluate_model(test_cases: list[dict]) -> list[dict]:
    results = []
    for test in test_cases:
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": test["input"]}],
            temperature=0,      # Deterministic — critical for fair evaluation
            top_p=1.0,
            seed=0,             # Belt and suspenders
        )
        results.append({
            "input": test["input"],
            "expected": test["expected"],
            "actual": response.choices[0].message.content,
        })
    return results

🔍 Common Mistakes

1. Using temperature=1.0 for factual questions

Temperature 1.0 is the model’s raw, unmodified distribution — built for language modeling, not factual accuracy. For Q&A, use 0.1–0.3. A common symptom: correct answers most of the time, wrong answers occasionally in production.

2. Using temperature=0 for creative tasks

At temperature 0, the model produces the single most probable completion. This works great for facts, but creative writing becomes robotic and formulaic. Users notice.

3. Setting both frequency and presence penalties high

High frequency_penalty + high presence_penalty together create outputs that desperately avoid any familiar phrasing. The result sounds alien and stilted. Use one at a time, at moderate values.

4. Not adjusting temperature for structured output (JSON)

When you need JSON output, use temperature=0 and the response_format={"type": "json_object"} parameter. Any randomness in structured output causes parsing failures.

# ✅ Correct — deterministic for structured output
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract name and age from: John is 32"}],
    temperature=0,                              # Critical
    response_format={"type": "json_object"},   # Ensure valid JSON
)

5. Ignoring context length when applying penalties

frequency_penalty and presence_penalty apply to the output so far during generation. For very short outputs (1-3 sentences), penalties have almost no visible effect. Don’t apply penalties to short Q&A responses expecting any difference.

6. Mixing temperature and top_p both to non-default values

This creates compound filtering effects that are hard to interpret. Pick one primary control. OpenAI’s own documentation says: “We generally recommend altering this or temperature but not both.”

💼 Quick Questions

Q: What’s the difference between temperature and top-P at a mechanistic level?

Temperature scales all logits before softmax, uniformly adjusting the sharpness of the entire distribution. Top-P filters the already-converted probability distribution, removing low-probability tokens entirely. Temperature changes the shape of the distribution; top-P changes its support (which tokens are possible at all). They operate on different stages of the sampling pipeline.

Q: Why would you use top-P instead of just using a lower temperature?

Top-P adapts to the model’s confidence dynamically. When the model is very confident (one token has 90% probability), top-P=0.9 samples from just 1–2 tokens. When the model is uncertain (50 tokens each at 2%), top-P=0.9 includes all 50. A fixed temperature can’t adapt this way — it applies the same amount of sharpening regardless of how confident the model already is.

Q: How would you configure parameters for an AI system that classifies customer support tickets into categories?

temperature=0, top_p=1.0, no penalties. Classification is deterministic by nature — you want the model’s highest-confidence prediction every time. Any randomness introduces inconsistency across repeated runs on identical tickets. Also add response_format={"type": "json_object"} and specify the exact category names in the prompt.

Q: A client complains that their LLM-generated product descriptions are repetitive. What do you investigate?

First check frequency_penalty — it’s likely 0.0 (default). Increase to 0.3–0.5. Also check if the prompt template is over-constraining the model (tight word-by-word instructions force repetition regardless of penalties). Check output length — short outputs can’t avoid repetition even with penalties active. If using Claude, note there are no frequency penalties; instead craft the prompt to explicitly ask for varied vocabulary.

🏭 Production Considerations

Evaluation first, parameters second. Before tuning parameters in production, build an offline evaluation dataset with 50–100 representative examples and a scoring rubric. Tune parameters against that dataset, not live traffic. Parameter changes that feel like improvements in casual testing can silently degrade quality for edge cases.

Temperature drift in fine-tuned models. If you fine-tune a model, the learned distribution changes. A temperature that worked well for the base model may produce different behavior on the fine-tuned version. Always re-evaluate parameters after fine-tuning.

Separate parameter profiles per use case. Large production applications have multiple AI-powered features: chat, summarization, code generation, classification. Each needs its own parameter preset. Maintain a central params_config.yaml and load it at startup — never hardcode parameters scattered across the codebase.

Log parameters with every request. When debugging quality issues, you need to know exactly what parameters produced the output. Log model, temperature, top_p, max_tokens, and any penalties alongside the input/output in your observability pipeline.

Watch for finish_reason: length. If a response ends with finish_reason=length, the model was cut off by max_tokens. Increase max_tokens or the output may be incomplete in ways that are hard to detect automatically.

🔑 Key Takeaways

LLMs generate text by sampling from a probability distribution — understanding this makes all other parameters intuitive
Temperature controls distribution sharpness — lower = more confident/deterministic, higher = more creative/random
Top-P adapts the nucleus dynamically to the model’s own confidence level
Top-K provides fixed-size filtering — simpler but less adaptive than top-P
Frequency penalty reduces repeated tokens proportionally to how often they’ve appeared
Presence penalty encourages topic diversity by penalizing any token that’s appeared at all
Use temperature=0 for evaluation, classification, and structured output — never introduce randomness where you need consistency
Don’t tune both temperature and top-P simultaneously — pick one primary control
Different use cases need different presets — code, creative writing, factual Q&A, and structured output each have optimal settings
Provider APIs differ in their ranges and supported parameters — Anthropic’s max temperature is 1.0, not 2.0

📚 Further Reading

OpenAI API Reference — Chat Completions — canonical parameter documentation
The Nucleus Sampling Paper (2020) — Holtzman et al., original top-P research
Hugging Face — How to Generate — excellent visual explanation of sampling strategies
Temperature and Creativity in LLMs — empirical study of temperature effects on factuality vs creativity

Day 9: LLM Inference Fundamentals — Temperature, Top-P, and Sampling

Why This Matters

Part 1: How LLMs Actually Generate Text

The Sampling Loop

Part 2: Temperature — The Creativity Dial

What Temperature Does

Temperature as Entropy

Temperature Reference Table

Code: Temperature in Practice

Part 3: Top-P (Nucleus Sampling)

What Top-P Does

The Key Difference: Top-P Adapts

Top-P Values Guide

Part 4: Top-K Sampling

Part 5: Frequency and Presence Penalties

Frequency Penalty

Presence Penalty

Penalty Decision Guide

Part 6: Repetition Penalty (Open-Source / Ollama)

Part 7: How Parameters Interact

Part 8: Production Parameter Presets

Part 9: Parameters Across Providers

Provider Quirks Worth Knowing

Part 10: Determinism and Reproducibility

The Seed Parameter

True Determinism: Temperature = 0

🔍 Common Mistakes

💼 Quick Questions

🏭 Production Considerations

🔑 Key Takeaways

📚 Further Reading

Related

Leave a Reply Cancel reply

Why This Matters

Part 1: How LLMs Actually Generate Text

The Sampling Loop

Part 2: Temperature — The Creativity Dial

What Temperature Does

Temperature as Entropy

Temperature Reference Table

Code: Temperature in Practice

Part 3: Top-P (Nucleus Sampling)

What Top-P Does

The Key Difference: Top-P Adapts

Top-P Values Guide

Part 4: Top-K Sampling

Part 5: Frequency and Presence Penalties

Frequency Penalty

Presence Penalty

Penalty Decision Guide

Part 6: Repetition Penalty (Open-Source / Ollama)

Part 7: How Parameters Interact

Part 8: Production Parameter Presets

Part 9: Parameters Across Providers

Provider Quirks Worth Knowing

Part 10: Determinism and Reproducibility

The Seed Parameter

True Determinism: Temperature = 0

🔍 Common Mistakes

💼 Quick Questions

🏭 Production Considerations

🔑 Key Takeaways

📚 Further Reading

Related

You May Also Like

Day 1 — Welcome to the AI Era: The 2026 Landscape

Day 2 — How Large Language Models Actually Work?

Day 8: Running LLMs Locally with Ollama & LM Studio

Leave a Reply Cancel reply