AIFeature Posts

Day 9: LLM Inference Fundamentals — Temperature, Top-P, and Sampling

“The difference between a poet and a fact-checker isn’t intelligence — it’s entropy. Temperature is how you dial between the two.”

Why This Matters

You’ve learned to call the API. You’ve sent prompts. You’ve gotten responses. But most developers use the default parameters and wonder why the output is sometimes too random, sometimes too repetitive, sometimes too robotic.

Sampling parameters are the dials on your mixing board. Get them wrong and your AI application produces outputs that feel wrong to users — even if they can’t articulate why. Get them right and the output feels exactly tailored to the task.

These parameters also directly affect:

  • Output quality: wrong temperature → incoherent or boring responses
  • Reliability: high randomness in production → unpredictable behavior
  • Cost: verbose outputs cost more tokens
  • User trust: repetitive or hallucinated outputs erode confidence

Understanding what happens inside the model during generation is the difference between guessing at parameters and making informed engineering decisions.

Part 1: How LLMs Actually Generate Text

The Sampling Loop

Before you can understand temperature, you need to understand what the model is actually doing when it generates a response. Here’s the process at each step:

Input tokens → Transformer layers → Logits vector → Softmax → Probability distribution → Sample → Next token → Repeat

Let’s break each step down:

Step 1: The model produces logits

After processing your input through all transformer layers, the model outputs a vector of raw scores called logits — one score per token in the vocabulary. GPT-4 has a vocabulary of ~100,000 tokens. So for each generation step, the model produces 100,000 numbers.

# Conceptually (you don't call this directly in production):
# logits shape: [vocab_size] e.g., [100,277]

logits = [
2.3, # token " the"
1.8, # token " a"
4.7, # token " Paris"
0.1, # token " London"
-2.1, # token " banana"
# ... 100,272 more scores
]

Step 2: Softmax converts logits to probabilities

The softmax function converts the raw scores into a valid probability distribution that sums to 1.0:

import numpy as np

def softmax(logits):
exp_logits = np.exp(logits - np.max(logits)) # numerical stability
return exp_logits / exp_logits.sum()

# Input logits
logits = np.array([2.3, 1.8, 4.7, 0.1, -2.1])

# Output probabilities
probs = softmax(logits)
# [0.038, 0.023, 0.922, 0.004, 0.000]
# " Paris" has 92.2% probability — very confident

Step 3: Sample from the distribution

The model samples one token from this probability distribution. This is not argmax (always picking the highest probability token) — it’s random sampling weighted by probability. This is why running the same prompt twice can give different outputs.

import random

tokens = [" the", " a", " Paris", " London", " banana"]
probs = [0.038, 0.023, 0.922, 0.004, 0.000]

# Weighted random sample
next_token = random.choices(tokens, weights=probs, k=1)[0]
# " Paris" most of the time, but occasionally " the" or " a"

Step 4: Append and repeat

The sampled token is appended to the sequence, and the entire process repeats until the model generates an end-of-sequence token or hits max_tokens.

This sampling loop is where every parameter you’re about to learn exerts its influence.

Part 2: Temperature — The Creativity Dial

What Temperature Does

Temperature modifies the logits before the softmax step. It divides each logit by the temperature value:

def softmax_with_temperature(logits, temperature):
# Divide logits by temperature BEFORE softmax
scaled_logits = logits / temperature
exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
return exp_logits / exp_logits.sum()

The effect is profound:

import numpy as np
import matplotlib.pyplot as plt

logits = np.array([4.7, 2.3, 1.8, 0.1, -2.1]) # Simulated logits
tokens = ["Paris", "Rome", "London", "Tokyo", "Banana"]

temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]

for temp in temperatures:
probs = softmax_with_temperature(logits, temp)
formatted = ", ".join(f"{t}:{p:.3f}" for t, p in zip(tokens, probs))
print(f"T={temp}: {formatted}")

Output:

T=0.1:  Paris:1.000, Rome:0.000, London:0.000, Tokyo:0.000, Banana:0.000
T=0.5: Paris:0.985, Rome:0.013, London:0.002, Tokyo:0.000, Banana:0.000
T=1.0: Paris:0.834, Rome:0.124, London:0.077, Tokyo:0.011, Banana:0.001
T=1.5: Paris:0.640, Rome:0.182, London:0.148, Tokyo:0.029, Banana:0.002
T=2.0: Paris:0.463, Rome:0.211, London:0.192, Tokyo:0.094, Banana:0.019

The intuition:

  • Low temperature (→ 0): Distribution becomes peaky — the highest-probability token dominates. Behavior approaches greedy (deterministic) decoding. Confident, focused, repetitive.
  • High temperature (→ ∞): Distribution flattens — all tokens become roughly equally likely. Behavior becomes random, creative, incoherent at extremes.
  • Temperature = 1.0: The model’s learned distribution, unmodified.

Temperature as Entropy

If you want the physics framing: temperature in LLMs is directly borrowed from thermodynamics. In statistical mechanics, temperature measures the entropy (disorder) of a system. Higher temperature = more disorder = more randomness in particle behavior. The analogy is exact.

Low entropy = predictable = low temperature. High entropy = unpredictable = high temperature.

Temperature Reference Table

Temperature    Behavior                          Best Use Cases
───────────────────────────────────────────────────────────────────
0.0 – 0.1 Near-deterministic, greedy Fact retrieval, structured output,
routing decisions, classification

0.2 – 0.4 Focused, minimal variation Code generation, SQL, function calls,
math, data extraction, Q&A with facts

0.5 – 0.7 Balanced (recommended default) General chat, summarization,
analysis, question answering

0.7 – 1.0 Creative, some surprise Blog writing, marketing copy,
product descriptions, emails

1.0 – 1.3 High creativity, occasional drift Brainstorming, poetry, story starters,
creative ideation

1.5+ Unpredictable, often incoherent Experimental use only —
rarely appropriate in production

Code: Temperature in Practice

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI()

async def compare_temperatures(prompt: str, temperatures: list[float]):
"""Send the same prompt at different temperatures and compare outputs."""

print(f"Prompt: {prompt}\n{'='*60}")

for temp in temperatures:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=temp,
max_tokens=100,
)
print(f"\nTemperature {temp}:")
print(response.choices[0].message.content)
print("-" * 40)

async def main():
# Creative prompt — notice how outputs diverge at higher temps
await compare_temperatures(
prompt="Complete this sentence: The old lighthouse keeper noticed something strange in the fog —",
temperatures=[0.1, 0.7, 1.2],
)

# Factual prompt — higher temp introduces errors
await compare_temperatures(
prompt="What is the capital of Australia?",
temperatures=[0.0, 0.7, 1.5],
)

asyncio.run(main())

Key observation: On the factual question, temperature 0.0 reliably answers “Canberra.” At 1.5, you may occasionally see “Sydney” (common misconception) or other errors as lower-probability tokens get sampled.

Part 3: Top-P (Nucleus Sampling)

What Top-P Does

Top-P, also called nucleus sampling, provides an alternative to temperature for controlling randomness. Instead of scaling the entire distribution, it dynamically selects the smallest set of tokens whose cumulative probability exceeds P, then samples only from that set.

def top_p_sampling(probs: np.ndarray, p: float) -> np.ndarray:
"""
Zero out probabilities for tokens outside the nucleus.

The nucleus = smallest set of tokens whose cumulative probability >= p
"""
# Sort tokens by probability (descending)
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]

# Find the nucleus — tokens whose cumulative prob reaches p
cumulative_probs = np.cumsum(sorted_probs)

# Find the cutoff point
cutoff_idx = np.searchsorted(cumulative_probs, p) + 1

# Zero out tokens outside the nucleus
filtered_probs = np.zeros_like(probs)
nucleus_indices = sorted_indices[:cutoff_idx]
filtered_probs[nucleus_indices] = probs[nucleus_indices]

# Renormalize
filtered_probs /= filtered_probs.sum()

return filtered_probs

Walkthrough with numbers:

tokens = ["Paris", "Rome", "London", "Berlin", "Osaka", "Banana", ...]
probs = [0.55, 0.20, 0.12, 0.06, 0.04, 0.002, ...]
cumsum = [0.55, 0.75, 0.87, 0.93, 0.97, 0.972, ...]

# top_p = 0.90
# Nucleus stops at "London" (cumsum = 0.87 → 0.93 passes 0.90 at "Berlin")
# Actually at Berlin: cumsum = 0.93 ≥ 0.90, so nucleus = {Paris, Rome, London, Berlin}
# "Osaka", "Banana" and everything after: excluded, probability set to 0
# Sample only from the nucleus, renormalized

The Key Difference: Top-P Adapts

The critical insight about top-P: the nucleus size adapts dynamically to the model’s confidence.

When the model is very confident (one token has 95% probability), the nucleus is tiny — just 1–2 tokens. The model stays focused.

When the model is uncertain (many tokens each with 5–10% probability), the nucleus is large — perhaps 20–30 tokens. The model is given more freedom.

This is the core advantage over temperature alone: top-P adjusts to the model’s own uncertainty.

Model state          top_p=0.9 behavior
───────────────────────────────────────
Very confident: Nucleus = 2-3 tokens → stays focused
Somewhat confident: Nucleus = 5-10 tokens → some variation
Uncertain: Nucleus = 15-20 tokens → explores options
Very uncertain: Nucleus = 30-50 tokens → broad exploration

Top-P Values Guide

Top-P    Effect                            Use Cases
──────────────────────────────────────────────────────────────────
1.0 No filtering (sample all tokens) Rarely useful in production
0.95 Very light filtering Creative writing, brainstorming
0.9 Standard — recommended default General purpose, chat
0.85 Moderate focus Summarization, analysis
0.7 Focused Factual Q&A, technical docs
0.5 Highly focused Structured outputs, classification
<0.3 Very narrow nucleus Near-deterministic (use temp=0 instead)

Part 4: Top-K Sampling

Top-K is simpler: keep only the top K highest-probability tokens, zero out the rest, and sample from those K tokens.

def top_k_sampling(probs: np.ndarray, k: int) -> np.ndarray:
"""Keep only the top-k probability tokens."""
top_k_indices = np.argsort(probs)[-k:] # indices of top k tokens

filtered_probs = np.zeros_like(probs)
filtered_probs[top_k_indices] = probs[top_k_indices]
filtered_probs /= filtered_probs.sum() # renormalize

return filtered_probs

Top-K vs Top-P comparison:

Feature            Top-K                      Top-P
────────────────────────────────────────────────────────────
Nucleus size Fixed (always K tokens) Adaptive (varies by confidence)
Vocabulary Cuts at a fixed count Cuts at a probability threshold
Behavior when K might still include P automatically focuses to
model is low-prob noise just the confident tokens
confident
Behavior when K might miss many P includes all reasonable
model is reasonable tokens tokens dynamically
uncertain
Used in Older generation, Ollama GPT-4, Claude, Gemini (default)
LM Studio local models modern production APIs

Top-K with OpenAI-compatible APIs (Ollama):

import httpx
import json
import asyncio

async def ollama_with_top_k(prompt: str, k: int = 40):
"""Ollama supports top_k — useful for local model inference."""

async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.2",
"prompt": prompt,
"options": {
"top_k": k, # Keep top-K tokens
"top_p": 0.9, # Then apply nucleus filtering
"temperature": 0.7, # Scale before both filters
"num_predict": 200,
},
"stream": False,
},
timeout=60.0,
)
return response.json()["response"]


async def main():
response = await ollama_with_top_k(
"Write a short story about AI",
k=50
)
print(response)

asyncio.run(main())

Part 5: Frequency and Presence Penalties

These OpenAI-specific parameters (also supported by many open-source APIs) modify the logits based on what tokens have already appeared in the generated text.

Frequency Penalty

Reduces probability of tokens in proportion to how often they’ve already appeared.

New logit = Original logit − (frequency_penalty × count_of_token_in_output)

Effect: tokens that appear frequently get their probability reduced more severely.

# Example: if "the" has appeared 5 times so far and frequency_penalty = 0.5:
# "the" logit is reduced by 0.5 × 5 = 2.5
# This significantly reduces the probability of "the" appearing again
frequency_penalty    Effect
──────────────────────────────────────────────────────────
0.0 No penalty (default)
0.3 Light reduction of repetition
0.6 Moderate — good for longer texts
1.0 Strong — model avoids most repeated tokens
2.0 Very strong — output becomes highly varied
but may sacrifice coherence

Presence Penalty

Reduces probability of tokens that have appeared at all in the output, regardless of how many times.

New logit = Original logit − presence_penalty × (1 if token has appeared else 0)

Effect: any token that’s appeared at all gets a flat penalty. Encourages the model to introduce new topics and vocabulary.

response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a blog post about machine learning"}],
temperature=0.7,
frequency_penalty=0.3, # Reduce repeated words
presence_penalty=0.2, # Encourage topic diversity
max_tokens=800,
)

Penalty Decision Guide

Situation                              Recommendation
──────────────────────────────────────────────────────────────────────
Short responses (< 100 tokens) penalties = 0 (negligible benefit)
Long-form content (blog posts, docs) frequency_penalty = 0.3–0.5
Repetitive output complaints frequency_penalty = 0.5–0.7
Output stays on one topic too long presence_penalty = 0.2–0.4
Structured output / JSON penalties = 0 (can corrupt structure)
Code generation penalties = 0 (syntax must repeat)

Part 6: Repetition Penalty (Open-Source / Ollama)

Open-source models (Llama, Mistral, Qwen) use a different parameter called repetition_penalty, which works multiplicatively rather than additively.

# Ollama / llama.cpp repetition_penalty
# Values > 1.0 penalize repetition; 1.0 = no penalty
async with httpx.AsyncClient() as http_client:
response = await http_client.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.2",
"prompt": "Write a story about a robot",
"options": {
"temperature": 0.8,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1, # 1.0 = off, 1.1 = light, 1.3 = strong
"repeat_last_n": 64, # How many past tokens to check
"num_predict": 300,
},
"stream": False,
}
)
repeat_penalty    Effect
──────────────────────────────────────────
1.0 No penalty (neutral)
1.05 – 1.1 Light — subtle reduction
1.1 – 1.2 Moderate — good default
1.2 – 1.3 Strong — visible variety
1.4+ Very strong — can cause incoherence

Part 7: How Parameters Interact

The parameters don’t operate independently — they execute in a specific order on the logit/probability distribution:

Raw logits


Apply temperature (divide logits by T)


Apply frequency/presence penalties (subtract from logits)


Convert to probabilities via softmax


Apply Top-K filtering (zero out all but top K)


Apply Top-P filtering (zero out below nucleus threshold)


Renormalize probabilities


Sample one token

Critical interaction: Temperature + Top-P

Write on Medium

OpenAI recommends: change either temperature OR top_p, not both simultaneously. Modifying both creates compounding effects that are hard to reason about.

# ✅ Correct — change one
{"temperature": 0.3, "top_p": 1.0} # Cool temperature, no top-p filtering
{"temperature": 1.0, "top_p": 0.8} # Default temperature, nucleus filtering

# ⚠️ Technically works, but harder to reason about
{"temperature": 0.7, "top_p": 0.9} # Both active — interactions compound

# In practice, most production teams use temperature alone
# and only add top_p when they need adaptive nucleus behavior# ⚠️ Technically works, but harder to reason about
{"temperature": 0.7, "top_p": 0.9} # Both active — interactions compound

Part 8: Production Parameter Presets

Here’s a tested set of parameter presets covering the most common production use cases. Save this as a reference:

# parameter_presets.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class SamplingParams:
temperature: float
top_p: float
frequency_penalty: float
presence_penalty: float
max_tokens: Optional[int] = None
description: str = ""


PRESETS = {

# ── Deterministic / Structured ──────────────────────────────────────────────
"deterministic": SamplingParams(
temperature=0.0,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
description="Greedy decoding — same input always gives same output. "
"Use for: classification, routing, structured JSON, math.",
),

"extraction": SamplingParams(
temperature=0.1,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
description="Near-deterministic. Use for: data extraction, NER, "
"schema filling, fact retrieval.",
),

# ── Code ────────────────────────────────────────────────────────────────────
"code": SamplingParams(
temperature=0.2,
top_p=0.95,
frequency_penalty=0.0,
presence_penalty=0.0,
description="Low temperature preserves syntax. Slight top-p prevents "
"hallucinating unusual APIs. Use for: code generation, SQL, "
"function calls.",
),

# ── Balanced / General Purpose ───────────────────────────────────────────────
"default": SamplingParams(
temperature=0.7,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
description="General purpose. Good starting point for most tasks.",
),

"chat": SamplingParams(
temperature=0.7,
top_p=0.95,
frequency_penalty=0.1,
presence_penalty=0.0,
description="Conversational assistant. Light frequency penalty prevents "
"repetitive phrasing in longer conversations.",
),

# ── Analysis / Writing ───────────────────────────────────────────────────────
"analysis": SamplingParams(
temperature=0.5,
top_p=0.9,
frequency_penalty=0.2,
presence_penalty=0.0,
description="Analytical tasks: summarization, document analysis, "
"report generation. Focused but not robotic.",
),

"writing": SamplingParams(
temperature=0.8,
top_p=0.95,
frequency_penalty=0.4,
presence_penalty=0.1,
description="Long-form writing: blog posts, articles, marketing copy. "
"Frequency penalty prevents word repetition over long outputs.",
),

# ── Creative ────────────────────────────────────────────────────────────────
"creative": SamplingParams(
temperature=1.0,
top_p=0.95,
frequency_penalty=0.5,
presence_penalty=0.2,
description="Creative writing: stories, poetry, brainstorming. "
"High temperature + penalties encourage diverse vocabulary.",
),

"brainstorm": SamplingParams(
temperature=1.1,
top_p=0.98,
frequency_penalty=0.5,
presence_penalty=0.4,
description="Ideation: generating diverse options, exploring possibilities. "
"High presence penalty pushes toward covering new angles.",
),
}


def apply_preset(preset_name: str, **overrides) -> dict:
"""
Get parameter dict for a preset, with optional overrides.

Usage:
params = apply_preset("code", max_tokens=500)
response = await client.chat.completions.create(**params, ...)
"""
if preset_name not in PRESETS:
raise ValueError(f"Unknown preset: {preset_name}. "
f"Available: {list(PRESETS.keys())}")

preset = PRESETS[preset_name]
result = {
"temperature": preset.temperature,
"top_p": preset.top_p,
"frequency_penalty": preset.frequency_penalty,
"presence_penalty": preset.presence_penalty,
}

if preset.max_tokens:
result["max_tokens"] = preset.max_tokens

result.update(overrides)
return result


# ── Usage examples ─────────────────────────────────────────────────────────────
async def example_usage():
from openai import AsyncOpenAI
client = AsyncOpenAI()

# Code generation — low temperature, no penalties
code_params = apply_preset("code", max_tokens=500)
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Write a Python function to parse a JWT token"
}],
**code_params,
)

# Creative story — high temperature, strong penalties
creative_params = apply_preset("creative", max_tokens=600)
story = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Write the opening paragraph of a noir detective story"
}],
**creative_params,
)

Part 9: Parameters Across Providers

Each provider exposes these controls differently. Here’s the unified reference:

# ── OpenAI ──────────────────────────────────────────────────────────────────────
await openai_client.chat.completions.create(
model="gpt-4o",
messages=[...],
temperature=0.7, # 0.0 – 2.0
top_p=0.95, # 0.0 – 1.0
frequency_penalty=0.3, # -2.0 – 2.0
presence_penalty=0.1, # -2.0 – 2.0
max_tokens=500,
n=1, # Number of completions to generate
seed=42, # For reproducibility (not strict guarantee)
)

# ── Anthropic ───────────────────────────────────────────────────────────────────
await anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
messages=[...],
temperature=0.7, # 0.0 – 1.0 (note: max is 1.0, not 2.0)
top_p=0.95, # 0.0 – 1.0
top_k=40, # Top-K (Anthropic supports this too)
max_tokens=500,
# Note: No frequency/presence penalties in Anthropic API
)

# ── Gemini ───────────────────────────────────────────────────────────────────────
model = genai.GenerativeModel(
model_name="gemini-2.5-flash",
generation_config=genai.GenerationConfig(
temperature=0.7, # 0.0 – 2.0
top_p=0.95, # 0.0 – 1.0
top_k=40, # Integer > 0
max_output_tokens=500,
candidate_count=1, # Number of candidate completions
stop_sequences=["END"], # Custom stop strings
)
)

# ── Ollama (local models) ─────────────────────────────────────────────────────────
{
"model": "llama3.2",
"options": {
"temperature": 0.7, # 0.0 – 2.0
"top_p": 0.9, # 0.0 – 1.0
"top_k": 40, # Positive integer
"repeat_penalty": 1.1, # > 1.0 penalizes repetition
"repeat_last_n": 64, # Context window for repeat penalty
"num_predict": 500, # max_tokens equivalent
"seed": -1, # -1 = random, positive = reproducible
"num_ctx": 4096, # Context window size
"mirostat": 0, # Advanced: Mirostat sampling (0=off, 1, 2)
"mirostat_tau": 5.0, # Target entropy (Mirostat only)
"mirostat_eta": 0.1, # Learning rate (Mirostat only)
}
}

Provider Quirks Worth Knowing

Provider      Temperature max    Penalties           Top-K     Notes
─────────────────────────────────────────────────────────────────────────
OpenAI 2.0 freq + presence ✓ No seed for reproducibility
Anthropic 1.0 None Yes Stricter than OpenAI range
Gemini 2.0 None Yes candidate_count for beam
Ollama 2.0 repeat_penalty ✓ Yes Mirostat available

Part 10: Determinism and Reproducibility

The Seed Parameter

OpenAI provides a seed parameter for reproducibility, but it’s not a hard guarantee:

# Same seed → usually same output (but not guaranteed across model updates)
response1 = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Pick a random number from 1-10"}],
temperature=0.7,
seed=42,
)

response2 = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Pick a random number from 1-10"}],
temperature=0.7,
seed=42,
)

# Check system_fingerprint — if different, model weights changed between calls
print(response1.system_fingerprint) # fp_a2c3b1...
print(response2.system_fingerprint) # fp_a2c3b1... (same = same model)

True Determinism: Temperature = 0

If you need genuine reproducibility (evaluation pipelines, unit tests, regression testing), use temperature=0:

# For evaluation and testing — ALWAYS use temperature=0
async def evaluate_model(test_cases: list[dict]) -> list[dict]:
results = []
for test in test_cases:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": test["input"]}],
temperature=0, # Deterministic — critical for fair evaluation
top_p=1.0,
seed=0, # Belt and suspenders
)
results.append({
"input": test["input"],
"expected": test["expected"],
"actual": response.choices[0].message.content,
})
return results

🔍 Common Mistakes

1. Using temperature=1.0 for factual questions

Temperature 1.0 is the model’s raw, unmodified distribution — built for language modeling, not factual accuracy. For Q&A, use 0.1–0.3. A common symptom: correct answers most of the time, wrong answers occasionally in production.

2. Using temperature=0 for creative tasks

At temperature 0, the model produces the single most probable completion. This works great for facts, but creative writing becomes robotic and formulaic. Users notice.

3. Setting both frequency and presence penalties high

High frequency_penalty + high presence_penalty together create outputs that desperately avoid any familiar phrasing. The result sounds alien and stilted. Use one at a time, at moderate values.

4. Not adjusting temperature for structured output (JSON)

When you need JSON output, use temperature=0 and the response_format={"type": "json_object"} parameter. Any randomness in structured output causes parsing failures.

# ✅ Correct — deterministic for structured output
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Extract name and age from: John is 32"}],
temperature=0, # Critical
response_format={"type": "json_object"}, # Ensure valid JSON
)

5. Ignoring context length when applying penalties

frequency_penalty and presence_penalty apply to the output so far during generation. For very short outputs (1-3 sentences), penalties have almost no visible effect. Don’t apply penalties to short Q&A responses expecting any difference.

6. Mixing temperature and top_p both to non-default values

This creates compound filtering effects that are hard to interpret. Pick one primary control. OpenAI’s own documentation says: “We generally recommend altering this or temperature but not both.”

💼 Quick Questions

Q: What’s the difference between temperature and top-P at a mechanistic level?

Temperature scales all logits before softmax, uniformly adjusting the sharpness of the entire distribution. Top-P filters the already-converted probability distribution, removing low-probability tokens entirely. Temperature changes the shape of the distribution; top-P changes its support (which tokens are possible at all). They operate on different stages of the sampling pipeline.

Q: Why would you use top-P instead of just using a lower temperature?

Top-P adapts to the model’s confidence dynamically. When the model is very confident (one token has 90% probability), top-P=0.9 samples from just 1–2 tokens. When the model is uncertain (50 tokens each at 2%), top-P=0.9 includes all 50. A fixed temperature can’t adapt this way — it applies the same amount of sharpening regardless of how confident the model already is.

Q: How would you configure parameters for an AI system that classifies customer support tickets into categories?

temperature=0top_p=1.0, no penalties. Classification is deterministic by nature — you want the model’s highest-confidence prediction every time. Any randomness introduces inconsistency across repeated runs on identical tickets. Also add response_format={"type": "json_object"} and specify the exact category names in the prompt.

Q: A client complains that their LLM-generated product descriptions are repetitive. What do you investigate?

First check frequency_penalty — it’s likely 0.0 (default). Increase to 0.3–0.5. Also check if the prompt template is over-constraining the model (tight word-by-word instructions force repetition regardless of penalties). Check output length — short outputs can’t avoid repetition even with penalties active. If using Claude, note there are no frequency penalties; instead craft the prompt to explicitly ask for varied vocabulary.

🏭 Production Considerations

Evaluation first, parameters second. Before tuning parameters in production, build an offline evaluation dataset with 50–100 representative examples and a scoring rubric. Tune parameters against that dataset, not live traffic. Parameter changes that feel like improvements in casual testing can silently degrade quality for edge cases.

Temperature drift in fine-tuned models. If you fine-tune a model, the learned distribution changes. A temperature that worked well for the base model may produce different behavior on the fine-tuned version. Always re-evaluate parameters after fine-tuning.

Separate parameter profiles per use case. Large production applications have multiple AI-powered features: chat, summarization, code generation, classification. Each needs its own parameter preset. Maintain a central params_config.yaml and load it at startup — never hardcode parameters scattered across the codebase.

Log parameters with every request. When debugging quality issues, you need to know exactly what parameters produced the output. Log modeltemperaturetop_pmax_tokens, and any penalties alongside the input/output in your observability pipeline.

Watch for finish_reason: length. If a response ends with finish_reason=length, the model was cut off by max_tokens. Increase max_tokens or the output may be incomplete in ways that are hard to detect automatically.

🔑 Key Takeaways

  1. LLMs generate text by sampling from a probability distribution — understanding this makes all other parameters intuitive
  2. Temperature controls distribution sharpness — lower = more confident/deterministic, higher = more creative/random
  3. Top-P adapts the nucleus dynamically to the model’s own confidence level
  4. Top-K provides fixed-size filtering — simpler but less adaptive than top-P
  5. Frequency penalty reduces repeated tokens proportionally to how often they’ve appeared
  6. Presence penalty encourages topic diversity by penalizing any token that’s appeared at all
  7. Use temperature=0 for evaluation, classification, and structured output — never introduce randomness where you need consistency
  8. Don’t tune both temperature and top-P simultaneously — pick one primary control
  9. Different use cases need different presets — code, creative writing, factual Q&A, and structured output each have optimal settings
  10. Provider APIs differ in their ranges and supported parameters — Anthropic’s max temperature is 1.0, not 2.0

📚 Further Reading

Leave a Reply

Your email address will not be published.