Key Model Parameters

Temperature, top-p, max tokens, frequency penalty — what each one does and when to change it.

These Parameters Actually Matter

Most tutorials mention temperature and move on. But if you're building anything beyond a quick prototype, you need to understand what each parameter does — and more importantly, when cranking one up will hurt you.

Temperature

Temperature controls the randomness of token selection during sampling. The model produces a probability distribution over its vocabulary at each step. Temperature reshapes that distribution before you sample from it.

Temperature 0.0: Always pick the highest-probability token. Deterministic (with caveats — floating point and GPU non-determinism mean you might get slight variation in practice). Use this for classification, extraction, structured output, anything where you want consistency.
Temperature 0.7: The reasonable default for most tasks. Some creativity, mostly coherent.
Temperature 1.0: Raw distribution, no reshaping. More varied outputs.
Temperature 1.5–2.0: Distribution flattened further. Outputs get weird. Useful for brainstorming divergent ideas, not for anything requiring accuracy.

The analogy: temperature 0 is a student who always gives the most expected answer. Temperature 2 is that same student after three espressos and a philosophy seminar.

Top-p (Nucleus Sampling)

Top-p (also called nucleus sampling) limits token selection to the smallest set of tokens whose cumulative probability exceeds p. At top_p=0.9, the model only samples from tokens that together account for 90% of the probability mass.

This is adaptive: if the model is very confident (one token has 95% probability), top-p of 0.9 might only include one token. If the model is uncertain, top-p might include hundreds of tokens.

Top-p and temperature interact. Many practitioners set one and leave the other at its default. OpenAI recommends not adjusting both simultaneously. As a rule of thumb: use temperature for coarse control, leave top-p at 1.0, or vice versa.

Top-k

Top-k limits sampling to the top k tokens by probability, regardless of their cumulative probability. top_k=50 means always sample from the 50 highest-probability tokens.

Top-k is a blunter instrument than top-p. It doesn't adapt to the model's confidence level — you get 50 candidates whether the top token has 99% probability or 5% probability. Anthropic's API exposes top-k; OpenAI's does not.

Max Tokens

Max tokens (called max_tokens in Anthropic, max_completion_tokens in newer OpenAI versions) sets a hard ceiling on output length. The model will stop generating at this limit even mid-sentence.

Set this lower than you think you need for cost control in production. For a customer support bot, 512 tokens is probably enough. For a summarization task, 1024. For a full code generation task, you might need 4096+.

If you're getting truncated responses, this is the first thing to check.

Frequency Penalty and Presence Penalty (OpenAI)

These are OpenAI-specific parameters that Anthropic doesn't directly expose.

Frequency penalty (0–2): Reduces the probability of tokens that have already appeared, proportional to how many times they've appeared. Good for reducing repetition in long generations.

Presence penalty (0–2): Reduces the probability of any token that has appeared at all, regardless of frequency. Encourages the model to introduce new topics and vocabulary.

For most use cases, leave both at 0. If you're getting repetitive outputs, try frequency_penalty=0.3 before reaching for higher values.

Stop Sequences

Stop sequences tell the model to stop generating when it produces a specific string. This is more reliable than max_tokens for structured output because you can stop exactly where you want.

Common uses: - Stop on \n for single-line responses - Stop on </answer> if you're using XML-style output formatting - Stop on ### if you're generating structured documents

Practical Parameter Recipes

```python # Anthropic — Classification/Extraction (deterministic) response = client.messages.create( model="claude-opus-4-5", max_tokens=256, temperature=0.0, messages=[{"role": "user", "content": "Classify this as positive, neutral, or negative: ..."}] )

# Anthropic — General chat (balanced) response = client.messages.create( model="claude-opus-4-5", max_tokens=1024, temperature=0.7, messages=[{"role": "user", "content": "Explain how async/await works in Python."}] )

# Anthropic — Creative writing (expressive) response = client.messages.create( model="claude-opus-4-5", max_tokens=2048, temperature=1.0, top_p=0.95, messages=[{"role": "user", "content": "Write an opening scene for a noir detective story."}] ) ```

```python # OpenAI — Classification (deterministic) response = client.chat.completions.create( model="gpt-4o", max_completion_tokens=256, temperature=0.0, messages=[{"role": "user", "content": "Classify this as positive, neutral, or negative: ..."}] )

# OpenAI — General chat response = client.chat.completions.create( model="gpt-4o", max_completion_tokens=1024, temperature=0.7, messages=[{"role": "user", "content": "Explain how async/await works in Python."}] )

# OpenAI — Reducing repetition in longer output response = client.chat.completions.create( model="gpt-4o", max_completion_tokens=2048, temperature=0.8, frequency_penalty=0.3, messages=[{"role": "user", "content": "Write a detailed technical blog post about..."}] ) ```

A Note on "Reasoning" Models

OpenAI's o1/o3 series and Anthropic's extended thinking mode work differently — they have their own internal chain-of-thought that runs before producing output. For these models, temperature is either fixed or less meaningful. You configure them differently, usually by setting a thinking budget rather than temperature. Don't apply the same mental model to reasoning models as you would to standard completions.

Have a follow-up question about this topic?

Ask AI

← Previous

How LLMs Actually Work

AI in Your Dev Workflow