New LLMs, such as OpenAI’s GPT-5 family, come with different versions (e.g., GPT-5, GPT-5-mini, and GPT-5-nano) and various parameters, including high, medium, low, and minimal. Below, we explore the differences between these versions of the models by gathering their benchmark performances and the costs to run these benchmarks.
Price vs. Success: Key Takeaways
We used the GPT-5 family in our analysis, as it is the newest model from OpenAI. We used six different benchmarks across various areas, including reasoning, coding, instruction following, and math.1
Our analysis revealed:
- On average across benchmarks, GPT-5 (high) and GPT-5 (medium) deliver nearly identical success rates (65% vs. 64%), yet GPT-5 (high) costs almost twice as much ($511 vs. $280). They are followed by GPT-5-mini (high), GPT-5 (low), and GPT-5-mini (medium), with success rates of 62%, 61%, and 60%, respectively, at much lower prices of $105, $90, and $28. This shows that by accepting only a ~5% drop in success rate, tasks can be completed at up to 18 times lower cost by switching from GPT-5 (high) to GPT-5-mini (medium).
- GPT-5-mini (high) outperforms GPT-5 (low) in nearly every benchmark, and does so at the same or lower cost. In IFBench, success rates are 75% vs. 67%; in AIME 2025, 97% vs. 83%; in Humanity’s Last Exam, 20% vs. 18%; and in GPQA Diamond, 83% vs. 81%. They tie on SciCode at 39%, yet GPT-5-mini (high) still comes in at a lower cost.
- The most expensive model, GPT-5 (high), only outperforms the second-best performer in three benchmarks, and even then, the margin is no greater than 3%. In all other benchmarks, it is outperformed by cheaper alternatives.
High-medium-low-minimal parameter settings
Although LLM parameters are often described in terms of numerical adjustments, they may also be expressed as qualitative ranges such as high, medium, and low. These ranges are not fixed standards; instead, they are conceptual categories that describe how much influence a parameter exerts on the model’s output.
Using these three levels helps in quickly selecting settings for different tasks, depending on the desired level of creativity, determinism, or length. These levels are beneficial when adjusting temperature, top-P, max tokens, and penalty parameters.
The medium parameter refers to a model’s regular (non-parameterized) version.
Minimal setting:
- Temperature: 0.0–0.2
- Top-p / Top-k: Very low (top-p ≈ 0.1–0.2, top-k = 1–5)
- Max tokens: Short limit
- Penalties: Very low or none
- Effects:
- Highly deterministic, almost identical outputs each time.
- Very concise, factual, and rigid.
- Best for code, math, database queries, or strict compliance answers.
- Very constrained, with low randomness, favoring predictability and precision.
Low setting:
- Temperature: 0.2–0.3
- Top-p / Top-k: Low (top-p ≈ 0.3–0.5, top-k = 5–10)
- Max tokens: Short to medium
- Penalties: Low to moderate
- Effects:
- Mostly deterministic but allows minor variations.
- Reduces robotic repetition compared to minimal.
- Suitable for summaries, structured explanations, or professional writing with a consistent style.
Medium setting:
- Temperature: 0.4–0.7
- Top-p / Top-k: Moderate (top-p ≈ 0.7–0.9, top-k = 20–50)
- Max tokens: Medium length
- Penalties: Moderate, to avoid repetition but allow some creativity
- Effects:
- Balanced between accuracy and creativity.
- Produces natural responses that vary slightly across runs.
- Suitable for general Q&A, drafting, and brainstorming.
High setting:
- Temperature: 0.8–1.2
- Top-p / Top-k: High (top-p ≈ 0.95–1.0, top-k = 50–100)
- Max tokens: Large limit for longer outputs
- Penalties: Medium to high, encouraging variety and novelty
- Effects:
- Highly creative and diverse outputs.
- Less predictable, with a greater risk of hallucinations.
- Best for storytelling, ideation, roleplay, and creative writing.
To decide which level to use, consider:
- Task type/purpose: If you need accuracy (legal, medical, code, factual), choose minimal or medium. If you need creativity, voice, novelty, high might be better.
- Tolerance for errors: How bad are occasional quirks or mistakes? If low, avoid high randomness.
- Computational constraints: High output lengths and high randomness often require more compute and memory.
- Model size: Larger models tend to cope better with high randomness, while smaller models may degrade significantly under high settings.
- Desired output length: Longer generated text can drift, so high randomness plus long length is riskier.
GPT-5
GPT-5 is the flagship model designed for coding, reasoning, and agentic tasks across domains. It balances higher reasoning ability with medium speed, making it suitable for complex, multi-step tasks where accuracy and adaptability are crucial.
- Context window: 400,000
- Max output tokens: 128,000
- Knowledge cutoff: September 30, 2024
- Reasoning: Higher, with reasoning token support
Pricing (per 1M tokens)
- Input: $1.25
- Cached input: $0.125
- Output: $10.00
Modalities
- Text: input and output
- Image: input only
- Audio: not supported
GPT-5 Mini
GPT-5 Mini is a smaller, faster, and more affordable version of GPT-5. It keeps a strong reasoning ability while being better suited for well-defined tasks.
- Context window: 400,000
- Max output tokens: 128,000
- Knowledge cutoff: May 31, 2024
- Features: Supports web search, file search, and code interpreter.
Pricing per 1M tokens:
- Input: $0.25
- Cached input: $0.025
- Output: $2.00
GPT-5 Nano
GPT-5 Nano is the fastest and cheapest option, designed for lightweight tasks such as classification and summarization.
- Context window: 400,000
- Max output tokens: 128,000
- Knowledge cutoff: May 31, 2024
- Features: Supports file search, image generation, and code interpreter (but not web search).
Pricing per 1M tokens:
- Input: $0.05
- Cached input: $0.005
- Output: $0.40
Key differences
Reasoning:
- GPT-5 has the highest reasoning ability. It is best suited for complex, multi-step tasks that require logical depth, such as advanced coding, scientific problem-solving, or nuanced decision-making.
- GPT-5 mini retains strong reasoning skills but is more effective for structured or well-defined prompts where efficiency matters as much as depth.
- GPT-5 nano offers only average reasoning ability. It works best when the task is straightforward and does not require deep analysis.
Speed:
- GPT-5 nano is the fastest option, optimized for real-time or large-scale applications where response time is critical.
- GPT-5 mini strikes a balance between speed and reasoning, making it suitable for everyday workloads that require both quick responses and accuracy.
- GPT-5 is slower than its smaller versions due to its more intensive reasoning process, but this trade-off yields higher-quality outputs.
Cost:
- GPT-5 nano is the most cost-efficient. Its low input and output token rates make it practical for high-volume workloads such as batch summarization or classification at scale.
- GPT-5 mini sits in the mid-range, offering a balance between cost and capability. It is often chosen when users want efficiency without sacrificing too much reasoning performance.
- GPT-5 is the most expensive. It is generally reserved for scenarios where accuracy, reliability, and advanced reasoning are more important than cost.
What are LLM parameters?
LLM parameters are settings that control how large language models (LLMs) generate text during inference. They do not alter the learned weights of a pre-trained model, which are determined during model training with large amounts of training data. Instead, they shape the process of selecting the following word in the probability distribution predicted by the model.
By adjusting LLM parameters, users can influence factors such as creativity, determinism, response length, repetitiveness, and coherence. This is crucial for tailoring a model’s behavior to specific tasks, whether generating creative content, producing structured outputs, or performing technical tasks such as code generation.
Key parameters include the temperature parameter, top-p sampling, maximum tokens, frequency penalty, presence penalty, and stop sequences. Each affects the model’s output in different ways and must be used carefully to strike a balance between output quality, computational resources, and model performance.
Temperature parameter
The temperature parameter controls how sharply the model’s probability distribution is sampled. By scaling the logits, temperature modifies the likelihood of selecting high-probability tokens.
- Low temperature values (0.0–0.3): The model produces deterministic responses, favoring the most probable tokens. This setting is practical for focused outputs such as summarization or factual Q&A.
- Higher temperature values (0.7–1.2): The model generates a broader range of outputs, where higher values increase creativity but also risk incoherence. This suits creative writing or brainstorming.
- Extreme values: A very low value (0) forces the model’s responses always to select the top token, while very high values may lead to incoherent text.
Adjusting LLM parameters, such as temperature, requires considering the model size. Larger models handle high randomness more effectively, while smaller models may compromise output quality at higher settings.
Top p sampling
Top p sampling, also known as nucleus sampling, limits token selection to the smallest set whose cumulative probability exceeds a specified threshold (p).
- Lower values (e.g., p = 0.5): The model samples only from the most probable tokens, resulting in coherent responses but reduced diversity.
- Higher values (e.g., p = 0.9): The model generates from a wider pool, supporting the creation of creative content, but risks producing off-topic continuations.
Compared to temperature, the top p parameter provides a direct control over the diversity of token selection. Combining both parameters effectively is crucial for striking a balance between creativity and precision in generated text outputs.
Max tokens (The token number)
The max tokens or the token number parameter defines the maximum number of tokens the model can produce in its generated output. It directly influences response length and the allocation of computational resources.
- A small number ensures concise outputs but may truncate essential details.
- A larger value allows detailed explanations but requires more computational resources and increases costs.
The token limit is also bounded by the context window, which includes both the prompt and the generated text. Exceeding this upper limit is not possible, regardless of the max tokens setting.
Frequency penalty parameter
The frequency penalty parameter adjusts the likelihood of repeated tokens based on their frequency of occurrence in the output.
- Positive values: Reduce repetition, leading to more diverse text.
- Negative values encourage the reuse of tokens, which can be beneficial when a term must appear multiple times in a document.
- Excessively high penalties can harm coherent responses, as natural repetition is often necessary in human-like writing.
This parameter is handy when aiming to generate responses that avoid redundancy in long texts.
Presence penalty
The presence penalty reduces the probability of tokens that have appeared even once in the text. Unlike the frequency penalty, it does not consider the number of occurrences but instead enforces novelty.
- Positive values: Promote exploration of new topics, helpful for creative writing or brainstorming sessions.
- Negative values: Can push the model’s behavior toward reinforcing certain words, which may be helpful in specific tasks like structured dialogues.
The presence penalty helps ensure the model generates diverse ideas, but it must be balanced to avoid unnatural avoidance of necessary terms.
Stop sequences
Stop sequences define explicit strings or tokens that cause the model’s output to halt. This allows the user to set a clear stopping point in the generated text outputs.
- Common in structured applications such as code generation, dialogue systems, or when aligning output to a template.
- Helps enforce predictable response length and prevents the model from producing irrelevant continuations.
Seed and determinism
Some APIs allow you to specify a random seed. This ensures that the same prompt and parameter settings will yield the same generated output, as long as other conditions remain constant.
- Useful for testing and evaluating parameters effectively.
- Important for comparing different parameter settings without introducing randomness into the results.
This approach supports reproducibility in experiments but may not guarantee exact outputs across different backends or AI models.
Differences between key parameters
Different LLM parameters influence the behavior of large language models in various ways. Understanding their differences is crucial for utilizing parameters effectively to strike a balance between model performance, output quality, and computational resources.
- Temperature parameter vs Top p sampling (nucleus sampling): Temperature adjusts the shape of the distribution, while top p sampling sets a cutoff point for candidate tokens. They can be combined, but altering one does not have the same effect as the other.
- Frequency penalty vs presence penalty: Frequency penalty scales with the number of tokens repeated, while presence penalty is a binary check on whether a token has appeared. Both are useful in adjusting LLM parameters for different outcomes, but their effects on the model’s responses diverge.
- Max tokens vs context window: The max tokens parameter limits the generated output directly, whereas the context window is a fixed capability of the model. For example, if a model has a 4,000-token limit, a prompt of 3,000 tokens means the upper limit for output is 1,000 tokens, regardless of the maximum tokens setting.
Reference Links


Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.