New LLMs, such as OpenAI’s GPT-5 family, come in different versions (e.g., GPT-5, GPT-5-mini, and GPT-5-nano) and with various parameter settings, including high, medium, low, and minimal.
Below, we explore the differences between these model versions by gathering their benchmark performance and the costs to run the benchmarks.
Price vs. Success: Key Takeaways
We used the GPT-5 family in our analysis, as it is the newest model from OpenAI. We used six benchmarks across various areas, including reasoning, coding, instruction-following, and math.
Our analysis revealed:
- On average across benchmarks, GPT-5 (high) and GPT-5 (medium) deliver nearly identical success rates (65% vs. 64%), yet GPT-5 (high) costs almost twice as much ($511 vs. $280). They are followed by GPT-5-mini (high), GPT-5 (low), and GPT-5-mini (medium), with success rates of 62%, 61%, and 60%, respectively, at much lower prices of $105, $90, and $28. This shows that by accepting only a ~5% drop in success rate, tasks can be completed at up to 18 times lower cost by switching from GPT-5 (high) to GPT-5-mini (medium).
- GPT-5-mini (high) outperforms GPT-5 (low) in nearly every benchmark, and does so at the same or lower cost. In IFBench, success rates are 75% vs. 67%; in AIME 2025, 97% vs. 83%; in Humanity’s Last Exam, 20% vs. 18%; and in GPQA Diamond, 83% vs. 81%. They tie on SciCode at 39%, yet GPT-5-mini (high) still comes in at a lower cost.
- The most expensive model, GPT-5 (high), outperforms the second-best performer on only three benchmarks, and even then, the margin is no greater than 3%. In all other benchmarks, it is outperformed by cheaper alternatives.
High-medium-low-minimal parameter settings
Although LLM parameters are often described in terms of numerical adjustments, they may also be expressed as qualitative ranges such as high, medium, and low. These ranges are not fixed standards; instead, they are conceptual categories that describe how much influence a parameter exerts on the model’s output.
Using these three levels helps in quickly selecting settings for different tasks, depending on the desired level of creativity, determinism, or length. These levels are beneficial when adjusting top-P, max tokens, and penalty parameters.
The medium parameter refers to a model’s regular (non-parameterized) version.
Minimal setting:
- Top-p / Top-k: Very low (top-p ≈ 0.1–0.2, top-k = 1–5)
- Max tokens: Short limit
- Penalties: Very low or none
- Effects:
- Highly deterministic, almost identical outputs each time.
- Very concise, factual, and rigid.
- Best for code, math, database queries, or strict compliance answers.
- Very constrained, with low randomness, favoring predictability and precision.
Low setting:
- Top-p / Top-k: Low (top-p ≈ 0.3–0.5, top-k = 5–10)
- Max tokens: Short to medium
- Penalties: Low to moderate
- Effects:
- Mostly deterministic but allows minor variations.
- Reduces robotic repetition compared to minimal.
- Suitable for summaries, structured explanations, or professional writing with a consistent style.
Medium setting:
- Top-p / Top-k: Moderate (top-p ≈ 0.7–0.9, top-k = 20–50)
- Max tokens: Medium length
- Penalties: Moderate, to avoid repetition but allow some creativity
- Effects:
- Balanced between accuracy and creativity.
- Produces natural responses that vary slightly across runs.
- Suitable for general Q&A, drafting, and brainstorming.
High setting:
- Top-p / Top-k: High (top-p ≈ 0.95–1.0, top-k = 50–100)
- Max tokens: Large limit for longer outputs
- Penalties: Medium to high, encouraging variety and novelty
- Effects:
- Highly creative and diverse outputs.
- Less predictable, with a greater risk of hallucinations.
- Best for storytelling, ideation, roleplay, and creative writing.
To decide which level to use, consider:
- Task type/purpose: If you need accuracy (legal, medical, code, factual), choose minimal or medium. If you need creativity, voice, novelty, high might be better.
- Tolerance for errors: How bad are occasional quirks or mistakes? If low, avoid high randomness.
- Computational constraints: High output lengths and high randomness often require more compute and memory.
- Model size: Larger models tend to cope better with high randomness, while smaller models may degrade significantly under high settings.
- Desired output length: Longer generated text can drift, so high randomness plus long length is riskier.
GPT-5
GPT-5 is the flagship model designed for coding, reasoning, and agentic tasks across domains. It balances higher reasoning ability with medium speed, making it suitable for complex, multi-step tasks where accuracy and adaptability are crucial.
- Context window: 400,000
- Max output tokens: 128,000
- Knowledge cutoff: September 30, 2024
- Reasoning: Higher, with reasoning token support
Pricing (per 1M tokens)
- Input: $1.25
- Cached input: $0.125
- Output: $10.00
Modalities
- Text: input and output
- Image: input only
- Audio: not supported
GPT-5 Mini
GPT-5 Mini is a smaller, faster, and more affordable version of GPT-5. It keeps a strong reasoning ability while being better suited for well-defined tasks.
- Context window: 400,000
- Max output tokens: 128,000
- Knowledge cutoff: May 31, 2024
- Features: Supports web search, file search, and code interpreter.
Pricing per 1M tokens:
- Input: $0.25
- Cached input: $0.025
- Output: $2.00
GPT-5 Nano
GPT-5 Nano is the fastest and cheapest option, designed for lightweight tasks such as classification and summarization.
- Context window: 400,000
- Max output tokens: 128,000
- Knowledge cutoff: May 31, 2024
- Features: Supports file search, image generation, and code interpreter (but not web search).
Pricing per 1M tokens:
- Input: $0.05
- Cached input: $0.005
- Output: $0.40
GPT-5 series features
The GPT-5 series introduces several capabilities that improve control, formatting, and efficiency. These features apply to GPT-5, GPT-5 Mini, and GPT-5 Nano.
Verbosity parameter
The verbosity parameter allows developers to influence the level of detail in model outputs without modifying the prompt.
It accepts three values:
- Low: short and concise results
- Medium: balanced results (default)
- High: detailed outputs suitable for explanation, documentation or review
Higher verbosity leads to longer responses and higher use of output tokens.
Free-form function calling
The GPT-5 series supports custom tool calls that accept raw text output instead of structured JSON. This makes it possible to generate code, SQL queries or configuration text that is passed directly into external runtimes such as:
- Code sandboxes
- SQL engines
- Shell environments
- Configuration systems
The custom tool type does not support parallel tool calls. It is intended for situations where natural text is preferable to a strict JSON schema.
Context free grammar (CFG) support
Models can produce text constrained by a grammar defined with Lark or regex syntax. This ensures that the generated text follows strict structural rules. Common use cases include:
- Enforcing specific SQL dialects
- Restricting timestamps or identifiers
- Validating configuration formats
When using CFGs, developers define terminals and rules that describe the set of acceptable strings. The model produces only outputs that match these rules.
Minimal reasoning mode
Minimal reasoning mode reduces or removes reasoning tokens. This reduces latency and improves time-to-first-token.
It is suitable for tasks such as:
- Classification
- Short rewrites
- Structured extraction
- Basic formatting operations
When no reasoning setting is provided, the default effort level is medium.
Key differences
The three models differ primarily in reasoning depth, speed, and cost. The new features can be used across all models, but their impact varies depending on the model’s design.
Reasoning
- GPT-5 provides the strongest reasoning capability. It is appropriate for complex, multi-step problems in coding, scientific analysis, or decision support.
- GPT-5 Mini offers strong reasoning for structured prompts with predictable task boundaries.
- GPT-5 Nano has moderate reasoning performance and works best on tasks that do not require deep analysis.
- Minimal reasoning mode can be used with all models and provides the most significant benefit for GPT-5 Nano and GPT-5 Mini, given their speed advantage.
Speed
- GPT-5 Nano is the fastest option and is effective for real-time or large-scale workloads.
- GPT-5 Mini balances speed with reasoning, making it suitable for regular production workloads.
- GPT-5 is slower because it performs more internal reasoning, but this results in more precise output.
- Minimal reasoning mode can further reduce latency, particularly for Nano.
Cost
- GPT-5 Nano has the lowest cost per token. It is preferred for high-volume tasks such as batch classification or summarization.
- GPT-5 Mini sits in the mid-range, offering a balance between capability and cost.
- GPT-5 is the most expensive model and is typically used when accuracy and consistency take priority.
- Verbosity settings influence cost because higher verbosity produces more output tokens.
What are LLM parameters?
LLM parameters are settings that control how large language models (LLMs) generate text during inference. They do not alter the learned weights of a pre-trained model, which are determined during model training with large amounts of training data. Instead, they shape the process of selecting the following word in the probability distribution predicted by the model.
By adjusting LLM parameters, users can influence factors such as creativity, determinism, response length, repetitiveness, and coherence. This is crucial for tailoring a model’s behavior to specific tasks, whether generating creative content, producing structured outputs, or performing technical tasks such as code generation.
Key parameters include top-p sampling, maximum tokens, frequency penalty, presence penalty, and stop sequences. Each affects the model’s output differently and must be used carefully to balance output quality, computational resources, and model performance.
Top p sampling
Top p sampling, also known as nucleus sampling, limits token selection to the smallest set whose cumulative probability exceeds a specified threshold (p).
- Lower values (e.g., p = 0.5): The model samples only from the most probable tokens, resulting in coherent responses but reduced diversity.
- Higher values (e.g., p = 0.9): The model generates from a wider pool, supporting the creation of creative content, but risks producing off-topic continuations.
Max tokens (The token number)
The max tokens or the token number parameter defines the maximum number of tokens the model can produce in its generated output. It directly influences response length and the allocation of computational resources.
- A small number ensures concise outputs but may truncate essential details.
- A larger value allows detailed explanations but requires more computational resources and increases costs.
The token limit is also bounded by the context window, which includes both the prompt and the generated text. Exceeding this upper limit is not possible, regardless of the max tokens setting.
Frequency penalty parameter
The frequency penalty parameter adjusts the likelihood of repeated tokens based on their frequency of occurrence in the output.
- Positive values: Reduce repetition, leading to more diverse text.
- Negative values encourage the reuse of tokens, which can be beneficial when a term must appear multiple times in a document.
- Excessively high penalties can harm coherent responses, as natural repetition is often necessary in human-like writing.
This parameter is handy when aiming to generate responses that avoid redundancy in long texts.
Presence penalty
The presence penalty reduces the probability of tokens that have appeared even once in the text. Unlike the frequency penalty, it does not consider the number of occurrences but instead enforces novelty.
- Positive values: Promote exploration of new topics, helpful for creative writing or brainstorming sessions.
- Negative values: Can push the model’s behavior toward reinforcing certain words, which may be helpful in specific tasks like structured dialogues.
The presence penalty helps ensure the model generates diverse ideas, but it must be balanced to avoid unnatural avoidance of necessary terms.
Stop sequences
Stop sequences define explicit strings or tokens that cause the model’s output to halt. This allows the user to set a clear stopping point in the generated text outputs.
- Common in structured applications such as code generation, dialogue systems, or when aligning output to a template.
- Helps enforce predictable response length and prevents the model from producing irrelevant continuations.
Seed and determinism
Some APIs allow you to specify a random seed. This ensures that the same prompt and parameter settings will yield the same generated output, as long as other conditions remain constant.
- Useful for testing and evaluating parameters effectively.
- Important for comparing different parameter settings without introducing randomness into the results.
This approach supports reproducibility in experiments but may not guarantee exact outputs across different backends or AI models.
Differences between key parameters
Different LLM parameters influence the behavior of large language models in various ways. Understanding their differences is crucial for utilizing parameters effectively to strike a balance between model performance, output quality, and computational resources.
- Frequency penalty vs presence penalty: Frequency penalty scales with the number of tokens repeated, while presence penalty is a binary check on whether a token has appeared. Both are useful in adjusting LLM parameters for different outcomes, but their effects on the model’s responses diverge.
- Max tokens vs context window: The max tokens parameter limits the generated output directly, whereas the context window is a fixed capability of the model. For example, if a model has a 4,000-token limit, a prompt of 3,000 tokens means the upper limit for output is 1,000 tokens, regardless of the maximum tokens setting.





Be the first to comment
Your email address will not be published. All fields are required.