No results found.

LLM Parameters: GPT-5 High, Medium, Low and Minimal

Sıla Ermut
Sıla Ermut
updated on Dec 26, 2025

New LLMs, such as OpenAI’s GPT-5 family, come in different versions (e.g., GPT-5, GPT-5-mini, and GPT-5-nano) and with various parameter settings, including high, medium, low, and minimal.

Below, we explore the differences between these model versions by gathering their benchmark performance and the costs to run the benchmarks.

Price vs. Success: Key Takeaways

We used the GPT-5 family in our analysis, as it is the newest model from OpenAI. We used six benchmarks across various areas, including reasoning, coding, instruction-following, and math.

Loading Chart

Our analysis revealed:

  • On average across benchmarks, GPT-5 (high) and GPT-5 (medium) deliver nearly identical success rates (65% vs. 64%), yet GPT-5 (high) costs almost twice as much ($511 vs. $280). They are followed by GPT-5-mini (high), GPT-5 (low), and GPT-5-mini (medium), with success rates of 62%, 61%, and 60%, respectively, at much lower prices of $105, $90, and $28. This shows that by accepting only a ~5% drop in success rate, tasks can be completed at up to 18 times lower cost by switching from GPT-5 (high) to GPT-5-mini (medium).
  • GPT-5-mini (high) outperforms GPT-5 (low) in nearly every benchmark, and does so at the same or lower cost. In IFBench, success rates are 75% vs. 67%; in AIME 2025, 97% vs. 83%; in Humanity’s Last Exam, 20% vs. 18%; and in GPQA Diamond, 83% vs. 81%. They tie on SciCode at 39%, yet GPT-5-mini (high) still comes in at a lower cost.
  • The most expensive model, GPT-5 (high), outperforms the second-best performer on only three benchmarks, and even then, the margin is no greater than 3%. In all other benchmarks, it is outperformed by cheaper alternatives.

High-medium-low-minimal parameter settings

Although LLM parameters are often described in terms of numerical adjustments, they may also be expressed as qualitative ranges such as high, medium, and low. These ranges are not fixed standards; instead, they are conceptual categories that describe how much influence a parameter exerts on the model’s output.

Using these three levels helps in quickly selecting settings for different tasks, depending on the desired level of creativity, determinism, or length. These levels are beneficial when adjusting top-P, max tokens, and penalty parameters.

The medium parameter refers to a model’s regular (non-parameterized) version.

Minimal setting:

  • Top-p / Top-k: Very low (top-p ≈ 0.1–0.2, top-k = 1–5)
  • Max tokens: Short limit
  • Penalties: Very low or none
  • Effects:
    • Highly deterministic, almost identical outputs each time.
    • Very concise, factual, and rigid.
    • Best for code, math, database queries, or strict compliance answers.
    • Very constrained, with low randomness, favoring predictability and precision.

Low setting:

  • Top-p / Top-k: Low (top-p ≈ 0.3–0.5, top-k = 5–10)
  • Max tokens: Short to medium
  • Penalties: Low to moderate
  • Effects:
    • Mostly deterministic but allows minor variations.
    • Reduces robotic repetition compared to minimal.
    • Suitable for summaries, structured explanations, or professional writing with a consistent style.

Medium setting:

  • Top-p / Top-k: Moderate (top-p ≈ 0.7–0.9, top-k = 20–50)
  • Max tokens: Medium length
  • Penalties: Moderate, to avoid repetition but allow some creativity
  • Effects:
    • Balanced between accuracy and creativity.
    • Produces natural responses that vary slightly across runs.
    • Suitable for general Q&A, drafting, and brainstorming.

High setting:

  • Top-p / Top-k: High (top-p ≈ 0.95–1.0, top-k = 50–100)
  • Max tokens: Large limit for longer outputs
  • Penalties: Medium to high, encouraging variety and novelty
  • Effects:
    • Highly creative and diverse outputs.
    • Less predictable, with a greater risk of hallucinations.
    • Best for storytelling, ideation, roleplay, and creative writing.

To decide which level to use, consider:

  • Task type/purpose: If you need accuracy (legal, medical, code, factual), choose minimal or medium. If you need creativity, voice, novelty, high might be better.
  • Tolerance for errors: How bad are occasional quirks or mistakes? If low, avoid high randomness.
  • Computational constraints: High output lengths and high randomness often require more compute and memory.
  • Model size: Larger models tend to cope better with high randomness, while smaller models may degrade significantly under high settings.
  • Desired output length: Longer generated text can drift, so high randomness plus long length is riskier.

GPT-5

GPT-5 is the flagship model designed for coding, reasoning, and agentic tasks across domains. It balances higher reasoning ability with medium speed, making it suitable for complex, multi-step tasks where accuracy and adaptability are crucial.

  • Context window: 400,000
  • Max output tokens: 128,000
  • Knowledge cutoff: September 30, 2024
  • Reasoning: Higher, with reasoning token support

Pricing (per 1M tokens)

  • Input: $1.25
  • Cached input: $0.125
  • Output: $10.00

Modalities

  • Text: input and output
  • Image: input only
  • Audio: not supported

GPT-5 Mini

GPT-5 Mini is a smaller, faster, and more affordable version of GPT-5. It keeps a strong reasoning ability while being better suited for well-defined tasks.

  • Context window: 400,000
  • Max output tokens: 128,000
  • Knowledge cutoff: May 31, 2024
  • Features: Supports web search, file search, and code interpreter.

Pricing per 1M tokens:

  • Input: $0.25
  • Cached input: $0.025
  • Output: $2.00

GPT-5 Nano

GPT-5 Nano is the fastest and cheapest option, designed for lightweight tasks such as classification and summarization.

  • Context window: 400,000
  • Max output tokens: 128,000
  • Knowledge cutoff: May 31, 2024
  • Features: Supports file search, image generation, and code interpreter (but not web search).

Pricing per 1M tokens:

  • Input: $0.05
  • Cached input: $0.005
  • Output: $0.40

GPT-5 series features

The GPT-5 series introduces several capabilities that improve control, formatting, and efficiency. These features apply to GPT-5, GPT-5 Mini, and GPT-5 Nano.

Verbosity parameter

The verbosity parameter allows developers to influence the level of detail in model outputs without modifying the prompt.
It accepts three values:

  • Low: short and concise results
  • Medium: balanced results (default)
  • High: detailed outputs suitable for explanation, documentation or review

Higher verbosity leads to longer responses and higher use of output tokens.

Free-form function calling

The GPT-5 series supports custom tool calls that accept raw text output instead of structured JSON. This makes it possible to generate code, SQL queries or configuration text that is passed directly into external runtimes such as:

  • Code sandboxes
  • SQL engines
  • Shell environments
  • Configuration systems

The custom tool type does not support parallel tool calls. It is intended for situations where natural text is preferable to a strict JSON schema.

Context free grammar (CFG) support

Models can produce text constrained by a grammar defined with Lark or regex syntax. This ensures that the generated text follows strict structural rules. Common use cases include:

  • Enforcing specific SQL dialects
  • Restricting timestamps or identifiers
  • Validating configuration formats

When using CFGs, developers define terminals and rules that describe the set of acceptable strings. The model produces only outputs that match these rules.

Minimal reasoning mode

Minimal reasoning mode reduces or removes reasoning tokens. This reduces latency and improves time-to-first-token.
It is suitable for tasks such as:

  • Classification
  • Short rewrites
  • Structured extraction
  • Basic formatting operations

When no reasoning setting is provided, the default effort level is medium.

Key differences

The three models differ primarily in reasoning depth, speed, and cost. The new features can be used across all models, but their impact varies by model.

Reasoning

  • GPT-5 provides the strongest reasoning capability. It is appropriate for complex, multi-step problems in coding, scientific analysis, or decision support.
  • GPT-5 Mini offers strong reasoning for structured prompts with predictable task boundaries.
  • GPT-5 Nano has moderate reasoning performance and works best on tasks that do not require deep analysis.
  • Minimal reasoning mode can be used with all models and provides the most significant benefit for GPT-5 Nano and GPT-5 Mini, given their speed advantage.

Speed

  • GPT-5 Nano is the fastest option and is effective for real-time or large-scale workloads.
  • GPT-5 Mini balances speed with reasoning, making it suitable for regular production workloads.
  • GPT-5 is slower because it performs more internal reasoning, but this results in more precise output.
  • Minimal reasoning mode can further reduce latency, particularly for Nano.

Cost

  • GPT-5 Nano has the lowest cost per token. It is preferred for high-volume tasks such as batch classification or summarization.
  • GPT-5 Mini sits in the mid-range, offering a balance between capability and cost.
  • GPT-5 is the most expensive model and is typically used when accuracy and consistency take priority.
  • Verbosity settings influence cost because higher verbosity produces more output tokens.

What are LLM parameters?

LLM parameters are settings that influence how large language models (LLMs) generate text during inference. These parameter controls do not modify the learned weights of a pre-trained model. Instead, they shape how the language model samples from a probability distribution over probable tokens when generating responses.

Large language models are neural network systems, typically built on the transformer model architecture. During training, the model learns numerical values called weights and biases. Weights represent the importance assigned to different inputs, allowing the model to capture relationships between words, concepts, and context. Biases are constant values added within layers that help activate neurons under certain conditions. Together, these values define the model’s ability to recognize complex patterns in language.

Inference parameters, by contrast, operate after training. They shape how the model’s learned knowledge is used, without changing the underlying weights. Adjusting LLM parameters allows users to influence output diversity, predictability, repetition, and output length, which is essential for optimizing model performance across specific tasks such as creative writing, structured generation, or technical explanations.

Key parameters include top-p nucleus sampling, max tokens, frequency penalty, presence penalty, and stop sequences. Together, these sampling parameters control the generated output while balancing output quality, computational cost, and inference efficiency.

Model size, parameters, and training fundamentals

The number of parameters in large language models can reach into the billions. Larger models typically have a stronger ability to handle nuanced language, long-range dependencies, and complex reasoning. This improved model performance comes at the cost of higher computational power requirements during both training and inference.

Smaller models require fewer computational resources and offer better computational efficiency, but they may struggle with more complex patterns or longer context windows. Choosing between larger models and smaller models depends on the task, acceptable latency, and available infrastructure. See LLM scaling laws to learn how AI researchers evaluate the effect of model size, data quality, and training strategy.

Several training parameters shape how a model learns before inference:

  • Batch size refers to the number of training samples processed before the model updates its weights. Larger batch sizes improve training efficiency but increase memory usage.
  • Learning rate controls how quickly the model adjusts its weights and biases. Higher values speed up learning but risk instability, while lower values promote steady convergence.
  • Hyperparameters define external settings such as model size, batch size, and learning rate, shaping the overall training process.

After pre-training, fine-tuning and alignment are essential. Fine-tuning adapts a pre-trained model to domain-specific data or tasks, while alignment ensures the generated text reflects human intent.

Parameter-efficient fine-tuning (PEFT) improves computational efficiency by freezing most parameters and updating only a small subset of task-relevant parameters.

Top-p sampling

Top-p sampling, also known as nucleus sampling, limits token selection to the smallest group whose cumulative probability exceeds a given threshold p. Instead of selecting from a fixed number of tokens, the model dynamically chooses from probable tokens that together account for the specified probability mass.

  • Lower values (for example, p = 0.5) restrict sampling to a narrow set of the highest probability tokens, resulting in coherent but less varied text.
  • Higher values (for example, p = 0.9) allow sampling from a broader pool, increasing output diversity but also the risk of drifting off-topic.

Top k sampling

Top k sampling restricts the model’s choice to the k highest probability tokens for the next step in text generation. By narrowing the candidate set, this parameter directly affects predictability and variety.

  • Lower top-k values limit selection to a small set of highly probable tokens, producing more predictable and focused outputs.
  • Higher values expand the candidate pool, increasing variability and supporting more diverse language.

While top-p sampling adapts dynamically based on the probability mass, top-k sampling uses a fixed cutoff. The two are often compared during model evaluation to determine optimal settings for specific tasks.

Max tokens (The token number)

The max_tokens parameter defines the maximum number of tokens the model can generate in a single response. It directly determines output length and influences computational cost.

  • Lower maximum values enforce concise responses but may cut off important details.
  • Higher values allow more detailed explanations but require more computational resources and increase inference time.

The maximum number of tokens is constrained by the context window, which includes both the input data and the generated output. If the combined number of tokens exceeds the model’s token limit, generation will stop regardless of the max tokens setting.

Frequency penalty parameter

The frequency penalty adjusts the probability of tokens based on how often they have already appeared in the generated text.

  • Positive values reduce repetition, improving output quality in longer responses.
  • Negative values encourage reuse, which can be helpful for documents that require consistent terminology.

Excessively high penalties can harm coherence, as natural repetition is often necessary for human-like text. This parameter is most effective when optimizing model performance for long-form text generation.

Presence penalty

The presence penalty reduces the probability of tokens that have appeared at least once, regardless of frequency. This encourages the model to introduce new ideas.

  • Positive values promote novelty and exploration, which are helpful in brainstorming and creative writing.
  • Negative values reinforce existing terms, which may help in structured or constrained outputs.

Presence penalty is a valuable control for guiding idea diversity, but it should be applied carefully to avoid unnatural avoidance of key terms.

Stop sequences

Stop sequences define specific tokens or strings that signal the model to halt generation. They are commonly used in structured applications.

  • Useful for enforcing templates in dialogue systems or code generation.
  • Help control output length and prevent irrelevant continuations.

Stop sequences improve predictability in generated text outputs without relying solely on token limits.

Seed and determinism

Some systems allow users to specify a random seed, ensuring that the same input data and parameter settings produce the same generated output.

  • Useful for model evaluation and testing.
  • Helps compare different parameter configurations without random variation affecting the results.

Deterministic generation supports reproducibility, although exact outputs may still vary across different AI models or deployment environments.

Differences between key parameters

Understanding how key parameters differ helps when adjusting LLM parameters for optimal results.

  • Frequency penalty vs presence penalty: Frequency penalty scales with how often a token appears, while presence penalty applies once after a token first appears.
  • Top k vs top p sampling: Top k limits selection to a fixed number of tokens, while top p dynamically selects tokens based on cumulative probability.
  • Max tokens vs context window: Max tokens caps output length, while the context window is a fixed upper bound covering both input and output tokens.

Careful tuning of these parameters allows practitioners to balance output quality, computational efficiency, and LLM performance across applications such as retrieval augmented generation, analytical tasks, and open-ended text generation.

Industry Analyst
Sıla Ermut
Sıla Ermut
Industry Analyst
Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.
View Full Profile
Researched by
Şevval Alper
Şevval Alper
AI Researcher
Şevval is an AIMultiple industry analyst specializing in AI coding tools, AI agents and quantum technologies.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450