What are the LLM scaling laws?

Large language models are trained as neural language models that predict the next token in natural language. The term LLM scaling laws describes empirical regularities that link model performance to model size, training data, and compute resources. These relationships are defined as power-law functions in the academic literature. They imply that language model performance improves predictably when researchers allocate more computational resources to model parameters and training data size. The foundational idea in the literature is that model loss decreases when models are given more parameters, more training tokens, and more compute. These rules have shaped how researchers train larger models and evaluate the trade-off between the number of parameters and the availability of sufficient training data. They also support decisions on how to allocate a compute budget between model architectures and available training data. Understanding these relationships is essential because organizations rely on language model scaling laws to forecast achievable performance gains from scaling model parameters or collecting more training data. They also help teams identify when smaller models trained on more data may offer similar performance to larger models that are undertrained.

How can organizations use compute optimal training principles?

They should check whether vendors align model parameters with the amount of training data and consider inference cost during selection. Models trained with compute-optimal scaling often match the performance of larger models while reducing operational costs.

How can teams use skill-based scaling to plan training?

Teams can train smaller models and fit scaling laws to predict how larger models will perform. Multi-skilling scaling shows that a few underlying skills drive performance across benchmarks, helping avoid unproductive training runs and guiding compute allocation.

How should organizations use efficiency and density insights?

They should track parameter efficiency trends to identify models that deliver stronger performance with fewer parameters. Improvements in architecture and training algorithms play a major role, so model selection should focus on overall performance gains rather than parameter count alone.

AI LLMs

LLM Scaling Laws: Analysis from AI Researchers

Sıla Ermut

updated on Nov 26, 2025

See our ethical norms

Large language models are usually trained as neural language models that predict the next token in natural language. The term LLM scaling laws refers to empirical regularities that link model performance to the amount of compute, training data, and model parameters used when training models.

We gathered expert opinions from five academic articles and three AI labs and researchers, including NVIDIA and MIT, on LLM scaling laws.

Evidence from academic scaling law research

“Scaling Laws for Neural Language Models”, Kaplan & McCandlish, 2020

This paper introduced the first widely cited language model scaling laws. The authors examined transformer-based neural language models and found clear power law relationships between model loss and three quantities. These are model size measured by the number of parameters, dataset size measured by total training tokens, and total training compute measured in floating-point operations.

The paper shows that model loss follows a power-law relationship for each variable when compute resources are allocated in a balanced manner.

The authors also identify a computationally optimal scaling regime where, for a fixed compute budget, there is an optimal combination of model parameters and dataset size that yields the lowest achievable loss. When one variable is underscaled relative to the others, model loss does not follow the expected power-law relationship, and performance gains are limited.

Figure 1: The figure shows how test loss changes with model size under different compute budgets and training step counts, revealing the optimal balance between model size, compute, and training duration for best performance.

This work established the foundation for later research on language model scaling laws. It also demonstrated that model shape and depth have a smaller effect than total parameter count and training tokens when compute is fixed. This insight influenced how later researchers designed training algorithms for large language models.¹

“Training Compute-Optimal Large Language Models”, Hoﬀmann, Borgeaud & Mensch, 2022

This paper reevaluates the earlier laws for neural language models using a large set of controlled experiments. It models loss as a joint function of model parameters and training data size, and finds that many earlier large models were undertrained for their parameter count. When researchers train larger models with insufficient training data, the resulting model quality does not align with predictions from traditional scaling laws.

The authors show that, for a fixed compute budget, optimal performance is achieved when models use parameter and training token counts of similar orders of magnitude. This result is widely known as the Chinchilla scaling law. It states that computing optimal training requires a near-proportional relationship between the number of parameters and the number of training tokens.

This approach produces smaller models that perform better than larger models trained on limited data. It also supports efficient model selection, as researchers can fit scaling laws to smaller models and predict language model performance for larger configurations before training.

Figure 2: The figure overlays predictions from several methods, all indicating that today’s large models are oversized and should instead be smaller and trained longer.²

“Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws”, Sardana, Portes & Doubov, 2025

This paper extends compute optimal scaling by including inference costs. Instead of optimizing training compute alone, the authors fix a target model loss and minimize the combined compute cost of training and inference over the lifetime of the system.

The analysis shows that the optimal balance between model size and training data can shift substantially when inference time compute is large. In high-usage environments, smaller models trained on more data can achieve the same performance as larger models at lower total cost. This occurs because parameter count strongly affects inference costs, while training tokens affect only training compute.

This approach highlights the importance of inference scaling alongside training compute. It also suggests that efficient model configurations may not follow the same patterns as those predicted by the Chinchilla optimal regime when models are deployed at scale. For organizations that rely heavily on language models, this work provides a framework for selecting smaller models that deliver the same performance at lower inference costs.

Figure 3: The graphs compare the ratios of total cost, parameter count, and training tokens between real-world cost-optimal models and Chinchilla-style models.³

“Sloth: Scaling laws for LLM skills to predict multi-benchmark performance across families”, Polo, Somerstep & Choshen, 2025

Sloth introduces a new approach to modeling scaling laws for large language models by shifting the focus from model loss to benchmark-level performance. Instead of treating tasks separately, Sloth identifies a set of latent skills that capture the performance of language models across different benchmarks. These skills represent general capabilities such as reasoning or knowledge retrieval.

The framework models how each skill scales with the model’s parameters and the training data. Sloth uses simple features, such as the logarithms of model and dataset sizes, to describe how these skills change within a model family. Once fitted, Sloth can predict how larger models in the same family will perform on many benchmarks without training them.

By using Sloth’s predictions, organizations can decide where to allocate computational resources and avoid training configurations unlikely to achieve the desired performance. This supports more rational planning of training models under real-world constraints.⁴

“Densing law of LLMs”, Xiao, Cai & Zhao, 2025

The paper examines how efficiently models use their parameters. It introduces the concept of capability density, defined as the ratio of a model’s effective parameter count to its actual parameter count. Effective parameter count is estimated by fitting scaling laws to existing models and asking how large a reference model would need to be to match current performance.

The authors observe that the best models at each time point show rising capability density. This means that newer models achieve a given performance with fewer parameters than older models. The trend appears approximately exponential over time.

This observation suggests that progress in large language models is not only about scaling model size but also about improving model architecture, training data quality, and training algorithms. The paper argues that tracking parameter efficiency is essential for understanding future directions in natural language processing and machine learning.

Figure 4: The graph shows estimated capability density for open-source base LLMs across five reasoning and coding benchmarks, with circle size indicating model parameter count, and a trend line suggesting a “densing law” in which peak capability density rises exponentially over time.⁵

LLM scaling laws opinions from major AI labs and researchers

NVIDIA, 2025

NVIDIA presents scaling laws as practical tools for designing and training large language models. It highlights three primary scaling axes:

Model size.
Dataset size.
Compute resources.

According to NVIDIA, scaling any of these factors in the correct regime results in predictable improvements in model quality.

The article also emphasizes the importance of test time compute. Modern systems spend more time on inference using techniques such as extended reasoning sequences. This adds a new dimension to scaling laws, extending beyond the original focus on training tokens and model parameters.

NVIDIA uses these ideas to explain why demand for compute resources continues to grow, even as models become more parameter-efficient. It suggests that both training and inference will remain significant drivers of compute use in future natural language processing systems.⁶

Cameron Wolfe, LLM researcher at Netflix, 2025

Cameron Wolfe provides a practitioner-centered summary of the scaling laws of language models. He explains how the original power law relationships from the academic literature apply to current models and how practitioners can use these curves to estimate achievable model performance before training larger models.

Wolfe discusses the roles of model shape and architecture in scaling and notes that, while traditional scaling laws focus on parameter count, practical systems must also consider data quality and training algorithms. The piece highlights concerns about the availability of high-quality data and how these constraints may affect the training of future larger models.

The discussion presents scaling laws as guidance for evaluating existing models and for estimating how model performance may change when training data is expanded or when model parameters are adjusted.⁷

MIT-IBM Watson AI Lab, 2025

The research from MIT-IBM Watson AI Lab presents a practical method for constructing scaling laws that predict the performance of larger language models using results from smaller models. The researchers compile a broad meta-dataset that includes 485 pretrained models, detailed training metadata, and more than 1 million performance measurements. This dataset is used to test over 1,000 candidate scaling laws and identify patterns that generalize across different model families.

The study outlines clear steps for fitting scaling laws under compute constraints. It recommends defining a compute budget and target performance, then training a small collection of models at different sizes rather than focusing on the largest models. Intermediate checkpoints are highlighted as valuable sources of information, while very early training data is discouraged due to noise.

The authors show that when these guidelines are followed, predictions can approach the lower bound set by random-seed variability. Even when predictions are less precise, scaling laws remain useful for comparing training choices and identifying promising configurations.

The work notes that performance varies significantly across model families, which reinforces the importance of using diverse training settings when fitting scaling laws.⁸

What do leading researchers say about the future of scaling?

Views supporting the continued validity of scaling laws

Across the research landscape, there is consistent evidence that scaling laws hold within the tested regimes. Foundational work shows clear power law relationships between model parameters, training data size, and training compute when models are trained in balanced settings.

Later studies refine this picture by demonstrating that compute optimal training requires aligning model size with the volume of training tokens, and that this alignment improves model performance relative to earlier approaches.

Additional work on multitask evaluation shows that benchmark performance also scales predictably when expressed in terms of a smaller set of latent skills. This reinforces the view that language model scaling laws remain reliable tools for forecasting model performance when dataset size and compute resources are allocated appropriately.

Views emphasizing efficient compute allocation

A second line of research argues that progress increasingly depends on how compute is distributed rather than on expanding parameter count alone. Analyses of compute optimal training show that models require sufficient training data to reach their potential and that larger models trained on limited data are often inefficient.

Work that incorporates inference costs extends this idea by showing that the total cost of a model depends on both training compute and inference time compute.

This perspective suggests that future scaling efforts will emphasize efficient configurations that jointly optimize model size, training tokens, and expected inference volume. It frames the design of large language models as an exercise in compute allocation rather than as a pursuit of maximal parameter growth.

Views emphasizing the growing importance of efficiency and density

Another viewpoint focuses on parameter efficiency and the effective use of computational resources. Research that tracks parameter density shows that newer models achieve stronger performance with fewer parameters than earlier models. This indicates that architectural improvements, data quality, and training algorithms play a significant role in performance gains.

Technical commentary also highlights the growing importance of inference behavior and post-training improvements. When combined, these findings suggest that future systems will rely on efficient model design and better training methods rather than uncontrolled expansion of parameter count. The emphasis shifts from larger models to more capable models that use their parameters more effectively.

Constraints on future LLM scaling

Compute and energy limits

A recurring theme in the literature is the heavy compute demand required to train and deploy large language models. Training large models consumes significant compute resources, while inference at scale incurs substantial operational costs.

These factors impose economic limits on scaling even when theoretical scaling laws indicate further gains. As models grow, energy consumption and hardware requirements become increasingly challenging to manage.

Data availability constraints

Another constraint is the availability of high-quality data. Traditional formulations of scaling laws assume access to abundant training data, but this assumption is no longer reliable.

Several analyses point to the limited supply of high-quality text and the increasing need for curated or synthetic data. As training data size becomes a limiting factor, data quality becomes as crucial as parameter count in determining model performance.

Economic and compute budget constraints

Practical scaling is limited not only by technical factors but also by financial and organizational considerations. Research that focuses on performance prediction shows that compute budget planning is essential for determining which training runs are feasible.

Commentary on industry practices highlights the rising cost of compute and the need for organizations to allocate their resources carefully. These factors constrain how far scaling can be pushed in real-world environments.

Algorithmic and architectural constraints

Research on scaling laws emphasizes that predictable improvements occur only when models are trained in balanced regimes. Work that analyzes parameter efficiency demonstrates that architectural advances can shift the relationship between model size and performance.

Additional commentary shows that training algorithms influence how effectively scaling laws apply. These insights imply that simple parameter scaling cannot continue indefinitely and that progress will increasingly depend on new training methods and model architectures.

Implications of LLM scaling for businesses and industries

Implications based on compute optimal scaling

Findings from compute optimal training indicate that organizations should not rely on parameter count alone when evaluating model quality. Smaller models trained with more data can surpass larger undertrained models.

Research incorporating inference costs shows that the total cost depends heavily on the time required for inference. Businesses should evaluate both training compute and inference compute when selecting models and deciding which systems to deploy at scale.

Implications based on parameter efficiency

Evidence of rising parameter efficiency shows that the most capable models increasingly achieve strong performance with fewer parameters.

An analysis of compute usage shows that inference load contributes significantly to operational costs. Organizations should consider parameter efficiency, expected inference volume, and dataset size when assessing model options. Models that achieve higher performance per parameter may offer better economic value.

Implications based on multi-skilling

Research on multi-benchmark scaling provides a method for predicting how various tasks will improve as models scale. This helps organizations identify which natural language processing applications benefit from larger models and which may require specialized training. The ability to forecast performance across tasks also supports more informed compute investment decisions and training strategies.

FAQ

Reference Links

https://arxiv.org/pdf/2001.08361

https://arxiv.org/pdf/2203.15556

https://arxiv.org/pdf/2401.00448

https://arxiv.org/pdf/2412.06540

Densing law of LLMs | Nature Machine Intelligence

Nature Publishing Group UK

How Scaling Laws Drive Smarter, More Powerful AI | NVIDIA Blog

Scaling Laws for LLMs: From GPT-3 to o3

Deep (Learning) Focus

How to build AI scaling laws for efficient LLM training and budget maximization | MIT News | Massachusetts Institute of Technology

Industry Analyst

Sıla Ermut

Industry Analyst

Follow On

Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.

View Full Profile