Contact Us
No results found.

LLM Scaling Laws: Analysis from AI Researchers in 2026

Sıla Ermut
Sıla Ermut
updated on Jan 27, 2026

Large language models predict the next token based on patterns learned from text data. The term LLM scaling laws refers to empirical regularities that link model performance to the amount of compute, training data, and model parameters used during training.

To understand how these relationships influence modern model design in practice, we reviewed findings from five academic papers and insights from three major AI labs and researchers.

Key takeaways

Leading researchers converge on the following key insights:

  • Model performance does not depend solely on parameter count. Data quantity and quality are equally critical.
  • Scaling decisions should be based on task requirements rather than assuming larger models are always better.
  • Parameter-efficient architectures can achieve competitive performance at lower training and inference costs.
  • In real-world deployments, inference costs can outweigh training costs and should be considered when choosing a model size.

Evidence from academic scaling law research

“Scaling Laws for Neural Language Models”, Kaplan & McCandlish, 2020

Kaplan et al. introduced the first widely cited scaling laws for neural language models.

In their analysis, model performance follows power-law relationships with respect to three key variables: the number of model parameters, the size of the training dataset (measured in tokens), and the total training compute.

By systematically varying these three factors, the authors showed that increasing any one of them leads to predictable reductions in loss, provided the others are appropriately scaled.

Figure 1: The figure shows how test loss changes with model size under different compute budgets and training step counts, revealing the optimal balance between model size, compute, and training duration for best performance.

This work established the foundation for later research on language model scaling laws. It also demonstrated that model shape and depth have a smaller effect than total parameter count and training tokens when compute is fixed. This insight influenced how later researchers designed training algorithms for large language models.1

“Training Compute-Optimal Large Language Models”, Hoffmann, Borgeaud & Mensch, 2022

This paper reevaluates the earlier laws for neural language models using a large set of controlled experiments. It models loss as a joint function of model parameters and training data size, and finds that many earlier large models were undertrained for their parameter count. When researchers train larger models with insufficient training data, the resulting model quality does not align with predictions from traditional scaling laws.

The authors show that, for a fixed compute budget, optimal performance is achieved when models use parameter and training token counts of similar orders of magnitude. This result is widely known as the Chinchilla scaling law. It states that computing optimal training requires a near-proportional relationship between the number of parameters and the number of training tokens.

This approach produces smaller models that perform better than larger models trained on limited data. It also supports efficient model selection, as researchers can fit scaling laws to smaller models and predict language model performance for larger configurations before training.

Figure 2: The figure overlays predictions from several methods, all indicating that today’s large models are oversized and should instead be smaller and trained longer.2

“Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws”, Sardana, Portes & Doubov, 2025

Sardana et al. extend the Chinchilla framework by incorporating inference costs into compute-optimal scaling.

Instead of minimizing training compute alone, they fix a target performance level and optimize the combined cost of training and inference over the model’s lifetime.

This shift leads to an important practical insight: in high-usage settings, smaller models trained on more data can often match the performance of larger models while incurring lower total compute costs.

Figure 3: The graphs compare the ratios of total cost, parameter count, and training tokens between real-world cost-optimal models and Chinchilla-style models.3

“Sloth: Scaling laws for LLM skills to predict multi-benchmark performance across families”, Polo, Somerstep & Choshen, 2025

Sloth introduces a new approach to modeling scaling laws for large language models by shifting the focus from model loss to benchmark-level performance. Instead of treating tasks separately, Sloth identifies a set of latent skills that capture the performance of language models across different benchmarks. These skills represent general capabilities such as reasoning or knowledge retrieval.

The framework models how each skill scales with the model’s parameters and the training data. Sloth uses simple features, such as the logarithms of model and dataset sizes, to describe how these skills change within a model family. Once fitted, Sloth can predict how larger models in the same family will perform on many benchmarks without training them.

By using Sloth’s predictions, organizations can decide where to allocate computational resources and avoid training configurations unlikely to achieve the desired performance. This supports more rational planning of training models under real-world constraints.4

“Densing law of LLMs”, Xiao, Cai & Zhao, 2025

The paper examines how efficiently models use their parameters. It introduces the concept of capability density, defined as the ratio of a model’s effective parameter count to its actual parameter count. Effective parameter count is estimated by fitting scaling laws to existing models and asking how large a reference model would need to be to match current performance.

The authors observe that the best models at each time point show rising capability density. This means that newer models achieve a given performance with fewer parameters than older models. The trend appears approximately exponential over time.

This observation suggests that progress in large language models is not only about scaling model size but also about improving model architecture, training data quality, and training algorithms. The paper argues that tracking parameter efficiency is essential for understanding future directions in natural language processing and machine learning.

Figure 4: The graph shows estimated capability density for open-source base LLMs across five reasoning and coding benchmarks, with circle size indicating model parameter count, and a trend line suggesting a “densing law” in which peak capability density rises exponentially over time.5

LLM scaling laws opinions from major AI labs and researchers

Beyond academic scaling laws, industry researchers and practitioners emphasize how these principles translate into real-world model development and deployment.

The following perspectives illustrate how different stakeholders, from hardware providers to applied researchers, interpret and apply scaling laws in practice.

NVIDIA, 2025

From an infrastructure perspective, NVIDIA presents scaling laws as practical tools for designing and training large language models. It highlights three primary scaling axes:

  • Model size.
  • Dataset size.
  • Compute resources.

According to NVIDIA, scaling any of these factors in the correct regime results in predictable improvements in model quality.

The article also emphasizes the importance of test time compute. Modern systems spend more time on inference using techniques such as extended reasoning sequences. This adds a new dimension to scaling laws, extending beyond the original focus on training tokens and model parameters.

NVIDIA uses these ideas to explain why demand for compute resources continues to grow, even as models become more parameter-efficient. It suggests that both training and inference will remain significant drivers of compute use in future natural language processing systems.6

Cameron Wolfe, LLM researcher at Netflix, 2025

From a practitioner’s standpoint, Cameron Wolfe explains how the original power law relationships from the academic literature apply to current models and how practitioners can use these curves to estimate achievable model performance before training larger models.

Wolfe discusses the roles of model shape and architecture in scaling and notes that, while traditional scaling laws focus on parameter count, practical systems must also consider data quality and training algorithms. The piece highlights concerns about the availability of high-quality data and how these constraints may affect the training of future larger models.

The discussion presents scaling laws as guidance for evaluating existing models and for estimating how model performance may change when training data is expanded or when model parameters are adjusted.7

MIT-IBM Watson AI Lab, 2025

Taking a more methodological view, the researchers from the MIT-IBM Watson AI Lab analyze scaling laws across multiple architectures and datasets.

The researchers compile a broad meta-dataset that includes 485 pretrained models, detailed training metadata, and more than 1 million performance measurements. This dataset is used to test over 1,000 candidate scaling laws and identify patterns that generalize across different model families.

The study outlines clear steps for fitting scaling laws under compute constraints. It recommends defining a compute budget and target performance, then training a small collection of models at different sizes rather than focusing on the largest models. Intermediate checkpoints are highlighted as valuable sources of information, while very early training data is discouraged due to noise.

The authors show that when these guidelines are followed, predictions can approach the lower bound set by random-seed variability. Even when predictions are less precise, scaling laws remain useful for comparing training choices and identifying promising configurations.

The work notes that performance varies significantly across model families, which reinforces the importance of using diverse training settings when fitting scaling laws.8

What do leading researchers say about the future of scaling?

Views supporting the continued validity of scaling laws

Across the research landscape, there is consistent evidence that scaling laws hold within the tested regimes. Foundational work shows clear power law relationships between model parameters, training data size, and training compute when models are trained in balanced settings.

Later studies refine this picture by demonstrating that compute optimal training requires aligning model size with the volume of training tokens, and that this alignment improves model performance relative to earlier approaches.

Additional work on multitask evaluation shows that benchmark performance also scales predictably when expressed in terms of a smaller set of latent skills. This reinforces the view that language model scaling laws remain reliable tools for forecasting model performance when dataset size and compute resources are allocated appropriately.

Views emphasizing efficient compute allocation

A second line of research argues that progress increasingly depends on how compute is distributed rather than on expanding parameter count alone. Analyses of compute optimal training show that models require sufficient training data to reach their potential and that larger models trained on limited data are often inefficient.

Work that incorporates inference costs extends this idea by showing that the total cost of a model depends on both training compute and inference time compute.

This perspective suggests that future scaling efforts will emphasize efficient configurations that jointly optimize model size, training tokens, and expected inference volume. It frames the design of large language models as an exercise in compute allocation rather than as a pursuit of maximal parameter growth.

Views emphasizing the growing importance of efficiency and density

Another viewpoint focuses on parameter efficiency and the effective use of computational resources. Research that tracks parameter density shows that newer models achieve stronger performance with fewer parameters than earlier models. This indicates that architectural improvements, data quality, and training algorithms play a significant role in performance gains.

Technical commentary also highlights the growing importance of inference behavior and post-training improvements. When combined, these findings suggest that future systems will rely on efficient model design and better training methods rather than uncontrolled expansion of parameter count. The emphasis shifts from larger models to more capable models that use their parameters more effectively.

Constraints on future LLM scaling

Compute and energy limits

A recurring theme in the literature is the heavy compute demand required to train and deploy large language models. Training large models consumes significant compute resources, while inference at scale incurs substantial operational costs.

These factors impose economic limits on scaling even when theoretical scaling laws indicate further gains. As models grow, energy consumption and hardware requirements become increasingly challenging to manage.

Data availability constraints

Another constraint is the availability of high-quality data. Traditional formulations of scaling laws assume access to abundant training data, but this assumption is no longer reliable.

Several analyses point to the limited supply of high-quality text and the increasing need for curated or synthetic data. As training data size becomes a limiting factor, data quality becomes as crucial as parameter count in determining model performance.

Economic and compute budget constraints

Practical scaling is limited not only by technical factors but also by financial and organizational considerations. Research that focuses on performance prediction shows that compute budget planning is essential for determining which training runs are feasible.

Commentary on industry practices highlights the rising cost of compute and the need for organizations to allocate their resources carefully. These factors constrain how far scaling can be pushed in real-world environments.

Algorithmic and architectural constraints

Research on scaling laws emphasizes that predictable improvements occur only when models are trained in balanced regimes. Work that analyzes parameter efficiency demonstrates that architectural advances can shift the relationship between model size and performance.

Additional commentary shows that training algorithms influence how effectively scaling laws apply. These insights imply that simple parameter scaling cannot continue indefinitely and that progress will increasingly depend on new training methods and model architectures.

FAQ

Industry Analyst
Sıla Ermut
Sıla Ermut
Industry Analyst
Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450