AIMultiple ResearchAIMultiple ResearchAIMultiple Research
MLOpsAI
Updated on Jun 13, 2025

Reproducible AI: Why it Matters & How to Improve it in 2025

Reproducibility is a fundamental aspect of scientific methods, enabling researchers to replicate an experiment or study and achieve consistent results using the same methodology. This principle is equally vital in artificial intelligence (AI) and machine learning (ML) applications, where the ability to reproduce outcomes ensures the reliability and robustness of models and findings. However:

  • ~5% of AI researchers share source code and less than a third of them share test data in their research papers. 1
  • Less than a third of AI research is reproducible, i.e. verifiable. 2

This is commonly referred to as the reproducibility or replication crisis in AI. We explore why reproducibility is important for AI and how businesses can improve reproducibility in their AI applications.3

What is reproducibility in artificial intelligence?

In the context of AI, reproducibility refers to the ability to achieve the same or similar results using the same dataset and AI algorithm within the same environment.

  • The dataset refers to the training dataset that the AI algorithm takes as input to make predictions,
  • The AI algorithm consists of model type, model parameters and hyperparameters, features, and other code.
  • The environment refers to the software and hardware used to run the algorithm.

To achieve reproducibility in AI systems, changes in all three components must be tracked and recorded.

Why is reproducibility important in AI?

Reproducibility is crucial for both AI research and AI applications in the enterprise because:

  • For AI / ML research, scientific progress depends on the ability of independent researchers to scrutinize and reproduce the results of a study.4 Machine learning cannot be improved or applied in other areas if its essential components are not documented for reproducibility. A lack of reproducibility blurs the line between scientific production and marketing.
  • For AI applications in business, reproducibility would enable building AI systems that are less error-prone. Fewer errors would benefit businesses and their customers by increasing reliability and predictability since businesses can understand which components lead to certain results. This is necessary to convince decision-makers to scale AI systems and enable more users to benefit from them

What are the challenges regarding reproducible AI?

Updated at 09-24-2024
ChallengeExample
Randomness/StochasticityDifferent results from stochastic gradient descent (SGD) in deep learning
Lack of Standardization in PreprocessingDifferent stopword removal in NLP affecting model performance
Non-Deterministic Hardware/SoftwareDifferences in results on NVIDIA GPU vs. AMD GPU
Hyperparameter TuningLearning rate differences in XGBoost drastically changing performance
Lack of Documentation/Code SharingTransformer models missing detailed implementation of layer normalization
Versioning IssuesTensorFlow 1.x vs. TensorFlow 2.x API changes affecting reproducibility
Dataset Availability/VariabilityProprietary healthcare datasets that aren’t accessible for replication
Computational ResourcesState-of-the-art models like GPT-4 requiring massive GPU clusters to replicate training
Overfitting to Specific Test SetsReporting results only on specific dataset splits, overfitting to test data
Bias/Cherry-Picking ResultsReporting only the best experimental run without disclosing other outcomes

1. Randomness and Stochastic Nature of Algorithms

Many AI models, especially deep learning algorithms, incorporate randomness during their training and inference processes. For instance, random weight initialization, dropout layers, and stochastic gradient descent (SGD) contribute to variability even when using the same dataset, codebase, and environment.

This issue is especially pronounced in Large Language Models (LLMs), such as GPT-4 Gemini or LLaMA, which are inherently probabilistic. Even when prompted with the same input and configuration, they may generate different outputs, particularly if temperature or top-k sampling parameters are adjusted. These settings control the randomness of output generation:

  • Temperature adjusts the probability distribution used during token sampling. A higher temperature (e.g., 1.0) produces more diverse, creative outputs, while a lower temperature (e.g., 0.2) yields more deterministic responses.
  • Top-k or top-p (nucleus) sampling further controls randomness by limiting the range of tokens considered at each step.

Real Life Example: Asking an LLM to summarize the same paragraph twice with a temperature of 0.9 may yield significantly different summaries. This variability makes it difficult to verify or reproduce model behavior unless settings are fixed and explicitly documented.

In enterprise applications, such as contract summarization, chatbot replies, or AI coding assistants, this unpredictability poses challenges for debugging, compliance, and quality assurance. Teams may struggle to trace which configuration led to a specific output unless all parameters, including the random seed and temperature, are logged consistently.

2. Lack of Standardization in Data Preprocessing

Preprocessing steps such as data augmentation, normalization, and feature extraction are often not consistently documented or shared. Small changes in how data is preprocessed, even seemingly minor ones like rounding errors, can lead to different results. This is particularly true for image processing or natural language processing tasks, where data variability is high.

3. Non-Deterministic Hardware and Software

The execution of AI algorithms can vary across different hardware (CPUs, GPUs, TPUs) and even on the same hardware due to underlying non-deterministic processes in libraries like TensorFlow or PyTorch. Differences in versions of these libraries can introduce further variability, even when code and data are identical.

4. Hyperparameter Tuning

Many AI models rely on hyperparameters, such as learning rate, batch size, or regularization strength, which need to be fine-tuned. Often, these are not shared in enough detail, or their selection is not explained rigorously, making it difficult to reproduce results. Also, slight changes in hyperparameters can result in very different performance outcomes.

5. Lack of Detailed Documentation and Code Sharing

Even when research papers provide code, it may not be complete or fully aligned with the published results. Some critical elements, such as specific libraries, model weights, or data pipelines, might not be disclosed, hindering exact reproduction.

6. Versioning Issues

The dynamic nature of AI software ecosystems means that libraries and frameworks are constantly evolving. A model trained using a specific version of a library might not perform the same when run on a later version, even if the code remains unchanged. Keeping track of versions for all dependencies can be difficult, and versioning is often poorly documented.

7. Dataset Availability and Variability

Some datasets used in AI research are proprietary or not publicly available, making it impossible to replicate studies. Even when datasets are available, there can be variations due to sampling, updates, or different preprocessing techniques applied at the time of research.

8. Computational Resources

Reproducing state-of-the-art AI models often requires significant computational resources, including specialized hardware like GPUs or TPUs. Researchers or practitioners without access to the same level of resources may find it hard to replicate results.

9. Overfitting to Specific Test Sets

In some cases, models are inadvertently overfitted to specific test sets or benchmarks. When these models are tested in different environments or on slightly altered datasets, the results may not generalize, making reproducibility challenging.

10. Bias in Reporting and Cherry-Picking Results

Researchers may report the best-performing version of a model after multiple runs without specifying the variability across runs or disclosing the total number of experiments conducted. This selective reporting skews the perceived reproducibility of results.

The role of AI researchers in addressing reproducibility

AI researchers develop cutting-edge models, but they also bear responsibility for ensuring that their work can be verified and trusted. Despite calls for transparency, many research outputs still fall short in practice:

  • An analysis of NeurIPS (Conference on Neural Information Processing Systems) papers found that only 42% included code, and just 23% provided links to datasets.
  • Most AI studies lack sufficient detail to be independently reproduced, often due to inadequate documentation of hyperparameters, training conditions, and evaluation protocols.
  • Nearly 70% of AI researchers admitted they had struggled to reproduce someone else’s results, even within the same subfield.

To overcome these issues, the AI research community must:

  • Adopt open science practices: Sharing code, data, and detailed experiment logs enables peer verification and scientific integrity.
  • Standardize reporting: Following structured formats like the Machine Learning Reproducibility Checklist helps ensure essential details are documented.
  • Promote cross-institutional validation: Encouraging independent replication by other research teams helps identify generalizability and reliability.

How to improve reproducibility in AI?

The best way to achieve AI reproducibility in the enterprise is by leveraging MLOps best practices. MLOps involves streamlining artificial intelligence and machine learning lifecycle with automation and a unified framework within an organization.

Some MLOps tools and techniques that facilitate reproducibility are: 

  • Experiment tracking: Experiment tracking tools help keep track of important information about these experiments in a structured manner. 
  • Data Lineage: Data lineage keeps track of where the data originates, what happens to it, and where it goes over the data lifecycle with recordings and visualizations.
  • Model Versioning: Similarly, data versioning tools help keep track of different versions of AI models with different model types, parameters, hyperparameters, etc. and allow companies to compare them.
  • Model Registry: Model registry is a central repository for all models and their metadata. This helps data scientists to access different models and their properties at different times.

Feel free to check our article on MLOps tools and our data-driven list of MLOps platforms for more on MLOps tools. 

Apart from the tools, MLOps also helps businesses improve reproducibility by facilitating communication between data scientists, IT staff, subject matter experts, and operations professionals.

What does reliable AI mean & how it relates to reproducible AI?

Reliable AI refers to systems that perform consistently and correctly under varied conditions. This includes producing accurate, fair, and safe outputs across different environments and data inputs. A key pillar of reliability is reproducibility, the ability to recreate the same results using the same inputs and methods, even when the system is deployed in new contexts or by different teams.

  • Consistency Across Runs: Reproducible AI ensures that repeated training or inference under the same conditions yields the same results, critical for validating reliability.
  • Debugging and Auditing: Reliable systems must be transparent and accountable. Reproducibility allows stakeholders to trace how a decision was made and verify it independently.
  • Robust Testing: To ensure reliability, AI must be tested under multiple conditions. Reproducibility enables standardized testing procedures to validate performance claims.
  • Trust Building: When results can be consistently reproduced, users and regulators are more likely to trust the AI’s reliability and safety.
  • Scientific Integrity: In AI research, reproducibility is essential for peer review and advancement. Reliable systems depend on this foundation to ensure theoretical soundness translates to practical dependability.
Share This Article
MailLinkedinX
Altay is an industry analyst at AIMultiple. He has background in international political economy, multilateral organizations, development cooperation, global politics, and data analysis.

Next to Read

Comments

Your email address will not be published. All fields are required.

2 Comments
Richard Rudd-Orthner
Oct 04, 2023 at 09:14

I have been working on this and have achieved it with on CPU. Repeatable determinism or reproducibility is a key stone of dependable systems and when applied in convolutional network can have higher accuracy.

These are some of the academically peer-reviewed publications made in the IEEE.

• [1] R. Rudd-Orthner and L. Mihaylova, “Non-Random weight initialisation in deep learning networks for repeatable determinism,” in Peer Reviewed Proc. of the 10th IEEE International Conference Dependable Systems Services and Technologies (DESSERT-19), Leeds, UK, 2019.
o This conference paper proved that an alternative to the random initialisation was possible and provided an almost equal performance but with reproducibility. Presented at the UK Ukraine and Northen Island IEEE branches conference in Leeds.

• [2] R. Rudd-Orthner and L. Milhaylova, “Repeatable determinism using non-random weight initialisations in smart city applications of deep learning,” Journal of Reliable Intelligent Environments in a Smart Cities special edition, vol. 6, no. 1, pp. 31-49, 2020.
o This Journal paper enhanced the performance to an equivalent performance by using the limits from He and Xavier and made the previous reproducibility a more general case for general use, although it was limited to Dense layers.

• [3] R. Rudd-Orthner and L. Milhaylova, “Non-random weight initialisation in deep convolutional networks applied to safety critical artificial intelligence,” in Peer Reviewed Proc. of the 13th International Conference on Developments in eSystems Engineering (DeSe), Liverpool, UK, 2020.
o This conference paper proved an approach to Convolutional layers that as alternative to the random initialisation and provided a higher performance with reproducibility. Presented at the UK and UAE IEEE branches conference in Liverpool held virtually.

• [4] R. Rudd-Orthner and L. Milhaylova, “Deep convnet: non-random weight initialization for repeatable determinism with FSGM,” Sensors, vol. 21, no. 14, p. 4772, 2021.
o This Journal paper extended the work into colour images proofs and used the cyber FSGM attack as a method for measuring effect in transferred learning.

• [5] R. Rudd-Orthner and L. Milhaylova, “Multi-type aircraft of remote sensing images: MTARSI2,” Zenodo, 30 June 2021. [Online]. Available: https://zenodo.org/record/5044950#.YcWalmDP2Ul. [Accessed 30 June 2021].
o This was the colour dataset used.

• [6] R. Rudd-Orthner, “Artificial Intelligence Methods for Security and Cyber Security Systems,” University of Sheffield, Sheffield, UK, 2022.
o This is the final full write up in the context and with other approaches.

Richard Rudd-Orthner
Oct 04, 2023 at 09:13

I have been working on this and have achieved it with on CPU. Repeatable determinism or reproducibility is a key stone of dependable systems and when applied in convolutional network can have higher accuracy.

These are some of the academically peer-reviewed publications made in the IEEE etc about Safety Critical AI.
• [1] R. Rudd-Orthner and L. Mihaylova, “Non-Random weight initialisation in deep learning networks for repeatable determinism,” in Peer Reviewed Proc. of the 10th IEEE International Conference Dependable Systems Services and Technologies (DESSERT-19), Leeds, UK, 2019.
o This conference paper proved that an alternative to the random initialisation was possible and provided an almost equal performance but with reproducibility. Presented at the UK Ukraine and Northen Island IEEE branches conference in Leeds.

• [2] R. Rudd-Orthner and L. Milhaylova, “Repeatable determinism using non-random weight initialisations in smart city applications of deep learning,” Journal of Reliable Intelligent Environments in a Smart Cities special edition, vol. 6, no. 1, pp. 31-49, 2020.
o This Journal paper enhanced the performance to an equivalent performance by using the limits from He and Xavier and made the previous reproducibility a more general case for general use, although it was limited to Dense layers.

• [3] R. Rudd-Orthner and L. Milhaylova, “Non-random weight initialisation in deep convolutional networks applied to safety critical artificial intelligence,” in Peer Reviewed Proc. of the 13th International Conference on Developments in eSystems Engineering (DeSe), Liverpool, UK, 2020.
o This conference paper proved an approach to Convolutional layers that as alternative to the random initialisation and provided a higher performance with reproducibility. Presented at the UK and UAE IEEE branches conference in Liverpool held virtually.

• [4] R. Rudd-Orthner and L. Milhaylova, “Deep convnet: non-random weight initialization for repeatable determinism with FSGM,” Sensors, vol. 21, no. 14, p. 4772, 2021.
o This Journal paper extended the work into colour images proofs and used the cyber FSGM attack as a method for measuring effect in transferred learning.

• [5] R. Rudd-Orthner and L. Milhaylova, “Multi-type aircraft of remote sensing images: MTARSI2,” Zenodo, 30 June 2021. [Online]. Available: https://zenodo.org/record/5044950#.YcWalmDP2Ul. [Accessed 30 June 2021].
o This was the colour dataset used.

• [6] R. Rudd-Orthner, “Artificial Intelligence Methods for Security and Cyber Security Systems,” University of Sheffield, Sheffield, UK, 2022.
o This is the final full write up in the context and with other approaches.

Related research