Data-Centric AI: What it is & 3 Best Practices to Adopt It in 2024
Over the past decade, the predominant approach to developing artificial intelligence (AI) systems has been to collect large amounts of data and train complex algorithms on it to produce results. This approach, called the model-centric approach to AI, requires large datasets for good performance to mitigate potential data problems.
The need to collect and process large amounts of data presents several challenges:
- Not all industries and business units have access to large datasets. Companies with a large user base can collect huge datasets but predicting a rare disease or detecting a new type of fraud requires working with a small number of data points.
- Big data is costly. Several factors contribute to this:
- Collecting, labeling, and validating data for machine learning models can cost between $10,500 and $85,000.
- Processing large amounts of data requires computers with high processing power, especially if you are working with images and videos. The computational resources needed to produce a large AI model have doubled every ~3.5 months between 2012 and 2018.
- Large models can have a huge environmental cost. A 2019 study estimates that training a single large deep learning model can produce CO2 emissions roughly equivalent to the total lifetime carbon footprint of five cars.
- Big data has privacy risks. The average cost of a data breach was $4.24 million in 2021.
Recently, industry leaders have started to discuss data-centric AI, which advocates putting data at the center of AI research and development. In this article, we’ll explore what data-centric AI is and how you can adopt such an approach in your business.
For more in-depth knowledge on data collection, feel free to download our whitepaper:
What is data-centric AI?
Data-centric AI is an approach to AI development that focuses on improving data rather than the model for more powerful algorithms. To better understand it, let’s look at how this approach differs from a model-centric approach to AI:
Data-Centric AI vs Model-Centric AI
In a model-centric approach to AI development, the training data is treated as a relatively fixed component, and the process of improving the model performance revolves around experimentation with model architectures and parameters.
This approach has been the dominant paradigm for AI development over the last decade, contributing to the development of state-of-the-art model architectures such as neural networks, as prominent AI researcher at Stanford University and Google Brain and Landing AI founder Andrew Ng points out.
In the model-centric approach, data issues such as noise and inaccurate labels are solved by collecting large datasets and optimizing the model so that it averages over good and bad data. Data cleaning is certainly involved, but it is often limited and done manually.
A data-centric approach shifts the focus on improving data rather than the model architecture to improve the performance. This can involve:
- High-quality data labels
- Collecting complete and representative data
- Minimizing data bias
Hence, the iterative process is to improve data quality while keeping the model part relatively fixed. However, it should be noted that a successful AI application does not depend solely on either good data or a good model but a combination of a well-designed model and high-quality data. Data-centric AI highlights that we spend too much time on improving our model architectures, but the data is often overlooked: Only 1% of AI research deals with data.
Andrew Ng argues that a data-centric AI approach leads to better-performing models. Figure 1 shows the improvement in accuracy with a model-centric approach vs. a data-centric approach:
Figure 1. The impact of improving the code vs. the data on model performance. Source: deeplearning.ai
Applying a data-centric approach to AI development
1. Leverage MLOps practices
Data-centric AI emphasizes spending more time on data compared to the model. Time spent on improving the model includes model selection, hyperparameter optimization, and experiment tracking, as well as model deployment and monitoring. Automating and streamlining these ML lifecycle processes play an important role in a data-centric approach.
Companies must adopt practices known as MLOps to standardize and automate model-building processes. MLOps involves:
- Automated pipelines to streamline machine learning lifecycle management,
- A unified framework to follow within an organization facilitates improved communication and collaboration.
2. Use tools and techniques to improve data quality
As we mentioned above, there are different aspects of high-quality data:
Quality of data labels
Labels provide information about the content of the data. It is essential for most AI algorithms to train on accurately and consistently labeled data. The problem with inaccurate labels is obvious: they provide incorrect information to the algorithm. But consistency is also important.
For instance, this image of two coffee cups (Figure 2) can be labeled individually (in red) or together (in blue). There is nothing wrong with either method, but it is important that the labeling remains consistent across different data points. A computer vision model trained on inconsistently labeled data can produce unintended results.
Figure 2. Different ways to label objects in an image.
Complete and representative data
Gaps and missing information in the data lead to inaccurate results. It is important to have a training dataset that contains enough data points for different classes and accurately represents its underlying real-world phenomena.
Healthcare AI models that have been used to detect covid are a good example of incomplete and unrepresentative data. For instance, a model is trained on a dataset that contains chest scans of children as non-covid examples. In the end, the model has learned to identify children, not covid cases.
Building AI systems involves human decision-making in a lot of instances, from data collection to labeling, which inevitably leads to biases. In turn, the outcome of AI models would also reflect these biases. It may be impossible to eliminate bias, but you can minimize it with careful design.
Feel free to check our article on bias in AI and how to reduce it.
3. Involve domain expertise
Creating datasets with domain knowledge is essential to a data-centric approach to AI development. Different industries, business functions, or even different problems within the same domain may have intricacies that can escape a data scientist. Domain experts can provide the ground truth for the specific business use case where you want to apply AI and determine if the dataset accurately represents the problem at hand.
For instance, if you want to use a machine learning algorithm for the predictive maintenance of wind turbines, you will need engineers, wind turbine operators, and maintenance workers, in addition to data scientists who build the model. They can provide knowledge about the locations of the sensors, the physical quantities measured by the sensors, or the statistical properties and time-series behavior of the measurements.
You can also check our data-driven lists of:
If you have other questions about data-driven AI and how to adopt it in your business, we can help:
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 60% of Fortune 500 every month.
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE, NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and media that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised businesses on their enterprise software, automation, cloud, AI / ML and other technology related decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
To stay up-to-date on B2B tech & accelerate your enterprise:Follow on
Next to Read
Your email address will not be published. All fields are required.