We follow ethical norms & our process for objectivity.

This research is not funded by any sponsors.

What are the data sources for AI training or inference?

Is your AI data collection ethical and compliant?

What is the cost of unethical or noncompliant AI data collection?

Checklist for AI training data

6 steps to AI data collection

FAQ about AI data collection

What are the data sources for AI training or inference?Is your AI data collection ethical and compliant?What is the cost of unethical or noncompliant AI data collection?Checklist for AI training data 6 steps to AI data collection FAQ about AI data collection

Table of contents

Data Collection

Updated on Jun 27, 2025

AI Data Collection: Risks, Challenges & Tools in 2025

Cem Dilmegani

with Sıla Ermut

See our ethical norms

AI builders need fresh, high quality data:

Research labs need more data to develop state of the art models and agents.
Software providers and enterprises need fresh data to finetune models and agents.

However, data collection comes with its risks. For example, enterprises need to avoid unethical data collection practices and ensure that data is collected ethically to minimize reputational risk.

Dive in for a comprehensive guide on AI data collection solutions to help business leaders and developers navigate its challenges:

What are the data sources for AI training or inference?

Web

There is not a state of the art LLM or large multimodal model (LMM) that doesn’t leverage web data. The web includes almost all public data and a significant volume of private data. AI models need web data for:

Training including finetuning: High-performing models almost always require fresh web data.
Inference: For RAG and other inference-time compute operations, web data is necessary to feed facts to models.

Real-life example: While most LLMs do not share their data sources, first version of LLaMA disclosed it and it relied 100% on web data.¹

Private or licensed data

Enterprises are the largest owners of private data. Unlocking this data for training can unlock further improvements in large language models. Since this involves data that was generated or gathered in the past, it can be pre-packaged and bought off-the-shelf in a rapid way.

Real-life examples:

OpenAI forged more than 30 media partnerships to fuel its models.²
Anthropic does not have a content partnership with Reddit and therefore it was sued for its use of Reddit’s data.³

Crowdsourced data

Suppose an image recognition system requires image data of road signs. Through public crowdsourcing, its developers can obtain these images from the public by providing some instructions to users of the network and creating a data-sharing platform.

Working with a third-party crowdsourcing platform or service provider can add cost-effectiveness and improved data quality to this method’s positives. However, this method can not be used for projects involving sensitive or confidential data.

Real-world examples: Reinforcement learning from human feedback (RLHF) was the major technical improvement between ChatGPT and GPT-3. RLHF involves collecting human ratings for AI models’ responses. These ratings are then used as training data to improve the model’s responses in line with human preferences.

Synthetic data

Synthetic data is artificially generated data that mimics real-world data. It is a form of automated data generation and can leverage both traditional AI and generative AI in augmenting existing AI training datasets or creating new ones.

Synthetic data is especially useful when you have limited labeled data, as it can improve the model’s accuracy and generalization capabilities.

Is your AI data collection ethical and compliant?

If you have collected web data for AI, you can review our ethical web data research to learn more. For other data sources, there are 3 dimensions to ethical data:

Legal data collection

While collecting public data is legal in most cases, there are exceptions.

Collecting private data can be illegal due to many reasons. For example, in most jurisdictions, it is illegal to collect private personally identifiable information (PII) without consent.

Real-life illegal private data collection example:

Cambridge Analytica collected private data belonging to 87 million Facebook users in deceptive ways. As a result:

Meta was charged $5 billion in the US.
The public outcry led Cambridge Analytica to declare bankruptcy.

Collecting public data that includes copyrighted material is also illegal.

Real-life public data collection controversy: Meta collected copyrighted books using the LibGen file sharing project.⁴ This led to ongoing litigation by the authors.

Ethical data collection

Legal data collection can be unethical.

Real-life ethical data collection controversy:

Location data brokers track and sell users’ locations via apps that legitimately gain access to users’ data after collecting user consent. However the collected data can be used for illicit actions like financial scams. Therefore, regulators stepped in to start regulating location data sales.⁵

Data supply chain

Regardless of whether you collect or use web data that is collected by others, unethical data collection can harm your business. In some jurisdictions like Germany, businesses are responsible for their suppliers’ legal conduct. And regardless of jurisdiction, enterprises have suffered reputational damage due to their suppliers’ conduct.

Real-life example of an enterprise’s attempt to reduce reputational damage from its data supplier:

Businesses are aware of the reputational damage their data supply chains can create. When one of Equifax’s data suppliers experienced a data leak, Equifax sent them a cease and desist letter to have references to Equifax removed from the website.⁶

Your business’ risk

If your business relies on web data, it may be at risk.

Even if your business is not collecting web data in an illegal or unethical manner, it is still at risk. Your data providers’ unethical or illegal actions are risking your business’ reputation.

If your business relies on web data, it is most probably working with a data collection service provider. For any large scale data collection, these companies’ services are indispensable.

Your risk of exposure to unethical or noncompliant data collection is high. This is because most data collection operations in our ethical web data benchmark fell short of the necessary level of transparency to ensure that their services are working in a legal and ethical manner.

What is the cost of unethical or noncompliant AI data collection?

Commercial risk

Data collection challenges can limit an AI models’ revenues with 2 mechanisms:

Performance issues

AI models that fail to deliver results due to data issues are unlikely to gain market share.

Compliance issues

Enterprises review AI models rigorously. For example, data provenance audits cover the entire data collection supply chain. Unethical data collection practices or not providing indemnification can limit enterprise adoption.

Real-life concerns about indemnification: Indemnification shields business users from specific legal risks like copyright infringement. Through private discussions with enterprises, AIMultiple team identified enterprises avoiding Meta’s Llama since it lacked indemnification.

OpenAI forced to retain user logs: OpenAI has been forced to contradict its existing privacy policy and retain user logs as part of the New York Times lawsuit about its use of public but copyrighted material in model training.⁷ Though enterprise users were excluded, it brings uncertainty to OpenAI’s commitments regarding user privacy.

Legal risk

The level of legal risk depends on how and where the solution is used:

Internal use

This is the most limited case as the only source of litigation can be employees or 3rd parties that are harmed by the model.

Real-world job applicant lawsuit against AI hiring tool:

iTutorGroup settled with Equal Employment Opportunity Commission (“EEOC”) after agreeing to pay $365,000 to applicants who had been denied an interview because of their age. The company also agreed to improve its anti-discrimination and complaint procedures.

The situation was discovered when an applicant submitted two identical applicants except one having a later birth date.⁸

Use by customers

Customers can sue model developers for lack of performance.

User’s family sued model developer:Character.AI chatbots have been accused of encouraging self harm. A teenager committed suicide after conversing with one of Character.AI’s chatbots.⁹

Use by customers that have been provided indemnification

This results in the maximum level of legal liability as your company’s risk includes the risk of all your clients’ work completed using your models. Indemnification provided by hyperscalers: Google and Microsoft provide indemnification for their models and this indemnification is expected by enterprises.¹⁰

Categories of legal risk

Legal risk takes many forms. It could be due to

Contractual liabilities which can cover guarantees of lawful data collection
IP risks due to the use of copyrighted or private data sets in training data.
Product liability due to performance issues (e.g. model bias) due to unethical data collection
Regulatory compliance, especially compliance to data protection regulations like CCPA or GDPR.

Regulatory compliance depends on the jurisdiction:

International differences in regulatory approaches

Globally, data privacy regulations vary significantly, showcasing different approaches to individual rights and data handling.¹¹

Notable examples from the world’s largest economies:

EU General Data Protection Regulation is a comprehensive regulation emphasizing strong individual rights, broad definitions of personal data, and strict consent requirements. As an early and comprehensive law, it shaped the laws that came after it.

California Consumer Privacy Act has a narrower scope than GDPR and a less centralized enforcement approach. It provides significant privacy rights to California residents, including the right to know, opt-out of sale, and deletion.

Personal Information Protection Law of China is similar to GDPR. PIPL mandates informed consent for personal data processing, with separate explicit consent required for sensitive data, overseas transfers, and public disclosure. Comprehensive privacy notices detailing processing are also mandatory.

Reputational risk

AI companies are frequently facing backlash due to unethical data practices. So far, boycotts haven’t reached enough scale to slow down most companies’ growth since the market has been growing quite fast. However, in the future, reputational risk may be more important as the market matures. Brand damage can make the difference between success and failure as product differentiations diminish to commoditization of critical capabilities.

Stability AI’s use of copyrighted material led to numerous leaders leaving the company.¹²Underpaid knowledge workers in Kenya serving a supplier of OpenAI were exposed to disturbing material.¹³

Other risks

Operational risks, talent churn, consumer boycotts are all possible due to challenges in data collection.

For example, if an AI research lab needs to remove a certain dataset from its training data, it would need to

Retire all models that leveraged that datasets
Create a new data pipeline to replace the removed dataset
Retrain models which can cost up to hundreds of millions¹⁴

Checklist for AI training data

AI training data needs to fulfill these requirements to minimize risk and create an AI product that is ready for enterprise adoption:

Legal

Data in these groups are legal to use:

Licensed
In the public domain
Owned by parties that allow its reuse without attribution in AI model training.

In all cases, data needs to be used according to its licensing terms. Since models can leak their training data, data owners are scouring AI models for their copyrighted data and initiating legal action about their copyrighted data in generative AI models.

Ethical

Ethical data collection goes beyond legal data collection ensuring that the data collection does not harm the data owners or other stakeholders.

We identified guidelines for ethical web data which should be followed while using web data for AI.

High quality

For example, hallucinations of LLMs and AI bias are a major impediment to AI adoption. Training data quality is key to reduce hallucinations.

Another component of quality is data freshness. Models trained on outdated data can misinform users.

Real-world example of garbage in, garbage out in LLMs: During its launch, Google’s AI Overviews recommended using glue to bind cheese to pizza based on an old comment on Reddit.

Secure

Security to protect competitive advantage

While aspects like user interface play a critical role in model adoption, data, compute and algorithms are the 3 core ingredients of an AI model. Therefore, data is a source of competitive advantage for AI model builders and it needs to be secured for this competitive advantage to be retained.

Security to prevent reputational or financial harm

A security issue could

Expose licensed data and harm data owners.
Shut down or compromise services to harm customers

Therefore, AI model builders need to invest in AI and LLM cybersecurity to protect their data.

6 steps to AI data collection

1. Identifying the need

It is the most crucial initial step in the data collection process. Without a clear focus, data collection efforts can be hard to finalize.

2. Selecting the method

Select the collection method which is most suitable for your project. For example, if conversation data between a patient and a doctor is necessary, AI model builders can contact:

Hospitals and other healthcare providers to license their data
Crowdsourcing companies for simulated conversations

3. Quality assurance

Cost of fixing data quality issues increase as the system matures. Therefore, in production systems, quality assurance should be prioritized during data collection.

4. Storage

A sound storage plan is essential regardless of your chosen method for collecting data. Consider privacy concerns, storage capacity, frequency of access, post-storage data management, etc.

5. Data annotation

Data annotation refers to the process of assigning labels to data to make it understandable for machines. It is an indispensable part of supervised learning. Although this step doesn’t entail collecting the data itself, it plays a crucial role in preparing the dataset for its ultimate application.

6. Verify

Check collected data against the checklist for AI training data with a data review including an audit of data pipelines.

FAQ about AI data collection

What is AI data collection?

AI data collection, also known as data harvesting, is the process of extracting data from various sources such as websites, online surveys, user feedback forms, customer social media posts, and ready-made datasets to be used in training and improving AI and machine learning (ML) models.
This process is foundational to creating AI systems, as the performance of these models heavily depends on the accuracy of the data they are trained on.

External Links

Share This Article

Cem Dilmegani

Follow on

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Researched by

Sıla Ermut

Industry Analyst

Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.

Next to Read

Ethical & Legal AI Data Collection in 2025: Examples & Policies

May 194 min read

Top 6 AI Data Collection Challenges & Solutions in 2025

Jun 298 min read

Comments

Your email address will not be published. All fields are required.

0 Comments

Related research

Automated Data Collection Tools & Use Cases in 2025

Jul 35 min read

Crowdsourced Data Collection Benefits & Best Practices

Jul 25 min read