AIMultipleAIMultiple
No results found.

Ethical & Legal AI Data Collection

Cem Dilmegani
Cem Dilmegani
updated on Aug 29, 2025

Disruptive technologies, such as AI, ML, the Internet of Things (IoT), and computer vision, require various types of data to operate. This data often includes biometric data, such as facial images and voice recordings. Collecting and managing such data requires multiple ethical and legal considerations, which, if disregarded, can lead to expensive lawsuits and significant reputational damage.

See data collection ethics and legal practices that organizations must consider when sourcing and gathering data to develop and deploy AI/ML solutions, backed by real-world case studies and current regulatory requirements.

How to achieve data collection ethics? (Best practices)

Extensive research 1 has been conducted on data collection ethics and how to achieve them; however, there is no golden key to the land of absoluteness.

Ethics is more of a process and a culture that needs to be adopted by all contributors (data collectors, developers, decision-makers, sales, marketing, executives, etc.) in the development and implementation of an AI/ML solution. 

1. Ethics training

Providing sufficient training about data collection ethics can be beneficial in promoting and adopting the culture. A best practice to ensure the instructions are followed is to use an ethics checklist that staff should tick off whenever they collect data. 

Modern ethics training should also include:

  • Understanding of current AI-specific regulations like the EU AI Act
  • Recognition of algorithmic bias and fairness issues
  • Knowledge of emerging privacy-preserving technologies like federated learning
  • Regular updates on evolving legal requirements and case law

You can also check our article on data collection services.

Obtaining consent is one of the most critical parts of data collection ethics. This is part of the agreement between the data owner and the collector, and should be done. Before the data is collected, for instance, if a smart home device gathers voice data from its user, a notification should be displayed during app setup, giving the user the option to provide consent.

However, consent must be:

  • Freely given and not bundled with service access
  • Specific to each type of data processing
  • Informed with clear explanations of AI training purposes
  • Withdrawable at any time without penalty
  • Documented and auditable for regulatory compliance

3. Clarity and understanding

This means that when collectors require user consent, their request should be clearly stated in easily understandable words. The data collectors should ensure that the user fully understands what he/she is permitting.

Best practices include:

  • Using plain language instead of legal jargon
  • Providing examples of how AI models will use the data
  • Offering multi-language support for global audiences
  • Testing comprehension through user studies

4. Trust and consistency

This means ethical and security practices while collecting data should be consistent to build trust in the data provider. For instance, if there are 500 data providers, then all 500 of them should be subjected to equal ethical considerations.

Modern trust-building requires:

  • Regular third-party privacy audits
  • Transparent reporting on data usage and sharing
  • Consistent global privacy standards, not just minimum compliance
  • Clear data governance policies accessible to users

5. Awareness and transparency

The data collection process should be transparent and open. The data provider should be aware of what data is being collected, who will have access to it, and how it will be utilized. 

Additionally, data providers should have control over how their data is used. For instance, if the data provider wants to stop using and sharing data in the future, he/she should be able to opt out easily.

Enhanced transparency now includes:

  • AI model cards explaining training data sources
  • Regular transparency reports showing data usage statistics
  • Precise opt-out mechanisms that actually work
  • Notification of any data breaches within 72 hours (GDPR requirement)

6. Risk consideration

Another critical point to consider is that the risk of future problems cannot be eliminated. Therefore, the data collection must assess the risk of such unforeseen events and prepare a mitigation plan. Additionally, the data collector should communicate this risk to the data provider.

Modern risk assessment should include:

  • Algorithmic bias testing and mitigation strategies
  • Data breach impact assessments
  • Cross-border data transfer risks
  • Evolving regulatory compliance requirements
  • Potential for data re-identification in AI systems

Major data collection lawsuits: Real cases with financial impact

1. Meta (Facebook) Cambridge Analytica- $725 Million Settlement

One of the largest data privacy settlements in history occurred when Meta paid $725 million to settle a class-action lawsuit over Cambridge Analytica. Facebook parent Meta agrees to pay $725 million to settle a privacy lawsuit that claimed the social media giant gave third parties access to user data without their consent, representing the “largest recovery ever achieved in a data privacy class action and the most Facebook has ever paid.”2

The case involved:

  • 87 million users’ data improperly shared without consent
  • Political profiling and targeted manipulation
  • Data sold to third parties for AI-powered analytics
  • Additional penalties: $5 billion FTC fine and $50 million Australian settlement

2. Clearview AI Biometric Privacy Violations – $51.75 Million Settlement

Clearview AI, which scraped billions of photos from social media to build facial recognition databases, settled for $51.75 million in 2025. This approved settlement agreement resolves that the company’s automatic collection, storage, and use of biometric data violated various privacy laws, including Illinois’ biometric privacy laws.3

Key violations:

  • Scraped over 3 billion photos without consent
  • Sold access to law enforcement and private companies
  • Violated the Illinois Biometric Information Privacy Act (BIPA)
  • Failed to obtain proper consent for biometric data collection

3. Healthcare AI Claims Denial – Cigna, Humana, UnitedHealth (2025 Ongoing)

In early 2025, major health insurers faced lawsuits for using AI algorithms to deny medical claims wrongfully. Major health insurers, including Cigna, Humana, and UnitedHealth Group, have been sued for allegedly using AI to deny medical claims unfairly. One filing cited Cigna’s internal process, in which an algorithm reviewed and rejected over 300,000 claims in just two months.4

Allegations include:

  • AI systems denying claims without proper medical review
  • Lack of transparency in algorithmic decision-making
  • Potential violations of patient care standards
  • Discriminatory impact on vulnerable populations

4. AI Training Data Lawsuits – OpenAI, Microsoft, Google (2024-2025)

Multiple newspaper publishers and content creators have sued AI companies for using copyrighted content without permission to train large language models. Eight newspapers filed a lawsuit against OpenAI and Microsoft on April 30, 2024, alleging that they had purloined millions of copyrighted news articles to train their AI, including The New York Daily News, Chicago Tribune, Denver Post, Mercury News, and Orange County Register.5

Current litigation involves:

  • Unauthorized use of copyrighted content for AI training
  • Claims of fair use vs. commercial exploitation
  • Demands for licensing fees and attribution
  • Potential damages in the billions

5. U.S. Immigration and Customs Enforcement (ICE) Facial Recognition

The Washington Post reported that the U.S. immigration and Customs Enforcement authority unconstitutionally collected facial image data to track the activities of immigrants without proper legal authority or consent.6

Watch the video to see how JFK Airport only gathers facial images of foreigners:

Key issues:

  • Targeting specific ethnic groups
  • Lack of constitutional protections for non-citizens
  • No opt-out mechanisms
  • Potential for discriminatory enforcement

To learn more about facial recognition, check out this quick read.

6. Voice Data Collection by Smart Home Devices – Amazon Alexa

Similarly, brands that offer smart home devices have also been under scrutiny for unethically collecting voice (biometric) data from their users.

For instance, Alexa was under a lawsuit for collecting user voice data without consent. A collaborative study by researchers from the University of Washington and three other institutions found this, leading to the lawsuit.7

Watch this video to see how smart home devices gather user data:

Latest regulations on data collection and protection

Regulation
Jurisdiction
Enacted
Scope
Key Requirements
EU AI Act
EU
2024
High-risk AI systems
Mandatory risk assessments, transparency requirements, human oversight
Data Security Law & Cybersecurity Law
China
2021/2017
Data localization, critical infrastructure
Data localization, security assessments for “important data,” network operator obligations
PIPL
China
2021
Personal information of Chinese citizens
Explicit consent, DPIA for critical data, cross‑border transfer approvals
CCPA & CPRA
California, USA
2020
Personal information of residents
Right to opt out of sale, deletion request handling, privacy notice update
GDPR
EU
2018
All personal data
Consent, DPIA, breach notification within 72 hrs, data subject rights
UK Data Protection Act 2018
UK
2018
Mirrors GDPR
Data protection principles, UK‑specific derogations, ICO enforcement powers
GINA
USA
2008
Genetic data
Prohibits use by insurers/employers, requires written consent
BIPA
Illinois, USA
2008
Biometric data
Written consent, retention limits, private right of action
COPPA
USA
1998
Data of children under 13
Parental consent, clear privacy policy, data minimization
  • Europe’s General Data Protection Regulation (GDPR) 8 is in place in the U.S. to protect children’s data. It includes the dos and don’ts of gathering and using children’s data, such as when to take consent from the guardian, where not to use the data, etc.
  • The Genetic Information Nondiscrimination Act (GINA) 9 in the U.S. protects people’s genetic data from being used by insurance companies, hospitals, and other organizations that might exploit it.
  • The Federal Trade Commission Act (FTC) 10 in the U.S. also protects consumer data and has imposed billions in AI-related fines.
  • The Data Protection Act 2018 11 is the UK’s version of the GDPR.
  • The Illinois Biometric Information Privacy Act (BIPA) has become a model for biometric data protection, resulting in hundreds of millions of dollars in settlements, including the $51.75 million Clearview AI case.
  • 3 main laws regulate data governance in China:
    • The Data Security Law (DSL)12
    • Personal Information Protection Law (PIPL)13
    • China’s Cybersecurity Law14

Further reading

If you need help finding a vendor or have any questions, feel free to contact us:

Find the Right Vendors

Reference Links

1
Ethical Machine Learning in Healthcare | Annual Reviews
2
CNBC – Meta Settlement
3
Regulatory Oversight – Clearview Settlement
4
Traverse Legal – AI Healthcare Litigation
5
TechTarget – AI Lawsuits Explained
6
FBI, ICE find state driver’s license photos are a gold mine for facial-recognition searches - The Washington Post
The Washington Post
7
Lawsuit alleges Amazon uses Alexa interactions for ad targeting without users' knowledge or consent – GeekWire
GeekWire
8
https://en.wikipedia.org/wiki/General_Data_Protection_Regulation[/efn_note]grants individuals the right to delete their data from the systems where it was uploaded, resulting in fines exceeding €2 billion since 2018. The EU AI Act (2024) introduces the world’s first comprehensive AI regulation, requiring risk assessments, transparency measures, and human oversight for high-risk AI systems. The Children’s Online Privacy Protection Act (COPPA) 8https://en.wikipedia.org/wiki/Children’s_Online_Privacy_Protection_Act
9
Genetic Information Nondiscrimination Act - Wikipedia
Contributors to Wikimedia projects
10
About the FTC | Federal Trade Commission
11
https://www.gov.uk/data-protection#:~:text=The%20Data%20Protection%20Act%202018%20is%20the%20UK’s%20implementation%20of,used%20fairly%2C%20lawfully%20and%20transparently
12
Data Security Law of the People's Republic of China - Wikipedia
Contributors to Wikimedia projects
13
Personal Information Protection Law of the People's Republic of China - Wikipedia
Contributors to Wikimedia projects
14
Cybersecurity Law of the People's Republic of China - Wikipedia
Contributors to Wikimedia projects
Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by
Özge Aykaç
Özge Aykaç
Industry Analyst
Özge is an industry analyst at AIMultiple focused on data loss prevention, device control and data classification.
View Full Profile

Comments 0

Share Your Thoughts

Your email address will not be published. All fields are required.

0/450