Disruptive technologies, such as AI, ML, the Internet of Things (IoT), and computer vision, require various types of data to operate. This data often includes biometric data, such as facial images and voice recordings. Collecting and managing such data requires multiple ethical and legal considerations, which, if disregarded, can lead to expensive lawsuits and significant reputational damage.
See data collection ethics and legal practices that organizations must consider when sourcing and gathering data to develop and deploy AI/ML solutions, backed by real-world case studies and current regulatory requirements.
How to achieve data collection ethics? (Best practices)
Extensive research 1 has been conducted on data collection ethics and how to achieve them; however, there is no golden key to the land of absoluteness.
Ethics is more of a process and a culture that needs to be adopted by all contributors (data collectors, developers, decision-makers, sales, marketing, executives, etc.) in the development and implementation of an AI/ML solution.
1. Ethics training
Providing sufficient training about data collection ethics can be beneficial in promoting and adopting the culture. A best practice to ensure the instructions are followed is to use an ethics checklist that staff should tick off whenever they collect data.
Modern ethics training should also include:
- Understanding of current AI-specific regulations like the EU AI Act
- Recognition of algorithmic bias and fairness issues
- Knowledge of emerging privacy-preserving technologies like federated learning
- Regular updates on evolving legal requirements and case law
You can also check our article on data collection services.
2. Consent
Obtaining consent is one of the most critical parts of data collection ethics. This is part of the agreement between the data owner and the collector, and should be done. Before the data is collected, for instance, if a smart home device gathers voice data from its user, a notification should be displayed during app setup, giving the user the option to provide consent.
However, consent must be:
- Freely given and not bundled with service access
- Specific to each type of data processing
- Informed with clear explanations of AI training purposes
- Withdrawable at any time without penalty
- Documented and auditable for regulatory compliance
3. Clarity and understanding
This means that when collectors require user consent, their request should be clearly stated in easily understandable words. The data collectors should ensure that the user fully understands what he/she is permitting.
Best practices include:
- Using plain language instead of legal jargon
- Providing examples of how AI models will use the data
- Offering multi-language support for global audiences
- Testing comprehension through user studies
4. Trust and consistency
This means ethical and security practices while collecting data should be consistent to build trust in the data provider. For instance, if there are 500 data providers, then all 500 of them should be subjected to equal ethical considerations.
Modern trust-building requires:
- Regular third-party privacy audits
- Transparent reporting on data usage and sharing
- Consistent global privacy standards, not just minimum compliance
- Clear data governance policies accessible to users
5. Awareness and transparency
The data collection process should be transparent and open. The data provider should be aware of what data is being collected, who will have access to it, and how it will be utilized.
Additionally, data providers should have control over how their data is used. For instance, if the data provider wants to stop using and sharing data in the future, he/she should be able to opt out easily.
Enhanced transparency now includes:
- AI model cards explaining training data sources
- Regular transparency reports showing data usage statistics
- Precise opt-out mechanisms that actually work
- Notification of any data breaches within 72 hours (GDPR requirement)
6. Risk consideration
Another critical point to consider is that the risk of future problems cannot be eliminated. Therefore, the data collection must assess the risk of such unforeseen events and prepare a mitigation plan. Additionally, the data collector should communicate this risk to the data provider.
Modern risk assessment should include:
- Algorithmic bias testing and mitigation strategies
- Data breach impact assessments
- Cross-border data transfer risks
- Evolving regulatory compliance requirements
- Potential for data re-identification in AI systems
Major data collection lawsuits: Real cases with financial impact
1. Meta (Facebook) Cambridge Analytica- $725 Million Settlement
One of the largest data privacy settlements in history occurred when Meta paid $725 million to settle a class-action lawsuit over Cambridge Analytica. Facebook parent Meta agrees to pay $725 million to settle a privacy lawsuit that claimed the social media giant gave third parties access to user data without their consent, representing the “largest recovery ever achieved in a data privacy class action and the most Facebook has ever paid.”2
The case involved:
- 87 million users’ data improperly shared without consent
- Political profiling and targeted manipulation
- Data sold to third parties for AI-powered analytics
- Additional penalties: $5 billion FTC fine and $50 million Australian settlement
2. Clearview AI Biometric Privacy Violations – $51.75 Million Settlement
Clearview AI, which scraped billions of photos from social media to build facial recognition databases, settled for $51.75 million in 2025. This approved settlement agreement resolves that the company’s automatic collection, storage, and use of biometric data violated various privacy laws, including Illinois’ biometric privacy laws.3
Key violations:
- Scraped over 3 billion photos without consent
- Sold access to law enforcement and private companies
- Violated the Illinois Biometric Information Privacy Act (BIPA)
- Failed to obtain proper consent for biometric data collection
3. Healthcare AI Claims Denial – Cigna, Humana, UnitedHealth (2025 Ongoing)
In early 2025, major health insurers faced lawsuits for using AI algorithms to deny medical claims wrongfully. Major health insurers, including Cigna, Humana, and UnitedHealth Group, have been sued for allegedly using AI to deny medical claims unfairly. One filing cited Cigna’s internal process, in which an algorithm reviewed and rejected over 300,000 claims in just two months.4
Allegations include:
- AI systems denying claims without proper medical review
- Lack of transparency in algorithmic decision-making
- Potential violations of patient care standards
- Discriminatory impact on vulnerable populations
4. AI Training Data Lawsuits – OpenAI, Microsoft, Google (2024-2025)
Multiple newspaper publishers and content creators have sued AI companies for using copyrighted content without permission to train large language models. Eight newspapers filed a lawsuit against OpenAI and Microsoft on April 30, 2024, alleging that they had purloined millions of copyrighted news articles to train their AI, including The New York Daily News, Chicago Tribune, Denver Post, Mercury News, and Orange County Register.5
Current litigation involves:
- Unauthorized use of copyrighted content for AI training
- Claims of fair use vs. commercial exploitation
- Demands for licensing fees and attribution
- Potential damages in the billions
5. U.S. Immigration and Customs Enforcement (ICE) Facial Recognition
The Washington Post reported that the U.S. immigration and Customs Enforcement authority unconstitutionally collected facial image data to track the activities of immigrants without proper legal authority or consent.6
Watch the video to see how JFK Airport only gathers facial images of foreigners:
Key issues:
- Targeting specific ethnic groups
- Lack of constitutional protections for non-citizens
- No opt-out mechanisms
- Potential for discriminatory enforcement
To learn more about facial recognition, check out this quick read.
6. Voice Data Collection by Smart Home Devices – Amazon Alexa
Similarly, brands that offer smart home devices have also been under scrutiny for unethically collecting voice (biometric) data from their users.
For instance, Alexa was under a lawsuit for collecting user voice data without consent. A collaborative study by researchers from the University of Washington and three other institutions found this, leading to the lawsuit.7
Watch this video to see how smart home devices gather user data:
Latest regulations on data collection and protection
Regulation | Jurisdiction | Enacted | Scope | Key Requirements |
---|---|---|---|---|
EU AI Act | EU | 2024 | High-risk AI systems | Mandatory risk assessments, transparency requirements, human oversight |
Data Security Law & Cybersecurity Law | China | 2021/2017 | Data localization, critical infrastructure | Data localization, security assessments for “important data,” network operator obligations |
PIPL | China | 2021 | Personal information of Chinese citizens | Explicit consent, DPIA for critical data, cross‑border transfer approvals |
CCPA & CPRA | California, USA | 2020 | Personal information of residents | Right to opt out of sale, deletion request handling, privacy notice update |
GDPR | EU | 2018 | All personal data | Consent, DPIA, breach notification within 72 hrs, data subject rights |
UK Data Protection Act 2018 | UK | 2018 | Mirrors GDPR | Data protection principles, UK‑specific derogations, ICO enforcement powers |
GINA | USA | 2008 | Genetic data | Prohibits use by insurers/employers, requires written consent |
BIPA | Illinois, USA | 2008 | Biometric data | Written consent, retention limits, private right of action |
COPPA | USA | 1998 | Data of children under 13 | Parental consent, clear privacy policy, data minimization |
- Europe’s General Data Protection Regulation (GDPR) 8 is in place in the U.S. to protect children’s data. It includes the dos and don’ts of gathering and using children’s data, such as when to take consent from the guardian, where not to use the data, etc.
- The Genetic Information Nondiscrimination Act (GINA) 9 in the U.S. protects people’s genetic data from being used by insurance companies, hospitals, and other organizations that might exploit it.
- The Federal Trade Commission Act (FTC) 10 in the U.S. also protects consumer data and has imposed billions in AI-related fines.
- The Data Protection Act 2018 11 is the UK’s version of the GDPR.
- The Illinois Biometric Information Privacy Act (BIPA) has become a model for biometric data protection, resulting in hundreds of millions of dollars in settlements, including the $51.75 million Clearview AI case.
- 3 main laws regulate data governance in China:
Further reading
- Data Collection Quality Assurance
- Data Collection Methods
- Crowdsourced AI Data Collection Benefits & Best Practices
If you need help finding a vendor or have any questions, feel free to contact us:
Find the Right VendorsReference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.