AIMultiple ResearchAIMultiple ResearchAIMultiple Research
Web Scraping
Updated on Jun 18, 2025

Ethical & Compliant Web Data Benchmark in 2025

As enterprises scale their web data operations, compliance, data, and risk executives increasingly evaluate the associated ethical, reputational, and legal risks.

We benchmarked 7 leading web data collection services across 3 dimensions and tested each service with more than 20 potentially unethical scenarios.

Our work helps you assess the ethical standing of your data collection practices and understand the potential consequences of unethical approaches. We also provide guidelines for ethical web data collection and assess web data collection services from an ethics and compliance perspective:

Assessment of web data collection services

We evaluated leading web data collection services (also called web data providers or web data infrastructure) using our ethical web data checklist. These scores represent maturity levels with 5 being the highest level:

Updated at 06-18-2025
ProvidersSummaryEthical use
by customers
Ethical
supply
Organizational
maturity
Insurance coverage
shared**
Bright DataLevel 5Level 5Level 5✅*
OxylabsLevel 2Level 4Level 2Certified for data security
ApifyLevel 1Level 1Level 1Certified for data security
DecodoLevel 1Level 3Level 1
ZyteLevel 1Level 1Level 0Certified for data security
NetNutLevel 1Level 1Level 0Certified for data securityTBD
NimbleLevel 1Level 1Level 0Certified for data security

* Indicates that the company achieved all external certifications in this category.

** ✅ indicates that the company chose to share its insurance certificates with AIMultiple. ❌ indicates that the company decided not to share its insurance certificates with us and therefore we couldn’t validate their insurance cover. Insurance cover is the only category where we relied on web data services companies’ participation to evaluate them.

Sorting is by summary score.

Scoring model for ethical web data

Below, we outline how these scores are derived. You can also see the rationale for selecting these scoring dimensions.

In the first 2 categories, we identified 5 competencies, and companies received scores based on the number of competencies that they satisfied. Level 5 represents the highest maturity observed in the market, reflecting current best practices rather than perfection.

Capabilities for ethical use by customers

  • Effective processes for ethical use: We assess each provider’s ability to prevent unethical use of their residential proxy services through controlled testing scenarios. If any one of our requests gets blocked by the provider, then this is achieved.
  • Improved processes for ethical use: Similar to “effective processes for ethical use”. However, this capability denotes that the service provider blocked more than one of our attempts to use their services for unethical use cases.
  • Best practice processes for ethical use: Similar to “effective processes for ethical use”. However, this capability denotes that the service provider blocked most of our attempts to use their services for unethical use cases.
  • Abuse management foundation: Publishing abuse management policy and a method to report abuse
  • Responsive abuse management: We measured how companies responded to multiple abuse reports. Even if there was no hotline for reporting abuse, we used the emails listed by the company to reach their team. If we didn’t receive any responses to our report within a week, the company is assumed to be unresponsive.

Capabilities for ethical supply

Ethical supply involves acquiring IP addresses in an ethical manner. Our market analysis identified the following levels of transparency regarding ethical IP supply: 

  • Level 1: Published IP sourcing policy.
  • Level 2: Disclosed at least a source (e.g. a mobile app) for IPs that supplies IPs in an ethical manner. Disclosed source should have in total at least 10k reviews on third-party platforms, including Google, Apple, Amazon app stores, and Trustpilot.
  • Level 3: Same as Level 3 but with 100k reviews
  • Level 4: Same as Level 3 but with 1M reviews
  • Level 5: Same as Level 4 but with 10M reviews

Reviews are an indicator of the popularity of apps and are an important signal for this assessment. Web data collection services need to work with popular applications to be able to satisfy the IP needs of their customers.

For qualification, the disclosed apps should follow these best practices. We will not check this for every disclosed app, but check it for a few randomly selected ones:

  • Informed consent:
    • Users need to opt-in before sharing their internet connection. The opt in screen should outline:
      • The provider
      • The service
      • How their IP will be used
    • Users should be able to access detailed info on
      • How their internet connection will be used
      • Privacy policy
  • Value: Users must receive some value from the app (e.g. payment, ability to skip ads or some other functionality)
  • Privacy: Limited and transparent user data collection.

Capabilities for organizational maturity

 We evaluated organizational maturity based on key certification areas relevant to enterprise-grade security and compliance:

  • PII certification: Demonstrated its capability to manage PII by acquiring ISO 27018
  • Data security certification: Demonstrated its data security practices by acquiring one of these certificates: SOC 2 or ISO/IEC 27001
  • IP source whitelisted: External certification providers like McAfee certify either:
    • Specific 3rd party apps that supply IPs
    • SDK that collects IPs from 3rd party apps

Insurance

We asked vendors to provide us these insurance documents:

  • Professional liability insurance certificate providing coverage for vendors’ liabilities in case of issues in the service
  • Cyber insurance certificate providing coverage for vendors’ liabilities in case of information security-related issues.

Summary score

This score is the sum of all scores divided by 3. The scores are:

  • 0 to 5 for capabilities for ethical use by customers
  • 0 to 5 for capabilities for ethical supply
  • 0 to 3 for organizational maturity
  • 0 to 2 for insurances

Leading web data collection services

AIMultiple selected the largest 7 web data collection services in terms of employees on LinkedIn.  We chose this metric since it is both public and should be correlated with the company’s revenues and enterprise-readiness. Better metrics such as revenues or the number of employees on payroll are not publicly available for these private companies.

All of the selected companies have more than 100 employees connected to their LinkedIn profile pages in April 2025.

Web data collection products in focus

These companies provide a range of products including proxies, data scraping APIs and datasets. While all products can be examined from an ethical perspective, we initially focused on the product that provides the highest level of flexibility and powers most other products: Residential proxies.

Web data collection products can be considered as a hierarchy where proxies form the core layer upon which all other services are built. This is because proxies allow machines to access the internet through different destinations, allowing a diverse and large set of internet connections crucial for data collection. Therefore, proxies are the most capable web data collection product, it can be used to carry out functions that would not be possible with datasets or data scraping APIs.

Among proxies, residential proxies are the product which is the hardest for websites to identify as a proxy. For example, other proxies such as datacenter proxies are easy-to-identify given their location. Therefore, residential proxies power most other web data products like data scraping APIs.

Verify: Is your web data collection compliant & ethical?

Your business is most probably leveraging web data. However, the industry faces limited regulation, making it important to choose an ethical and compliant provider. To achieve that, we prepared a holistic framework to take into account different aspects of web data collection including ethical sourcing, ethical usage and organizational maturity.

Web data is a common operational asset

As an enterprise, your business partially relies on web data because of its numerous use cases like:

  • Dynamic pricing for retail & e-commerce
  • Real-time alt data for investment funds
  • KYC process in commercial banking
  • AI model training or finetuning
  • AI inference or RAG
  • Market research

With AI, web data is now more important

Though web data collection is as old as the web, its importance increased drastically after the rise of generative AI models. Builders of these models such as OpenAI and Anthropic started out without any significant content partnerships and used mainly online data to build their initial models which has led to the rise of the trillion dollar AI industry.

Limited regulatory oversight

Although AI regulation is under spotlight, the data collection industry remains mostly unregulated in most countries. Clear illegal online activities are well defined. However, there are limited regulatory requirements for industry players to proactively prevent misuse of their services by users.

It is up to the platforms themselves to set best practices and compliance standards to ensure ethical data collection and proxy usage. Therefore, your choice of vendor matters more in data collection compared to heavily regulated industries like banking where every service provider is required to abide by numerous regulations.

Your suppliers’ ethical stance is part of your company’s reputation

Regardless of whether you collect or consume the data, you are responsible for its acquisition process.

Enterprises’ responsibilities for unlawful activities in their supply chain depends on the jurisdiction. For example, in Germany, enterprises are responsible to carry out KYS and risk management activities to identify and prevent harms caused by their supply chain. Even when companies are not responsible for harms caused by their supply chain, they can suffer reputational risk.

What is the cost of unethical & noncompliant data collection?

Reputational risk

If it becomes public that an enterprise is leveraging a web data collection service which engages in unethical behavior or actions that endanger its data security, this can lead to significant reputational damage such as lost business, customer churn, talent churn and loss of investor confidence.

Real-life examples of enterprise suppliers’ leading to reputational loss: 

  • Nike has suffered reputational damage numerous times due to its suppliers’ unethical labor practices.1
  • Many enterprises like EY lost their customers’ trust when they were affected by the MOVEit managed file transfer software breach.2

Reputational loss, especially that leads to public outrage, is typically followed by lawsuits from the company’s customers or other stakeholders who have been harmed by the unethical practices.

Real-life example: Starbucks is one of the recent brands to be sued over sourcing from companies with unethical practices.3

Ethical web data checklist

Enterprise web data needs to satisfy 3 requirements to be ethical:

Ethical use by customers

As part of their Know Your Supplier processes, enterprises avoid using services that enable unethical activities. Using such services exposes businesses to reputational harm. Real-world example: In cases where a provider was documented while allowing its platform to be used in unethical activities, numerous enterprises distanced themselves from the provider until it improved its practices.4

How this relates to web data: Web data is collected via different IP addresses. These addresses can be used to engage in different unlawful activities such as DDOS attacks to prevent digital services delivery, unauthorized non-public data collection or ad fraud. Bad actors need IPs to power their actions and web data infrastructure/proxy providers are the largest suppliers of IPs to retail users.

Ethical supply

Services used for ethical purposes can cause unethical and harmful actions during their production. For example, brands like Nike and Nestle suffered reputational harm and faced lawsuits due to their contractors’ use of child labor.

How this relates to web data:

Businesses need to access a large number of and diverse sources of bandwidth for rapid and global data collection. This requires the use of residential proxies: While collecting public data is legal under many conditions, 5 websites can also choose to block some of their visitors. For example, they can block their competitors’ crawlers. In such cases, businesses need to rely on a large number of connections from retail users or other 3rd parties to collect web data.

Proxy providers collect millions of internet connections from various sources and provide them to businesses which use IP addresses to access these connections. Some of these IPs originate from residential users’ devices. Collecting these connections can be legal or unlawful:

  • Legal: Legally compliant practices involve obtaining informed user consent, providing compensation, and offering opt-out mechanisms in accordance with local regulations. The web data provider should
    • Inform users about how their bandwidth would be used
    • Get their consent digitally
    • Compensate them in return
    • Allow them to opt out at any time
  • Illegal: Bad actors can gain access to users’ devices and use their internet connection without permission or compensation. This can happen through malware apps, masked installations, automatic opt in and other methods that can put the device owner at risk.

Businesses using illegally obtained proxies can inadvertently pay bad actors for unauthorized access to devices. 

Real-life examples:

  • Routers and IoT devices have been compromised for botnet operations and sold as residential proxies.6 7
  • Certain proxy providers promote their services in forums frequented by bad actors. These IPs are likely to be illegally obtained.8
  • VPN apps on Google Play Store have also been used to acquire residential IPs without user consent.9

Though these operations have been shut down, it is likely that bad actors are still accessing residential IPs without consent via botnets and compromised or malicious applications.

Organizational maturity

Enterprise buyers need secure, enterprise-ready solutions. We identified the ingredients for a mature web data organization:

Data security

Lack of data security in a suppliers’ systems can erode an enterprise’s competitive advantage or lead to data loss and system down time. Loss of system functionality can erode trust and lead to the devaluation of an enterprise.

System intrusion

Data collection services are not as deeply integrated to an enterprise’s systems as core digital services (e.g. a system of record like CRM). Therefore their security credentials are not as thoroughly reviewed as the credentials of a core system like a system of record. However, data security is critical for data collection services’ customers since these services:

  • Are sometimes integrated to more central systems like pricing engines.
  • Can infect enterprise systems even when they are not integrated to such systems. Using a data collection service involves receiving data from that service. Even some of the most secure forms of data transfer include risks.

Real-life vulnerability example:

Receiving CSVs is one of the most secure forms of data transfer however infected CSV files have been used to infect spreadsheet software. 10  

Technology suppliers that have been compromised, like Solarwinds’ Orion software, have enabled bad actors to gain access to their customers’ data.

Data loss

Without data security, bad actors can gain access to data collected by enterprises to identify their activities and strategies leading to a loss of competitive advantage or business opportunities.

Real-life example:

Though web data is public, businesses can use web data in novel ways for competitive advantage. For example, investors spend up to 10% of their market data budget on alt data11 , but they rarely disclose their strategies since they believe that it can help them gain an advantage compared to their competitors. A data leak can lead to their strategies being exposed and therefore replicated by their competitors.
PII management

Web data includes private data behind login or PII that may be accidentally or purposefully disclosed on public websites. If web data collection services fail to manage PII correctly, such data can be acquired by bad actors. This can lead to reputational harm for the web data collection service and its customers.

Application security

Applications or intermediate programs like SDKs that source the web data collection services’ IPs can be whitelisted by external certification providers like McAfee. This increases enterprise’s trust in ethical supply practices of the web data collection service.

Insurance coverage

Enterprises typically require these insurances from any digital providers: 

  • Professional liability insurance
  • Cyber insurance certificate

Detailed benchmark: Assessment of web data infrastructure providers

Benchmark: Ethical use by customers

Here we aim to answer the question: Does the company ensure that use of its solution is ethical and in-line with applicable laws and regulations? Summary of our findings:

Updated at 06-18-2025
VendorEthical use
by customers
Effective
processes
Improved
processes
Best practice
processes
Abuse management
foundation
Responsive abuse
management
Bright DataLevel 5
OxylabsLevel 4
DecodoLevel 3
ApifyLevel 1
NetNutLevel 1
NimbleLevel 1
ZyteLevel 1N/A*

* Not applicable: Since Zyte buys proxies from its suppliers and does not directly collect it from residential users, it would not be reached by website owners regarding abuse and therefore it doesn’t need to create a contact form for websites.

First, we reviewed policies:

Acceptable use policy review

All vendors prohibit illegal activities and provide examples like DoS attacks, unsolicited bulk messages, impersonation or spoofing. 

In addition, some vendors also highlight that they prohibit activities which are likely to be illegal. Below, we list the prohibited activities based on the acceptable use policies of each vendor. 

We looked for terms that would prohibit activities that are likely to be illegal and can be identified based on user activity. For example, a significant share of users using proxies to take paid surveys could be using proxies to mislead survey providers about their actual location. Therefore, this activity is both likely to be illegal and can be identified based on user activity (i.e. when a user logs into a paid survey website).

Updated at 06-08-2025
Prohibited ActivityBright DataOxylabsZyteApifyDecodoNimbleNetNut
Unauthorized data scraping
Harmful websites
Resale without permission
Ad fraud
Websites for adults
Account creation and management
Automated ticket purchasing
Posting on classifieds and marketplaces
Government websites
Paid surveys
Artificial social engagement
SEO manipulation
Trading virtual assets or currencies
Trading in-game assets
Games of chance for financial gain
Streaming
Malicious code (e.g. malware)
Terrorism
Accessing sensitive data (e.g. health)

Though clearly identifying prohibited activities is beneficial, it is not a requirement and does not impact our scores. Companies may choose to mention that they don’t allow illegal activities rather than mentioning every possible instance of illegal activities.

Mentioning an activity as prohibited doesn’t mean that such activities will be reviewed or blocked. Our scores rely on how these policies are implemented as outlined below:

Processes for ethical use

While some categories outlined in the acceptable use policies are quite broad (e.g. unauthorized data scraping or access), others are specific enough to be converted into preventative actions (e.g. blocking access) that data collection services can implement for users that have not completed their KYC process.

Based on these specific prohibited uses, we prepared an extensive list of uses which are likely to be illegal uses of proxies. For each use case, we identified scenarios including relevant web domains and actions. For example, in the scenario for artificial social media engagement, we attempted to log into a social network using a proxy to like an existing post.

Then, to test whether companies allow unethical use by customers, we created an account on each providers’ service using a non-AIMultiple email address. We did not complete a KYC process with this account and proceeded to use the services to understand what anonymous users can achieve with each service. KYC is a crucial step during which the user submits data to validate the legal entity that they represent. This links user activity to a legal entity:

  • That can be held accountable.
  • Whose rationale for online actions (e.g. using proxies to log into government websites) can be examined. For example, after understanding their use case, a researcher or government agency can be allowed to login to a government website using a proxy.

We expected these use cases to trigger a KYC process but in most vendors, that didn’t happen. A check mark indicates that the request was blocked for users that didn’t yet complete the KYC process:

Updated at 06-08-2025
CategoryDomainBright DataOxylabsDecodoNetNutApifyNimbleZyte
Ad fraudgoogle.com
Ad fraudbing.com
AdultCan be provided upon request
AdultCan be provided upon request
Artificial social engagementfacebook.com
Artificial social engagementinstagram.com
Automated ticket purchasingviagogo.com
Automated ticket purchasingticketmaster.com
Classifiedcraigslist.com
Classifiedgumtree.com
Digital currency exchangebinance.com
Digital currency exchangecoinbase.com
Digital currency exchangecoinmarketcap.com
Games of chancestake.com
Games of chancebetway.com
Gamingepicgames.com
Gamingrockstargames.com
Governmenthttps://cms-sgi.cra-arc.gc.ca/gol-ged
Governmenthttps://secure.login.gov/
Harmful websitesguns.com
Paid surveysprizerebel.com
Streamingkick.com
Streamingtwitch.tv

For clarity, data collection services companies have no legal obligation to block these websites and some of these scenarios may be part of legal use. For example, a researcher may want to leverage proxies to run a controlled social media experiment. However, given the abuse potential in these scenarios, we expected data collection services to block them for users that have not completed the KYC process.

How brands communicate domains that they block
  • Bright Data lists restricted domain categories in their acceptable use policy. 
  • Oxylabs and Decodo share documents separate from their acceptable use policies where they list some of the domains that they block. 12 13 This list was consistent with the blocking that we experienced in their system. 
Respecting websites’ preferences regarding automated data collection

What is robots.txt?

robots.txt is a filename for implementing Robots Exclusion Protocol. This protocol is used by websites to indicate portions of the website which the website owner prefers bots not to visit. Adherence to robots.txt is voluntary.

Pros and cons of adhering to robots.txt

➕ Respects website preferences.

➖ May not be recently updated and therefore be outdated.

➖ It typically involves terms that indicate that website owner prefers certain public sections of the website not to be accessed by bots.

Robots.txt may also provide uneven access to bots. For example, website owners may indicate that they don’t prefer answer engines’ bots to visit certain URLs that search engines’ bots visit.

Robots.txt is not a legal document and it can request to block bot access for pages that are legally:

  • allowed to be scraped (e.g. public data) or 
  • not allowed to be scraped (e.g. data behind a login where the website owner’s ToC prohibits scraping such data).

Web data collection service providers may request residential proxy users to complete a KYC process and prove that they have a legal and ethical use case before these users can disregard robots.txt.

For testing, we sent requests to pages in subfolders that are requested to be blocked by robots.txt. The domains that we used were aimultiple.com and 5 web domains among the top 100 most visited web domains. Only Bright Data blocked these requests:

Updated at 06-05-2025
URLBright DataOxylabsSmartproxyNimbleNetNutZyteApify
https://edition.cnn.com/terms0
https://www.bbc.com/search
https://www.samsung.com/us/business/search/
https://www.imdb.com/registration/signin
https://www.etsy.com/cart
CNN example

CNN’s robots.txt blocks the folder /terms14 . For testing, we navigated to that folder with residential proxies and received 200 messages with the page’s data from all providers except Bright Data. Bright Data’s response is: “Residential Failed (bad_endpoint): Requested site is not available for immediate residential (no KYC) access mode in accordance with robots.txt. To get full residential access for targeting this site, fill in the KYC form: https://brightdata.com/cp/kyc.

Abuse management

We outlined a methodology to evaluate abuse management practices of vendors and collected data to fulfill our evaluation criteria:

Updated at 06-18-2025
VendorLevelDedicated email
for reporting
Webform for
reporting
Bright DataFoundation & responsive
OxylabsFoundation & responsive
DecodoFoundation & responsive
ApifyResponsive
ZyteResponsiveN/A*N/A*
NimbleResponsive
NetNut

* Not applicable: Zyte buys proxies from other proxy providers and therefore when Zyte’s service is used for abuse, website owners would reach its proxy providers rather than Zyte.

While all vendors provide means for 3rd parties or their customers to reach them, having these are important for issue resolution:

  • Public abuse policy
  • A dedicated email address to report abuse
  • An alternate contact method (e.g. webform or messaging interface) that allows reporters to reach the company. This is helpful as emails can get filtered and may fail to reach the inbox.
  • Responsiveness to messages

Only 3 providers in the benchmark (Bright Data, Oxylabs and Smartproxy) provided an email for reporting abuse. All these providers also outlined their policies in this domain.

We expect all other providers to do the same and this to become a widespread industry practice in the short term.

Finally, we evaluated abuse management responsiveness by emailing abuse reports from third-party domains (i.e. non-AIMultiple) and measuring response times. If we could not find an abuse email address, we sent it to the general contact form. We tested this via 3 batches of emails sent on: 

  • Friday May 2, 2025 from:
    • A ticket sales service with ~30k monthly traffic
    • A law firm with ~1k monthly traffic in
  • May 17, 2025 from the ticket sales service.
  • May 24, 2025 from a social media agency with limited online traffic.

The first emails sent on May 2, 2025 were only sent to companies that provided dedicated emails. Later, we expanded our list and included more general email addresses listed in the contact sections of all benchmarked web data collection services. If a company responded to our emails, we stopped sending them further emails.

In our emails, we mentioned that our websites received suspected bot traffic via proxies and asked for their support in identifying the source of proxies. We were able to get all compliance teams except one to answer us. Almost all responses were received on the same day.

Usage transparency

Website owners that provide web data and web collection services historically have had no data exchange about data collection activities. To limit crawling activities, website owners could either:

  • Contact web data collection services to report abuse
  • Work with bot management providers like Cloudflare to make crawling more challenging.

Now, there are initiatives for more structured data exchange between these parties. Bright Data launched Bright Data Webmaster console for webmasters to monitor crawling activities on their websites. More transparency is likely to improve web data collection practices.

Our experience with Webmaster console

We signed up by verifying our domain ownership and adding a collectors.txt file on the domain.

We now have access to the bot activity from Bright Data on our website:

Benchmark: Ethical supply

Updated at 06-08-2025
VendorEthical
Supply
Sourcing approach
explained
# of publicly
disclosed apps that
source IPs
Total # of reviews on
3rd party platforms
Bright DataLevel 512014,617,919*
OxylabsLevel 2119,224
DecodoLevel 1
Apify
NetNut
Nimble
Zyte

* Reviews on these 3rd party platforms were included: Amazon Appstore, App Store, Google Play Store, Trustpilot. For convenience, this value was calculated for 5 major apps for Bright Data, not all 120 apps featured on their website.

Partner transparency

Bandwidth required by web data infrastructure companies can be supplied in an ethical manner by providing benefits (e.g. payments, features like the ability to skip ads) in exchange for consent to share one’s internet connection. However, it is also possible to gain unauthorized access to retail users’ systems and sell their connections. 

Web data infrastructure providers can formulate policies and processes, run external audits and publish their approach and audit findings to create transparency around how they acquire their internet connections. This can foster trust in the ethical supply of their service.

We created a framework for supply-side transparency in web data and rated vendors using this framework. We applied this framework regardless of whether a web data collection service acquired residential IPs itself or through other proxies. Our aim is to bring transparency to the entire supply chain of IPs since unethical practices can originate at any point in the supply chain.

Here you can find our detailed results:

Bright Data 

Bright Data is classified as Level 5 since they publish

  • Their sourcing approach and how app developers can work with them via their SDK15 16
  • Details on 120 suppliers were shared publicly. We could check reviews of these suppliers on 3rd party platforms to estimate how popular they are. 17

Review of selected apps

Bright Data shares 120 of apps on their website. Apps like Bright VPN are certified by 3rd parties on their disclosure and UX.18 We also downloaded these apps to see them in more detail:

  • Bright VPN
  • EarnApp
  • Sling Kong

Opt-in form with obligation not to collect personally identifiable data: Consent form with clear explanation from Bright VPN:

Earn App:

Sling Kong:

  • User is presented with the offer during the game:
  • Opt-in:
  • Additional info during opt-in:
  • Opt-out:

Value provided by apps:

  • Bright VPN: Free VPN service
  • EarnApp: Payments
  • Sling Kong: In-game virtual currency
Oxylabs

Shares its sourcing approach and discloses one app among its sources. 19 This app has ~20k reviews on Trustpilot.

Detailed review of Honeygain

The app:

  • Commits not to collect personally identifiable user data, 20 and outlines collected data in its privacy policy. 21
  • Has a detailed Terms of Use document which outlines how the device will be used and needs to be signed before signing up to the app.
  • Pays users to compensate them for their internet.
  • Outlines its data privacy policy
Decodo

Shares its sourcing approach however doesn’t disclose any apps.22

Others

While most providers are aware of ethics in web scraping and have published on the topic (e.g. 23 , we haven’t identified their specific commitments in this front.

We expect this to change and most providers to move to at least Level 1 in the short term.

Organizational maturity

Updated at 06-07-2025
VendorOrganizational MaturityData Security
Certification
PII
Certification
IP Source
Whitelisted
Bright Data✅ *
ApifyCertified for data security
OxylabsCertified for data security
NetNutCertified for data security
NimbleCertified for data security
ZyteCertified for data security
Decodo

* Indicates that the company achieved all external certifications in this category

It is crucial for vendors to have the right systems, personnel & processes to protect clients’ data and secure the apps that supply its IPs. See our organizational maturity measurement methodology to see the logic behind our scoring.

GDPR & CCPA compliance

All vendors publicly claim to be compliant to both data privacy regulations. Therefore, this was not included in scoring.

How we certified organizational maturities

Based on the capabilities that we identified in this domain, we evaluated each provider:

  • Data security certification & PII certification: Public statements. 24 25 26 27  28 29
  • IP source whitelisted: Public statements.30

Some providers that do not hold ISO 27018 certificates claimed that they should be considered certified since they use cloud service providers that hold ISO 27018 certificates. Our cybersecurity advisor‘s opinion was that while this would facilitate certificate acquisition, they would still need to have their policies and controls certified to acquire the certificate.

Insurance coverage

3 web data collection companies shared their certificates for insurances. We do not publish certificates but reviewed the documents to ensure that

  • they covered these 2 insurance categories
  • Insurance limit in each category is at least in the multi-million scale in US$.

Disclaimers and recommendations for next steps

All of the providers in this benchmark except Nimble are sponsors of AIMultiple. As always, we followed our ethical commitments during this research.

We have completed an exhaustive review of ethical web data collection and while we are satisfied with the scope of this benchmark, we would love to increase its participation. We thank these companies for sharing their insurance coverage: Apify, Bright Data, Zyte

We are waiting for responses from NetNut, Nimble. We’ll update the report as soon as we have more updates from them.

Oxylabs and Decodo that have chosen not to participate in this iteration of the benchmark. Therefore, we were not able to add information about their insurance coverage. In the future, we would also like to verify their insurance coverage like we did for other vendors.

This is the first report to focus on ethical web data according to our research. We think that this transparency can help the web data industry find creative solutions to its challenges. These solutions will need to balance the interests of web data collectors, web automation users, website owners and residential users that supply their IPs to the industry.

References

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments