What is website data collection?

Website data collection refers to the process of extracting publicly available data from public web sources such as review platforms, social media platforms, and eCommerce sites. This data can be gathered directly and indirectly for a variety of purposes, including personalization, market research, and UX improvement.

Is it legal to scrape data from websites?

Many people believe that collecting publicly available information from the internet is legally permissible. It is a held belief. You can technically collect any publicly available information on the internet, but there are still some ethical and legal implications to consider.

What are the main differences between data collection, data mining, and data analysis?

Data collection, mining, and analysis denote different stages in the data lifecycle. Data collection is the process of gathering information from web sources manually or automatically. Data mining, also known as knowledge discovery in data (KDD), is a computational process of discovering patterns in large datasets. Data analysis is the process of inspecting, cleaning, transforming, and interpreting data to reach conclusions.

What are the main web data collection methods?

1. Web scraping 1.1. No code web scrapers): No-code scraping tools provide features like visual point-and-click, where users can point and click on elements they intend to scrape. The main advantage of no-code web scrapers is that they are suitable for non-technical users. 1.2. In-house web scrapers: In-house web scrapers are developed internally by an organization using scraping libraries. One of the main advantages is the ability to customize self-built web scrapers based on the organization’s particular scraping needs and business requirements. 1.3. Pre-built web scraping APIs: Web scraping APIs (Application Programming Interfaces) allow users to access and collect data from web pages. Instead of writing a scraping script, you can use the scraping API’s pre-written algorithms to navigate and extract data. 1.4. Ready-to-use datasets: Pre-collected datasets allow businesses to bypass the time-consuming process of data collection, cleaning, and pre-processing. 2. Tracking online behavior Online tracking technologies monitor and record the actions of website visitors as they interact with applications and digital ads. Online tracking involves different methods such as cookies, web beacons, fingerprinting, localStorage, and SessionStorage. 3. Qualitative data collection Qualitative data collection is the process of gathering non-numerical data through online surveys, interviews, and observations to understand intangible aspects of individuals, such as behaviors and motivations.

What are the top best practices for responsible data collection?

1. Limit data collection: Instead of bombarding the target website with rapid requests, space out your connection requests. 2. Read the terms of use of the scraped website: Before conducting any scraping activity, review the target website’s terms of use or terms of service. Many websites explicitly mention their stance on automated data extraction in these documents. 3. Follow robots.txt: Robots.txt file indicates which parts of the target website should not be accessed by web scrapers. 4. Use website’s API: More responsible way to access data compared to scraping the website directly.

What is the future of website data collection?

Advanced algorithms such as AI and ML can anticipate the most valuable data and adjust collection strategies in real time. For example, AI-based data collection tools such as adaptive web scraping adjust themselves to changes that are implied by a website, even minor design changes rather than just its structure.

Data Web Data Scraping

Web Data Collection Benchmark with 30M Requests

Gulbahar Karatas

updated on Nov 28, 2025

See our ethical norms

We crawled web pages more than 30 million times while using more than 50 different products from 6 leading web data infrastructure companies. This massive undertaking allowed us to assess critical performance metrics such as success rates, latency, and stability at scale.

Our goal was to determine which solutions truly handle the complexities of enterprise-level scraping. Below, you will find the defining criteria for high-quality web data and a comprehensive analysis of the leading products based on our findings.

Benchmark results

Vendor	API Coverage*	Unblocking Rate	Dynamic Scraper	Price**	Reliability
Bright Data	89%	98%	✅	3.0	High
Oxylabs	37%	95%	✅	3.9	High
Zyte	32%	97%	✅	1.5***	N/A***
Decodo	53%	96%	❌	2.8	Normal
Apify	63%	N/A	❌	6.3	Normal
NetNut	11%	N/A***	❌	3.0	Normal

Leading results in each column are bold. Brands are sorted first by their number of leading results and then alphabetically.

* Number of page types in our benchmark for which a scraping API is available divided by the total number of page types in our benchmark. A scraping API for a page type is deemed available if it has a 90% or higher success rate on a specific type of page.

** Price is in thousands $ for a package sufficient for an enterprise PoC which is detailed in the methodology section. Prices are updated on a monthly basis. These providers do not provide all of the products included in the enterprise PoC. Therefore, assumptions were necessary to prepare an indicative price for them:

Zyte does not offer proxies but we assumed their proxies to be priced like their API.
NetNut offers few scrapers and its unblocker pricing was not publicly available, therefore we assumed these products to be priced like its rotating residential proxies.
Apify doesn’t provide a web unblocker or mobile proxies, therefore these products were assumed to be priced like its residential proxies.

*** NetNut’s unblocker was not available for AIMultiple’s testing. Zyte’s API-based solution was not tested because load testing was for residential proxies.

Learnings from 30M web requests

Since the legality of collecting web data continues to be challenged, many businesses do not yet have a web data strategy and may not be aware of all solutions. Enterprises that need to collect web data typically value receiving structured, high-quality web data with limited technical effort via cost-effective and reliable services.

To achieve the goals above, enterprises need to:

Outline types of pages that they need to crawl
Leverage web scraping APIs when they are available since they minimize tech effort on the client side by providing structured data and they are cost-effective. They cost about the same as residential proxies in most cases even though residential proxies provide unstructured data.
Our experience: Before this benchmark, we relied on unblockers for our own company’s data collection needs. Our tech team was burdened every time our target websites changed their design. After realizing the scope of web scraping APIs and seeing that they are not more expensive than unblockers, we switched to using scraping APIs in our data collection workflows.

For remaining pages, rely on:

Web unblockers for hard-to-scrape pages since they are the only solution to consistently return successful results at more than 90% of the time without a complex configuration. However, they are also the most expensive product in most providers’ toolkits.
Datacenter or residential proxies for other pages if the enterprise’s tech team is comfortable with configuring proxies and maintaining these configurations to ensure high success rates.
Mobile proxies to get mobile responses and other proxies for more niche use cases.

Compare web data providers’ performance, price & reliability

In web-scraping APIs, you can choose:

Bright Data for its market-leading range of web scraping APIs at cost-effective prices with detailed results. Many Bright Data SERP and e-commerce APIs return more data points than competitors’ APIs.
Apify for its market-leading range of web scraping APIs thanks to its community-driven scraper approach. However, success rates of some of its APIs were below our threshold for a successful API (i.e. below 90% success rate) and it was the most expensive provider in our benchmark.
Zyte for its market-leading prices
Others opportunistically (e.g. Decodo returned the most data points for Instagram posts).

Learn more about web scraping APIs and see detailed results.

In unblockers, leading products include:

Bright Data is slightly more successful than most others in real-world tests and significantly more successful than others in more difficult scenarios like scraping websites that regularly present Javascript challenges. It also provides the second-lowest-priced unblocker in the benchmark.
Zyte has the lowest-priced unblocker and fastest unblocker that responded within ~2 seconds on average in real-world tests.

Learn more about web unblockers and see detailed results.

Proxies: You can rely on any of the providers based on your technical team’s preferences and pricing. This is because results vary significantly based on:

Time: While publishers improve their anti-scraping approaches, web data infrastructure providers constantly receive fresh IPs and improve their approaches. We used the same proxy type from the same provider on the same website with the same configuration for thousands of URLs in different runs. There were runs where almost all responses were correct and some where the success rate was ~50%. Success rate depended on the test time.
Request: Success of a request via a proxy depends on how the request is sent. For example, user agent choice or delay between requests significantly impacts the success rate.

However, this recommendation is not relevant in niche use cases. For example, a company that is not part of our benchmark could be providing higher quality mobile proxies in Portugal. For niche cases, we recommend teams to experiment with different providers.

Learn more about proxy providers and see detailed benchmark results.

As for reliability, all benchmarked providers’ services were reliable at 5,000 parallel requests. At 100,000 parallel requests, all services experienced some degradation but Bright Data, Oxylabs and Decodo exhibited more reliability, showing limited change in success rate or response times.

See detailed benchmark results about large scale web scraping.

How to choose the right data collection solution

1. Enterprise web data requirements:

Enterprises include diverse businesses. For example, businesses with e-commerce operations and hedge funds require high volumes of data to feed their models (e.g. dynamic pricing, stock replenishment). Their requirements include:

Buyer-related dimensions
- High volume
- Batch
- Price & quality sensitivity
- Want to receive structured data
Website-related dimensions
- Easy & difficult-to-crawl
- Static and dynamic
- Mixed

For more on how we derived these requirements see dimensions of web data requirements.

To achieve these requirements, enterprises need:

Capabilities to support their requirements:
- A wide-selection of web scraping APIs that return detailed results with a high success rate to deliver structured data and satisfy their quality sensitivity. Measurement: Share of types of web pages to be crawled for which a web scraping API is provided. This would depend on the types of pages that each enterprise targets.
- A powerful unblocker for difficult-to-crawl websites. Measurement: Crawler’s success rate for a wide range of web pages including most challenging ones.
- Unblocker integration with browsers to enable interacting with websites for dynamic scraping. Measurement would include checking availability or lack of this browser.
Cost-effective services to satisfy their price sensitivity. For measurement, the price to crawl a set of web pages is measured.
Reliability:
- A resilient web data infrastructure to handle high-volume batch queries. Measurement is based on how success rate degrades during load testing. Most resilient networks should not experience drastic success rate declines while answering tens of thousands of parallel queries.

2. Web data requirements for small, highly technical teams:

If your data collection costs are going to decide your company’s profitability and if you are a highly technical team, we recommend relying on proxies to save costs.

Finally, all buyers should pay attention to pricing, therefore we calculated prices for the same packages for all major web infrastructure providers:

See pricing methodology for details.

Dimensions of web data requirements

There are a few dimensions to a company’s web data requirements. Understanding your company’s requirements will help you identify the right vendor.

We are not covering every type of web data use case here. There are many web data users that have multiple one-off requests over time. That is not the focus of this report. We have seen that enterprises typically have recurring web data needs to monitor sentiment, prices or other rapidly changing metrics. Therefore, we have only focused on companies continuously using web data.

These dimensions are:

Buyer-related dimension
- Volume:
  - High volume meaning 100 GB/month or more
  - Low volume for any lower volume
- Time sensitivity
  - Real time: When web data in raw or processed form is to be served to human end users while they are using applications, real-time responses are important.
  - Batch: Response times are not important as long as results are received within tens of seconds. In most use cases, businesses batch process incoming web data to update their systems.
- Quality sensitivity:
  - Quality sensitive: All web data solutions sometimes return empty responses when they get blocked by websites. Companies that want to spend limited time in resending requests would prefer solutions that have higher success rates.
  - Price sensitive: Given their other requirements are satisfied, these businesses want to receive the lowest price and are ready to run their data collection systems multiple times to achieve higher quality results.
  - Price & quality sensitive: Businesses that want the optimal combination of high success rates and price.
- Technical involvement:
  - Want to build custom scrapers: The technical team is experienced with using proxies to overcome anti-scraping technologies and has the capacity to build a custom internal solution. They are ready to constantly devote effort to overcome evolving anti-scraping approaches.
  - Want to build HTML parsers: The technical team wants to receive HTML data which they will parse themselves. They are ready to constantly devote effort to reparsing web pages when the page design changes.
  - Want to receive structured data: Team wants to receive structured data (e.g. JSON files) to integrate into their applications.
Website-related dimension:
- Difficulty:
  - Difficult-to-crawl websites like Amazon employ numerous anti-scraping technologies. Unblockers are necessary to consistently receive data with high success rates from them
  - Easy-to-crawl websites can be crawled with proxies
  - Easy & difficult-to-crawl websites
- Interactivity:
  - Static websites make up most of the web and deliver data via changes in the URL.
  - Dynamic websites require users to use their mouse or keyboard on the website to disclose additional information.
  - Static and dynamic websites
- Scraper availability:
  - Available: A custom scraper exists for every target type of webpage.
  - Not available: There are no scrapers for any of the types of target webpages.
  - Mixed: For some targets, scraper exists, for others it doesn’t exist.

Methodology

This web data benchmark includes the benchmarks below and the methodology for each benchmark is explained in its specific page:

You can see the methodology for the pricing benchmark below:

Pricing methodology

Almost all prices are based on publicly disclosed packages.

However, not all vendors disclose pricing at the same levels. While a vendor may provide pricing for 100 GB residential proxy usage, another may only provide pricing for 50 GB. In cases where their pricing was not public, if vendors share with us private pricing information, we include that in the benchmark as long as it does not change the ranking of vendors.

Our rationale is that we want to share:

The most accurate pricing as possible with our readers
Pricing levels that are in-line with the publicly available prices which can be constantly monitored.

Unit conversions

For the same product, vendors may provide pricing in GB or requests, we needed to convert these values to one another.
We assume an average page size to be ~400KB which is the case in our measurement of 1,700 e-commerce URLs. Therefore, we assumed 1GB to equal 2.5k requests.

Packages

We looked into 2 packages: enterprise PoC package and enterprise package. Enterprise PoC package is designed to be broadly representative of an enterprise PoC scope:

100 GB residential proxies
100 GB mobile proxies
500 GB datacenter proxies
500k unblocker requests
500k scraping API requests to Amazon product pages

Enterprise package is the highest volume package with public prices. In each product category, we identified the highest volumes offered by each provider and took the highest volume as volume in enterprise package for that product:

1,000 GB residential proxies
1,000 GB mobile proxies
5,000 GB datacenter proxies
2.5M unblocker requests
2.5M scraping API requests to Amazon product pages

Limitations

When enterprises procure such services at high volumes, they are likely to get discounts. Such enterprise discounts are not public and are not included in the benchmark.

Vendor-specific assumptions

Some vendors’ pricing is complex which requires certain assumptions:

Apify:
- For datacenter proxies, we assumed that the user buys a $499/month package and pays $0.25/GB for platform usage.
- For scrapers: We took the average price of these 2 scrapers: junglee~amazon-crawler and tri_angle~walmart-product-detail-scraper
Oxylabs prices its unblocker only on a GB basis. Therefore, we converted its pricing to requests while assuming an average page size to be ~400 KB.
Zyte: 4th pricing tier was recommended for the websites in our benchmark. We leveraged the HTTP response service.

Limitations and next steps

AIMultiple’s experience may differ from an average users’ experience in these cases: Users can

Receive faster responses due to caching. Our work aimed to bypass caching in all providers to provide a level playing field.
Receive fewer successful responses when extracting data from less popular websites since their requests may be blocked due to website health issues.
Make configuration mistakes, miss KYC requirements or get blocked when they initially send a high volume of requests. All of these can undermine their experience and success rates. All of these issues can be swiftly resolved by support teams.

Finally, network quality will fluctuate over time and this benchmark is a series of snapshots taken during a month. It should be representative for that month but network quality can change after the benchmark.

Acknowledgements & disclaimers for transparency

All providers contributed to this benchmark by providing part of or all of the credits used in the benchmark. We thank them for their support for our research.

All providers in this benchmark are customers of AIMultiple. Our team ensures the objectivity of our research by following our ethical commitments.

FAQ

Industry Analyst

Gulbahar Karatas

Industry Analyst

Follow On

Gülbahar is an AIMultiple industry analyst focused on web data collection, applications of web data and application security.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

Benchmark results

Learnings from 30M web requests

Compare web data providers’ performance, price & reliability

How to choose the right data collection solution

Dimensions of web data requirements

Methodology

Limitations and next steps

Acknowledgements & disclaimers for transparency

FAQ

We follow ethical norms & our process for objectivity. AIMultiple's customers in Web Data Scraping include Bright Data, Oxylabs, Decodo, Webshare, Coresignal, Apify, Zyte.

Next to Read

Data CollectionSep 3

Web Data Collection Benchmark with 30M Requests

Benchmark results

Learnings from 30M web requests

Compare web data providers’ performance, price & reliability

How to choose the right data collection solution

1. Enterprise web data requirements:

2. Web data requirements for small, highly technical teams:

Dimensions of web data requirements

Methodology

Pricing methodology

Unit conversions

Packages

Limitations

Vendor-specific assumptions

Limitations and next steps

Acknowledgements & disclaimers for transparency

FAQ

Be the first to comment

Next to Read

10+ Image Data Collection Services

LLM Data Guide & 6 Methods of Collection

AI Data Collection: Risks, Challenges & Tools

eCommerce Data Collection: Best Practices & Examples

Top 4 Facial Recognition Data Collection Methods

Automated Data Collection Tools & Use Cases