We benchmarked leading web scraper APIs with 12,500 requests to e-commerce platforms and search engines. Then, we tested the reliability of the underlying services (i.e. residential proxies) with 5,000 and 100,000 parallel requests.
Based on these experiences, we explain how to scrape large-scale data efficiently and ethically. Explore the top vendors, challenges of large-scale data extraction and best practices to overcome these hurdles:
How to scrape websites at large scale
Scraping websites at a large-scale effectively requires the combination of a well-planned strategy and automated tools to handle the challenges that arise. There are typically two different types of large-scale data scraping goals:
1) Scraping thousands/millions of pages from a few large websites
Large websites typically have complex pagination systems and incorporate anti-scraping techniques. To scrape data from large websites, you can leverage web scraping APIs when they are available. They are cost-effective because they minimize tech effort on the client side by providing structured data.
However, web scraping APIs are not available for every website. You can follow these steps for an optimal approach:
Recommended approach
- Create a list of types of pages to collect. For example, a search page on Amazon is a different type of page than a product page.
- Compare this list with the APIs that each provider supplies to identify which provider allows them to retrieve most pages via APIs. Each additional type of page delivered via API allows businesses to save technical teams from managing proxies and parsing HTML pages. You can see all web scraping APIs along with benchmarks showcasing data fields provided by different services.
- Use APIs when they are available.
- When scraping APIs are not available, use unblocker services or residential proxies for bypassing strict anti-bot measures.
Real-life example
E-commerce companies and retailers scraping their competitors’ websites (i.e. Amazon) for dynamic pricing face this challenge. This is a common use case and as a result, e-commerce scraping APIs are the most commonly found scraping APIs.
If you plan to scrape millions of pages per day, you need to leverage a service that can handle large volumes. To identify such services, we ran a reliability benchmark:
Web data infrastructure reliability benchmark
We measured success rate and response times of residential proxies to understand how these systems behave under different loads. Since residential proxies underline all advanced services (e.g. unblockers, web scraper APIs), the residential proxy capacity is typically the limiting factor.
All benchmarked providers’ services were reliable at 5,000 parallel requests. At 100,000 parallel requests, all services experienced some degradation but Bright Data, Oxylabs and Smartproxy exhibited more reliability, showing limited change in success rate or response times. For example, as we scaled up parallel requests from 5k to 100k:
- Bright Data residential proxies’ success rate decreased from 96.5% to 93.4% and response time increased from 1 second to 3.6 seconds.
- Oxylabs‘ success rate drop from 97.2% to 93.8% and response time increased from 1.3 to 6.4 seconds.
Power scale has been used on the vertical axis to make it easier to see the difference between products:
Limitation: This observation is a snapshot. Even though this observation involved 5 million requests sent to each vendor, it is possible for vendor performance to change over time.
Methodology
We used each benchmarked residential proxy service to send parallel requests to 50 different URLs hosted by aimultiple.com. These URLs did not employ any anti-scraping services since we turned off all of our website’s security services like WAF and DDOS protection on the network layer during this test.
We ran these tests from 100+ servers, each with 10GB uplink, hosted at different regions. During our measurements, we ensured that all parallel threads were simultaneously live. In one measurement, we had 5k parallel requests and in another, we had 100k.
A request was considered successful if it returned 200 response code and a correct identifier. To ensure that results were not cached, we added a unique identifier to the request header. Then, via a script, the URL printed this identifier to the response body. Finally, we compared the two identifiers (one in the response body and another in the request header). With this approach, we were able to ensure that the requests visited the target URLs and that the results were not cached (i.e. fresh).
Total Price for Enterprise Package
Given so many different products offered by each vendor, it is hard to compare vendors by price. However an overall price index gives an idea about the affordability of that vendor’s services. For further information, see our benchmark’s pricing approach.
2) Scraping thousands of small websites
This type of large-scale web scraping is challenging since web data infrastructure providers typically do not provide web scraping APIs for them and most small websites have diverse website structures.
However, smaller websites typically incorporate lower levels of anti-scrape technology. Therefore, proxies are typically used in such scraping operations. Steps include:
- Identify the target websites or web pages: Determine the websites or web pages from which you want to extract data.
- Send connection requests: The method of sending HTTP requests varies depending on the web scraping method. You can send requests to the target website using an automated tool or an in-house web crawler. Most off-the-shelf web scrapers have built-in anti-scraping technologies to overcome scraping measures such as CAPTCHAs and IP bans. If you extract data from web pages using an in-house scraper, you will need to integrate proxy services and IP rotation mechanisms into your scraper to scrape data effectively at a large-scale.
- Parse HTML content: Parse the HTML content of the web page to locate the desired data elements. You can use the Beautiful Soup library to parse the HTML source of the web page you are scraping.
- Extract desired data: Once you have parsed the source of the target web page, you can extract the desired data from the HTML content.
- Clean and process extracted data: After scraping all the required data, you must ensure that the data is accurate, consistent, and ready for analysis. However, cleaning large datasets is challenging and time-consuming due to the volume of data and heterogeneity. You may need data science tools and various techniques, such as parallel processing, to handle large datasets.
- Save data: Pre-built web scrapers automatically download and save scraped and processed data in different formats. If you have scraped data using a web scraping library such as Scrapy, you have several storage options to save the collected data.
For instance, you can store data in databases such as SQLite, PostgreSQL, etc. For SQLite, you can use Python’s built-in “sqlite3” module. If you want to store data in a text file format such as CSV, you can use Python’s built-in “CSV ” module.
New developments: LLMs and AI scrapers
Parsing web pages used to be a time-consuming and manual work involving engineers that used pattern matching techniques to convert HTML into structured data.
With generative AI, large language models can be used in parsing. However, LLMs are prone to hallucination and businesses are recommended to test automatically-parsed data to ensure that it is correctly parsed.
Web data infrastructure providers are bundling LLMs in their offering, learn more about this new category: AI web scraping.
What is large-scale web scraping?
Large-scale web scraping is the process of extracting data from websites at least hundreds thousands of requests on a monthly basis. Although users can perform it manually, the term typically refers to an automated process implemented by web crawlers or scrapers.
The volume and complexity of data involved in large-scale web scraping pose ethical and legal issues, necessitating a comprehensive understanding of the tools, techniques, and web scraping best practices for achieving success.
What are the challenges of web scraping at a large scale?
Large-scale web scraping poses numerous challenges due to the complexity of handling extensive data volumes and the technical components involved. Here are some of the most common challenges of large-scale scraping:
- Dynamic websites: Dynamic websites, as opposed to static websites, use JavaScript to load or display content, making traditional web scraping methods challenging to collect data from. Most dynamic websites require user interactions, such as clicking buttons or filling out forms. Your scraper must be able to simulate these interactions to access data.
- Rate limiting: Websites employ rate limiting to control the number of requests a client can make within a specific period. This protects websites from malicious bots and prevents their data from being abused or misused.
- Data accuracy: It can be difficult to ensure data accuracy, especially when working with large datasets. For instance, large datasets collected from multiple sources can result in data inconsistencies. Inspecting new data manually, especially in large datasets, can be impractical and tedious. You can employ automated metrics to validate and inspect data, such as leveraging machine learning algorithms or developing scripts.
- Anti-scraping measures: Many websites employ anti-scraping mechanisms, such as CAPTCHAs, JavaScript challenges, and IP blocks, to prevent or restrict web scraping activities.
How to perform large-scale web scraping effectively
We have compiled the following key guidelines to help you overcome large-scale web scraping challenges, ensuring efficient and legally compliant data extraction. It is important to use these best practices responsibly and in accordance with website terms of service.
- Scraping browsers bring unblocking features to browsers that can be programmatically controlled. This faciliates data collection.
- Headless browsers allow users to extract the data they require from dynamic websites. When scraping dynamic sites, you can use headless browsers to simulate user interactions, such as mouse movements and clicks. However, they may not be able to correctly render pages that rely heavily on Javascript.
- Proxies and IP rotation: Most web scraping libraries and tools provide options for using proxy servers. Pre-built web scrapers often include built-in integration with proxy services to help users avoid being blocked by target websites.
For example, rotating proxies allow web scrapers to bypass the rate-limiting and make more requests without being flagged as suspicious. We advise employing residential IPs widely known for their reliability and speed. - Web browser automation: Web automation tools like Selenium and Puppeteer enable you to mimic human activities and interact with websites in the same way that humans do. This can be useful for extracting large amounts of data from dynamic websites without manually navigating the site.
- Distributed computing techniques: A distributed web scraping architecture enables more efficient large-scale web scraping by dividing and spreading web scraping tasks across multiple machines. You can build your distributed scraper in any language based on your familiarity to overcome challenges such as rate limiting and handling dynamic content.
Comments
Your email address will not be published. All fields are required.