Websites track the IP address of every incoming request, and a high volume of traffic from a single IP is the signal of an automated bot. The solution is a proxy. A proxy server is an intermediary that stands between your scraper and the target website, forwarding your requests while masking your real IP address.
By rotating through a pool of proxies, you can distribute your requests across hundreds or thousands of different IP addresses, making your traffic appear more organic and reducing the likelihood of detection.
This article examines how proxy servers function, their various types, and their applications in data collection.
How many proxies are needed?
The number of proxy servers needed to achieve the mentioned benefits above can be calculated using the following formula: Number of proxies = number of access requests / crawl rate.
The number of access requests depends on
- Pages the user wants to crawl: This refers to the total volume of web data that needs to be collected. Are you scraping 10,000 product pages, a million forum posts, or 50,000 articles? This is the foundational number for your calculation.
- Frequency of crawling: This variable defines how often you need to perform the scrape. Do you need the data refreshed daily, hourly, or just once a day? Frequency has a significant impact on the total number of requests over a given period.
The crawl rate is limited by the number of requests/user and time period allowed by the target website. For example, most websites limit the number of requests/users within a minute to differentiate between human user requests and automated ones.
Websites impose rate limits to protect themselves from being overwhelmed and to distinguish between human users and automated bots.
A human might click a few links a minute, but a scraper can send hundreds of requests in the same timeframe. This sudden spike can look like a denial-of-service (DoS) attack or aggressive scraping.
How to determine the crawl rate?
- Check robots.txt: Some websites specify a crawl-delay in their robots.txt file (e.g., crawl-delay: 10 means you should wait 10 seconds between requests).
- Website tolerance: The acceptable rate varies greatly.
- High tolerance (e-commerce sites): Might tolerate 100-300 requests per hour per IP.
- Low tolerance (social media, search engines): Can be much stricter, potentially flagging an IP after just 30-60 requests per hour.
Is just changing my IP address enough?
While rotating IP addresses is the most critical step, sophisticated websites use a technique called browser fingerprinting to identify and block automated scrapers. If a website receives thousands of requests with an identical fingerprint coming from different IP addresses, it’s a clear red flag.
Key elements of a browser fingerprint:
- User-agent: This is a string that identifies your browser, its version, and the operating system you are using.
- HTTP headers: Real browsers send a specific set of headers in a particular order (e.g., Accept, Accept-Language, Upgrade-Insecure-Requests). Your scraper must accurately replicate these headers.
- JavaScript rendering: Modern websites use JavaScript to probe for more detailed information, such as:
- Screen resolution and color depth
- Installed fonts and browser plugins
- Canvas fingerprinting forces your browser to render a hidden image and uses the tiny variations in output to create a highly unique ID.
Strategy for successful requests:
- Rotate user agents and headers: Just as you rotate proxies, you must rotate user agents. Maintain a list of realistic, modern user-agents and pair them with a consistent set of HTTP headers.
- Match your stack: Ensure your user-agent, headers, and proxy type align. For example, don’t use an iPhone user-agent if your HTTP headers indicate a Windows desktop.
- Use headless browsers: Tools like Selenium, Playwright, or Puppeteer run a real browser in the background, which executes JavaScript and generates a much more convincing, human-like fingerprint.
Why do you need proxies for web scraping?
Businesses utilize web scraping to collect data about industries and gain market insights, enabling them to make data-driven decisions and offer data-powered services. Using a proxy server, which acts as an intermediary between your web scraping API and the target website, offers several key advantages:
1. Avoiding IP bans
Business websites set a limit to the amount of crawlable data, called the “Crawl Rate,” to prevent scrapers from making too many requests, thereby slowing down the website speed.
Using a sufficient pool of proxies for scraping enables the crawler to bypass rate limits on the target website by sending access requests from different IP addresses.
2. Bypassing geo-restrictions
Businesses that use website scraping for marketing and sales purposes may want to monitor websites (e.g., competitors) offering for a specific geographical region to provide appropriate product features and prices.
Using residential proxies with IP addresses from the targeted region enables the crawler to access all the content available in that region. Additionally, requests coming from the same area appear less suspicious, and are therefore less likely to be banned.
3. Improved performance and scalability
There’s no way to determine if a website is being scraped programmatically. However, the more activity a scraper has, the more likely its activity can be tracked.
For example, scrapers may access the same website too quickly or at specific times every day, or reach not directly accessible webpages, which puts them at risk of being detected and banned. Proxies provide anonymity and allow making more concurrent sessions to the same or different websites.
In-house vs. outsourcing proxy scraping
In-house proxies ensure data privacy and provide complete control to the engineers involved. You can implement your own encryption and security standards on top of the proxy services. However, building an in-house proxy is time-consuming and requires an experienced engineering team to develop and maintain the proxy pool.
By outsourcing your proxy needs, you get reliable, scalable power on demand from experts whose sole job is to keep the lights on, allowing you to focus on what you actually want to do: build your product. Therefore, most businesses choose to use off-the-shelf proxy services.
Web scraping proxy types
Here’s a guide to the different types of proxies available for web scraping projects:
1. Based on the IP source
The most significant distinction between proxy types lies in the origin of their IP addresses.
Datacenter proxies
Datacenter proxies are artificial IP addresses created in and hosted on servers in data centers. They are not affiliated with Internet Service Providers (ISPs) and are the most common and affordable type of proxy.
- Pros:
- High Speed: Hosted on powerful servers with high-speed connections.
- Affordability: Generally, the cheapest paid proxy option, making them ideal for projects with a limited budget.
- Cons:
- Easily detectable: Because their IPs come from data centers, websites with sophisticated anti-bot measures can easily identify and block them.
Residential proxies
Your web scraping traffic is routed through real desktop and mobile devices, making your requests appear as if they are coming from a genuine user.
- Pros:
- Lower block rate than datacenter proxies: Residential proxies are more suitable for difficult-to-crawl websites, as they appear to be organic traffic, making them effective for detection and blocking.
- Geo-targeting accuracy: Better for accessing content that is restricted to specific geographic locations.
- Cons:
- Higher cost: They are significantly more expensive than datacenter proxies.
- Slower Speeds: Speeds can be less consistent as they depend on the end-user’s home internet connection.
Mobile Proxies
Mobile proxies use IP addresses assigned by mobile carriers to individual mobile devices connected to 3G, 4G, or 5G networks.
- Pros:
- Dynamic IPs: Mobile IPs change frequently, further reducing the risk of detection.
- Cons:
- Most expensive: Due to the difficulty in acquiring them, mobile proxies are the costliest option.
ISP Proxies (Static Residential Proxies)
ISP proxies combine the best features of datacenter and residential proxies. They are static IP addresses hosted on servers in a data center, but are officially registered with an ISP.
- Pros:
- High stability: They offer the speed and reliability of datacenter proxies.
- Cons:
- High cost: They are an expensive, premium option.
2. Based on usage
This classification focuses on how you access and use the proxy IPs.
Shared vs. Dedicated Proxies
- Shared proxies: Multiple users share the same pool of IP addresses simultaneously. They are cheaper but come with the “bad neighbor” effect.
- Dedicated proxies: An IP address is assigned exclusively to a single user. This provides greater speed, security, and a lower risk of being blacklisted, but at a higher cost.
Static vs. rotating proxies
- Static proxies: You are assigned a fixed IP address that remains the same over time. This is particularly useful for tasks that require a consistent identity, such as managing a social media account.
- Rotating proxies: The proxy network automatically assigns a new IP address from a pool for every request or after a set interval. This is ideal for large-scale web scraping.
Comments
Your email address will not be published. All fields are required.