AIMultipleAIMultiple
No results found.

Amazon Scraping: Extract Product Data with Proxies & Python

Gulbahar Karatas
Gulbahar Karatas
updated on Oct 28, 2025

Traditional scraping techniques with tools like Selenium, BeautifulSoup, or headless browsers such as Undetected ChromeDriver are frequently detected and blocked by CAPTCHAs, IP bans, or changes to page structure.

In this tutorial, we’ll extract data from Amazon pages using the review scraper API and proxies, instead of scraping directly.

  1. Use proxies to search Google and locate Amazon product URLs.
  2. Send those URLs to the Amazon Reviews API, which handles large-scale scraping.
  3. Extract structured data, including product ratings, review texts, author names, and other details.

How to build an Amazon review scraper in Python

Traditional scraping methods like Selenium, BeautifulSoup, or Undetected ChromeDriver often fail. Amazon’s strict anti-bot systems trigger CAPTCHA and IP bans. In our early tests, all these common approaches were detected and blocked by Amazon, making direct scraping unreliable.

To overcome this, we combined Google Search scraping (via Bright Data’s proxy network) to identify Amazon product URLs, which we then fed into the Amazon reviews dataset to extract structured, bulk review data efficiently.

For example, using the keyword “headphones,” we search for relevant product links, submit those to the API, and retrieve the associated reviews in a single, streamlined process.

Step 1: Imports and configuration

The following section includes all necessary imports and setup for your Amazon review scraper. Make sure to replace the placeholders (YOUR_API_TOKEN, YOUR_DATASET_ID, and proxy credentials) with your account information.

The API_TOKEN allows targeted access to the reviews dataset using the AMAZON_REVIEWS_DATASET_ID. The proxy settings redirect requests through residential IP addresses to prevent bot detection.

Step 2: Define the Google Search Function

Here’s the complete function that queries Google for Amazon product links:

  • site:amazon.com/dp: This query limits Google’s search results to Amazon product pages only. Every Amazon product page includes /dp/ followed by a 10-character ASIN (Amazon Standard Identification Number).
  • urllib.parse.quote(): Encodes special characters in the query, ensuring a valid Google search URL.
  • num=20: Requests 20 results per page for efficient scraping.
  • start parameter: Used for pagination.
  • User-Agent header: Pretends to be a real Chrome browser to help avoid detection as a bot.
  • Error handling: The try…except block ensures that any network or parsing error won’t crash the script.

Step 3: Define the URL extraction function

This function parses HTML from Google Search and extracts valid Amazon URLs:

This Python function scans Google search results and extracts Amazon product URLs using regular expressions.
Here’s the breakdown:

  • re.findall(r’/dp/([A-Z0-9]{10})’, html): searches for any URL segment containing /dp/ followed by 10 alphanumeric characters, which represents the ASIN (Amazon Standard Identification Number).
  • seen_asins: stores already extracted ASINs to avoid duplicates.
  • cleaned_urls: builds a clean list of unique Amazon product URLs in the format https://www.amazon.com/dp/ASIN.

From a messy URL like:

The regex extracts only the ASIN (B0BZRTHN8B) and rebuilds a clean, standardized URL:

Step 4: Create the scraping function

Below is the Python function that initiates the scraping job on the API:

This asynchronous function starts the job without waiting for results. We POST Amazon URLs to the endpoint and specify the dataset. brightData responds with a snapshot_id, like a tracking number for your web scraping activity.

The include_errors parameter allows retrieving results even if some products fail to scrape, rather than failing the entire job. Important: the data format must be [{“url”: “…”}, {“url”: “…”}]. Initially, we attempted to incorporate parameters such as max_reviews; however, the API rejected them because it accepts only the URL field.

Step 5: Poll scraping results

Here’s the complete Python function to check job status and fetch your scraped Amazon data:

Bright Data scraping jobs usually take 30 seconds to 5 minutes, depending on the number of Amazon reviews per product. Polling is the safest way to:

  • Minimize unnecessary API calls during wait times.
  • Automatically detect when a job is completed.
  • Prevent timeouts or hanging processes during the scraping of large datasets.

The max_wait_minutes parameter serves as a fail-safe timeout, ensuring your script exits gracefully if the job exceeds the specified duration.

Step 6: Display scraped reviews by product

Here’s the function that neatly displays your Amazon review results, grouped by product:

After scraping, reviews are stored as dictionaries, each representing one review. The function organizes these reviews by Amazon product URL using Python’s defaultdict.

Here’s what happens step by step:

  • Group by product: Reviews are organized by product URL, so you can view all reviews for each ASIN (Amazon’s product ID).
  • Extract the ASIN: The regex r’/dp/([A-Z0-9]{10})’ makes it simple to pick out the ASIN from each URL, helping you refer to them more easily.
  • Summarize results: Displays the total number of reviews collected and the number of unique products scraped.
  • Preview Reviews: Displays up to five sample reviews per product, showing:
    • Product Name
    • Rating
    • Review Title (Header)
    • A short snippet of the review text (first 100 characters for readability).
  • Handle long review lists: If a product has more than 5 reviews, it notes the number of extra reviews.

Step 7: Save reviews to a CSV file

Here’s the Python function that organizes and exports your scraped reviews into a structured CSV:

This function converts raw data into a clean CSV for easy analysis. It extracts ASIN for filtering and groups. There are two rating columns: product_rating (overall) and review_rating (individual). Encoding=’utf-8-sig’ ensures special characters are saved correctly, with BOM for Excel compatibility. The .get(‘field’, ‘N/A’) pattern assigns ‘N/A’ if a field is missing, avoiding errors.

Step 8: Bringing everything together

Here’s the main function that coordinates the entire scraping process:

Let’s go through the process step-by-step to clarify what happens behind the scenes:

  1. Set the parameters: The KEYWORD variable specifies the product category to search for (e.g., “wireless headphones”, “gaming mouse”, “Bluetooth speakers”), and the NUM_PRODUCTS variable specifies the number of Amazon products to gather data from.
  2. Search Google for Amazon URLs: The script repeatedly executes search_amazon_products(), incrementing the start parameter by 10 each time to fetch additional search results. The obtained results are then processed through extract_amazon_urls() to isolate clean Amazon product URLs.
  3. Prevent duplicates with sets: Using set() automatically skips duplicate product URLs, even if Google returns the exact ASIN multiple times.
  4. Throttle Requests: time.sleep(2) avoids overloading Google and prevents blocks, best practice for ethical scraping.
  5. Send URLs to BrightData: After gathering sufficient product URLs, the script batches and sends them to BrightData’s API via trigger_brightdata_scraping(). Bright Data manages all the scraping processes, effectively bypassing CAPTCHAs, JavaScript hurdles, and Amazon’s anti-bot protections.
  6. Poll for Results: The get_scraped_data() function checks the job status every 10 seconds until BrightData returns the completed dataset. Once ready, the reviews are retrieved in a structured NDJSON format.
  7. Display and Save Reviews: The results are shown in a clear, easy-to-read format using display_results(), and then saved to a CSV file with save_reviews_csv() for further analysis.
  8. Error Handling: Each step features graceful error handling. If any function fails (e.g., due to missing data or a connection timeout), the script logs the error and exits gracefully rather than crashing.

Example Output

Once the script runs, it fetches Amazon reviews and displays them in a structured format. Below is a sample output from save_reviews_csv() with the selected fields.

Available data fields from Bright Data Amazon Reviews API

The Amazon scraper API offers a wide range of data fields that can be extracted. Below is the complete list of available columns along with their descriptions:

To add more columns, include them in the data.append() dictionary within the save_reviews_csv() function.

How to use proxies for scraping Amazon product data?

Amazon employs advanced anti-bot systems that swiftly detect web scrapers. To avoid being blocked, using proxies, preferably rotating proxies, is crucial. This tutorial illustrates how to integrate Bright Data’s residential proxies to conceal the scraper’s actual IP address.

Every request to the Google or Amazon website uses these residential IPs to bypass CAPTCHA and rate limits. The proxy configuration in Python is shown below.

Does Amazon allow web scraping?

No, Amazon’s Terms of Service forbid unauthorized scraping of its site. The platform employs robust anti-scraping measures, including CAPTCHA, dynamic content loading, and rate limiting.

However, the tutorial sidesteps direct scraping by using the official Amazon Reviews dataset. This compliant and scalable option offers structured product and review data via an API, ensuring both reliability and legal safety. 

Understanding Amazon’s product page structure

We will distinguish between two principal types of pages essential for data extraction: Listing Pages (Search/Category Results) and Product Detail Pages (PDPs).

1. Listing pages (search or category results)

These pages are helpful for broad data collection. Data available from listing pages includes:

  • Thumbnail images of products
  • Product titles
  • Ratings and number of reviews
  • Product price
  • Links to the product pages (often containing an ASIN)

You can manually locate data points on Amazon. Here is an example:

  1. Right-click and select “Inspect” (or press Ctrl+Shift+I).

2. Highlight the element container to identify the relevant HTML tags and classes for scraping.

Note: Amazon’s search results display both organic and sponsored products, which may have slightly different HTML structures. If your Amazon scraper API targets only the typical layout for organic products, sponsored products might be skipped.

2. Product detail pages (PDPs)

Follow the same steps as on the listing pages: open the page, right-click the data you want, choose Inspect, and analyze the HTML to find the relevant tags and attributes.

Challenges of scraping Amazon

Scraping Amazon is tricky because of:

  • CAPTCHA is activated due to numerous requests
  • A dynamic HTML structure that frequently updates.
  • IP bans are triggered after repeated requests from the same address.
  • Content rendered with JavaScript that requires sophisticated browser automation.

Previous attempts with Selenium, BeautifulSoup, and Undetected ChromeDriver all failed because of Amazon’s strong anti-bot measures. The breakthrough came from using the dataset API, which avoids these issues altogether.

FAQs about Amazon scraping

Industry Analyst
Gulbahar Karatas
Gulbahar Karatas
Industry Analyst
Gülbahar is an AIMultiple industry analyst focused on web data collection, applications of web data and application security.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450