ChatGPT is an easy way to bring AI to web scraping, saving developers from manual parsing work that requires constant updates. Using LLMs is becoming one of web scraping best practices.
See below how ChatGPT is used in web scraping including various use cases where combining web scraping and ChatGPT can facilitate data collection:
How to scrape websites using ChatGPT
In this tutorial, we will extract product data from an e-commerce website using OpenAI’s GPT-4.
1. Load the HTML File:
Choose the target website you want to extract data from. Press Ctrl + S (or Cmd + S on macOS) to save the page as HTML. If you’d like to automate saving the HTML file, you can use the following prompt example. This will prompt ChatGPT to generate the necessary Python code to save the HTML file from the provided URL.
Example Prompt to ChatGPT:
“Please provide a Python script that automates the process of saving an HTML page from the following URL: https://www.walmart.com/browse/electronics/gaming-mouse/3944_1089430_132959_1008621_4842284_9144425 The script should send a GET request to the page, retrieve the HTML content, and save it to a file named walmart_gaming_mouse.html.”

2. Inspecting the structure of the HTML:
Once you have saved the HTML file from the target page, drag and drop the HTML file ChatGPT.
Example Prompt to ChatGPT:
“Please provide a Python script that automates the inspection of the HTML structure from the file walmart_gaming_mouse.html to identify the correct HTML tags and classes that contain the product name, price, and product link. The script should load the saved HTML file, find the elements that contain product names, prices, and links, and print the relevant tag names, classes, and text content.”
Example Python script to automate inspection:

3. Parsing Data from the HTML:
Example Prompt to ChatGPT:
“Please provide a Python script that automates the parsing of the HTML file walmart_gaming_mouse.html to extract product details such as product name, price, and link. The script should parse the HTML, extract the required details for each gaming mouse on the page, and store them in a structured format such as a CSV file.”
Python script for parsing the data:

4. Storing or Displaying the Data:
Example Prompt to ChatGPT:
“Please provide a Python script that stores the parsed product details from the walmart_gaming_mouse.html file into a structured format like CSV. The script should extract the product name, price, and link, and save them to a CSV file named gaming_mouse_products.csv. Additionally, the script should display a confirmation message once the data is saved.”
Python script for storing or displaying the data:

Using ChatGPT as an XPath tool
ChatGPT can help you extract specific elements from the target page using XPath expressions. When you ask ChatGPT how to utilize XPath to extract data, you need to:
- Inspect the HTML structure first.
- Handle edge circumstances including missing data or JavaScript-generated content.
- To account for tiny differences in HTML, use flexible XPath expressions.
Prompt:
“How can I use XPath to extract all product names, prices, and links from this HTML file?”
ChatGPT Response:

ChatGPT applications in web scraping
1. Generate code for scraping websites
Language models like ChatGPT can help developers generate code snippets in their preferred programming language and library for web scraping tasks.
Keep in mind that the structures and designs of websites may change, which can impact the HTML elements and attributes you’re targeting. In such a scenario, your code may fail to function properly or extract the desired data. You need to monitor and update your scraping code regularly.
For example, you can use the prompt below to extract product description data from a specific Amazon product page.

Most websites employ anti-scraping measures to prevent automated data collection activities. You must ensure that your data collection practices adhere to ethical standards. Check the website’s terms of service or robots.txt file before scraping any data.
Residential proxies and web unblockers are highly effective for bypassing stringent anti-bot defenses. Unlike datacenter proxies, residential proxies use IP addresses provided by actual Internet Service Providers (ISPs), making them appear more authentic.
Depending on your particular requirements, you can utilize a residential proxy in conjunction with your web scraping tool, employ unblocker technology, or a combination of both.
Sponsored
You can integrate an unblocking technology with your web crawler to enhance your web scraping projects. Bright Data’s Web Unlocker empowers businesses and individuals to collect data from web sources ethically and legally while avoiding anti-scraping measures.

1.1 Provide Python instructions for web scraping
ChatGPT offers step-by-step instructions for scraping data from web sources in various programming languages. In this example, we will use the requests library to fetch the content of a webpage and Beautiful Soup to parse and retrieve the desired data.
- ChatGPT provides the command to install required libraries. You can run the following code to install the libraries in python.

- You can use the Python code generated by ChatGPT to import requests and Beautiful Soup.

- The requests library allows you to fetch the content of the target web page. You can use the requests library to send HTTP requests to that target server and handle the responses. To fetch the content of the product page, type the following command in the terminal by replacing “https://example.com/product-page” with the target web page URL:

- After fetching the content of a web page, you need to parse the fetched data to extract the desired data. To parse the fetched data using the Beautiful Soup library:

If you scrape an e-commerce website to extract product data, such as product titles, you must inspect the produc page to locate the necessary tags and attributes corresponding to the data.
- To save or print the scraped data, type the code generated by ChatGPT:

2. Clean extracted data
Once you’ve scraped data, it’s essential to clean the text to remove irrelevant elements and stopwords such as “the”,”and”, etc. ChatGPT can provide guidance and suggestions on cleaning and formatting collected data.
Assume you collected a large amount of data and imported it into Excel. However, you realize that the data is disorganized and messy. For instance, the full names are in column B, and you want to separate the first and last names into two different columns. You can request that ChatGPT provide a formula for separating first and last names.
The formula generated by ChatGPT to extract the first name:

The ChatGPT-generated formula to extract the last name:

Sponsored
OxyCopilot is a web scraping API feature provided by Oxylabs, allowing users to extract relevant information based on prompt-based formatting and filter out unwanted data. In the example below, we used OxyCopilot to streamline the API results by retrieving only the four key data fields: price, name, rating and review. Unnecessary details, such as content, meta tags, and status codes, were excluded from the output, making the data easier to handle.

3. Process extracted data
3.1 Conduct sentiment analysis
ChatGPT can perform sentiment analysis on scraped data to generate interpretable insights from unstructured text data. Assume you scraped social mentions of your brand from a social media platform to analyze your audience growth. After you have obtained data and cleaned the collected data, you can instruct ChatGPT to analyze the text data and label it as negative, neutral, or positive (Figure 4).
Figure 4: Demonstrate the process of analyzing and labeling a sample text document

Here’s an example of how you can instruct ChatGPT to perform sentiment analysis:
“Analyze the sentiment of the text: ‘The battery life is also long’.”
ChatGPT’s response to our query:

Note that the accuracy of sentiment analysis can vary depending on different factors, such as the complexity of the text and context-dependent errors.
3.2 Categorize scraped content
ChatGPT can help categorize scraped data into predefined categories. You can define the categories you want to classify the content into. Here is an example of categorizing content using ChatGPT:
As an example, we want to categorize the following content:

The following is the output for categorizing scraped data with ChatGPT:

What are other applications of ChatGPT?
Pre-trained language models like ChatGPT can understand natural language and generate human-like responses. Forbes reported that companies like Meta, Canva, and Shopify already use the technology that powers ChatGPT in their customer service chatbot systems. 1
For more: ChatGPT use cases
Further reading
- Web Scraping Tools: Data-driven Benchmarking
- Top web scrapers listed
- 7 Web Scraping Best Practices You Must Be Aware of
- A Comprehensive Guide to Web Scraping Techniques
Comments
Your email address will not be published. All fields are required.