Pre-trained language models like ChatGPT can understand natural language and generate human-like responses, making them an attractive choice for companies. Forbes reported that companies like Meta, Canva, and Shopify already use the technology that powers ChatGPT in their customer service chatbot systems. 1
There have been similar discussions about using ChatGPT for web scraping. Advanced natural language processing models like ChatGPT can significantly improve the efficiency and effectiveness of web scraping processes.
In this article, we will discuss how ChatGPT is used in web scraping. We will discuss various use cases where combining web scraping and ChatGPT can unlock new opportunities and streamline processes.
How to scrape websites using ChatGPT
In this tutorial, we will extract product data from an e-commerce website using ChatGPT-4.
Scraping Amazon web pages with ChatGPT
As an example, we will target the Amazon product page for gaming mice. The target web page contains product details such as titles, images, ratings, and prices. If you use a prompt such as “scrape the product price information from this website: [paste the url] , it will not scrape data. It will instead instruct you to write code to extract data from the target website (Figure 1).
Figure 1: Shows how ChatGPT guides you through writing the codes for extracting data.
We aim to extract the product titles displayed in the provided image (Figure 2). We must first examine the web page’s structure. To inspect the elements, right-click on any element of the interest and select the “Inspect” option from the context menu. This will allow us to analyze the HTML code and locate the required data for web scraping.
Figure 2: Identifying the desired data on the target web page for web scraping
Then we need to identify the desired data and its attributes. HTML element that corresponds to the data we want to extract in the image below (Figure 3). The element has a “class” attribute, which we will use in our web scraping library.
Figure 3: Demonstrates how to inspect a web page for the desired data and attributes
It is important to identify the target elements you want to scrape and their attributes. This helps ChatGPT understand what information we require and how to locate it on the target website.
The prompt we used to scrape the product titles from the Amazon search results page:
The code generated by ChatGPT for data extraction:
ChatGPT applications in web scraping
1. Generate code for scraping websites
Keep in mind that the structures and designs of websites may change, which can impact the HTML elements and attributes you’re targeting. In such a scenario, your code may fail to function properly or extract the desired data. You need to monitor and update your scraping code regularly.
For example, you can use the prompt below to extract product description data from a specific Amazon product page.
It is important to note that most websites employ anti-scraping measures to prevent web scraping activities. You must ensure that your web scraping practices adhere to ethical standards. Check the website’s terms of service or robots.txt file before scraping any data.
You can integrate an unblocking technology with your web crawler to enhance your web scraping projects. Bright Data’s Web Unlocker empowers businesses and individuals to collect data from web sources ethically and legally while avoiding anti-scraping measures.
1.1 Provide Python instructions for web scraping
ChatGPT offers step-by-step instructions for scraping data from web sources in various programming languages. In this example, we will use the requests library to fetch the content of a webpage and Beautiful Soup to parse and retrieve the desired data.
- ChatGPT provides the command to install required libraries. You can run the following code to install the libraries in python.
- You can use the Python code generated by ChatGPT to import requests and Beautiful Soup.
- The requests library allows you to fetch the content of the target web page. You can use the requests library to send HTTP requests to that target server and handle the responses. To fetch the content of the product page, type the following command in the terminal by replacing “https://example.com/product-page” with the target web page URL:
- After fetching the content of a web page, you need to parse the fetched data to extract the desired data. To parse the fetched data using the Beautiful Soup library:
If you scrape an e-commerce website to extract product data, such as product titles, you must inspect the produc page to locate the necessary tags and attributes corresponding to the data.
- To save or print the scraped data, type the code generated by ChatGPT:
2. Clean extracted data
Once you’ve scraped data, it’s essential to clean the text to remove irrelevant elements and stopwords such as “the”,”and”, etc. ChatGPT can provide guidance and suggestions on cleaning and formatting collected data.
Assume you collected a large amount of data and imported it into Excel. However, you realize that the data is disorganized and messy. For instance, the full names are in column B, and you want to separate the first and last names into two different columns. You can request that ChatGPT provide a formula for separating first and last names.
The formula generated by ChatGPT to extract the first name:
The ChatGPT-generated formula to extract the last name:
3. Process extracted data
3.1 Conduct sentiment analysis
ChatGPT can perform sentiment analysis on scraped data to generate interpretable insights from unstructured text data. Assume you scraped social mentions of your brand from a social media platform to analyze your audience growth. After you have obtained data and cleaned the collected data, you can instruct ChatGPT to analyze the text data and label it as negative, neutral, or positive (Figure 4).
Figure 4: Demonstrate the process of analyzing and labeling a sample text document
Here’s an example of how you can instruct ChatGPT to perform sentiment analysis:
“Analyze the sentiment of the text: ‘The battery life is also long’.”
ChatGPT’s response to our query:
Note that the accuracy of sentiment analysis can vary depending on different factors, such as the complexity of the text and context-dependent errors.
3.2 Categorize scraped content
ChatGPT can help categorize scraped data into predefined categories. You can define the categories you want to classify the content into. Here is an example of categorizing content using ChatGPT:
As an example, we want to categorize the following content:
The following is the output for categorizing scraped data with ChatGPT:
- Web Scraping Tools: Data-driven Benchmarking
- 7 Web Scraping Best Practices You Must Be Aware of
- A Comprehensive Guide to Web Scraping Techniques
For guidance to choose the right tool, check out data-driven list of web scrapers, and reach out to us:
Next to Read
Your email address will not be published. All fields are required.