AIMultiple ResearchAIMultiple Research

ChatGPT Web Scraping in 2024: Tips & Applications

Pre-trained language models like ChatGPT can understand natural language and generate human-like responses, making them an attractive choice for companies. Forbes reported that companies like Meta, Canva, and Shopify already use the technology that powers ChatGPT in their customer service chatbot systems. 1

There have been similar discussions about using ChatGPT for web scraping. Advanced natural language processing models like ChatGPT can significantly improve the efficiency and effectiveness of web scraping processes.

In this article, we will discuss how ChatGPT is used in web scraping. We will discuss various use cases where combining  web scraping and ChatGPT can unlock new opportunities and streamline processes.

How to scrape websites using ChatGPT

In this tutorial, we will extract product data from an e-commerce website using ChatGPT-4.

Scraping Amazon web pages with ChatGPT

As an example, we will target the Amazon product page for gaming mice. The target web page contains product details such as titles, images, ratings, and prices. If you use a prompt such as “scrape the product price information from this website: [paste the url] , it will not scrape data. It will instead instruct you to write code to extract data from the target website (Figure 1).

Figure 1: Shows how ChatGPT guides you through writing the codes for extracting data.

The image illustrates how ChatGPT guides users through the data extraction coding process.

We aim to extract the product titles displayed in the provided image (Figure 2). We must first examine the web page’s structure. To inspect the elements, right-click on any element of the interest and select the “Inspect” option from the context menu. This will allow us to analyze the HTML code and locate the required data for web scraping.

Figure 2: Identifying the desired data on the target web page for web scraping

Then we need to identify the desired data and its attributes. HTML element that corresponds to the data we want to extract in the image below (Figure 3). The element has a “class” attribute, which we will use in our web scraping library.

Figure 3: Demonstrates how to inspect a web page for the desired data and attributes

You can identify the desired data and its attributes for web scraping, By inspecting the source of the target web page.

It is important to identify the target elements you want to scrape and their attributes. This helps ChatGPT understand what information we require and how to locate it on the target website.

The prompt we used to scrape the product titles from the Amazon search results page:

The code generated by ChatGPT for data extraction:

ChatGPT applications in web scraping

1. Generate code for scraping websites

Language models like ChatGPT can help developers generate code snippets in their preferred programming language and library for web scraping tasks.

Keep in mind that the structures and designs of websites may change, which can impact the HTML elements and attributes you’re targeting. In such a scenario, your code may fail to  function properly or  extract the desired data. You need to monitor and update your scraping code regularly.

For example, you can use the prompt below to extract product description data from a specific Amazon product page.

It is important to note that most websites employ anti-scraping measures to prevent web scraping activities. You must ensure that your web scraping practices adhere to ethical standards. Check the website’s terms of service or robots.txt file before scraping any data.

Residential proxies and web unblockers are highly effective for bypassing stringent anti-bot defenses. Unlike datacenter proxies, residential proxies use IP addresses from actual Internet Service Providers (ISPs), which makes them seem more authentic. Web unblockers enhance this with advanced proxy capabilities like JavaScript rendering, headless browsers, and browser fingerprinting techniques. Depending on your particular requirements, you can utilize a residential proxy in conjunction with your web scraping tool, employ unblocker technology, or a combination of both.

You can integrate an unblocking technology with your web crawler to enhance your web scraping projects. Bright Data’s Web Unlocker empowers businesses and individuals to collect data from web sources ethically and legally while avoiding anti-scraping measures.

Source: Bright Data

1.1 Provide Python instructions for web scraping

ChatGPT offers step-by-step instructions for scraping data from web sources in various programming languages. In this example, we will use the requests library to fetch the content of a webpage and Beautiful Soup to parse and retrieve the desired data.

  1. ChatGPT provides the command to install required libraries. You can run the following code to install the libraries in python.
  1. You can use the Python code generated by ChatGPT to import requests and Beautiful Soup.
  1. The requests library allows you to fetch the content of the target web page. You can use the requests library to send HTTP requests to that target server and handle the responses. To fetch the content of the product page, type the following command in the terminal by replacing “https://example.com/product-page” with the target web page URL:
  1. After fetching the content of a web page, you need to parse the fetched data to extract the desired data. To parse the fetched data using the Beautiful Soup library:

If you scrape an e-commerce website to extract product data, such as product titles, you must inspect the produc page to locate the necessary tags and attributes corresponding to the data.

  1. To save or print the scraped data, type the code generated by ChatGPT:

2. Clean extracted data

Once you’ve scraped data, it’s essential to clean the text to remove irrelevant elements and stopwords such as “the”,”and”, etc. ChatGPT can provide guidance and suggestions on cleaning and formatting collected data.

Assume you collected a large amount of data and imported it into Excel. However, you realize that the data is disorganized and messy. For instance, the full names are in column B, and you want to separate the first and last names into two different columns. You can request that ChatGPT provide a formula for separating first and last names.

The formula generated by ChatGPT to extract the first name:

The ChatGPT-generated formula to extract the last name:

3. Process extracted data

3.1 Conduct sentiment analysis

ChatGPT can perform sentiment analysis on scraped data to generate interpretable insights from unstructured text data. Assume you scraped social mentions of your brand from a social media platform to analyze your audience growth. After you have obtained data and cleaned the collected data, you can instruct ChatGPT to analyze the text data and label it as negative, neutral, or positive (Figure 4).

Figure 4: Demonstrate the process of analyzing and labeling a sample text document

Here’s an example of how you can instruct ChatGPT to perform sentiment analysis:

“Analyze the sentiment of the text: ‘The battery life is also long’.”

ChatGPT’s response to  our query:

Note that the accuracy of sentiment analysis can vary depending on different factors, such as the complexity of the text and context-dependent errors.

3.2 Categorize scraped content

ChatGPT can help categorize scraped data into predefined categories. You can define the categories you want to classify the content into. Here is an example of categorizing content using ChatGPT:

As an example, we want to categorize the following content:

The following is the output for categorizing scraped data with ChatGPT:

Further reading

For guidance to choose the right tool, check out data-driven list of web scrapers, and reach out to us:

Find the Right Vendors
Access Cem's 2 decades of B2B tech experience as a tech consultant, enterprise leader, startup entrepreneur & industry analyst. Leverage insights informing top Fortune 500 every month.
Cem Dilmegani
Principal Analyst
Follow on

Gulbahar Karatas
Gülbahar is an AIMultiple industry analyst focused on web data collections and applications of web data.

Next to Read

Comments

Your email address will not be published. All fields are required.

1 Comments
JayLi
Sep 04, 2023 at 06:28

It’s almost useless. If you are a good coder, you can easily write this code.
I think the better way to extract dynamic or difficult html content, script send html content to chatgpt by api and chatgpt need to return the answer of key content.
If this way work, it will be useful.
Thanks.

Related research