AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
Web Scraping
Updated on Jun 16, 2025

Playwright vs Puppeteer in 2025: Scraping & Automation

Playwright and Puppeteer are the most powerful open-source tools for controlling headless browsers. The main difference between these tools lies in cross-browser support and feature richness.

Playwright supports multiple browser engines, whereas Puppeteer is primarily focused on Chromium-based browsers and offers a more straightforward experience.

How to build a price scraper with Playwright and an LLM

Instead of relying solely on Playwright for web scraping, we integrated an agent system powered by a large language model (LLM) via OpenRouter. This setup enabled us to send the HTML and visual content of a webpage to the LLM, which then intelligently interpreted the necessary action, such as identifying and clicking the correct button.

The target Amazon page includes elements like:

  • Product title
  • Price
  • Add to cart button
  • Related products

Step 1: Setting up your environment

We will:

  • Import all the necessary libraries
  • Configure logging
  • And set up a connection to an LLM (via OpenRouter).

1.1 Import required libraries

We need to import the right libraries, configure logging, and connect to an LLM provider.

import os
import json
import time
import requests
import base64
import csv
from openai import OpenAI as OpenRouterOpenAI
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeoutError, Error as PlaywrightError
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import BaseTool
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from typing import Type, Any
from pydantic import BaseModel, Field
import re
import logging
  • requests: Needed to upload files to the web.
  • base64: Converts HTML and images into safe text formats to send to the LLM.
  • OpenAI as OpenRouter: We’re using OpenRouter (a proxy for OpenAI-like models) under the hood.
  • playwright.sync_api: This provides a synchronous interface for controlling a browser using Playwright.

1.2 Set up Logging

This sets up logging so we can monitor what the script is doing.

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

1.3 Connect to your LLM provider (OpenRouter)

selector_llm_client = OpenRouterOpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY", "sk-or-v1-...")
)

Step 2: Configure the language model and browser connection

We will,

  • Set up a browser connection via a WebSocket endpoint
  • Define timeouts and retry logic

2.1 Set your language models

Both models are accessed through OpenRouter.

PRIMARY_MODEL = "openai/gpt-4o-mini"
FALLBACK_MODEL = "anthropic/claude-3-haiku-20240307"

2.2 Connect to a remote browser

We routed the traffic through proxies.

DRIVER_URL = os.getenv(
    "DRIVER_URL",
    "wss://brd-customer-...@brd.superproxy.io:9222"
)

2.3 Set timeout values and retry logic

BROWSER_CONNECT_TIMEOUT = 60000      # 60 seconds
NAVIGATION_TIMEOUT = 60000           # 60 seconds
INTERACTION_TIMEOUT = 10000          # 10 seconds
PAGE_LOAD_WAIT = 5                   # seconds
MAX_CDP_RETRIES = 3
CDP_RETRY_DELAY = 5                  # seconds

2.5 Helper: Ask the LLM for page selectors

This sends a prompt, along with optional HTML and a screenshot, to the large language model (LLM).

def ask_selector_llm(prompt: str, html_path: str = None, screenshot_path: str = None, model: str = PRIMARY_MODEL) -> str:
    """Ask LLM for selectors with HTML and screenshot context."""
    start_time = time.time()
    file_urls = []

    if html_path and os.path.exists(html_path):
        try:
            file_urls.append(f"HTML_BASE64: {base64.b64encode(open(html_path,'rb').read()).decode()}")
        except Exception as e:
            logger.warning(f"HTML processing failed: {e}")

    if screenshot_path and os.path.exists(screenshot_path):
        try:
            file_urls.append(f"SCREENSHOT_BASE64: {base64.b64encode(open(screenshot_path,'rb').read()).decode()}")
        except Exception as e:
            logger.warning(f"Screenshot processing failed: {e}")

    full_prompt = f"{prompt}\n\nFiles:\n" + "\n".join(file_urls)
    max_prompt_length = 30000
    if len(full_prompt) > max_prompt_length:
        logger.warning(f"Prompt too long ({len(full_prompt)} chars), truncating.")
        full_prompt = full_prompt[:max_prompt_length] + "\n...[TRUNCATED]"

    try:
        resp = selector_llm_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an expert at extracting CSS selectors or XPaths from provided HTML and screenshots. Respond ONLY with a JSON object containing the requested selectors. Do not add explanatory text."},
                {"role": "user", "content": full_prompt}
            ],
            temperature=0,
            response_format={"type": "json_object"}
        )
        content = resp.choices[0].message.content.strip()

        try:
            json.loads(content)  # Confirm it's valid JSON
            duration = time.time() - start_time
            logger.info(f"Selector LLM call successful. Model: {model}, Duration: {duration:.2f}s")
            metrics["llm_calls"].append({"step": prompt[:50], "duration_s": duration})
            return content

        except json.JSONDecodeError as e:
            logger.warning(f"LLM response not valid JSON. Model: {model}, Response: {content}, Error: {e}")
            if model == PRIMARY_MODEL:
                logger.info(f"Retrying with fallback model: {FALLBACK_MODEL}")
                return ask_selector_llm(prompt, html_path, screenshot_path, model=FALLBACK_MODEL)
            raise RuntimeError(f"Invalid JSON from fallback model. Response: {content}")

    except Exception as e:
        logger.error(f"LLM call failed. Model: {model}, Error: {e}")
        if model == PRIMARY_MODEL:
            logger.info(f"Retrying with fallback model: {FALLBACK_MODEL}")
            return ask_selector_llm(prompt, html_path, screenshot_path, model=FALLBACK_MODEL)
        raise

Step 3: LangChain agent and prompt setup

3.1 Define the agent’s system prompt

This prompt helps the LLM determine what to do, such as finding the price.

prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a web automation assistant. You receive page context and must decide what to click or extract."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}")
])

3.2 Define a tool

class PageContextTool(BaseTool):
    name = "get_page_context"
    description = "Captures the current page's HTML and screenshot for LLM review."

3.3 Create the agent

This combines your model with your prompt and tool(s).

llm = ChatOpenAI(model=PRIMARY_MODEL, temperature=0)

agent = create_openai_tools_agent(
    llm=llm,
    tools=[PageContextTool(page=page)],
    prompt=prompt_template
)

3.4 Prepare the agent executor

agent_executor = AgentExecutor(agent=agent, verbose=True)

Step 4: Playwright browser automation logic

4.1 Connect to Playwright and a remote browser

The following code will launch Playwright in synchronous mode. We have connected it to Bright Data’s Scraping Browser, a proxy browser.

with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(DRIVER_URL)
    context = browser.new_context()
    page = context.new_page()

4.2 Capture page context (HTML + screenshot)

These lines will get the full HTML of the current page.

html_content = page.content()
page.screenshot(path="screenshot.png", full_page=True)

This screenshot was taken programmatically by the agent:

Steps 5 & 6: Running the Agent with Browser and LLM

In this final section, we will:

  • Set up a loop that controls the flow of automation.
  • Provide the agent with natural language instructions and parsed user intent.

5.1 Parse the user’s search intent

def parse_search_query(query: str) -> dict:
    query = query.lower()
    keywords = query.replace("find me", "").strip()
    max_price = 9999
    brand = ""
    
    if "under" in query:
        price_match = re.search(r'under\s+(\d+)\s+dollar', query)
        if price_match:
            max_price = float(price_match.group(1))
            keywords = query.split("under")[0].replace("find me", "").strip()
    
    if "asus" in query:
        brand = "ASUS"

    return {
        "keywords": keywords,
        "max_price": max_price,
        "brand": brand
    }

5.2 Construct the initial prompt

initial_input = (
    f"Start the process. The website is {website_url}. The user's request is: '{search_string}'.\n"
    f"Parsed query: Keywords='{query_info['keywords']}', Max Price=${query_info['max_price']}, Brand='{query_info['brand']}'.\n"
    f"Follow the steps:\n"
    f"1. Navigate to {website_url}.\n"
    f"2. Search for '{query_info['keywords']}' using the search bar.\n"
    f"3. Select a product from the HTML with price <= ${query_info['max_price']} and matching brand '{query_info['brand']}' by parsing the HTML, without applying price filters.\n"
    f"4. Add the product to the cart.\n"
    f"5. Navigate to the cart page.\n"
    f"Use tools methodically. Retry failed product extraction with refreshed page context (max 2 retries). Log all steps and errors."
)

5.3 Loop through agent steps

for i in range(max_steps):
    logger.info(f"Agent Step {i+1}/{max_steps}")
    current_input = initial_input if i == 0 else "Continue with the next step based on previous actions and observations."

    if last_step_context:
        current_input += f"\nUse the latest page context: {last_step_context}"

    response = agent_executor.invoke({
        "input": current_input,
        "chat_history": chat_history
    })

    chat_history.append(HumanMessage(content=current_input))
    chat_history.append(AIMessage(content=response["output"]))

    logger.info(f"Agent Output: {response['output']}")

    if "No valid product found" in response["output"] and product_extraction_retries < max_product_retries:
        product_extraction_retries += 1
        logger.info(f"Product extraction failed, retrying page context ({product_extraction_retries}/{max_product_retries})")
        context_result = PageContextTool(page=page)._run(f"search_results_retry_{product_extraction_retries}")
        if "Error" not in context_result:
            last_step_context = context_result
            chat_history.append(AIMessage(content=f"Retried page context: {context_result}"))
            continue
        else:
            logger.warning(f"Page context retry {product_extraction_retries} failed: {context_result}")

    product_extraction_retries = 0
    last_step_context = None

    if "cart page" in response["output"].lower() or "navigated to cart" in response["output"].lower():
        logger.info("Agent indicates task completion.")
        break

    if i == max_steps - 1:
        logger.warning("Max steps reached.")

5.4 Chat history for context

chat_history = []

# After each step in the loop:
chat_history.append(HumanMessage(content=current_input))
chat_history.append(AIMessage(content=response["output"]))

5.5 Error handling & retry logic

# MODIFIED: Retry logic for product extraction failure
if "No valid product found" in response["output"] and product_extraction_retries < max_product_retries:
    product_extraction_retries += 1
    logger.info(f"Product extraction failed, retrying page context ({product_extraction_retries}/{max_product_retries})")

    context_result = PageContextTool(page=page)._run(f"search_results_retry_{product_extraction_retries}")

    if "Error" not in context_result:
        last_step_context = context_result
        chat_history.append(AIMessage(content=f"Retried page context: {context_result}"))
        continue  # Re-enter the agent loop with updated context
    else:
        logger.warning(f"Page context retry {product_extraction_retries} failed: {context_result}")

Main differences between Playwright and Puppeteer

Playwright and Puppeteer are both open-source Node.js libraries commonly used for web automation tasks and web scraping. Both tools support controlling headless browsers, automation via DevTools, and provide APIs for interacting with pages and elements.

Explore the key differences and similarities between Playwright and Puppeteer:

Updated at 04-18-2025
FeaturesPlaywrightPuppeteer
MaintainerMicrosoftGoogle (Chrome team)
Browser SupportChromium (Chrome, Edge), Firefox, and WebKit (Safari)Primarily Chromium, limited Firefox support
Programming LanguagesJavaScript/TypeScript, Python, Java, C# (official)JavaScript/TypeScript (official); unofficial wrappers
Cross-browser TestingLimited (mostly Chromium-focused)
Mobile Browser EmulationNative support for Chrome Android & Mobile SafariPrimarily Chrome Android emulation
Community & EcosystemRapidly growing but newerLarger, more mature ecosystem
GitHub statistics (April, 2025)71.8k stars, 4.1k forks90.4k stars, 9.2k forks

What is Puppeteer?

Puppeteer is an open-source Node.js library that provides a user-friendly API to control headless Chrome or Chromium browsers over the DevTools Protocol or WebDriver BiDi.

Puppeteer allows automation testing of Chrome Extensions for performance testing. Users can capture precise screenshots of entire pages or specific UI components.

Advantages of Puppeteer

  • Since Puppeteer is developed and maintained by Google, the tool quickly integrates the latest Chrome developments.
  • Cross-browser support is limited. Runs Chrome/Chromium in headless mode by default.
  • Offers full control over Chrome’s features, including clicking buttons, form submission, scrolling, and taking screenshots.
  • For Chrome-only tasks, Puppeteer is slightly faster than Playwright.

Disadvantages of Puppeteer

  • Puppeteer does not support other browsers, such as Safari or Microsoft Edge.
  • The primary language Puppeteer supports is JavaScript (and TypeScript via typings).
  • Puppeteer is tightly coupled with specific versions of Chromium or Firefox. If you want to test on older browser versions, you need to manage the browser binary manually.

What is Playwright?

Playwright is an open-source, cross-browser automation and testing library developed by Microsoft. The tool enables developers to interact with all major browsers like Chromium (Chrome, Edge), Firefox, and WebKit (Safari).

Playwright allows capturing screenshots of entire pages or specific elements, generating PDFs of pages, and recording videos of test sessions.

Advantages of Playwright

  • Cross-browser and cross-language support: Playwright is compatible with multiple browsers and supports various programming languages, including Python, .NET, JavaScript, and TypeScript.
  • Built-in cross-browser testing: Developers can use the same scripts and tests across all supported browsers, both in visible (headed) and headless modes.
  • Native mobile app testing of Chrome for Android and Mobile Safari: Includes predefined device profiles for common mobile devices.
  • Built-in auto-wait: Auto-wait mechanisms ensure that elements become actionable before interactions occur.

Disadvantages of Playwright

  • PDF Generation Limitation: Only supported on headless Chromium. Firefox and WebKit do not currently support PDF generation.
  • Resource-intensive: Launching multiple browsers can consume memory and CPU resources.
  • Less mature ecosystem relative to Puppeteer (offering extensive community support): While Playwright has quickly grown in popularity (initially released in early 2020), the tool still requires more community engagement.

Combining Scraping and Automation in One Puppeteer Script

In this example, we will:

  • Navigate across different blogs
  • Extract article titles, URLs, publish dates, and tags.

Step 1: Create a folder for your project

mkdir realpython-scraper

Then navigate to the folder

cd realpython-scraper

Step 2: Initialize a New Node.js Project

This will hold your project’s dependencies:

npm init -y

After creating a package.json file, install Puppeteer in the folder by running:

npm install puppeteer

Step 3: Create the Scraping Script

  1. Create a new JavaScript file for the scraping script:
touch realpython-scraper.js
  1. Open the file:
nano realpython-scraper.js
  1. Paste the following code, then save and exit (CTRL + O → Enter, CTRL + X):
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto('https://realpython.com/', {
    waitUntil: 'domcontentloaded'
  });

  await page.waitForSelector('.card.border-0');

  const articles = await page.$$eval('.card.border-0', cards => {
    return cards
      .filter(card => card.querySelector('h2.card-title')) // filter only articles
      .slice(0, 5)
      .map(card => {
        const title = card.querySelector('h2.card-title')?.innerText.trim();
        const excerpt = card.querySelector('p.card-text')?.innerText.trim();
        const url = card.querySelector('a')?.href;
        return { title, excerpt, url };
      });
  });

  console.log('\n📰 Top 5 Articles on Real Python:\n');
  articles.forEach((a, i) => {
    console.log(`${i + 1}. ${a.title}`);
    console.log(`   ${a.url}`);
    console.log(`   ${a.excerpt}\n`);
  });

  await browser.close();
})();

The script will extract:

  • Article titles, URLs, publish dates, and tags.

Step 3: Run the Script

node scrape.js

Expected output:

Troubleshooting

In the below image

  1. The Extracted Job Listings: [] means the script didn’t find any job listings on the page.
  2. No element found for selector: #text-input-what indicates the form input for the job search couldn’t be found.

How to fix the issue:

  • Job listings scraping issue: The selector used for extracting the job titles can be outdated or incorrect. You need to inspect the page and update the script with the correct selector.

Many social media platforms, job search engines like Indeed, and e-commerce sites like Amazon use anti-bot measures to prevent automated requests.

For example, in the image below, Amazon serves the dog page, indicating that they utilize bot detection and block your request. Puppeteer (particularly in headless mode or with default settings) launches the target website.

Share This Article
MailLinkedinX
Gülbahar is an AIMultiple industry analyst focused on web data collection, applications of web data and application security.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments