AIMultipleAIMultiple
No results found.

How to Scrape X.com (Twitter) with Python and Playwright

Cem Dilmegani
Cem Dilmegani
updated on Jul 4, 2025

We used Python and Playwright to test Twitter (X) data collection methods focusing on most prominent pages:

Twitter scraping methodology

We have performed all tests without logging in. Our aim is to collect public data.

Pages to be scraped

  1. User profiles to get bio text, follower numbers, and join date (i.e. date when the user joined X.com)
  2. Hashtag pages to see tweets under a specific hashtag
  3. Search result pages to find tweets based on keywords

Scraping methods

A combination of web automation tools and network configurations was used to manage dynamic content and bypass anti-bot protections:

  1. Python + Playwright (sync API): We used Playwright’s synchronous API version, suitable for JavaScript-heavy sites like Twitter.
  2. Chromium browser (non-headless mode): Twitter’s anti-bot systems are more likely to block headless sessions.
  3. Proxy configurations: We tested three network setups:
    • No proxy: We used the service with our internet connection without relying on any proxies. This is not a scalable approach.
    • Residential proxy
    • Web unblocker.

Twitter web scraping results by proxy configuration

While scraping Twitter data, we tried to get data like the bio, follower count, and join date by inspecting the page’s raw HTML using methods like page.content() or checking network responses. But we noticed that some information, like follower counts, was missing. These parts of the page are loaded later with JavaScript.

We used CSS selectors to target the data from the fully loaded (rendered) page to solve this. This approach is often necessary when scraping Twitter, but not all elements are equally easy to grab. Some change more often or load slowly.

Using a tool like unblocker helped improve consistency by ensuring the page loads and behaves similarly to how it would for a real user, reducing errors and making the CSS selectors more reliable.

CSS selector reliability on profile pages

Even if your Twitter scraper gets through the platform’s proxy detection methods, it may fail because the data is no longer where you expect it to be. We tested the reliability of different selectors used to extract common profile fields:

1. Profile page scraping

We extracted four common data fields from Twitter profile pages: bio, follower count, following count, and join date. Below is a summary of how each proxy configuration performed:

2. Hashtag page scraping

Unlike profile pages, Twitter hashtag feeds are harder to scrape reliably. When accessing these pages without a proxy, we were often redirected or faced CAPTCHA challenges. Residential proxies performed slightly better, allowing the page to begin loading, but they still failed to deliver usable data consistently.

The only method that successfully rendered and accessed the hashtag feed was the unblocker.

*Represents a partial result. The hashtag page began loading, but the tweet data didn’t appear.

3. Search page scraping

Search result pages on Twitter are the most difficult to access via web scraping, as they include rate-limiting, SSL certificate checks, and dynamic behavior that blocks web scraping tools.

  • Connections without proxies failed.
  • Residential proxies had slightly better page load performance, but the data rarely became available. Attempts often ended in timeouts or HTTP 502 errors.
  • Even unblocker, which successfully handled other Twitter endpoints, failed here. It triggered certificate errors (like ERR_CERT_AUTHORITY_INVALID) and inconsistent server responses.

Technical challenges in Twitter web scraping

We encountered various recurring errors during our scraping tests, especially when accessing more protected pages like search results. These issues often stemmed from SSL problems, slow-loading pages, or aggressive anti-bot protections on Twitter.

Here is a breakdown of the most common errors and what likely caused them:

  • ERR_CERT_AUTHORITY_INVALID: This error points to SSL certificate issues, which are often caused by misconfigured proxies or when using advanced proxy tools like Unblocker, which inject their certificates.
  • Timeout 30000ms exceeded: A common Playwright error indicating that the page took too long to load. This typically happens due to heavy JavaScript rendering or slow proxy connections that delay full page hydration.
  • 502 Server Error or read timeout: These errors suggest the server blocked or dropped the request, especially when accessing search pages. Twitter may be actively denying access to automated traffic.
  • Event loop is closed!: This is a Playwright or asyncio-related error that usually occurs after a crash, an abrupt disconnection, or an incomplete async response. It often requires resetting the browser context or reinitializing the session.

Twitter web scraping findings

While specific errors were more technical (e.g., timeouts, SSL issues), some challenges were inherent to each Twitter page type’s structure and protection level. After testing multiple proxy setups, we found clear differences in reliability across profile, hashtag, and search pages:

  • Unblocker is the most reliable option for scraping profile pages, especially when retrieving follower/following counts.
  • Hashtag and search result pages remain difficult to scrape with any method tested.
  • Running a visible (headful) browser helps reduce bot detection, but it is not enough.
  • JavaScript rendering and hydration delays (i.e., the time it takes for data to appear on the page fully) significantly affect scraping accuracy.

FAQ about web scraping Twitter

Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile

Comments 1

Share Your Thoughts

Your email address will not be published. All fields are required.

0/450
Jones
Jones
Sep 20, 2023 at 12:10

You cannot access tweets for free using the API. Twitter (X) charges developers at minimum $100/month to use the API to access tweets. The free developer option is limited to posting only, which is not what you’d want to scrape Twitter for anyway.

Cem Dilmegani
Cem Dilmegani
Nov 01, 2023 at 17:31

Indeed, we updated that section, thank you for the heads up!