We used Python and Playwright to test Twitter (X) data collection methods focusing on most prominent pages:
Page Type | Best Method | Reliability |
---|---|---|
Profile | Unblocker | High |
Hashtag | Unblocker | Medium |
Search | Not found | N/A |
Twitter scraping methodology
We have performed all tests without logging in. Our aim is to collect public data.
Pages to be scraped
- User profiles to get bio text, follower numbers, and join date (i.e. date when the user joined X.com)
- Hashtag pages to see tweets under a specific hashtag
- Search result pages to find tweets based on keywords
Scraping methods
A combination of web automation tools and network configurations was used to manage dynamic content and bypass anti-bot protections:
- Python + Playwright (sync API): We used Playwright’s synchronous API version, suitable for JavaScript-heavy sites like Twitter.
- Chromium browser (non-headless mode): Twitter’s anti-bot systems are more likely to block headless sessions.
- Proxy configurations: We tested three network setups:
- No proxy: We used the service with our internet connection without relying on any proxies. This is not a scalable approach.
- Residential proxy
- Web unblocker.
Twitter web scraping results by proxy configuration
Page type | No proxy | Residential | Unblocker |
---|---|---|---|
Profile page | Partial (bio & join date only | Mostly successful | Successful |
Hashtag page | Fails (empty or blocked) | Fails (challenge) | Fails (502 / timeout) |
Search page | Fails | Fails | Fails (cert/auth errors) |
While scraping Twitter data, we tried to get data like the bio, follower count, and join date by inspecting the page’s raw HTML using methods like page.content() or checking network responses. But we noticed that some information, like follower counts, was missing. These parts of the page are loaded later with JavaScript.
We used CSS selectors to target the data from the fully loaded (rendered) page to solve this. This approach is often necessary when scraping Twitter, but not all elements are equally easy to grab. Some change more often or load slowly.
Using a tool like unblocker helped improve consistency by ensuring the page loads and behaves similarly to how it would for a real user, reducing errors and making the CSS selectors more reliable.
CSS selector reliability on profile pages
Even if your Twitter scraper gets through the platform’s proxy detection methods, it may fail because the data is no longer where you expect it to be. We tested the reliability of different selectors used to extract common profile fields:
Field | CSS Selector | Reliability |
---|---|---|
Bio | div[data-testid=”UserDescription”] span | High |
Followers | a[href$=”/followers”] span span | Medium-low |
Following | a[href$=”/following”] span span | Medium-low |
Joined Date | span[data-testid=”UserJoinDate”] | High |
1. Profile page scraping
We extracted four common data fields from Twitter profile pages: bio, follower count, following count, and join date. Below is a summary of how each proxy configuration performed:
Proxy type | Bio | Followers | Following | Joined date |
---|---|---|---|---|
No proxy | ✅ | ✅ | ✅ | ✅ |
Residential | ✅ | ✅ | ✅ | ✅ |
Unblocker | ✅ | ✅ | ✅ | ✅ |
2. Hashtag page scraping
Unlike profile pages, Twitter hashtag feeds are harder to scrape reliably. When accessing these pages without a proxy, we were often redirected or faced CAPTCHA challenges. Residential proxies performed slightly better, allowing the page to begin loading, but they still failed to deliver usable data consistently.
The only method that successfully rendered and accessed the hashtag feed was the unblocker.
Proxy type | Page load | Data access |
---|---|---|
No proxy | ❌ | ❌ |
Residential | ⚠️* | ❌ |
Unblocker | ✅ | ✅ |
*Represents a partial result. The hashtag page began loading, but the tweet data didn’t appear.
3. Search page scraping
Search result pages on Twitter are the most difficult to access via web scraping, as they include rate-limiting, SSL certificate checks, and dynamic behavior that blocks web scraping tools.
- Connections without proxies failed.
- Residential proxies had slightly better page load performance, but the data rarely became available. Attempts often ended in timeouts or HTTP 502 errors.
- Even unblocker, which successfully handled other Twitter endpoints, failed here. It triggered certificate errors (like ERR_CERT_AUTHORITY_INVALID) and inconsistent server responses.
Proxy type | Page load | Data access |
---|---|---|
No Proxy | ❌ | ❌ |
Residential | ⚠️ | ❌ |
Unblocker | ❌ | ❌ |
Technical challenges in Twitter web scraping
We encountered various recurring errors during our scraping tests, especially when accessing more protected pages like search results. These issues often stemmed from SSL problems, slow-loading pages, or aggressive anti-bot protections on Twitter.
Here is a breakdown of the most common errors and what likely caused them:
- ERR_CERT_AUTHORITY_INVALID: This error points to SSL certificate issues, which are often caused by misconfigured proxies or when using advanced proxy tools like Unblocker, which inject their certificates.
- Timeout 30000ms exceeded: A common Playwright error indicating that the page took too long to load. This typically happens due to heavy JavaScript rendering or slow proxy connections that delay full page hydration.
- 502 Server Error or read timeout: These errors suggest the server blocked or dropped the request, especially when accessing search pages. Twitter may be actively denying access to automated traffic.
- Event loop is closed!: This is a Playwright or asyncio-related error that usually occurs after a crash, an abrupt disconnection, or an incomplete async response. It often requires resetting the browser context or reinitializing the session.
Twitter web scraping findings
While specific errors were more technical (e.g., timeouts, SSL issues), some challenges were inherent to each Twitter page type’s structure and protection level. After testing multiple proxy setups, we found clear differences in reliability across profile, hashtag, and search pages:
- Unblocker is the most reliable option for scraping profile pages, especially when retrieving follower/following counts.
- Hashtag and search result pages remain difficult to scrape with any method tested.
- Running a visible (headful) browser helps reduce bot detection, but it is not enough.
- JavaScript rendering and hydration delays (i.e., the time it takes for data to appear on the page fully) significantly affect scraping accuracy.
FAQ about web scraping Twitter
What is Twitter data?
When we think of Twitter, we can all imagine a feed full of tweets back to back, with numbers of likes and specific owners of the tweets. However, you can get more details from Twitter.
Keywords/hashtags: You can pull a certain number of tweets that contain a specific keyword or hashtag, or combinations of them. You can curate your search by limiting the tweets to a certain number of likes or dates to narrow down your data to a particular event or power of influence.
Tweets: You can pull all the tweets of specified profiles, again with the ability to filter your tweet data into specific tweets of these individuals, such as tweets that contained a URL or tweets that got retweeted.
Profiles: You can collect all the information about a Twitter user’s public account. Anything you see on their page, such as their bio, number of followers, or tweets, will be reported in a structured format along with the profile owner.
Is it legal to scrape Twitter data?
Though this is not legal advice, in most jurisdictions, scraping publicly available data (e.g., anything you can see without logging into the website) from Twitter is legal and allowed.
For example, if a user’s profile is private, even if you follow this person and can access their profile, you can’t scrape, share, or use this data for any purpose. That being said, being scraped in general is not desired by websites like Twitter because it brings excess traffic to their website and reduces the scarcity of their data. Therefore, they try to block web scrapers.
Is web scraping better than the Twitter API?
Paid API to retrieve tweets
This API is more expensive (i.e., Pro level with read access starts at $5k/month) than other options.1
The most significant advantage of the API is that, since Twitter supports it, there is no risk of being blocked if you pull the data by following their API guidelines. However, API has certain limitations regarding how far back in the past you can pull data and how many tweets you can pull in a minute. These rules can change year by year and should be double-checked directly from Twitter’s most up-to-date guidelines.
Free write-only API for developers
Twitter provides free API access for write-only use cases. 1 register your use case at the Twitter Developer website. If your use case is confirmed, they will share your API key in a few days.
You need to register your use case at the Twitter Developer website. If your use case is confirmed, they will share your API key within a few days.
Comments
Your email address will not be published. All fields are required.