Crunchbase is protected by Cloudflare’s enterprise-grade anti-bot system, which blocks most automated scrapers. Even advanced tools like Selenium often return 403 errors or endless “Just a moment…” pages.
In this guide, you’ll learn how to scrape Crunchbase with Python: setting up your environment, using a web unlocker to bypass restrictions, and extracting data from Crunchbase search results and company pages.
Crunchbase scraper API benchmark result
The chart shows the daily success rate of the Crunchbase scraper APIs:
For details on how these metrics are collected, see the full Crunchbase scraping benchmark methodology.
How to scrape Crunchbase with Python
In this Python scraping tutorial, we’ll show how to collect Crunchbase data, including company names, descriptions, websites, headquarters, employee counts, funding rounds, and growth metrics.
We used Bright Data Web Unlocker to bypass anti-bot challenges and maintain stable access.
Step 1: Configuration
Start by installing the required Python libraries for web scraping and configuring our Crunchbase API proxy.
Company slugs are the unique URL identifiers on Crunchbase (for example, if the page URL is crunchbase.com/organization/anthropic, the slug is anthropic).
Step 2: Making requests through web unlocker
Instead of sending direct requests to Crunchbase, we use the web unlocker API to bypass anti-bot systems and ensure consistent results. This method is ideal for Crunchbase scraping at scale, as it returns clean HTML responses while automatically handling CAPTCHAs and JavaScript rendering delays.
Step 3: Parse HTML content
We parse the HTML returned by Crunchbase using BeautifulSoup, extracting text for structured data extraction. This step is essential for any Python Crunchbase scraper, as it allows us to locate elements such as the company name, description, and website URL.
Step 4: Extract the company name
Here, we extract the company name from the <title> tag on the Crunchbase page. The name appears before the first dash, and we use regex to capture and clean it. This ensures our Crunchbase scraper collects only valid company names, not system titles or placeholders.
Step 5: Extract company description
The meta description tag gives us a standardized company summary. It’s an excellent source of consistent business descriptions for building a company data scraper or an enrichment dataset.
Step 6: Extract the company website URL
This block extracts the company’s official website URL from Crunchbase. Since Crunchbase displays domains as visible link text, we filter out Crunchbase internal links and identify valid company websites.
Step 7: Extract headquarters location
We locate the headquarters city or country by targeting Crunchbase links that match known location URL patterns. Extracting this ensures your Crunchbase data includes location metadata useful for regional analysis or market segmentation.
Step 8: Extract employee count
The Crunchbase data scraper attempts to extract the employee count using the structured tags in Crunchbase. If unavailable in link format, it falls back to searching text spans (e.g., “1001–5000 employees”). This ensures reliable company size data for analytics and segmentation.
Step 9: Extract funding information
This part of the Crunchbase scraping tutorial extracts funding round information (e.g., Series A, Seed, Series F) and total raised capital values.
By targeting structured funding fields, this method enables your Python Crunchbase scraper to gather accurate startup investment data for trend and growth analysis.
Step 10: Extract growth and heat scores
We extract growth and heat scores to measure company momentum. Because Crunchbase doesn’t always provide a consistent HTML structure for these values, the Crunchbase scraper uses regex to detect them directly from text. These metrics are beneficial for AI-based company ranking and startup growth prediction models.
Step 11: Build results and save output
Finally, we structure all Crunchbase company data, including name, description, funding, size, and scores, into a dictionary, add a small delay between requests (for safe scraping), and save the output as crunchbase_data.json.
This ensures your Crunchbase data extraction pipeline produces clean, structured results ready for analysis, dashboards, or integration into data pipelines.
Example output
This output demonstrates how the Python Crunchbase scraper structures and exports data.
Each entry includes a company’s name, description, funding, location, employee size, and performance scores, all formatted as JSON for easy integration into analytics tools or databases.
Why Crunchbase scraping is challenging
We tried multiple methods before finding a reliable approach that worked for Crunchbase. Each conventional method failed due to Cloudflare’s advanced anti-bot system. Crunchbase’s protection doesn’t rely on simple IP checks. Cloudflare performs deep browser fingerprinting, analyzing dozens of indicators:
- TLS handshake patterns
- JavaScript execution behavior
- Browser API completeness
- Canvas and WebGL fingerprints
- Mouse movement timing and window focus
Even if you use proxies, Cloudflare can still identify your client fingerprint. Regular scraping proxies only hide your IP; they don’t emulate real browser behavior.
Simple HTTP requests didn’t work
We began with Python’s requests library to send straightforward GET requests to Crunchbase URLs. Every attempt returned 403 Forbidden. Crunchbase’s servers immediately detected the bot signature and refused to serve any content.
Adding browser headers still failed
Next, we tried adding User-Agent strings, Accept headers, and other browser-like metadata to mimic legitimate browser behavior. We tested multiple profiles and combinations, yet every request was blocked. Cloudflare’s system caught them all instantly.
Selenium with Chrome got stuck on Cloudflare
We escalated to Selenium, thinking that automating a real Chrome browser would solve the issue. Instead, we hit Cloudflare’s “Just a moment…” challenge page every time. The loading spinner ran indefinitely, and even if we occasionally passed through, we faced CAPTCHA that couldn’t be solved programmatically.
Undetected ChromeDriver was unstable
We then tested Undetected-ChromeDriver, which patches Selenium to make it appear more human-like. While it worked briefly, we ran into browser compatibility problems and intermittent Cloudflare challenges. Some pages loaded successfully, but the next ones were blocked without any clear pattern, far too unreliable for production use.
The working solution: Web unlockers
After testing several methods, we found this was the only reliable solution for consistent, scalable Crunchbase scraping. Web Unlockers solves this problem by running real browsers in the cloud, complete with full fingerprinting, JavaScript execution, and CAPTCHA solving. They:
- Rotate residential IPs automatically
- Randomize browser fingerprints
- Execute full browser rendering (JavaScript, cookies, dynamic content)
- Solve CAPTCHA and Cloudflare challenges in real time
Unlike proxies that only change your network location, web unlockers replicate the behavior of a genuine human user, which is what Cloudflare expects.
💡Conclusion
Scraping Crunchbase is far from a beginner-level task. The site’s Cloudflare protection effectively blocks almost every standard Python web scraper, including those using advanced libraries like Selenium or Playwright.
By leveraging an unlocker solution, you can overcome these defenses responsibly, maintain stability, respect rate limits, and retrieve clean, structured data.
If you need accurate, scalable Crunchbase data extraction, use a web unlocker-based Python scraper. It’s the most reliable, ethical, and production-grade approach to building your company’s data intelligence pipeline.
Crunchbase scraping benchmark methodology
Benchmark the performance of Crunchbase company page scraping, measuring request success, response time, and reliability under consistent conditions.
- Target URLs: 100 Crunchbase company pages ( crunchbase.com/organization…)
- Request interval: every 15 minutes
- Timeout limit: 60 seconds
- Evaluation frequency: daily
Each request uses the same configuration to allow direct comparison between runs.
Success criteria:
A request is counted as successful if:
- The HTTP status code is between 200 and 399, and
- The response contains valid Crunchbase company data detected by predefined CSS selectors or content byte checks.
Empty or malformed responses are marked as failures.
Error classification:
- Timeouts: >60s, marked failed
- Network errors: logged with details
- Decoding errors: parsing failure
- Empty or malformed responses: missing content
Daily data collection:
At day’s end, results are aggregated to compute the final result. These metrics quantify the reliability and performance of Crunchbase scraping.:
- Daily success rate
- Average response time
- Error distribution
FAQs about Crunchbase scrapers
Reference Links



Be the first to comment
Your email address will not be published. All fields are required.