Python and JavaScript are the two most popular languages for web scraping tasks. In this guide, we’ll not only compare Python and JavaScript for web scraping, we’ll also walk through complete tutorials for each language, from setup to data extraction and saving results.
You’ll see how Python uses libraries like BeautifulSoup and Selenium with a two-stage parsing approach, while JavaScript with Puppeteer works directly in the browser context using await and asynchronous capabilities.
Python vs JavaScript: Which Should You Use for Web Scraping?
Pros and cons of Python for web scraping
Pros:
- Modular: import only the libraries you need (time, re, BeautifulSoup, selenium).
- Mature ecosystem with specialized scraping tools.
- Synchronous code is straightforward; no async/await needed.
- Powerful list comprehensions and regex support.
- Easy file handling with open() and direct json.dump.
Cons:
- Requires multiple libraries and setup steps (Service, Options, WebDriver).
- Two-stage parsing: fetch page_source, then parse with BeautifulSoup (slower).
- Higher memory usage (~400–500MB).
- Regex is more verbose (needs re module).
- Async is optional, but less natural compared to JavaScript.
Pros and cons of JavaScript for web scraping
Pros:
- Puppeteer is an all-in-one solution: launch a browser, load a page, and extract data.
- Built-in async/await model makes I/O non-blocking.
- Direct DOM access with page.evaluate() (faster than Python’s two-stage approach).
- Regex is concise with inline /pattern/i literals.
- Functional array methods (filter, map, forEach) are expressive.
- Lower memory usage (~250–350MB).
Cons:
- Async/await is mandatory, and adds complexity for beginners.
- Requires explicit .catch() for unhandled promise rejections.
- File saving is a two-step process (JSON.stringify + fs.writeFile).
- More verbose for error handling inside try/catch.
Web scraping JavaScript vs Python
Python scraping tutorial: Setup and example
To get started with Python web scraping, we’ll use a combination of libraries. This first section covers the setup of libraries and basic configurations needed before running the scraper.
Section 1: Libraries and setup
import time
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import json
def scrape_craigslist_apartments(url="https://newyork.craigslist.org/search/apa"):
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920,1080')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
driver = None
listings = []
- time: used for adding waits.
- re: provides regex support for text parsing.
- BeautifulSoup: helps parse HTML documents.
- selenium: automates the browser to interact with websites dynamically.
- webdriver_manager: automatically downloads and manages the correct ChromeDriver version.
- Options: sets Chrome’s startup options. Here, we run Chrome in headless mode (no visible browser window), disable GPU, and set window size.
- User-Agent string: helps avoid detection by making the scraper behave like a real browser.
- driver and listings are initialized as empty, preparing the web scraping tools to store apartment listings later.
Section 2: Browser launch and page loading
try:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('div', class_='cl-search-result')
- Service(ChromeDriverManager().install()): automatically downloads the correct ChromeDriver and creates a service object.
- webdriver.Chrome(service=service, options=options): initializes the browser instance with the settings from Section 1.
- driver.get(url): opens the target Craigslist URL.
- time.sleep(8): adds a delay to allow JavaScript content on the page to fully render before scraping begins.
- driver.page_source: retrieves the entire HTML of the loaded page.
- BeautifulSoup(driver.page_source, ‘html.parser’): parses the HTML into a format that BeautifulSoup can work with.
- soup.find_all(‘div’, class_=’cl-search-result’): extracts all the apartment listings from Craigslist, identified by the ‘cl-search-result’ class.
Section 3: Data extraction and processing
for item in results:
try:
listing = {
'title': 'N/A',
'price': 'N/A',
'bedrooms': 'N/A',
'area': 'N/A',
'location': 'N/A',
'date': 'N/A',
'link': 'N/A'
}
title_elem = (
item.find('a', class_='posting-title') or
item.find('a', class_='cl-search-anchor')
)
if title_elem:
listing['title'] = title_elem.get_text(strip=True)
if title_elem.has_attr('href'):
href = title_elem['href']
listing['link'] = href if href.startswith('http') else f"https://newyork.craigslist.org{href}"
price_elem = (
item.find('div', class_='price') or
item.find('span', class_='priceinfo')
)
if price_elem:
listing['price'] = price_elem.get_text(strip=True)
meta_elem = item.find('div', class_='meta')
if meta_elem:
meta_text = meta_elem.get_text(strip=True)
time_match = re.search(r'(\d+\s*(?:mins?|h|hours?|days?)\s*ago)', meta_text, re.IGNORECASE)
if time_match:
listing['date'] = time_match.group(1)
meta_text = meta_text.replace(time_match.group(1), '', 1)
bed_match = re.search(r'(\d+br)', meta_text, re.IGNORECASE)
if bed_match:
listing['bedrooms'] = bed_match.group(1)
meta_text = meta_text.replace(bed_match.group(1), '', 1)
area_match = re.search(r'(\d+ft2)', meta_text, re.IGNORECASE)
if area_match:
listing['area'] = area_match.group(1)
meta_text = meta_text.replace(area_match.group(1), '', 1)
location = meta_text.strip()
if location:
listing['location'] = location
listings.append(listing)
except Exception as e:
continue
- The scraper loops through each result and creates a listing dictionary with default values (‘N/A’).
- Title: Looks for anchors with either ‘posting-title’ or ‘cl-search-anchor’. If found, extracts text using .get_text(strip=True). If the link is relative, it appends the Craigslist domain.
- Price: Extracted from <div class=”price”> or <span class=”priceinfo”>.
- Meta information: Comes from the ‘meta’ div. This block can contain information such as date, bedroom count, area, and location.
- re.search() is used to extract patterns such as time (2 hours ago), bedrooms (2br), and area (800 sqft).
- Each match is removed from meta_text, so the remaining string becomes the location.
- Finally, each processed listing is appended to the list.
- If an error occurs, the script skips to the next item using the continue statement.
Section 4: Cleanup and return
except Exception as e:
pass
finally:
if driver:
driver.quit()
return listings
def save_to_json(data, filename='apartments.json'):
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
def print_summary(listings, count=5):
valid = [l for l in listings if l['title'] != 'N/A']
for i, listing in enumerate(valid[:count], 1):
pass
if __name__ == "__main__":
data = scrape_craigslist_apartments()
if data:
save_to_json(data)
print_summary(data, count=5)
- finally block: always runs, ensuring the browser is properly closed using driver.quit() to free system resources.
- save_to_json(): saves the scraped data into a JSON file (apartments.json).
- Uses with open() to handle file writing safely.
- json.dump() writes the dictionary to a file.
- indent=2 makes the JSON human-readable.
- ensure_ascii=False keeps Unicode characters intact.
- print_summary(): filters valid listings (ignores empty titles) and prepares a summary of the first 5 results.
- if name == “main _”: ensures the script only runs when executed directly (not when imported as a module).
JavaScript web scraping tutorial: Setup and example
We’ll use Puppeteer (a Node.js library for controlling headless Chrome) and fs.promises (for handling file operations asynchronously).
Section 1: Libraries and setup
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
async function scrapeCraigslistApartments(url = 'https://newyork.craigslist.org/search/apa') {
let browser = null;
const listings = [];
try {
browser = await puppeteer.launch({
headless: 'new',
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-gpu',
'--window-size=1920,1080'
]
});
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)');
- puppeteer: controls a headless browser, letting us scrape dynamic web pages.
- fs.promises: provides async file system methods, making it easier to save results later.
- The function is marked async so we can use await inside.
- puppeteer.launch(): starts the browser in headless mode with specific flags for stability.
- browser.newPage(): opens a new browser tab.
- page.setUserAgent() sets a custom user agent string, so the scraper appears to be a real browser.
Section 2: Page loading and waiting
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
await page.waitForSelector('div.cl-search-result', { timeout: 15000 });
await new Promise(resolve => setTimeout(resolve, 3000));
- page.goto(url, …): navigates to the Craigslist page.
- waitUntil: ‘domcontentloaded’, ensures the DOM is fully loaded before continuing.
- timeout: 30000, sets a maximum wait time of 30 seconds.
- page.waitForSelector(‘div.cl-search-result’): waits until the specified CSS selector appears (in this case, the container for listings). This ensures the web scraper only proceeds once results are visible.
- new Promise(resolve => setTimeout(resolve, 3000)): adds a manual 3-second delay to make sure all JavaScript-rendered elements have finished loading.
Section 3: Data extraction (browser context)
const results = await page.evaluate(() => {
const items = document.querySelectorAll('div.cl-search-result');
const data = [];
items.forEach(item => {
try {
const listing = {
title: 'N/A',
price: 'N/A',
bedrooms: 'N/A',
area: 'N/A',
location: 'N/A',
date: 'N/A',
link: 'N/A'
};
const titleElem = item.querySelector('a.posting-title') ||
item.querySelector('a.cl-search-anchor');
if (titleElem) {
listing.title = titleElem.textContent.trim();
const href = titleElem.getAttribute('href');
if (href) {
listing.link = href.startsWith('http')
? href
: `https://newyork.craigslist.org${href}`;
}
}
const priceElem = item.querySelector('div.price') ||
item.querySelector('span.priceinfo');
if (priceElem) {
listing.price = priceElem.textContent.trim();
}
data.push(listing);
} catch (e) {
// skip if error
}
});
return data;
});
- page.evaluate(): runs the code within the browser context, making the document object directly available.
- document.querySelectorAll(‘div.cl-search-result’): selects all Craigslist listing elements.
- An empty listing object is created with default values (‘N/A’).
- querySelector(): fetches specific elements (title, price, etc.).
- The || operator is used as a fallback if one selector is missing.
- textContent.trim(): extracts and cleans the text from elements.
- getAttribute(‘href’): gets the link. If it’s relative, the Craigslist domain is prepended.
- Each listing object is pushed into the web data array, which is returned at the end.
- Errors are caught with try/catch, allowing the scraper to continue smoothly.
Section 4: Meta parsing and return
const metaElem = item.querySelector('div.meta');
if (metaElem) {
let metaText = metaElem.textContent.trim();
const timeMatch = metaText.match(/(\d+\s*(?:mins?|h|hours?|days?)\s*ago)/i);
if (timeMatch) {
listing.date = timeMatch[1];
metaText = metaText.replace(timeMatch[1], '');
}
const bedMatch = metaText.match(/(\d+br)/i);
if (bedMatch) {
listing.bedrooms = bedMatch[1];
metaText = metaText.replace(bedMatch[1], '');
}
const areaMatch = metaText.match(/(\d+ft2)/i);
if (areaMatch) {
listing.area = areaMatch[1];
metaText = metaText.replace(areaMatch[1], '');
}
const location = metaText.trim();
if (location) {
listing.location = location;
}
}
data.push(listing);
} catch (error) {}
});
return data;
});
listings.push(...results);
} catch (error) {}
finally {
if (browser) {
await browser.close();
}
}
return listings;
}
- Meta parsing happens entirely inside the browser context.
- Regex matching is performed with /pattern/i, where i makes it case-insensitive.
- Collects data points such as date, number of bedrooms, and area in sequence.
- Each match is removed from metaText, leaving the location.
- data.push(listing): adds each listing object into the array.
- evaluate() + return data: sends parsed data back from the browser to Node.js.
- listings.push(…results): merges the returned array into the main results list.
- finally { await browser.close() }: ensures the browser always closes to free resources.
- return listings: final step, sending back all scraped apartment data.
Section 5: Helper functions and main
async function saveToJson(data, filename = 'apartments.json') {
try {
await fs.writeFile(filename, JSON.stringify(data, null, 2, 'utf-8'));
} catch (error) {}
}
function printSummary(listings, count = 5) {
const valid = listings.filter(l => l.title !== 'N/A');
valid.slice(0, count).forEach((listing, index) => {
// print summary logic here
});
}
async function main() {
const data = await scrapeCraigslistApartments();
if (data && data.length > 0) {
await saveToJson(data);
printSummary(data, 5);
}
}
main().catch(error => {
process.exit(1);
});
- saveToJson(): async function that saves scraped results to a JSON file.
- Converts objects to a JSON string using JSON.stringify().
- Writes it asynchronously with await fs.writeFile().
- printSummary(): sync helper that filters valid listings and prints a preview of the first 5.
- main(): async wrapper function that:
- Calls scrapeCraigslistApartments().
- If results exist, save them and print a summary.
- main().catch(): ensures any unhandled promise rejections are caught and handled gracefully.
💡Conclusion
Both Python and JavaScript are excellent choices for web scraping, but the best option depends on your specific scraping project’s needs.
- Python is ideal if you want simplicity, modular libraries like BeautifulSoup and Selenium, and integration with data analysis tools.
- JavaScript (Node.js + Puppeteer) is the stronger choice for scraping JavaScript-heavy websites and single-page applications. Its async/await model and direct DOM access make it faster and more efficient for modern, JavaScript-rendered content.
FAQs about JavaScript vs Python for Web Scraping

Be the first to comment
Your email address will not be published. All fields are required.