Data has become the fuel of business growth for the last decade, and the internet is the main source of data, with 5 billion users generating billions of data points every second. Analysis of web data can help businesses uncover insights to achieve their business objectives. However, collecting a large volume of data is not easy for businesses; specifically those that still think “Export to Excel” button (if there is one) and manual processing are the only options to extract data.
Web scraping enables businesses to automate web data collection processes using bots or automated scripts called web crawlers. This article highlighted all important aspects of web scraping, including what it is, why it matters, how it works, applications, vendor landscape & purchase guide for products and services.
What is web scraping?
Web scraping, also called web data collection/extraction, data/screen scraping, web/data harvesting, and sometimes called web crawling, is the process of extracting data from websites.
The process of scraping a page involves making requests to the page and extracting machine-readable information from it.
Why is web scraping important?
Increasing reliance on analytics and automation are two big trends among businesses. Web scraping can enable both trends. Along with these reasons, web scraping has numerous applications that can affect all industries. Web scraping enables businesses to
- automate data collection processes at scale
- unlock web data sources that can add value to your business
- make data-driven decisions
These factors explain the increasing interest in web scraping, as seen on Google trends above.
How does it work?
A general web scraping process involves a series of steps:
- Identification of target URLs
- If the website to be crawled uses anti scraping tools, scraper may need to choose the appropriate proxy server to get a new IP address to send its requests from.
- Making requests to these URLs to get HTML code
- Using locators to identify the location of data in HTML code
- Parsing the data string that contains information
- Converting the scraped data into the desired format
- Transferring the scraped data to the data storage of choice
What are web scraping applications?
Common web scraping applications are below, however, if you want an in-depth analysis of web crawling applications, feel free to check our related article.
- Data analytics and data science
- Collecting machine learning training data
- Enrichment of company databases
- Marketing & sales
- Price comparison (especially relevant in e-commerce)
- Fetching product descriptions
- Monitoring companies’ current status & competitors on search engines as part of SEO efforts
- Lead generation
- Website testing
- Monitoring consumer sentiment
- Aggregating news articles about the company
- Collecting financial data
- Market research
How does the web scraping landscape look like?
To be categorized as a web scraping company, the solution provider should enable extracting data from a variety of web sources and export extracted data in different formats. Yet, the web scraping landscape is crowded, and there are various options to handle enterprise web scraping tasks:
Open source solutions
Open-source frameworks are making web scraping cheaper and facilitating personal use. The most widely used tools are Scrapy, Selenium, BeautifulSoup and Puppeteer.
Users can crawl information using the libraries such as Selenium to automate this crawling process. Most of the time when there is a list on the page, there are more pages than what is shown on the screen to the user at the first glance. An example is pages with “infinite scroll”. For example, assume that you are browsing Youtube. You can’t find what you would like to watch over all videos listed on the YouTube page you were checking. Then, you need to click the “next” button at the end of the screen. Selenium allows users to automate going to these “next” pages and crawl the desired information about every item in the listing. Users can then create a dataset containing information about each list on the website. For example, it is possible to create a dataset from names, IMDb ratings, actors and rankings of top 250 IMDb movies by crawling IMDb’s top movies list with open source tools such as Scrapy.
Though there are various proprietary solutions in the market, products are segmented into two types:
- Solutions with code interface where a vendor provides an API key to access data
- No code solutions where no programming skills are necessary and tools aim to democratize crawling.
Though it is easy to run a scraper on your own website, it is harder to run it on websites that aim to block bots that are not from search engines from crawling their content. As a result, industrial grade scrapers scrape using a diverse set of IP addresses and digital signatures, acting like a group of users browsing a website rather than an automated bot. This takes significant effort to set up and companies like Luminati offer it as a managed, cloud service. Users can rely on coding or no-code interfaces to build scrapers that run on the infrastructure provided by the SaaS company.
Fully managed web scraping services, also called data-as-a-service (DaaS), are easier for businesses that need data at scale. These projects’ workflow tends to be:
- Client sends the requirements such as sites to be crawled, fields to be extracted, and frequency of crawls.
- Managed service company checks the feasibility and sets up the crawlers.
- The company performs data cleaning best practices, transforms data into the desired format, and sends it to the client.
Using existing software (open or closed source) and programming skills, any company can build competent web scrapers. As long as the business has technical personnel to handle the task, and the scraping task is for a strategic project, in-housing is the most optimal option.
Which web crawler should you use?
The right web crawler tool or service depends on various factors, including the type of project, budget, and technical personnel availability. To summarize the decision tree we provide above, the right-thinking process when choosing a web crawler should be like below:
- Are you going to use web crawling for personal uses?
- If yes, choose an open-source or community edition tool that will enable you to crawl data without paying for services.
- If no, are you a tech company working on strategic projects that will differentiate your products/ services?
- If yes, build an in-house team that will prevent third-party companies to capture your data.
- If no, is your budget available to make more than $10,000 of investment?
- If yes, prefer a managed service because scraping systems require significant maintenance. You may not want your internal team focused on maintaining a non-strategic project.
- If no, do you have programming talent?
- If yes, go with open-source solutions or low-cost proprietary solutions. Programming your scraper can be more efficient than using no-code solutions, especially for repetitive scraping tasks, since a programmed solution can offer a higher level of automation.
- If no, go with no-code solutions. Most of these are proprietary but could be purchased with limited budgets.
Is web scraping legal?
In short, if it involves scraping publicly available data, scraping does not harm the scraped company, scraped data does not include personal data, and if scraped data is republished, it is republished with a citation, the answer seems to be yes. However, this is not legal advice, and please refer to a legal professional for specific advice.
Legality of scraping used to be in a gray zone for a long time, but now there is more clarity. Personal data privacy regulations such as EU’s GDPR and California’s CCPA do not stand against web scraping as long as
- Publicly available data is scrapped.
- Personal data is stored securely and in line with best practices.
- Data is not sold or shared with 3rd parties unless it has been agreed with the individual.
For businesses, The Ninth Circuit Court of Appeals in the US ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA) after Linkedin’s lawsuit against hiQ. This decision may be reviewed by the US Supreme Court; however, as of December 2020, it was not clear whether the Supreme Court would review the decision or not.
Limitations still apply when using web scraping.
- Data extraction should not cause any damage to data owners.
- Scraper can not publish the data without proper citation. That would be unethical and illegal.
When considering the legality of scraping, also bear in mind that every search result that you see on search engines has been scraped by search engines. In addition, hedge funds are reported to be spending billions on scraping to make better investment decisions. So scraping is not a shady practice only followed by small businesses.
Why do website owners want to stop web scraping?
- Web crawlers may burden the site’s performance. Bots, excluding those from search engines, make up 24% of web traffic according to cyber security vendor Imperva.
- Competitors can crawl their pages for insights. For example, this allows them to be notified of their competitors’ new customers, partnerships or features.
- Their nonpublic data can also be scraped by competitors creating substitutes or competing services, reducing the demand for their own services
- Their copyrighted content can be copied and cited without references, leading to a loss of revenue for the content generator
What are the challenges of web scraping?
- Complex website structures: Most web pages are based on HTML, and web page structures are widely divergent. Therefore when you need to scrape multiple websites, you need to build one scraper for each website.
- Scraper maintenance can be costly: Websites change the design of the page all the time. If the location of data that intended to be scrapped changes, crawlers are required to be programmed again.
- Anti-scraping tools used by websites: Anti-scraping tools enable web developers to manipulate content shown to bots and humans and also restrict bots from scraping the website. Some anti-scraping methods are IP blocking, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), and honeypot traps.
- Login requirement: Some information you want to extract from the web may require you to log in first. So when the website requires login, the scraper needs to make sure to save cookies that have been sent with the requests, so the website recognizes the crawler is the same person who logged in earlier.
- Slow/ unstable load speed: When websites load content slowly or fail to respond, refreshing the page may help, yet, the scraper may not know how to deal with such a situation.
What are web scraping best practices ?
Common web scraping best practices are:
Use proxy servers
Many large website operators use anti-bot tools that need to be bypassed to crawl a large number of HTML pages. Using proxy servers and making requests through different IP addresses can help overcome these obstacles.
Use dynamic IP
Changing your IP from static to dynamic can also be useful to avoid being detected as a crawler and get blocked.
Make the crawling slower
You should limit the frequency of requests to the same website due to two reasons:
- it easier to detect crawlers if they make requests faster than humans
- a website’s server may not respond if it gets too many requests simultaneously. Scheduling crawl times to start at the websites’ off-peak hours and programming the crawler to interact with the page can also help to avoid this issue.
Comply with GDPR
Under GDPR, It is illegal to scrape the personally identifiable information (PII) of an EU resident unless you have their explicit consent to do so.
Beware of Terms & Conditions
If you are going to scrape data from a website that requires login, you need to agree on terms & conditions to sign up. Some T&C involves companies’ web scraping policies that explicitly state that you aren’t allowed to scrape any data on the website.
However, even though Linkedin’s T&C clearly bans scraping, as mentioned above, scraping Linkedin has been found to be a legal activity so far. We don’t provide legal advice and we can’t exactly clarify the implications of companies’ T&Cs. If you have expertise on this topic, feel free to share.
What does the future hold for web scraping?
Scraping is turning into a cat & mouse game between content owners and content scrapers with both parties spending billions to overcome measures developed by the other party. We expect both parties to use machine learning to build more advanced systems.
Open source is playing a larger role in software development, this area is no different. As we mentioned before, popularity of Python is also increasing and already quite high. We expect open source libraries such as Selenium, Scrapy and Beautiful Soup that work on Python to shape the web crawling processes in the near future.
Along with open source libraries, the interest in AI makes the future of web scraping bright because AI systems heavily rely on data, and automating data collection can facilitate various AI applications trained on public data.
If you still have questions about the web scraping landscape, feel free to check out the sortable list of web scraping vendors or contact us:
How can we do better?
Your feedback is valuable. We will do our best to improve our work based on it.