Is Web Scraping Legal? The Definitive Guide

Originally published at CrawlNow

It is a data-driven world. Sourcing and consuming external data is the necessity of many businesses. Not only that, leveraging publicly available data is the only way to survive and undercut competition for many businesses. While web scraping is the key to unlocking access to web data, there is lots of confusion, and myths, around the legality and ethics of web scraping. This article aims to address those and bring clarity to the topic. It also goes over the best practices you should follow, as well as the legal and ethical boundaries you should respect, to get the best out of web scraping while keeping it safe and legal.

Web scraping is a great way to source useful external data for data-driven businesses around the globe. However, there is lots of confusion on the legality of web scraping. If you type the question, “Is web scraping legal?” on Google, you’ll find opposing views on the topic, depending on who is answering it. While data scraping companies will try to paint an optimistic picture to get more business, anti-scraping service providers will equate it with data theft to sell their solutions.

The truth is that almost all big companies use web scraping, one way or the other, to collect data about their competitors and markets. They do not see it as unethical for their own use. However, it may irk them when they find others scraping their own websites.

In this blog, I will try to take an unbiased view. Things may not always be black and white, and may be open to interpretation in some situations, though. So, I would recommend seeking legal advice when in doubt. This article does not intend to provide legal advice.

Before we look into whether web scraping is legal or illegal, let’s understand what it is.

What Is Web Scraping?

Web scraping is the use of computers and automation to visit pages on one or more websites, and extract information from their HTML and Javascript source code, in a format that is readable for software applications, e.g. spreadsheets or databases.

The same operation could be performed by humans, but it will be much slower. An example is to download product attributes for a few thousand items on Amazon.

So, Is Web Scraping Legal?

So, is it legal to scrape a website, then?

There is no law in the US, or elsewhere, that says web scraping is illegal. So does that mean web scraping is legal? It depends on what data you are scraping and how you are using it.

Web Scraping is simply a tool to automate what humans can otherwise do manually. A tool itself cannot be legal or illegal. It’s the use of the tool that can be legal or illegal.

Data scraping has been in use for a long time. Search engines use bots to discover and index web pages. Price comparison websites use scraping to inform their consumers before they make purchases. You could even scrape your own website for analytics. At the same time, bad actors may use scraping to conduct fraudulent activities such as data theft or DDoS attacks.

Though web scraping is not illegal, it’s a technology you should use with care. There are boundaries that you would want to respect to make sure you don’t get into legal trouble. If you scrape smartly, abiding by the ethical web scraping practices, it’s highly unlikely for it to be held against you even if the websites you are scraping do not like it.

It comes down to three things that decide legality:

What you scrape
How you scrape it
How you use the data you scrape

The following section will help you evaluate your use case and determine whether your web scraping use case lies in the safe zone or not.

Questions To Ask Yourself Before You Scrape

Asking yourself the following 6 questions, pertaining to the generally accepted web scraping ethics, will help you stay compliant.

Are You Scraping Personal Data?

Personal data scraping could be an unsafe area where you need to be extra cautious. Different jurisdictions have different laws governing access and use of personal data. While it might be okay to scrape personal data in some US states, you may get into trouble for doing the same in California. Wherever you are, check your local regulations before you scrape personal data.

Extending to the territorial laws, even if you are situated in a place where scraping data is okay but you scrape the data of a person situated in the EU, for example, the laws in EU may apply to you. The EU is very particular about their citizens' privacy, so you may want to review the General Data Protection Regulation (GDPR) before scraping their information.

Next, you may ask, what is personal data?

According to the California Consumer Privacy Act (CCPA), personal information is the data that can identify or be linked to an individual or household. It includes, but is not limited to, a person’s name, birthday, contact details, IP address, and audio and video recordings.

On the bright side, you won’t typically need to worry about personal data when scraping for price intelligence or competitive analysis.

However, when scraping reviews and social media data, personal data is often a consideration. Usernames, names, profile pictures, among other things can be categorized under personal data in this case. In such scenarios, there are multiple ways to avoid web crawling legal issues. For example, you can anonymize the data by omitting fields like username, emails etc..

When you’re working with CrawlNow, we carefully review your specific use case and work hand in hand with you to make sure you comply with laws related to personal data, including GDPR, CCPA and your local jurisdictions.

Are You Scraping Non-Public Data?

Before scraping a website, you should know what is public data and what is not. Websites generally keep certain data available to the public. As long as you are scraping only the publicly available content, you should generally be safe. However, there are a few other things to keep in mind that are discussed in the following sections.

Non-public data is something that is not accessible to everyone on the web. You will typically need to login to view this data. If the data is only available after you have logged in, it directly means that it is not available for public access. If you scrape non-public content, you may be inviting trouble, but it depends on the context.

Facebook, for instance, may allow you to scrape data in certain conditions, but only after “Facebook’s express written permission”.

Are You Scraping Copyrighted Data?

A lot of the content available on the internet is protected by some kind of copyright. Scraping and using copyrighted material irresponsibly may fall under copyright infringement. Music, news, blogs, research papers, movies, images, databases and logos are some potentially copyrightable data. Even when not explicitly declared a “copyright”, every private, original work is automatically copyrighted for the author under the Berne Convention.

However, not all information on the internet can be flagged under copyrights. Some of it are plain facts, and consequently a safe resource for web scrapers. Product name, product descriptions, price data, and number of sales or views, which is the core input of price intelligence and competitive analysis, are some examples of plain facts.

Images, videos and databases are some of the content types that may come up in web scraping projects. In such cases, it’s important to look at the use case, since you may be able to scrape copyrighted data in certain situations, depending on how you use it.

Aggregators, for example, typically use snippets from different sources and attach a link that directs the viewer to the original source, i.e. the copyright holder. In many situations, you may want to scrape copyrighted data for analysis. In many jurisdictions, these may be considered as ethical web scraping. However, scraping copyrighted data and publishing it as your own is undoubtedly illegal.

Is The Crawling Rate Tolerable?

Web scrapers are prefered over manual data extraction because they can fetch you data in mere seconds. Though web scrapers are efficient tools, you should not hit a website’s server with too many requests in a small interval.

Scraping websites aggressively can overload the website’s server and may even crash them if the website has no rate limiting in place. In this case, you damage a website’s functionality and may be held liable under “Trespass to Chattels” law (more on this later).

Most websites specify a “crawl-delay” directive in their robot.txt file (more on this later, also). crawl-delay 10 means that a bot should wait at least 10 seconds between two consecutive requests.

If the crawl-delay directive isn’t specified by the website, 1 request per 10 to 15 seconds is a reasonable crawl rate in most scenarios. As long as you stay within the reasonable crawl rate, there’s no reason to get into web crawling legal issues.

Are You Abiding By The Terms Of Service?

Websites can attempt to discourage scraping activities by laying down the conditions in their ToS (“Terms of Service” or “Terms of Use”). While websites can put whatever they want in their ToS, the conditions are not always enforceable. The terms may or may not be contractually binding on web scrapers, depending on how they appear on the website.

Agreements can be either browsewrap or clickwrap. Browsewrap agreements are concluded upon visiting the website. However, in many cases, they either appear inconspicuously at the bottom of the screen or within a drop-down menu. In such cases, they are generally not binding by law. However, if the agreement appears as a pop-up window or the website provides a link to the ToS at a noticeable position, they may be enforceable. You’ll better understand the legal theory behind browsewrap agreements by looking at a summary of related court cases.

In contrast, clickwrap agreements are those that require the user to tick a checkbox or click a button. Below the button or checkbox, something around the lines, “By clicking, you agree to our Terms and Conditions” will be written. After you take the required action, the Terms and Conditions are legally binding on you and the court may enforce it.

Are You Complying With robots.txt File?

If you want to use web scraping tools, you should know about robots.txt. Consider it as an instruction manual that the website places for bots.

The “Disallow: /” command tells the robots which pages the website owner does not want them to visit. Minimum allowed delay between successive requests may also be mentioned under the “crawl-delay” command.

It is generally a good idea to visit the website’s robot.txt file before scraping it and respect the directives laid down in it.

Legal Precedent

Let’s look at some important laws governing web scraping and some high profile judgements that carve the present and future of the data collection world.

HiQ vs. LinkedIn

Very recently, HiQ vs. LinkedIn case came out as a landmark for web scrapers. LinkedIn came into dispute with a small data analytics company, HiQ Labs, by sending an official letter demanding the latter to cease all scraping activities on LinkedIn. The letter also stated that LinkedIn had blocked HiQ Labs from accessing public profiles.

Did HiQ back out?

No. HiQ Labs took the case to the court saying scraping publically available data is not illegal, and blocking it gives big companies like LinkedIn the unfair advantage of hoarding public information.

In September 2019, US Ninth Circuit gave an unprecedented decision in favor of HiQ, stating that collecting publicly available data was not a violation of CCFA. In June 2020, the Supreme Court granted LinkedIn the petition for writ of certiorari and sent the case back to the 9th circuit for further consideration. Though the case is still pending, a decision in favor of HiQ could mean a groundbreaking victory for ethical web scraping.

Facebook vs. Power Ventures

“Facebook vs. Power Ventures” is another well-known dispute in the web scraping community. It began in 2009 by Facebook taking legal action against Power Ventures for extracting Facebook’s user information and displaying it on their own website. Facebook alleged that the action caused violations of CAN-SPAM Act, CFAA, DMCA, UCL and Copyright infringement.

What happened next?

Though the court dismissed other claims, three claims, violation under CAN-SPAM Act, CFAA and California Penal Code, were held for the final decision. Finally, the decision went in favor of Facebook and the court ordered Power to pay Facebook a hefty sum of $79,640.50.

Comparing the two cases, “HiQ vs. LinkedIn” and “Facebook vs. Power Ventures”, it’s easier to understand where data scraping may or may not be legal. Facebook controls access to its data by requesting login and password. When you scrape their user profiles, you scrape behind the login. Is data scraping legal in this case? Power Ventures was sued for it, what do you think!

In contrast, LinkedIn’s public profiles are accessible directly through the browser. You don’t need to login to view these profiles. Is scraping legal here? According to how the case is turning out in court, there’s a good chance it could be.

Computer Fraud and Abuse Act (CFAA)

CFAA is another important law that might be relevant when considering the legality of your scraping activity. The act says that intentionally accessing a computer system without either authorization or in excess of authorization may be subject to legal action.

So what does that mean to web scrapers?

Though the HiQ vs LinkedIn case is sent back for revision to the Ninth Circuit Court, the preliminary decision of the court suggests that when a server’s data is publically available, accessing it may not be a violation of CFAA. But we’ll have to wait for the final decision on the case to know for sure.

Besides how the ruling on the HiQ vs. LinkedIn case turns out, CFAA may still apply on web scraping in cases where non-public data is involved. Websites that hold certain information behind the login may hold you liable for scraping it under CFAA.

Trespass To Chattels

Everyone knows that trespassing someone’s property is illegal. Digital trespass is equally illegal. A website is the property of the website’s owner. Trespass To Chattels is a law that governs the wrongful use of someone’s digital property.

When you enter a website, which is the personal digital property of the website’s owner, you should behave in a responsible manner. If irresponsible behavior when using a website causes any damage to the website’s condition, quality or value, you may be held liable under Trespass To Chattels. For instance, if a high crawling rate crashes the website’s server, the website’s owner may file a lawsuit under “Trespass To Chattels”.

That being said, as long as you scrape a website responsibly, and make sure no damage is inflicted in any way, you wouldn’t have to worry about violating Trespass To Chattels.

Fair Use in the United States

Fair Use is a legal doctrine in the United States that permits scraping and use of copyrighted content in certain situations. Under this law, certain uses, including criticism, research, teaching, and news reporting, of copyrighted material may be considered “fair use”.

However, there are four factors that govern whether a use case falls under fair use:

“Transformative” uses, in which the user adds something new to extend the purpose of the original content, are typically considered fair use. Aggregators that generate lists for competitive purposes are likely to fall under this category.
Nature of the copyrighted material that was used is also a factor. Scraping factual material, including new articles, technical writings, are more likely to support the claim of fair use than creative work, such as movies or novels.
Scraping a small portion of the copyrighted material is more likely to be considered “fair use” than using a substantial portion of it.
The court also weighs the extent to which the use of copyrighted material damages the market for the original work, if at all, in deciding whether it may be considered “fair use” or not.

Conclusion

So what does it come down to? Is web scraping legal or not? We firmly believe it is. It is nothing more than the automation of work, done otherwise by humans.

You just have to respect certain legal boundaries and best practices. Respect robots.txt, don’t swamp the website with unreasonably high crawl rates, be extra cautious with copyrightable content and personal data. Seek professional legal advice whenever in doubt.

Generally, partnering with a professional web scraping service makes it easier to follow these principles.

When conducted in a responsible manner, web scraping is a powerful technology for gathering information, and even creating new information, on the internet. From content aggregation and competitive research to creating datasets for training machine learning models, the use cases for web scraping are endless.

Speak to a CrawlNow data expert today to explore new opportunities for using data to fuel growth for your business.

Blog