12 tips on how to think like a web scraping expert

mantisus

Max Bohomolov

Posted on November 4, 2024

12 tips on how to think like a web scraping expert

Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process.

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

Crawlee & Apify

This is the official developer community of Apify and Crawlee. | 8987 members

favicon discord.com

In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results.

So, let's explore the mindset of a web scraping developer.

1. Choosing a data source for the project

When you start working on a project, you likely have a target site from which you need to extract specific data. Check what possibilities this site or application provides for data extraction. Here are some possible options:

  • Official API - the site may provide a free official API through which you can get all the necessary data. This is the best option for you. For example, you can consider this approach if you need to extract data from Yelp
  • Website - in this case, we study the website, its structure, as well as the ways the frontend and backend interact
  • Mobile Application - in some cases, there's no website or API at all, or the mobile application provides more data, in which case, don't forget about the man-in-the-middle approach

If one data source fails, try accessing another available source.

For example, for Yelp, all three options are available, and if the Official API doesn't suit you for some reason, you can try the other two.

2. Check robots.txt and sitemap

I think everyone knows about robots.txt and sitemap one way or another, but I regularly see people simply forgetting about them. If you're hearing about these for the first time, here's a quick explanation:

  • robots is the established name for crawlers in SEO. Usually, this refers to crawlers of major search engines like Google and Bing, or services like Ahrefs and ChatGPT.
  • robots.txt is a file describing the allowed behavior for robots. It includes permitted crawler user-agents, wait time between page scans, patterns of pages forbidden for scanning, and more. These rules are typically based on which pages should be indexed by search engines and which should not.
  • sitemap describes the site structure to make it easier for robots to navigate. It also helps in scanning only the content that needs updating, without creating unnecessary load on the site

Since you're not Google or any other popular search engine, the robot rules in robots.txt will likely be against you. But combined with the sitemap, this is a good place to study the site structure, expected interaction with robots, and non-browser user-agents. In some situations, it simplifies data extraction from the site.

For example, using the sitemap for Crawlee website, you can easily get direct links to posts both for the entire lifespan of the blog and for a specific period. One simple check, and you don't need to implement pagination logic.

3. Don't neglect site analysis

Thorough site analysis is an important prerequisite for creating an effective web scraper, especially if you're not planning to use browser automation. However, such analysis takes time, sometimes a lot of it.

It's also worth noting that the time spent on analysis and searching for a more optimal crawling solution doesn't always pay off - you might spend hours only to discover that the most obvious approach was the best all along.

Therefore, it's wise to set limits on your initial site analysis. If you don't see a better path within the allocated time, revert to simpler approaches. As you gain more experience, you'll more often be able to tell early on, based on the technologies used on the site, whether it's worth dedicating more time to analysis or not.

Also, in projects where you need to extract data from a site just once, thorough site analysis can sometimes eliminate the need to write scraper code altogether. Here's an example of such a site - https://ricebyrice.com/nl/pages/find-store.

Ricebyrice

By analyzing it, you'll easily discover that all the data can be obtained with a single request. You simply need to copy this data from your browser into a JSON file, and your task is complete.

Ricebyrice Response

4. Maximum interactivity

When analyzing a site, switch sorts, pages, interact with various elements of the site, while watching the Network tab in your browser's Dev Tools. This will allow you to better understand how the site interacts with the backend, what framework it's built on, and what behavior can be expected from it.

5. Data doesn't appear out of thin air

This is obvious, but it's important to keep in mind while working on a project. If you see some data or request parameters, it means they were obtained somewhere earlier, possibly in another request, possibly they may have already been on the website page, possibly they were formed using JS from other parameters. But they are always somewhere.

If you don't understand where the data on the page comes from, or the data used in a request, follow these steps:

  1. Sequentially, check all requests the site made before this point.
  2. Examine their responses, headers, and cookies.
  3. Use your intuition: Could this parameter be a timestamp? Could it be another parameter in a modified form?
  4. Does it resemble any standard hashes or encodings?

Practice makes perfect here. As you become familiar with different technologies, various frameworks, and their expected behaviors, and as you encounter a wide range of technologies, you'll find it easier to understand how things work and how data is transferred. This accumulated knowledge will significantly improve your ability to trace and understand data flow in web applications.

6. Data is cached

You may notice that when opening the same page several times, the requests transmitted to the server differ: possibly something was cached and is already stored on your computer. Therefore, it's recommended to analyze the site in incognito mode, as well as switch browsers.

This situation is especially relevant for mobile applications, which may store some data in storage on the device. Therefore, when analyzing mobile applications, you may need to clear the cache and storage.

7. Learn more about the framework

If during the analysis you discover that the site uses a framework you haven't encountered before, take some time to learn about it and its features. For example, if you notice a site is built with Next.js, understanding how it handles routing and data fetching could be crucial for your scraping strategy.

You can learn about these frameworks through official documentation or by using LLMs like ChatGPT or Claude. These AI assistants are excellent at explaining framework-specific concepts. Here's an example of how you might query an LLM about Next.js:

I am in the process of optimizing my website using Next.js. Are there any files passed to the browser that describe all internal routing and how links are formed?

Restrictions:
- Accompany your answers with code samples
- Use this message as the main message for all subsequent responses
- Reference only those elements that are available on the client side, without access to the project code base

Enter fullscreen mode Exit fullscreen mode

You can create similar queries for backend frameworks as well. For instance, with GraphQL, you might ask about available fields and query structures. These insights can help you understand how to better interact with the site's API and what data is potentially available.

For effective work with LLM, I recommend at least basically studying the basics of prompt engineering.

8. Reverse engineering

Web scraping goes hand in hand with reverse engineering. You study the interactions of the frontend and backend, you may need to study the code to better understand how certain parameters are formed.

But in some cases, reverse engineering may require more knowledge, effort, time, or have a high degree of complexity. At this point, you need to decide whether you need to delve into it or it's better to change the data source, or, for example, technologies. Most likely, this will be the moment when you decide to abandon HTTP web scraping and switch to a headless browser.

The main principle of most web scraping protections is not to make web scraping impossible, but to make it expensive.

Let's just look at what the response to a search on zoopla looks like

Zoopla Search Response

9. Testing requests to endpoints

After identifying the endpoints you need to extract the target data, make sure you get a correct response when making a request. If you get a response from the server other than 200, or data different from expected, then you need to figure out why. Here are some possible reasons:

  • You need to pass some parameters, for example cookies, or specific technical headers
  • The site requires that when accessing this endpoint, there is a corresponding Referrer header
  • The site expects that the headers will follow a certain order. I've encountered this only a couple of times, but I have encountered it
  • The site uses protection against web scraping, for example with TLS fingerprint

And many other possible reasons, each of which requires separate analysis.

10. Experiment with request parameters

Explore what results you get when changing request parameters, if any. Some parameters may be missing but supported on the server side. For example, order, sort, per_page, limit, and others. Try adding them and see if the behavior changes.

This is especially relevant for sites using graphql

Let's consider this example

If you analyze the site, you'll see a request that can be reproduced with the following code, I've formatted it a bit to improve readability:

import requests

url = "https://restoran.ua/graphql"

data = {
    "operationName": "Posts_PostsForView",
    "variables": {"sort": {"sortBy": ["startAt_DESC"]}},
    "query": """query Posts_PostsForView(
    $where: PostForViewWhereInput,
    $sort: PostForViewSortInput,
    $pagination: PaginationInput,
    $search: String,
    $token: String,
    $coordinates_slice: SliceInput)
    {
        PostsForView(
                where: $where
                sort: $sort
                pagination: $pagination
                search: $search
                token: $token
                ) {
                        id
                        title: ukTitle
                        summary: ukSummary
                        slug
                        startAt
                        endAt
                        newsFeed
                        events
                        journal
                        toProfessionals
                        photoHeader {
                            address: mobile
                            __typename
                            }
                        coordinates(slice: $coordinates_slice) {
                            lng
                            lat
                            __typename
                            }
                        __typename
                    }
    }"""
}

response = requests.post(url, json=data)

print(response.json())
Enter fullscreen mode Exit fullscreen mode

Now I'll update it to get results in 2 languages at once, and most importantly, along with the internal text of the publications:

import requests

url = "https://restoran.ua/graphql"

data = {
    "operationName": "Posts_PostsForView",
    "variables": {"sort": {"sortBy": ["startAt_DESC"]}},
    "query": """query Posts_PostsForView(
    $where: PostForViewWhereInput,
    $sort: PostForViewSortInput,
    $pagination: PaginationInput,
    $search: String,
    $token: String,
    $coordinates_slice: SliceInput)
    {
        PostsForView(
                where: $where
                sort: $sort
                pagination: $pagination
                search: $search
                token: $token
                ) {  
                        id
                        # highlight-start
                        uk_title: ukTitle
                        en_title: enTitle
                        # highlight-end
                        summary: ukSummary
                        slug
                        startAt
                        endAt
                        newsFeed
                        events
                        journal
                        toProfessionals
                        photoHeader {
                            address: mobile
                            __typename
                            }
                        # highlight-start
                        mixedBlocks {
                            index
                            en_text: enText
                            uk_text: ukText
                            __typename
                            }
                        # highlight-end
                        coordinates(slice: $coordinates_slice) {
                            lng
                            lat
                            __typename
                            }
                        __typename
                    }
    }"""
}

response = requests.post(url, json=data)
print(response.json())
Enter fullscreen mode Exit fullscreen mode

As you can see, a small update of the request parameters allows me not to worry about visiting the internal page of each publication. You have no idea how many times this trick has saved me.

If you see graphql in front of you and don't know where to start, then my advice about documentation and LLM works here too.

11. Don't be afraid of new technologies

I know how easy it is to master a few tools and just use them because it works. I've fallen into this trap more than once myself.

But modern sites use modern technologies that have a significant impact on web scraping, and in response, new tools for web scraping are emerging. Learning these may greatly simplify your next project, and may even solve some problems that were insurmountable for you. I wrote about some tools earlier.

I especially recommend paying attention to curl_cffi and frameworks
botasaurus and Crawlee for Python.

12. Help open-source libraries

Personally, I only recently came to realize the importance of this. All the tools I use for my work are either open-source developments or based on open-source. Web scraping literally lives thanks to open-source, and this is especially noticeable if you're a Python developer and have realized that on pure Python everything is quite sad when you need to deal with TLS fingerprint, and again, open-source saved us here.

And it seems to me that the least we could do is invest a little of our knowledge and skills in supporting open-source.

I chose to support Crawlee for Python, and no, not because they allowed me to write in their blog, but because it shows excellent development dynamics and is aimed at making life easier for web crawler developers. It allows for faster crawler development by taking care of and hiding under the hood such critical aspects as session management, session rotation when blocked, managing concurrency of asynchronous tasks (if you write asynchronous code, you know what a pain this can be), and much more.

:::tip
If you like the blog so far, please consider giving Crawlee a star on GitHub, it helps us to reach and help more developers.
:::

And what choice will you make?

Conclusion

I think some things in the article were obvious to you, some things you follow yourself, but I hope you learned something new too. If most of them were new, then try using these rules as a checklist in your next project.

I would be happy to discuss the article. Feel free to comment here, in the article, or contact me in the Crawlee developer community on Discord.

You can also find me on the following platforms: Github, Linkedin, Apify, Upwork, Contra.

Thank you for your attention :)

💖 💪 🙅 🚩
mantisus
Max Bohomolov

Posted on November 4, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related