How to still use Crawlers in Client-Side Websites

vabelha

Vasco Abelha

Posted on August 12, 2020

How to still use Crawlers in Client-Side Websites

Originally this was published on my blog. You can find the publication here!

If you wanna discuss anything feel free to reach me on Twitter.

Introduction

In this post, I will describe a solution that I built for an already existing React Client-Side platform, in which people wanted to be able to share specific content on their feeds.

This publication is useful for developers:

  • have an already built a Client-Side Website (doesn't need to be solely React)
  • want to understand how we can interact with different crawlers.

Technologies used:

  • VPS where the project was hosted
  • Nginx;
  • ExpressJS (It doesn't matter what you are using).
  • ReactJS
  • Facebook SDK - OpenGraph

Contextualization

Whenever you share a link to a website in Facebook, Twitter, or any other social platform, they spawn a crawler that will scrape your website in order to look for meta tags that can help them understand what they are looking at and how they can share it - App, Card, Summary, Large Card, etcetera.

One of the biggest problems in a React Client-Side website is that everything is rendered through JavaScript. If you use a Browser or a Crawler that doesn't process JS, you will just be presented with a blank page - You need to enable JavaScript to run this app. This applies to Facebook or Twitter Crawlers.

Example of Black Page

In the end, if you share an URL from your website on one of these social platforms, you won't get any type of card or information from your website.

Note: You can use https://cards-dev.twitter.com/validator to verify and test yourself.

Twitter Card Validator

To the Left we Have a React Client Side Website. To the Right we have a static website.

In both Websites, I have React-Helmet (which allows modifications to your document head), yet the left side still shows no meta-tags fetched by the crawlers due to requiring JavaScript to render.

Show what the Crawlers want to see

If we are hosting the website on a typical Virtual Private Server, then there is a good chance that we are using a web server like apache, nginx or lighttpd to process the incoming HTTP requests.
Thus a web server like Nginx is the perfect place to "trick" him and proxy him into a renderer HTML with the information that we want the crawler to see.
For this we need:

  • To know which requests come from the crawlers;
  • a service that renders Dynamic HTML Content;
  • Update NGINX to link crawlers to the new service.

Crawlers Identification

After researching Facebook and Twitter Documentation we can identify the crawlers by the following user-agent strings:

  • facebookexternalhit/1.1 (Facebook)
  • Twitterbot(Twitter)

Service to render Dynamic HTML

You have other types of solutions. You can pretty much use anything that renders an HTML webpage.

In this case, I had an already established set of services available through expressjs, so I stuck with it and created one endpoint that would take params (in this case a news publication identifier) and return an HTML page with every kind of head and meta tags that I wanted to be scraped by the crawlers.

Note: The URL must be equal to the one where I view the news publication.

Example of the service:

//(routes/social.js -> socialRoutes)
...
router.get("/news/:id", async (req, res) => {

    const { id } = req.params;
    const {news} = await getNews(id);
    res.set("Content-Type", "text/html");
  res.send(`<!DOCTYPE html>
  <html>
        <head>
            <link rel="canonical" href="${news.url}" />
            <meta property="og:title" content="${news.title}>" />
            <meta property="og:description" content="${news.description}" />
            <meta property="og:image" content="${news.cover_image}" />
            <meta property="og:url" content="${news.url}" />
            <meta name="twitter:title" content="${news.title}>" />
            <meta name="twitter:description" content="${news.description}" />
            <meta name="twitter:image" content="${news.cover_image}" />
            <meta name="twitter:url" content="${news.url}" />
            <meta name="twitter:card" content="summary" />
        </head>
  </html>
`);
});

//server.js
...
app.use("/social", socialRoutes);
...
app.listen(3500, () => {
  console.log('started at localhost:3500');
});

Update NGINX and Send Crawlers to our Service

With knowing the user-agent strings of the crawlers and having already defined our service to generate HTML pages free of javascript.
We can now "trick" the crawlers with the help of NGINX and send them to our services instead of the real webpage.
Usually, if you are using a react app under Nginx, your default.conf file will be generally similar to this:

server{
    root /var/www/html;

    # Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    server_name www.example.com example.com;

    location / {
        try_files $uri /index.html; 
    }
}

Nevertheless, this isn't enough, because the crawlers will still go to our files located in root and only see blank pages due to javascript rendering.

Therefore we need to add a prior condition to verify the user-agent before sending them to our project folder.

server{
    root /var/www/html;

    # Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    server_name www.example.com example.com;

    location / {
        # Here we proxy the request to our api if user-agent matches any of these regular expressions
        if ($http_user_agent ~ facebookexternalhit|Twittterbot) {
            proxy_pass http://localhost:3500/social$uri$args;
        }
        try_files $uri /index.html; 
    }
}

Conclusion

Every time we have a new request that matches the user-agents of Facebook and Twitter, we will proxy it to our service for HTML rendering. Thus, in turn, allowing the crawlers to process our "not-so-real" webpage as the real one and fetch the meta-tags needed to share our website.

As long as you have some kind of middleware that can act as a reverse proxy, then you can still allow client-side web applications to be scraped by crawlers that don't execute javascript.

Nevertheless, if possible you should take a look at Static Side Generators or Server-Side Rendering Frameworks.

This publication is only useful to shed some light on how you can interact with crawlers and to possibly guide or help someone in anything similar that they are working on.

💖 💪 🙅 🚩
vabelha
Vasco Abelha

Posted on August 12, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related