The crucial difference between robots.txt and x-robots-tag
Andrew Betts
Posted on November 27, 2023
“Robots” directives control how search engine crawlers are allowed to index your website, and are one of the oldest web standards. But does it matter if you use the meta tag, HTTP header, or the robots.txt file? Yes, very much so.
I hadn’t given this a huge amount of thought until I recently had a conversation with Harry Roberts, at the perf.now() conference. Harry made me realise that there is a significant difference between robots.txt and the other approaches to restricting indexing, which if not correctly understood might mean your site goes unindexed by Google and friends completely.
Here’s an example robots.txt file telling a search crawler that it is not allowed to crawl anything under the /api
path on that domain:
User-Agent: *
Disallow: /api
This means the crawler will not even request anything (except the robots.txt file). But, where the robots.txt allows it, and the crawler does make a request for something on my site, I can still stop that response from being indexed by adding a meta tag to the markup, if the response is an HTML page:
<meta name="robots" content="noindex">
Of course not everything is an HTML page, and Google and other search engines will often show documents like PDFs in search results, so a generally better approach is to use an HTTP header, which has the same effect and can be used on any type of response:
x-robots-tag: noindex
This brings us to the trade offs between using robots.txt vs these per-response directives.
Sub-resources must be fetchable
Search engines crawl using a real browser, and will therefore download all the images, CSS and scripts needed to render the page. If a script on the page makes API calls, the crawler will make those requests too.
Imagine you have a site that requires an API response to fully render a page (eg as part of a React app that doesn’t do server side rendering). The content inserted into the page from that API response will be indexable if the page itself is indexable, regardless of whether the API response had a noindex
directive on it. However, if the API request path is disallowed by robots.txt, the crawler won’t even make the request at all, and the page will be incomplete.
Pages need to be discoverable
By using robots.txt to prevent resources from being fetched, you also prevent the crawler from discovering links to other resources. Conversely, a per-resource noindex
directive says “don’t index me”, but doesn’t stop the crawler from scanning the page to find things that are indexable (you can separately tell crawlers not to follow links in a page with the nofollow
directive).
In my experience crawlers have a creepy ability to find things even if you think they’re not linked to anywhere, but it’s certainly possible to make a piece of content accidentally undiscoverable.
Don’t invite unnecessary crawler traffic
You might be getting the sense at this point that you should lean into using resource-level noindex
directives instead of robots.txt. But that comes with a different risk: of simply creating unnecessary requests to your servers. Crawlers tend to create traffic that’s more expensive to service, because they hit every page of your site. This forces your application stack to generate each page separately, even including obscure long-tail content for which crawlers might be the only regular clients.
Certainly for situations where, for example, you want an entire domain to be off limits to a crawler, having it download every page only to find it has a noindex
directive is pretty wasteful.
It would also be easy to forget to include the noindex
on some resources and end up having a site indexed by accident. Robots.txt is certainly easier to apply comprehensively.
Getting the balance right
So, taking the above scenarios into account, there are a few rules of thumb that make for a generally pretty good robots strategy:
- where a whole domain should be off limits to crawlers, for example for staging/preview URLs, use robots.txt.
- Where resources should not be indexed but might be necessary to render indexable pages, they should be marked as
noindex
using anx-robots-tag
header.
Looking back at the earlier example of using robots.txt to disallow /api
: this seems likely to be a risky decision. Maybe responses from /api
would be better labelled with a noindex
directive instead.
Use an edge network
If you use an edge network like Fastly, it’s often possible to adjust the headers of responses before they are served to the end user. This can be handy for quickly adjusting the indexability of resources. Check out this example to learn how to add and remove headers. You could even consider generating and serving the robots.txt file from the edge too.
Posted on November 27, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024