Mike Fallows
Posted on August 25, 2023
Having set up a fair few sites over the years, I periodically get a chunk of emails from Google's Search Console notifying me of indexing errors. Usually, these are pretty small potatoes caused by things like old products being unpublished and will resolve themselves on the next crawl.
I realised I hadn't yet set up a robots.txt
file for this site when a couple of errors popped up in Search Console that I wouldn't have expected. The errors I had were for:
https://mikefallows.com/cdn-cgi/l/email-protection
https://mikefallows.com/admin/
I had already set up a sitemap.xml
file but somehow overlooked creating a robots.txt
file at the time. The /admin/
URL was easy for me to identify, that is for my Forestry integration which I use as a CMS. That page has a noindex
meta tag in the head.
<meta name="robots" content="noindex" />
I was mistaken
I was under the impression that would be enough to signal to Search Console that it should be ignored, but it turns out that it's being included in my sitemap.xml
so it's (quite rightly) marked as invalid.
The other URL that started /cdn-cgi/l/email-protection
was more of a mystery. I hadn't added anything in a /cdn-cgi/
folder! The fact that it contained a reference to a CDN was a clue, so I wondered if it was related to Netlify, but I couldn't think of an obvious reason why it would have any reference to emails. After a bit of quick research, I realised this was related to Cloudflare which I'd recently set up for the site. As I had activated their proxy in front of the site, it explained the unknown folder and it appears to be a part of their bot protection.
So to fix these validation errors in Search Console I needed to:
- add a
robots.txt
file that disallows/admin/
and/cdn-cgi/
- exclude
/admin/
from mysitemap.xml
Adding a robots.txt
file
This is super-easy in Eleventy. I created a file: src/robots.txt
; and added the following to my .eleventy.js
config:
// Put robots.txt in root
eleventyConfig.addPassthroughCopy({ 'src/robots.txt': '/robots.txt' });
The addPassthroughCopy
method will just copy the file "as is" into the generated _site
folder. Great.
My robots.txt
file looked like this:
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /cdn-cgi/
Host: https://mikefallows.com/
Sitemap: https://mikefallows.com/sitemap.xml
The important parts were the two Disallow
rules that tell bots that they shouldn't try to crawl or index those paths in their results.
You can also view whatever the current version is.
Excluding pages from sitemap.xml
My sitemap is generated by a single sitemap.xml.njk
file:
---
permalink: /sitemap.xml
eleventyExcludeFromCollections: true
---
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{%- for page in collections.all %}
{%- set adminUrl = r/^\/admin\//i.test(page.url) %}
{%- set draft = page.data.draft %}
{%- if not adminUrl and not draft %}
{%- set absoluteUrl %}{{ page.url | url | absoluteUrl(metadata.url) }}{% endset %}
<url>
<loc>{{ absoluteUrl }}</loc>
<lastmod>{{ page.date | htmlDateString }}</lastmod>
<changefreq>{{ page.data.changeFreq if page.data.changeFreq else "monthly" }}</changefreq>
</url>
{%- endif %}
{%- endfor %}
</urlset>
This generates an XML file for all pages, excluding any pages where the URL begins /admin/
or is marked as a draft. I usually default my posts to draft until I'm ready to publish them. Draft posts are excluded by checking if the frontmatter has draft
value set to true
.
The key bit was writing the Regex to test that a URL begins /admin/
:
set adminUrl = r/^\/admin\//i.test(page.url)
Just to break that down:
-
r/
regular expressions in Nunjucks need to be prefixed withr
-
^
indicates we're only matching the start of the string -
\/admin\/
literally matches/admin/
-
/i
makes the test case insensitive (y'know, just in case)
What I find most enjoyable about having my own site is having the time to tinker and dig through these types of issues. When they're low pressure like this one, it's great to spend a little time polishing and learning in a way that I often miss in client work. For such a small task, I solidified my knowledge just a little bit more about Eleventy, Sitemaps, Regex, and Robots files and that's mostly due to taking the time to write it up.
Posted on August 25, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.