Optimizing Hugo Sitemaps to prioritize posts crawling over taxonomies

You might notice that I’ve been struggling recently with SEO and that I’ve been trying some tricks to improve it, but it feels not enough.

I’ve been thinking about Google’s Crawl Budget¹ and the way I could influence it to focus more on new and important content. Adding lastmod should help, but together with each new post added, Hugo generates a ton of taxonomy pages (tags, categories, etc) which also would have lastmod updated. Taxonomy pages might catch some traffic eventually, but nowadays they’re mostly useful for humans, when they’re looking for similar content. Search engines rather forward people directly to the content.

Initially I was thinking about excluding from crawl taxonomy pages, for example:

tags/*
categories/*
paginated pages like */page/1 This for sure would allow Google to focus mostly on the pages with content, but as pages won’t be removed from Sitemap, they would pollute it and I would be left with a ton of errors in Google’s Search Console . Not nice.

I was also thinking about using meta robots tag set to “noindex, follow”:

<meta name=robots content="noindex, follow">

which theoretically should ignore some of the pages, but would allow Google to discover the linking structure and importance of specific URLs. But it looks like it’s no longer a valid assumption, at least for Google. I don’t know about the others like Bing/Yandex/Baidu. I have some percentage of the traffic generated by other search engines so I don’t want to “piss them off”.

I started looking for options to exclude tags and categories from the sitemap generation process and I found accidentally something else: cascade param in front-matter². Which combined with the sitemap override³ pointed me into another direction: maybe I could just setup priority + changefreq for specific page categories and it would do the trick. For more details check Sitemap protocol docs⁴.

By default, Hugo’s sitemap layout produces quite basic sitemap without any priorities set. This doesn’t guide Google well on which page should be crawled first. Let chase this idea.

I crated a dir and file tags/_index.md to match my taxonomy preference. In the file I’ve added:

---
title: Tags
cascade:
  params:
    sitemap:
      # disable: true
      changefreq: weekly
      priority: 0.1
---

There is option to disable at all, but tags can actually route some traffic to my site, so I didn’t like it. I changed frequency and set priority to lowest possible.

Warning

That’s intriguing, but it’s really important to remember about the title here! When we add custom tags/_index.md, it overwrites defaults and title could evaluate to empty string, which in my case broke the breadcrumbs and resulted in Google’s Search Console to report critical issues!

I followed same pattern for categories/_index.md but with priority: 0.2 as I have less categories and they feel more important to me.

Then time came for posts/_index.md with priority: 0.9 and posts/2024/_index.md with priority: 1.0. Posts are the core of my blog, so I gave them highest priority and even higher for the recent ones. To finalize it, I added few changefreq here and there too, based on my gut feeling.

That’s more or less what was the final result:

graph LR
    root["/"] --> posts["/posts"]
    posts --> older["/..."]
    older --> older_sitemap["priority: 0.9"]
    posts --> 2023["/2023"]
    2023  --> older_sitemap
    posts --> 2024["/2024"]
    2024  --> newer_sitemap["changefreq: weekly, priority: 1.0"]

    root --> books["/books"]
    books --> book_sitemap["changefreq: monthly, priority: 0.6"]

    root --> tags["/tags"]
    tags  --> tags_sitemap["changefreq: weekly, priority: 0.1"]

    root --> categories["/categories"]
    categories  --> categories_sitemap["changefreq: weekly, priority: 0.2"]

    root --> authors["/authors"]
    authors  --> authors_sitemap["changefreq: yearly, priority: 0.1"]

This should prioritize crawling of freshly created content over automatically generated pages. When I will write less, crawlers can go through the taxonomy pages, but otherwise they should prioritize the true content.

Enjoyed?