Sitemap / Robots.txt Cross-Check: two sources, one truth

Google’s crawler uses two files to decide whether a URL gets indexed: the XML sitemap as an active invitation list, and the robots.txt as an exclusion list. When the two disagree — a URL sits in the sitemap but is also blocked by robots.txt — Google treats it as a crawl error. Search Console surfaces those URLs under “Indexed, though blocked by robots.txt” or “Blocked by robots.txt”.

This scenario is classic after migrations: a new Disallow: /preview/ rule lands in robots.txt, but the sitemap generator keeps exporting preview URLs. Or a Sitemap: entry still points at a file that doesn’t exist after a relaunch — a 404 nobody notices.

Our cross-check accepts a URL or a domain, fetches the robots.txt, parses every Sitemap: declaration, reads each sitemap (including <sitemapindex> files), and tests a sample of its URLs against the robots.txt rules. The compact punch list tells you immediately: what matches, what doesn’t, and what’s missing.

Sitemap ↔ Robots.txt Cross-Check

Enter a domain to audit the consistency between its robots.txt and its XML sitemap(s). Flags URLs that are listed in the sitemap but blocked by robots.txt, sitemap declarations that 404, and missing sitemap declarations.

Heads-up: when running many checks in quick succession, target servers or CDNs (Cloudflare, Imperva, Akamai, Sucuri) may temporarily throttle or block the requests. If a check returns WAF or rate-limit errors, wait a few minutes before retrying. Results are cached for 5 minutes, so re-running the same domain is free.

How to run the cross-check

  1. Enter a domain: either just example.com or any URL on the same domain. The tool extracts the origin and appends /robots.txt automatically.
  2. Click the button: the audit usually takes 2–5 seconds and is near-instant on repeated runs thanks to a 5-minute transient cache.
  3. Read the summary strip: four stat boxes show declared sitemaps, reachable sitemaps, URLs sampled, and URLs disallowed by robots.txt. That last value in red is the action indicator.
  4. Check the findings: results are sorted by severity (high > medium > info). Red = fix immediately. Blue = informational only, not a bug.
  5. Expand sitemap cards: each discovered sitemap gets its own card with status code, type (urlset vs. sitemapindex), sample table, and — where a URL is blocked — the exact rule that blocks it along with its line number in robots.txt.

What the sample size means

For sitemaps with tens of thousands of URLs, the cross-check does not test the full list — it tests a sample of 25 URLs per sitemap. The reason is pragmatic: running 10,000 URLs individually against robots rules would slow the audit down without adding insight. If even one URL collides with the typical /preview/ pattern, the rule conflict shows up in the report. For a full per-URL audit, supplement with Google Search Console → Coverage.

For <sitemapindex> files

Large sites pack their sitemaps into an index file. Our tool recurses one level deep and examines the first three sub-sitemaps. That’s enough to surface typical collisions without hammering 500 sub-sitemaps.

Why do these errors happen?

Migration from staging to production

Staging environments typically ship a catch-all Disallow: / in robots.txt to keep the construction site out of search indexes. At go-live the file is swapped — but the sitemap generator keeps exporting staging-specific URLs that either no longer exist after the relaunch or 301-redirect to production. Google fetches the redirect out of the sitemap and follows it; the robots rule, of course, no longer applies to the old URL, but the wrong signal is already sent.

Preview and intranet paths

WordPress installs with page-preview plugins, staging subdomains, or internal client areas commonly declare Disallow: /preview/ or Disallow: /intern/. If the XML sitemap still serves the same URLs (e.g. because they exist as “pages” in WP with no noindex meta), you get exactly the contradiction this tool catches.

Forgot to register the sitemap in robots.txt

A classic: the site serves /sitemap.xml, but robots.txt has no Sitemap: line. Google still finds the file through Search Console, but other crawlers (Bing, Seznam, Baidu, AI bots) rely on the declaration. We test /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, and /wp-sitemap.xml as fallbacks — when one hits, a blue info finding surfaces with a remediation tip.

Frequently asked questions

What does the status code on each sitemap card mean?

The HTTP status code of the HEAD request we made against the sitemap URL. 200 = reachable. 301/302 = redirect (we follow automatically). 403 combined with a “Blocked by WAF” badge = Cloudflare or similar is blocking crawlers — allow-list it in your firewall, otherwise Google can’t reach it either. 404 = sitemap file doesn’t exist.

Why is my sitemap entry green but the URL inside marked “Disallowed”?

Because the sitemap file‘s HTTP status is green (200), but an individual URL inside it is caught by a Disallow rule in robots.txt. That’s the tool’s headline finding. The card shows the rule and the line number in robots.txt — you know exactly where to look.

Does the tool test against user-agent *?

Yes, we test against the * group because it’s the default policy for Googlebot and every other standard crawler. For bot-specific testing, use our Robots.txt Validator: there you can test a single URL against a specific user-agent and see the full decision trace line by line.

Why only a 25-URL sample?

Because it’s enough as an early-warning system. If your sitemap contains even a single typical “staging leftover”, it’s very likely to be in the first 25 entries — sitemap generators mostly sort alphabetically or by lastmod. For a complete indexing audit, Google Search Console remains the more comprehensive tool.

Does the tool check whether the URLs themselves are reachable?

Not in this version. A broken-link check per URL would mean 25 extra HEAD requests per audit and slow the check significantly. For reachability tests of individual sitemap URLs, use our LLMS.txt Validator (even though it checks a different source, it uses the same HEAD-check infrastructure) or our Hreflang Tester (tier 1 liveness check).

Why do I get errors when running many checks in a row?

When you run several cross-checks in quick succession — for example auditing multiple domains back-to-back — an upstream server or a WAF like Cloudflare, Imperva, Akamai, or Sucuri may temporarily flag the requests as bot-like and throttle or block them. You’ll see this either as a 403/503 status with a WAF badge or as an “Unexpected error” response. When that happens, wait 2–5 minutes and try again. Results are cached for 5 minutes, so re-running the same domain costs no extra network requests.

What does “sitemapindex” vs. “urlset” mean?

Both are valid XML sitemap formats per the sitemaps.org specification. urlset is a flat list of concrete URLs. sitemapindex is a list of other sitemaps — typical for large sites that exceed the 50,000-URL-per-file limit and split their sitemap across multiple files. The cross-check follows an index file one level deep and examines the first three sub-sitemaps.

Changelog

  • Detects URLs listed in the sitemap that are also blocked by robots.txt
  • Flags Sitemap: declarations that return 404 or a WAF block
  • When no declaration exists, tries the conventional paths /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml
  • Follows <sitemapindex> files one level deep
  • Pure local rule matching — no 25 extra network calls per audit
  • Three-tier finding severity (high / medium / info)

Weitere Tools, die du mal testen solltest