atlookup
Back to blog

Technical SEO

robots.txt Explained: How to Control What Google Crawls

robots.txt Explained: How to Control What Google Crawls

The robots.txt file lives at the root of your domain and is the first thing most crawlers check. It tells them which parts of your site they may or may not request. Used well it protects crawl budget; used carelessly it can hide pages you want ranked.

What robots.txt does

It grants or restricts crawling by user-agent. See the full robots.txt definition. It does not control indexing — that’s a critical distinction.

Crawling is not indexing

Blocking a page in robots.txt stops crawlers from reading it, but the URL can still appear in search if other sites link to it. To keep a page out of results, use a meta robots noindex tag — and do not block it in robots.txt, or Google can’t see the noindex.

Basic syntax

  • User-agent: — which crawler the rules apply to (* = all).
  • Disallow: — paths not to request.
  • Allow: — exceptions inside a disallowed path.
  • Sitemap: — the absolute URL of your XML sitemap.

Common mistakes

Build a correct file

Avoid syntax slips with our free Robots.txt Generator — set rules and copy a valid file. To validate your live robots.txt and sitemap together, run a free atlookup audit.

FAQ

Where does robots.txt go?

At the domain root, reachable at https://yoursite.com/robots.txt.

Does Google honor crawl-delay?

No — Google ignores it, but Bing and some others respect it.