What Does robots.txt Do for AI Search?

robots.txt is a plain-text file placed at the root of a site, usually at a URL like https://example.com/robots.txt. Its job is to tell compliant crawlers which parts of a site they may or may not crawl.[1][2][3] Google describes it mainly as a crawl-management tool, not as a way to keep pages out of search results entirely. The formal standard, RFC 9309, makes the same point even more directly: robots.txt is not a form of access authorization.[1][2][3] (Google for Developers )

That distinction matters even more now because AI-related access is no longer one thing. Some bots crawl for search. Some are used for training controls. Some fetch pages in response to user actions. A robots.txt rule can affect one of those functions without affecting the others. That is why robots.txt still matters, but it should be understood as one control layer inside a broader AI visibility strategy rather than the whole strategy itself.[4][5][6][7][8] (OpenAI Developers )

What robots.txt Is and What It Is Not

robots.txt controls crawling for bots that choose to honor the Robots Exclusion Protocol. It does not password-protect a page, remove a page from the web, or guarantee that a page will never appear in search interfaces.[1][2][3] Google’s documentation explicitly says robots.txt is mainly used to avoid overloading a site with crawler requests and is not a mechanism for keeping a page out of Google; for that, Google recommends noindex or password protection instead.[2][9] (Google for Developers )

It is also important not to confuse “disallow crawling” with “disappear everywhere.” Google notes that when a URL is disallowed for crawling, Google may still index the URL itself and show it in results without a snippet if it learns about that URL from elsewhere.[10] OpenAI says something similar for ChatGPT search: if a disallowed page is discovered through a third-party provider or other crawling signals, it may still surface as a link and title even when the page content itself is not available for summaries and snippets.[11] (Google for Developers )

How robots.txt Works

The file has to live at the top level of the site host where it applies, and the rules apply only to that protocol, host, and port. In other words, https://example.com/robots.txt does not automatically control https://blog.example.com/ or http://example.com/.[1][12] Google’s documentation is explicit that a robots.txt file must be at the root and that rules apply only to the host, protocol, and port where the file is posted. That is why separate subdomains often need separate robots.txt files.[1][12] (Google for Developers )

At a basic level, robots.txt is made of crawler groups and rules. The core directives most site owners use are:

User-agent to specify which crawler the rules apply to
Disallow to block crawling of a path
Allow to permit a more specific path when broader blocking exists
Sitemap to point crawlers to XML sitemaps[1][3][12]

Those are the basics that matter most in practice. Google’s documentation and RFC 9309 both support this model, and Google’s documentation also explains how longest-match logic applies when multiple rules could match the same URL.[1][3][10][12] (RFC Editor )

AI Search Access Is Not One Single Category

One reason robots.txt gets overstated is that people talk about “AI bots” as if they all do the same thing. They do not.

OpenAI separates OAI-SearchBot from GPTBot. OpenAI says OAI-SearchBot is used to surface websites in ChatGPT search features, while GPTBot is used as the training-control bot for generative foundation models. OpenAI also says these settings are independent, so a site owner can allow search visibility while disallowing training use.[4] (OpenAI Developers )

Google makes a similar distinction with Googlebot and Google-Extended, but with an important nuance: Google-Extended is not a separate crawler user agent string. It is a standalone product token in robots.txt that publishers can use to manage whether content Google crawls may be used for training future Gemini models and for grounding in Gemini Apps and certain Vertex AI uses. Google also says Google-Extended does not affect inclusion in Google Search and is not a Google Search ranking signal.[5] (Google for Developers )

Apple draws a similar line. Apple says Applebot-Extended gives publishers additional control over whether website content may be used to train Apple’s foundation models. Apple also states that Applebot-Extended does not crawl webpages and that pages disallowing Applebot-Extended can still be included in search results. That is a strong reminder that not every AI-related robots token is a crawler in the traditional sense.[8] (Apple Support )

Perplexity distinguishes between PerplexityBot and Perplexity-User. Perplexity says PerplexityBot is designed to surface and link websites in Perplexity search results, while Perplexity-User supports user actions within Perplexity. Perplexity also states that because Perplexity-User fetches are user-requested, that fetcher generally ignores robots.txt rules.[6] (Perplexity )

This is the key strategic point: a robots.txt rule may block training use, search inclusion, or crawler-based retrieval differently depending on the specific user agent involved. That is why “block AI” is not precise enough to guide a good robots.txt policy.[4][5][6][8] (OpenAI Developers )

What robots.txt Can Influence for AI Search

robots.txt can influence whether compliant crawlers are allowed to access the pages you want surfaced, cited, indexed, or withheld. For example, OpenAI says publishers who want their content included in ChatGPT search summaries and snippets should make sure they are not blocking OAI-SearchBot.[11] Perplexity says that to ensure a site appears in search results, site owners should allow PerplexityBot in robots.txt and permit requests from Perplexity’s published IP ranges.[6] (OpenAI Help Center )

For Google, robots.txt influences crawl access for Google’s automated crawlers, but Google also states that the REP is not applicable to Google crawlers that are controlled by users. That nuance matters because it means site owners should not assume that every user-driven fetch or safety-related system works exactly like classical web crawling.[10] (Google for Developers )

So the clearest way to think about robots.txt in AI search is this:

it can open or close the door for compliant crawler-based access
it can help separate search access from training permissions for some vendors
it does not decide whether your content is high quality, well structured, or likely to be cited once accessed
it does not function as a security mechanism[2][3][5][8][10][11] (Google for Developers )

What robots.txt Does Not Do

robots.txt does not secure sensitive data. RFC 9309 says the rules are not access authorization, and Google says robots.txt is not a way to keep a page out of Google.[2][3] If a page truly needs protection, use authentication, paywalls, permissions, or other server-side controls. If the goal is to prevent indexing of a page that may still be crawled, use a noindex meta tag or HTTP header instead.[2][9] (Google for Developers )

It also does not support every directive people sometimes try to use. Google has explicitly said unsupported robots.txt rules such as noindex are not recognized in robots.txt, and its current documentation repeats that noindex in robots.txt is not supported.[9][13] Google also does not support crawl-delay, even though Anthropic says its bots do support the non-standard Crawl-delay extension.[9][7][13] That means crawl-delay is not a reliable universal control across vendors.[7][13] (Google for Developers )

The Strategic Tradeoff: Allow, Block, or Split the Difference

For many businesses, the real question is not whether robots.txt exists. The question is what the policy should be.

If the goal is broader discoverability in AI-assisted search, blocking search-oriented crawlers such as OAI-SearchBot or PerplexityBot works against that goal because those vendors explicitly say access helps their systems surface and cite content.[4][6][11] (OpenAI Developers )

If the goal is to avoid model-training use while still remaining visible in search, some vendors now offer more granular controls. Google says Google-Extended can be disallowed without affecting inclusion in Google Search.[5] Apple says Applebot-Extended can be disallowed without preventing pages from appearing in search results.[8] OpenAI says allowing OAI-SearchBot while disallowing GPTBot is a valid split configuration.[4] (Google for Developers )

That makes a selective policy more realistic than an all-or-nothing one. Some organizations may choose to allow search access while limiting training access. Others may decide that proprietary or premium content should not be crawlable at all. The right choice depends on business goals, content sensitivity, and how much value the organization expects from AI-driven discovery.

Practical Rules That Usually Matter Most

For most sites, the practical priorities are simpler than the broader discourse suggests.

First, keep the syntax clean and readable. Google recommends using a plain UTF-8 text file at the root of the site and testing the rules you add.[12] A file full of overlapping or contradictory directives is harder to maintain and easier to misread. (Google for Developers )

Second, focus on rules that genuinely serve a purpose:

block private admin or staging areas from compliant crawlers
avoid blocking files Google needs for rendering unless there is a good reason
allow the search-oriented crawlers you actually want access from
add explicit AI-related product tokens or user agents only where there is a business reason to do so[1][10][12] (Google for Developers )

Third, do not rely on robots.txt alone when the real need is stronger control. Google’s documentation points site owners toward noindex, password protection, or other methods when the goal is to prevent appearance in search or restrict access more directly.[2][9] (Google for Developers )

robots.txt Is One Layer in AI Visibility, Not the Whole Strategy

Even when robots.txt is configured correctly, it only answers one question: can a compliant crawler access the content?

It does not answer the harder questions: Is the page clear? Is the site structured well? Are the headings descriptive? Is the brand credible? Is the content good enough to be surfaced, cited, or trusted once accessed? That is why robots.txt belongs inside a broader AI visibility system rather than replacing one.[1][11] (Google for Developers )

This is also where the relationship to your other child pages becomes important. What Is an AI-Structured Website? explains how site structure, HTML, schema, accessibility, and clarity support machine readability. This page is narrower. It covers access policy. It explains who can crawl what, and where robots.txt stops being enough.

Practical Priorities Before You Edit the File

If you are reviewing robots.txt with AI search in mind, start with the basics:

back up the current file
confirm which host or subdomain the file applies to
identify which crawlers are relevant to your search and training goals
decide which of those you want to allow, disallow, or separate
test the rules before and after deployment
review the file again whenever new bots or policy needs become relevant[1][4][5][6][8][12] (Google for Developers )

This is not glamorous work, but it is foundational. A weak robots.txt policy can block the very crawlers you want, while an overly open file can expose more than the business intended.

Final Thoughts

robots.txt still matters, but it should be understood clearly.

It is a crawl-control file for compliant bots. It is not a security layer. It is not a complete indexing control system. And it is not a guarantee of AI visibility by itself.[2][3][9][10] (Google for Developers )

For AI search, its real value is strategic. It lets site owners decide which compliant crawlers can access which parts of the site and, in some ecosystems, whether content may be used for search surfacing, training, or both.[4][5][6][8] (OpenAI Developers )

That is useful. But it is only the first layer. Once access is open, the harder work begins: building pages that are clear enough to understand, strong enough to trust, and useful enough to earn visibility.

FAQ

Does robots.txt control AI search?

It can influence AI search when the relevant system depends on compliant crawler access. OpenAI says OAI-SearchBot should be allowed for inclusion in ChatGPT search summaries and snippets, and Perplexity says PerplexityBot should be allowed for appearance in Perplexity search results. But robots.txt does not control every user-triggered fetch, and it does not guarantee that content will be surfaced once accessed.[4][6][10][11] (OpenAI Developers )

If I block a bot, will my page disappear everywhere?

Not necessarily. Google says a disallowed URL may still appear in search results without a snippet if Google learns about it from elsewhere. OpenAI says a disallowed page may still surface as a link and title in some cases if relevant signals exist from other sources.[10][11] (Google for Developers )

Can I block training without blocking search?

Sometimes, yes. Google says Google-Extended can be disallowed without affecting Google Search inclusion. Apple says Applebot-Extended can be disallowed without preventing search result inclusion. OpenAI says OAI-SearchBot and GPTBot can be controlled independently.[4][5][8] (OpenAI Developers )

Is robots.txt enough to protect private or sensitive content?

No. RFC 9309 says the protocol is not access authorization, and Google says robots.txt is not a mechanism for keeping a page out of Google. Use authentication, paywalls, permissions, or noindex where appropriate.[2][3][9] (Google for Developers )

Does `noindex` belong in robots.txt?

No. Google explicitly says noindex in robots.txt is not supported. Use a robots meta tag or X-Robots-Tag header instead.[9][13] (Google for Developers )

robots.txt should not be treated like a one-time technical checkbox. It should support a deliberate visibility policy. Get your AI Visibility Assessment to review whether your current robots.txt, crawler access, and site structure support the way your brand should appear in search and AI discovery.

Free AI Visibility Assessment

Endnotes

[1] Google Search Central, How to write and submit a robots.txt file. Explains that robots.txt lives at the root of the site, applies only to the host/protocol/port where it is posted, and should be a plain UTF-8 text file. (Google for Developers )

[2] Google Search Central, Introduction to robots.txt. States that robots.txt tells crawlers which URLs they can access and is mainly used to avoid overloading a site; it is not a mechanism for keeping a page out of Google. (Google for Developers )

[3] RFC 9309, Robots Exclusion Protocol. Defines robots.txt as a protocol for crawler access requests and states that the rules are not a form of access authorization. (RFC Editor )

[4] OpenAI, Overview of OpenAI Crawlers. Documents OAI-SearchBot for ChatGPT search features and GPTBot for training control, and notes that these settings are independent. (OpenAI Developers )

[5] Google for Developers, Google’s common crawlers (Google-Extended). Explains that Google-Extended is a standalone product token controlling training and grounding uses for Gemini-related products and does not affect Google Search inclusion or ranking. (Google for Developers )

[6] Perplexity, Perplexity Crawlers. Documents PerplexityBot for search results and Perplexity-User for user actions, and states that Perplexity-User generally ignores robots.txt because the fetch is user-requested. (Perplexity )

[7] Anthropic Help Center, Does Anthropic crawl data from the web, and how can site owners block the crawler?. States that Anthropic’s bots honor robots.txt and that Anthropic supports the non-standard Crawl-delay extension. (Claude Help Center )

[8] Apple Support, About Applebot. Explains Applebot-Extended, including that it is used to control how Apple may use website content for foundation-model training and that disallowing it does not prevent pages from being included in search results. (Apple Support )

[9] Google Search Central, Block Search indexing with noindex and A note on unsupported rules in robots.txt. States that noindex in robots.txt is not supported by Google and should instead be implemented as a meta tag or HTTP header. (Google for Developers )

[10] Google for Developers, How Google interprets the robots.txt specification. Explains that Google’s automated crawlers parse robots.txt before crawling, but the REP is not applicable to Google crawlers controlled by users; also notes that disallowed URLs may still be indexed as URLs without content snippets. (Google for Developers )

[11] OpenAI Help Center, Publishers and Developers FAQ. States that publishers wanting content included in ChatGPT search summaries and snippets should not block OAI-SearchBot, and notes that disallowed pages may still appear as links and titles in some cases. (OpenAI Help Center )

[13] Google Search Central Blog, A note on unsupported rules in robots.txt. States that unsupported rules such as noindex in robots.txt were never officially supported by Google. (Google Search Central Blog )