AEO

Crawlability for AI Search: How AI Bots Actually Find Your Website

Crawlability for AI search is the foundation of AI visibility. If AI systems cannot read your website, nothing else matters. Here is how it works and what to fix.

Duncan Hotston·

Crawlability for AI search is the degree to which AI systems can access and read your website. It is the first step in AI visibility, and without it, none of the other signals matter. A business can have excellent content, a well-structured site, and strong authority signals. If the door is locked to AI crawlers, none of that counts.

This is Layer 1 of the 5-Layer Framework. Everything else builds on it.

Why Crawlability Comes First

Think of your website as a physical shop. Structured data is the signage. Entity signals are your reputation in the neighbourhood. But crawlability is whether the shop is open and whether customers can get through the door.

AI systems like ChatGPT, Perplexity, and Claude do not have a human researcher sitting behind them reading every webpage on demand. They rely on crawlers, automated programmes that visit websites, read the content, and pass it back to be processed and stored. If your site prevents that visit, or makes the content difficult to read once the crawler arrives, the AI system simply does not have your information. It cannot recommend what it does not know.

This is not a new problem. It is a familiar one with a new consequence. Blocking search crawlers always cost you visibility. Blocking AI crawlers costs you recommendations.

How AI Crawlers Work

A crawler is a programme that visits a URL, reads the HTML, follows links, and moves on. It identifies itself with a user-agent string, a kind of name badge. GPTBot is OpenAI's crawler. ClaudeBot is Anthropic's. PerplexityBot belongs to Perplexity. Google uses its existing Googlebot infrastructure for its AI features.

Each of these crawlers checks your robots.txt file before doing anything else. Robots.txt is a plain-text file at the root of your domain that tells crawlers what they are and are not allowed to access. It is the list of house rules posted on the front door.

If your robots.txt says "disallow GPTBot", OpenAI's systems will not crawl your site. If it says "disallow all", none of them will. Many sites have rules in their robots.txt that were written years ago for reasons nobody remembers, and those rules are now quietly blocking every AI crawler that comes looking.

The Three Most Common Crawlability Problems

1. Accidental bot blocking

This is the most common issue. A developer added a broad disallow rule, a CDN or security layer added bot protection, or a legacy CMS setting was never reviewed. The site owner has no idea. The AI has no access.

The fix is simple: audit your robots.txt and check the user-agent entries specifically for GPTBot, ClaudeBot, PerplexityBot, and any other AI crawlers relevant to your sector.

2. JavaScript-dependent content

A crawler reads HTML. If your site loads its main content through JavaScript after the initial page request, the crawler may arrive, read a near-empty page, and leave with almost nothing.

This is an increasingly common problem as websites use more dynamic frameworks. The content exists. It is just delivered in a way the crawler cannot access. Server-side rendering, or at minimum static HTML fallbacks, resolves this.

3. No sitemap, or a broken one

A sitemap is a file that lists every important page on your site so crawlers can find them without having to follow every link. Without one, a crawler may miss large sections of your site entirely, especially if your internal linking is poor.

A sitemap should be current, submitted correctly, and free of broken links. It should also be referenced in your robots.txt so crawlers know where to find it.

What AI Crawlability Requires That Traditional SEO Does Not

Traditional SEO crawlability has been a focus for two decades. Most businesses have done some work on it, or their platform has handled it for them. AI crawlability is materially similar but has a few specific additions.

The first is explicit permissions for AI user-agents. Many robots.txt files predate GPTBot and ClaudeBot entirely. They may use rules that inadvertently block everything not explicitly listed as allowed. Checking for AI-specific permissions is a separate task from general SEO crawl health.

The second is llms.txt. This is a plain-text file, placed at the root of your domain, that gives AI systems a direct, structured summary of your business and content. It does not replace a full crawl but it makes the crawl more useful and gives AI systems a reliable reference point even when the full site is complex or large. The detail on how it works lives on the llms.txt pillar page.

The third is content structure. AI systems do not just index your pages, they read them. Headings, paragraph structure, and clear declarative sentences all make it easier for a language model to extract accurate information. A page written for human skimming and a page written to be accurately summarised by an AI are similar but not identical.

Crawlability and the Rest of the 5-Layer Framework

Crawlability is Layer 1 of the 5-Layer Framework. The remaining layers are Structured Data, llms.txt, Entity Signals, and WebMCP.

Each layer adds something. Structured data tells AI systems what your content means. llms.txt gives them a direct briefing on your business. Entity signals build trust and authority. WebMCP enables live, real-time interactions with AI agents.

None of those layers function if crawlability is broken. A structured data implementation on a page that is blocked from AI crawlers does nothing. An llms.txt file that contradicts a disallow rule achieves nothing. Crawlability is not the most interesting layer to work on. It is simply the one that has to come first.

What to Check on Your Own Site

These are the five things worth reviewing before anything else:

Robots.txt. Open yourwebsite.com/robots.txt. Look for disallow rules. Check whether any entries reference specific AI crawlers or use wildcard rules that would catch them.

Sitemap. Check whether yourwebsite.com/sitemap.xml exists and loads correctly. Confirm it is referenced in your robots.txt and contains your current pages.

JavaScript rendering. View the page source of your homepage and a key service or product page. If the source is mostly empty divs and script tags rather than readable content, you have a rendering problem.

Page speed. A page that takes more than three seconds to respond risks incomplete crawling. Core Web Vitals scores give a reasonable proxy.

llms.txt. Check whether yourwebsite.com/llms.txt exists. If it does not, that is a gap in your AI readability regardless of how well the rest of the site performs.

The 16-Probe Scan checks all of these automatically and returns a scored result. It is the most efficient way to establish where your site stands before doing anything else.

Common Mistakes Worth Avoiding

Blocking AI crawlers deliberately because of content scraping concerns is understandable but counterproductive if you want AI recommendations. There are more targeted ways to protect content without forfeiting AI visibility entirely.

Fixing robots.txt and assuming the job is done is the other common mistake. Crawlability gets you in the door. What the crawler finds when it arrives determines whether the visit is useful. A crawlable site with no structure, no entity signals, and no llms.txt is like an open door into an empty room.

Treating crawlability as a one-time fix is also a mistake. Sites change. Platforms update. CDN configurations shift. A crawlability audit should be a periodic check, not a single event.

The Baseline Every Business Needs

Crawlability is not a competitive advantage. It is a baseline requirement. A site that cannot be read by AI crawlers does not exist in AI search, regardless of what the business does, how good its product is, or how long it has been trading.

Most businesses have not checked their AI crawlability. Some are blocked without knowing it. Some have the door open but nothing readable inside.

The 16-Probe Scan tells you where you stand in minutes. That is the right place to start.


Frequently Asked Questions

What is crawlability for AI search?

Crawlability for AI search is the degree to which AI systems can access, read, and index the content on your website. If your site blocks AI crawlers, loads content only through JavaScript, or lacks a sitemap, AI tools cannot find you regardless of how good your content is.

Do AI search tools use the same crawlers as Google?

Some do, some do not. OpenAI uses GPTBot. Anthropic uses ClaudeBot. Perplexity uses PerplexityBot. Google's AI systems use its existing crawler infrastructure. Each has its own user-agent string and can be blocked independently, which is why checking your robots.txt settings matters.

What does robots.txt have to do with AI visibility?

Robots.txt is the file that tells crawlers which parts of your site they can and cannot access. If your robots.txt blocks AI crawlers, intentionally or through legacy rules, those systems will never read your content. Many sites block AI bots without realising it.

How is crawlability different from SEO?

Traditional SEO crawlability focuses on making content readable for Google's ranking algorithm. AI crawlability focuses on making content accessible and understandable for language models that generate answers. The technical requirements overlap but the goals differ: Google ranks pages, AI systems synthesise answers.

What is the fastest way to find out if my site has crawlability problems?

Run the 16-Probe Scan on beknown.world. It checks your site across 16 specific signals including robots.txt permissions, sitemap presence, JavaScript rendering behaviour, and AI-specific access rules. You get a scored result in minutes.

Does page speed affect AI crawlability?

Yes. AI crawlers have resource budgets. A slow or poorly structured page is more likely to be skipped or incompletely indexed. Clean HTML, fast load times, and server-side rendering all improve the likelihood that a crawler reads your full content.

What is llms.txt and does it affect crawlability?

llms.txt is a plain-text file you place at the root of your domain to give AI systems a direct summary of your business, content, and structure. It does not replace crawlability but it dramatically improves how useful a crawl is once access is granted. Full detail is on the llms.txt pillar page.

crawlabilityAI searchAI botsllms.txtrobots.txtsitemapAEOstructured data

Check your AI visibility

Find out how AI search engines see your business. Free check, no commitment.

Get your free check