How to handle AI crawlers: bait, keep, or block

Maintained by Unsourced · the reasoning behind every allow/block call in our crawler directory

Not every AI crawler deserves the same treatment — and the User-Agent string won't tell you which is which, because it can be spoofed. So the call is always two steps. First confirm the crawler is who it claims (forward-confirmed reverse DNS, or a match against the operator's published IP ranges); then decide by what the bot actually does with your content. Unsourced sorts every verified crawler into one of three buckets.

Bait — court these crawlers AI Search

These power the AI answer engines that cite and link their sources — ChatGPT Search, Perplexity, Google's AI Overviews and the like. A page they can read is a page they can quote, credit, and send real referral traffic to. Blocking them protects nothing; it simply drops you out of answers your audience is already asking for. The move here is to be easy to find and easy to cite — bait, not block. 5 of the crawlers in our directory sit in this bucket, for example PerplexityBot and OAI-SearchBot.

Keep — don't block these lightly AI Trainer

Training crawlers — GPTBot, ClaudeBot, Google-Extended and others — gather text that teaches future models rather than answering a question today. They send no referral traffic, so the trade-off is quieter: opt out and you remove yourself from what the next generation of models learns, which over time can reduce how often they mention you at all. There are honest reasons to opt out — licensing, principle, control — but it's a deliberate trade, not a free win. 9 crawlers fall here.

Block — reasonable to refuse Scraper

Bulk extractors and resold-data crawlers take your content with no citation, link, or referral in return — and some honour robots.txt only inconsistently. There's little upside to feeding them, so blocking is a defensible default; just confirm the block actually holds, because the worst offenders ignore the rules. The same bucket catches impostors: anything wearing a trusted bot's name from an IP outside that operator's verified ranges belongs here, whatever its User-Agent claims. 4 crawlers land in this bucket, for example Bytespider.

One rule cuts across all three

robots.txt is honoured, not enforced — well-behaved crawlers obey it, but it can't physically stop anything. That's why verifying identity matters more than maintaining a block list: a rule only works on the bots that choose to follow it. For the deeper argument and worked examples, read our bait, keep, or block deep dive; to see exactly where a given crawler lands, browse the full AI crawler directory.

Want to know which of these are actually hitting your site?

Unsourced verifies every AI crawler against published ranges and reverse DNS, and shows which AI assistants cite you.

Check your site free →

14-day free trial · no card required · cancel anytime