AI TrainerCommon Crawl

CCBot: IP ranges, verification & how to handle it

Last verified: 2026-07-28 · maintained by Unsourced

CCBot belongs to Common Crawl, a non-profit whose open web archive is a major training-data source for many AI models. Blocking CCBot removes you from a dataset many labs train on.

What CCBot does, and how it differs

CCBot belongs to Common Crawl, a non-profit that publishes a free, openly downloadable snapshot of the web — and that archive is one of the most widely reused training corpora in the industry. Blocking CCBot therefore has outsized reach: you're not opting out of one company's model but out of a dataset countless labs ingest. And because the archive is public, anything it already captured stays in circulating copies even after you later disallow it.

How to verify CCBot

Common Crawl publishes neither an IP-range feed nor a reverse-DNS footprint for CCBot, so there is no way to confirm it by network identity. Treat every request carrying the CCBot user-agent as an unverified claim, and judge it on what it does rather than the name it gives.

Should you allow or block CCBot?

Recommended: keep. CCBot feeds Common Crawl's open archive, which countless labs reuse, so blocking it has outsized effect — you remove yourself from one of the most widely ingested datasets on the web, and whatever it already captured persists in circulating copies.

If you do choose to act in robots.txt (which crawlers honour but don't enforce):

# CCBot: recommended to ALLOW — blocking can cost you AI visibility
User-agent: CCBot
Disallow:

Official sources

Common Crawl crawler info: https://commoncrawl.org/ccbot ↗

Common questions about CCBot

If I block CCBot, is my old content already gone?

Partly — CCBot feeds Common Crawl's public archive, and anything it captured before you blocked it stays in the freely downloadable copies that many labs have already ingested.

Why does blocking CCBot matter more than blocking one model?

Because Common Crawl's archive is reused as training data across countless AI labs — disallowing CCBot opts you out of one of the most widely ingested datasets on the web, not just a single company's crawler.

Can I verify CCBot by IP?

No — Common Crawl publishes neither an IP-range feed nor a reverse-DNS host for CCBot, so a request carrying its name can't be confirmed at the network level. Judge it by behaviour.

Related crawlers

See who is really crawling your site — Common Crawl, or an impostor.

Unsourced checks each crawler against published ranges andreverse DNS, and shows where AI search cites you instead of a competitor.

Check your site free →

10-day free trial · no card required · cancel anytime