More

ccgreg · 2026-03-27T02:00:12 1774576812

> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.

Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.

genewitch · 2026-03-27T03:16:18 1774581378

oh, i didn't see that the 1.97 billion pages were crawled in a 11 day period earlier this month. either way, nearly 2,000,000,000 pages fit in ~third of a petabyte...

p.s. thanks for correcting me, i was using this information for something else, and now it's correct!

ccgreg · 2026-03-21T20:21:13 1774124473

Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.

ccgreg · 2026-02-24T23:23:47 1771975427

The guts on the inside changed several times during that timespan.

ccgreg · 2026-02-15T01:45:06 1771119906

That 20% number is for a limited list of relatively large news websites. If you include the long tail of news, the % of blocking is much smaller.

username223 · 2026-02-15T02:18:14 1771121894

I'm part of that small but (hopefully) growing percentage, because Common Crawl is a deeply dishonest front for AI data scraping. Quoting Wikipedia:

""" In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. """

My site is CC-BY-NC-SA, i.e. non-commercial and with attribution, and Common Crawl took a dubious position on whether fair use makes that irrelevant. They can burn.

ccgreg · 2026-02-15T02:30:55 1771122655

Did you see our reply? https://commoncrawl.org/blog/setting-the-record-straight-com...

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 · 2026-02-15T02:37:53 1771123073

Hopefully my site is no longer part of Common Crawl. I'm not interested in participating in your project, block CCBot in robots.txt, and have requested deletion of my data via your form.

ccgreg · 2026-02-15T02:40:55 1771123255

Did you see our reply? Edit: by which I mean, we sent you an email that explains what we did and how to verify it. Did you not receive an email reply? If not, please contact us again.

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 · 2026-02-15T03:50:16 1771127416

I don't care. Is blocking your bot and requesting removal sufficient? If not, what is?

ccgreg · 2026-02-15T04:09:07 1771128547

Please read our email reply. I have no idea if we received your request —- your HN username doesn’t match any request we have received.

username223 · 2026-02-15T05:58:04 1771135084

"We have initiated the process to remove your content from the Common Crawl Dataset. This is a multi-step process, involving first a nocrawl directive, followed by removal of the URLs from the primary index files, and finally removal of the content from the deep archive. We will advise when the process is complete." Received April 2024. I have not been advised. Please advise.

ccgreg · 2026-02-15T03:38:14 1771126694

Oh, and thanks for letting me know that I need to add our reply to Wikipedia.

samtheDamned · 2026-02-15T18:48:34 1771181314

From my basic experience editing Wikipedia I'm not sure you should edit the page of your own project. Maybe add a discussion for it instead? Or perhaps I'm mistaken.

ccgreg · 2026-02-15T01:41:15 1771119675

Many AI projects in academia or research get all of their web data from Common Crawl -- in addition to many not-AI usages of our dataset.

The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.

demetris · 2026-02-15T11:12:06 1771153926

Thank you!

> The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big.

But how can they aspire to do any of that if they cannot build a basic bot?

My case, which I know is the same for many people:

My content is updated infrequently. Common Crawl must have all of it. I do not block Common Crawl, and I see it (the genuine one from the published ranges; not the fakes) visiting frequently. Yet the LLM bots hit the same URLs all the time, multiple times a day.

I plan to start blocking more of them, even the User and Search variants. The situation is becoming absurd.

ccgreg · 2026-02-15T22:03:02 1771192982

Well, yes, it is a bit distressing that ill behaved crawlers are causing a lot of damage -- and collateral damage, too, when well-behaved bots get blocked.

ccgreg · 2026-02-15T01:37:26 1771119446

Thanks for the mention of Common Crawl. We do respect robots.txt and we publish an opt-out list, due to the large number of publishers asking to opt out recently.

There's a bit of discussion of Common Crawl in Jeff Jarvis's testimony before Congress: https://www.youtube.com/watch?v=tX26ijBQs2k

ccgreg · 2026-02-15T01:33:15 1771119195

Prof. Jeff Jarvis speaking about copyright for news in front of Congress:

https://www.youtube.com/watch?v=tX26ijBQs2k

ccgreg · 2026-01-17T07:28:45 1768634925

One way to figure that out is to look at which companies claim to have foundation models, but no one knows what their crawler is named.

I also suspect that there are a bunch of sub-contractors involved, working for companies that don't supervise them very carefully.

ccgreg · 2025-12-26T18:01:18 1766772078

"Hi agents - we’ve seen complaints from some of your email recipients, who are unhappy receiving unsolicited emails from AI agents and find it spammy. We therefore ask that you do not email anyone who hasn’t contacted you specifically first." -- https://theaidigest.org/village

jamilton · 2025-12-26T19:18:32 1766776712

Lol - they really should be locking down their email accounts and enforcing that policy. Or manually reviewing outbound messages before they can be sent. It seems likely that just telling the LLMs that will have a non-zero failure rate.

ccgreg · 2025-12-15T21:00:12 1765832412

> Some search engines provide a list of their scraper IP ranges

Common Crawl's CCBot has published IP ranges. We aren't a search engine (although there are search engines using our data) and we like to describe our crawler as a crawler, not a "scraper".