Hacker News new | past | comments | ask | show | jobs | submit login
Common Crawl (commoncrawl.org)
397 points by Aissen on March 26, 2021 | hide | past | favorite | 61 comments



I don't think that this is the answer to "only google can crawl the web". This is a huge archive suitable for making a web search engine maybe.

What if you want to make a simple link previewer? An abstract crawler for scientific articles? Most websites are behind cloudflare which will block/captcha you, but happily whitelist only google & major social sites. Tha answer is measures that bring the web back to basics, not this over-SEOed bot infested ecosystem. FANGS succeeded in sucking out all the information of the web, but they suck at creating protocols that are interoperable (even twitter now needs its own tags!).

Incidentally, maybe the next search engine should use a push-system , where websites will ping to it whenever they have updates. If the engine has unique features, it might be actually seen as a measure to reduce the loads from bots.


Actually I was able to search URLs with very little ram across the entire collection as they have a series of indexes you can download.

In theory someone could do something similar with terms, or you could first use URLs to filter the text size you download into Elastic or Solr and do your own custom search that way.

The indexes are really neat though, I highly recommend playing with them.


Can you suggest where I can get them from?



This is mainly used for language modeling research. A filtered CC was used in GPT3 and I have personally used data from CC for NLP projects.


I recently did a pass through CC it took me 5 days, there were 80,000 1GB warc files. I had to just filter the information I wanted and dump the rest because I can't possibly host that much data on my machines. It's great data but very noisy, like the web is. Lots of repetition and useless junk mixed with useful data.


>maybe the next search engine should use a push-system

So setting up a new search engine that way would require going to every site and convincing them to notify you of changes. Wouldn't that be even more limited than the current Cloudflare whitelist system? At least there's some chance you can get around the whitelist system.


This... I couldn't get past the irony of the comment. Basically, the problem is that sites only let google and friends index them. The solution? Site should only send index data to google and friends.

I mean, I get it, then sites can send to any number of indexers, but let's be honest, like you say, any new search engine has to get sites to push data out to them. That's just not. going. to. happen.


I don't think we can bring the web back to basics in the sense you're envisioning without kicking most users off of it.

Cloudflare's protection is to guard against traffic spikes and automated malicious attacks; Google and social sites are allow-listed because they're trustable entities with well-defined online presences and a vested interest (generally) in not breaking sites. "How do we do away with the need to put up an anti-automated-traffic screen plus whitelist" is in the same category of problem as "How do we change the modern web to address automated malicious attacks?"


Another way to look at this is if CDNs don't allow Google, nobody is going to want to use them. Their content doesn't get indexed and anybody doing a search is going to get directed to someone else that doesn't put their content behind a CDN with that level of protection. That or someone like google will just solve the problem themselves and be both the CDN and the indexer, bringing them one step closer to complete ownership of finding anything on the web.


> Cloudflare's protection is to guard against traffic spikes and automated malicious attacks

This simply isn't true. By my reading, Cloudflare implies any automated traffic that doesn't openly advertise itself as such is bad. They certainly seem to be happy to assist you in blocking it regardless of whether it was actually malicious or actually causing traffic problems. (https://blog.cloudflare.com/super-bot-fight-mode/)


I'm not sure I understand the distinction you're making. Automated traffic that isn't flagging itself as automated is deceptive, and that's a negative signal for maliciousness right out of the starting gate.


Shouldn't automated traffic be allowed privacy? To not do so means we have to take away privacy from humans in order to verify they are not automated. What do we gain for this? Why shouldn't some programmable bot be allowed to see my content but any human no matter their intent are allowed by default?


Bots don’t have human. They are identifiable by behavior. If they aren’t, there’s no problem.


> What do we gain for this?

More reliable services for those willing to de-pssudonymize and declare themselves in an auditable fashion.


Good resource, admirable intention, great that it simply exists. Good sized index.

I see a lot of people subscribe to the idea of this being the feeder to alternative search engines.

I'd guess part of the problem with doing things this way is the 'crawl priority' of what the search engine thinks are the next best pages to crawl, it's totally out of their hands or at least, they'd still need to crawl on top of the Common Crawl data.

The recent UK CMA report into monopolies in online advertising estimated Google's index to be around 500-600 billion pages in size and Bing's to be 100-200 billion pages in size [0]. Of course, what you define as a 'page' is subjective given URL canonicals and page similarity.

At the very least, the common crawl gets around crawl rate limiting problems by being one massive download.

Would be interesting to know if there's an appreciable % of site owners blocking it, though going on past data (there is some data in the UK CMA about this also), it's not a huge issue.

[0] https://assets.publishing.service.gov.uk/media/5efc57ed3a6f4... (page 89)


The interesting past threads seem to be the following. Others?

Ask HN: What would be the fastest way to grep Common Crawl? - https://news.ycombinator.com/item?id=22214474 - Feb 2020 (7 comments)

Using Common Crawl to play Family Feud - https://news.ycombinator.com/item?id=16543851 - March 2018 (4 comments)

Web image size prediction for efficient focused image crawling - https://news.ycombinator.com/item?id=10107819 - Aug 2015 (5 comments)

102TB of New Crawl Data Available - https://news.ycombinator.com/item?id=6811754 - Nov 2013 (37 comments)

SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data [video] - https://news.ycombinator.com/item?id=6214874 - Aug 2013 (2 comments)

A Look Inside Our 210TB 2012 Web Corpus - https://news.ycombinator.com/item?id=6208603 - Aug 2013 (36 comments)

Blekko donates search data to Common Crawl - https://news.ycombinator.com/item?id=4933149 - Dec 2012 (36 comments)

Common Crawl - https://news.ycombinator.com/item?id=3690974 - March 2012 (5 comments)

CommonCrawl: an open repository of web crawl data that is universally accessible - https://news.ycombinator.com/item?id=3346125 - Dec 2011 (8 comments)

Tokenising the english text of 30TB common crawl - https://news.ycombinator.com/item?id=3342543 - Dec 2011 (7 comments)

Free 5 Billion Page Web Index Now Available from Common Crawl Foundation - https://news.ycombinator.com/item?id=3209690 - Nov 2011 (39 comments)


Love to see this.

As an aside, it always jars me when a site hijacks default browser scrolling functionality. In my experience, making it as fast as possible is a _far_ better use of dev resources than figuring out how to make scrolling unique (no matter what the marketing department says).


> it always jars me when a site hijacks default browser scrolling functionality

I assume you are saying this because this site does it. What do you mean? I can't see any difference from normal scrolling functionality on there.


They've drastically increased the momentum of scrolling. Scroll a tiny bit and it will keep moving the entire length of the page.


In what browser?


I've used this and it's invaluable for all types of things but a feeder for Google killers it is not.

They don't approach the scale of what Google crawls, they state as much. Nor do they do it on the same timeline as Google. This is really nice for research or kick starting a project but this isn't a long term viable solution for alternative search engines. Between breadth, depth, timeline/speed, priority, and information captured it falls well short.


Well that's to be expected, but if a lot of search engines start using it, it's likely that websites will start allowing it to crawl and index their pages. So there might be potential there.


It's not an issue of being allowed to crawl.


Oh I guess I misunderstood then. Why don't they crawl at Google's scale?


Mostly it comes down to their crawler itself and the cost of storing all of that data, it's sufficiently prohibitive.


How feasible would it be to store all that data on a decentralized system like IPFS or Sia-Skynet etc, instead of Amazon, to add further meaning to the cause?


You may not grasp just how large the Common Crawl dataset is. It's been growing steadily at 200-300 TB per month for the last few years. I'm not certain how large the entire corpus is at this point, but it's almost certainly in the tens to low hundreds of petabytes. (This is significantly larger than the capacity of the entire Sia network, for example.)

Storing a dataset of this size and making it available online is not inexpensive. Amazon has generously donated their services to handle both of these tasks; it would be foolish to turn them down.


(Update: the complete Common Crawl dataset is actually a little smaller than I thought, at 6.4 PB. That's still pretty big, though.)


> Amazon has generously donated their services to handle both of these tasks; it would be foolish to turn them down.

Amazon makes plenty of money from people using AWS to processes the data.


That's a win-win.


It's not clear, but it looks like the last crawl was 280 TiB (100 TiB compressed) and contains a snapshot of the web at that point; i.e. you don't need prior snapshots unless you're interested in historical content.

EDIT: the state of the crawls are summarized at https://commoncrawl.github.io/cc-crawl-statistics/.


As best I can gather, the crawl is an ongoing process, not a series of independent "snapshots". There's almost no overlap in URLs between each crawl archive, although it looks as though there's some repetition on a larger scale (roughly every 2 months):

https://commoncrawl.github.io/cc-crawl-statistics/plots/craw...


Blockchain storage is going to cost you a pretty penny if you were to store all of Common Crawls pentabytes, so not very feasible.


IPFS doesn't require any blockchain.

In fact someone could easily setup some IPFS nodes that fetch the data from the current host if requested over IPFS.[1] This way people could access it via IPFS and provide an alternate mirror of the data.

[1] https://github.com/ipfs/go-ipfs/blob/master/docs/experimenta...

The main benefits here would be

- Even if the source is unavailable there may be other copies on IPFS which would be transparently used.

- There may be some performance benefits in rare cases.

- If you are accessing this on a bunch of machines your IPFS gateways would handle downloading the source once, then automatically using the local copy from inside your network.

The maindownside is that if Amazon is donating their resources why bother with IPFS?


That's not correct. When it comes to storage and transfer, blockchain alternatives are fraction of the cost of Amazon. For example Sia Skynet is offering $5/month/TB[1] storage. If you skip Skynet and run your own Sia node the price can even go lower to $2/month/TB basing on the market conditions.

[1] https://blog.sia.tech/announcing-skynet-premium-plans-faster...


Amazon is hosting the Common Crawl on S3 for free, so... yes, $2/month/TB is a lot more expensive.


It seems that at least on Sia's plans, you can maximally host 20TB for 80$/month, not even a tenth of a monthly common crawl.

Of course Sia's Skynet are package deals right now and I guess they're currently bootstrapping the network with users. Filecoin has no operational storage yet. Storj quotes 10$/Terabyte/month [1] so that will come out expensive.

1. https://www.storj.io/blog/2019/11/announcing-pioneer-2-and-t...


Moreso than Amazon? From my (limited) experience blockchain storage solutions are often less expensive, although I've never worked with petabytes of data so maybe it's different on that scale.


As a European (German) I am always wondering about the legal basis for a) making a copy of copyrighted material and databases available and b) processing contained personal data.

The Internet archive seems more like a library with exceptions applying, but common crawl seems to advertise also many other purposes that go beyond archiving publically relevant content.

Would this be possible in Europe, too? My feeling is that US legislation different here. Do you have to actively claim copyright in the US or enforce technically e.g. via DRM? Anyone can use anything without a license as long nobody finds out?


Well they've got a bunch of different data that would fall under different precedents. Metadata would likely fall under the phone book ruling. The raw warc data could be argued to be transformative, and would certainly fall under fair use - especially because it isn't even close to threatening market replacement. Property abandonment might fit in somewhere as well :)

https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


I am curious about this too, from a copyright perspective:

1) Whether it is legal for Common Crawl to collect this data in the first place? Presumably they don't have explicit permission from every site they have crawled?

and

2) Whether it is legal to use this data to train ML algorithms for commercial use (e.g. GPT-3)? They provide a bunch of examples of this on their site but the terms of use are kind of vague about this.


It has to be legal to use legally acquired published information for a commercial purpose. That is not nearly the same thing as selling someone else's work. Training a ML algorithm on published work is no different from me reading a tutorial on the web and using the new skills I learned to do something that makes me money. That is very different from reselling the tutorial as if I authored it.


The data is open they don't crawl inside password protected areas of the internet. Why shouldn't they be able to read data put into the open? If the authors didn't want their data openly accessible they should put it in a private place.


Except just because someone published something on the internet doesn't mean you have the legal right to reproduce and redistribute it yourself.


That has nothing to do with reading data that the publisher does distribute.


They're not just reading data, they are redistributing it.


It seems like Common Crawl is doing a lot of awesome stuff, but they're not attacking Google's stranglehold head on.

Presumably this is because they lack the money to do so. Have they attempted to estimate how much it would cost per year to crawl the web as aggressively and comprehensively as Google does? I've checked their site and didn't find anything like that.

If they came up with a number, say $2 or $10 billion per year, it might actually be possible to gather enough donations to dethrone Google.

A lot of Google competitors would love to see them dethroned. And it would be a huge win for virtually everyone else too. There's no one in the world that wants Google to maintain their web search monopoly indefinitely.


relates closely to the recent "Only Google is allowed to crawl the web"[1][2] post.

[1] https://news.ycombinator.com/item?id=26592635

[2] https://knuckleheads.club/


hope these guys team up w/ archive.org


I was hoping it was going to be a massively multiplayer dungeon crawling game...


whiskey checker pickle box

https://www.google.com/search?q=%22whiskey+checker+pickle+bo...

That doesn’t work with common crawl.


Great resource. Does anyone have a good free source for popular keywords/topics on google/the internet?


https://trends.google.com/trends/ is probably the best resource for the "top" queries, though it doesn't dive too far down the list.

Particularly their "Year in Search" entries, like: https://trends.google.com/trends/yis/2020/US/


I'm kinda surprised this is new to people. It's 10 years old. Is this really the first time it's been talked about here?


It's the 39th time, apparently, as you can see by clicking the domain next to the submission.


Oh I always forget about that feature. Excuse my stupidity.


It'll be new to some people every time it's posted because new people are entering the industry or signing up with HN or developing new interests every day. That's one reason I personally don't mind reposts or old content.


https://xkcd.com/1053/ - Ten Thousand


Love it! Just donated.


Soon, you and your friends can host your own private search engine at modest cost and enjoy total privacy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: