Viktor- I'm curious as to whether Common Crawl [0] would be useful to you. It's ...

dreamcompiler 11 months ago | parent | context | favorite | on: Marginalia: 3 Years

Viktor- I'm curious as to whether Common Crawl [0] would be useful to you. It's currently around 100TB and 3.35 billion pages, so it's going to be a long download unless you process it in place on S3. I have no idea what its signal/noise ratio is.

[0] https://commoncrawl.org/overview