Hacker News new | past | comments | ask | show | jobs | submit login

Viktor- I'm curious as to whether Common Crawl [0] would be useful to you. It's currently around 100TB and 3.35 billion pages, so it's going to be a long download unless you process it in place on S3. I have no idea what its signal/noise ratio is.

[0] https://commoncrawl.org/overview




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: