Hacker News new | past | comments | ask | show | jobs | submit login

> How deep did you crawl?

Front pages only.

> I would have guessed the flash usage to be higher.

When adding all the pages in a site it no doubt will be. I'll update the article to clarify this.

> How big is the dataset?

In flight: huge, but after culling and keeping only the bits that I needed it was a lot smaller, about 20G.

> How long did it take?

About 10 days.

> Which tools did you use besides phantomjs?

Just some php glue scripts, nothing fancy, about 500 lines.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: