Hacker News new | past | comments | ask | show | jobs | submit login

He was probably one of the biggest users that day, so that makes sense.

The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.

Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.

If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.




I don't think I agree. Cache has a cost too.

In theory, you'd want to cache more popular pages and let the rarely visited ones go through the uncached flow.

Crawling isn't user-behavior, so the odds are that a large percentage of the crawled pages were not cached.


That's true. On the other hand, pages with infoboxes are likely well-linked and will end up in the cache either due to legitimate popularity or due to crawler visits.

Checking a random sample of 50 pages from this guy's dataset, 70% of them were cached.


Note - there's several levels of caching at wikipedia. Even if those pages aren't in cdn (varnish) cache, they may be in parser cache (an application level cache of most of the page).

This amount of activity really isn't something to worry about, especially when taking the fast path of logged out user viewing a likely to be cached page.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: