Hacker News new | past | comments | ask | show | jobs | submit login
Downloading a file regularly - how hard can it be? (adblockplus.org)
121 points by joeyespo on April 9, 2012 | hide | past | favorite | 42 comments



A common solution to this problem, is to make a 2 stage process, where step 1 is a request of "should I download?", where there are 2 possible replies: "no, check again in N time" and "yes, here is a token". Step 2 is then presenting the token to the api point for download, and getting the file.

On the server side, you don't even need specific instance tracking, just a simple decision based on current resource usage, and a list of valid tokens (optionally, they can expire in some short time to avoid other thundering herd type issues). Say, you set a max number of file transfers, or bandwidth or whatever metric makes sense to you, and you simply reply based on that metric. Further, you can smooth out your load with a bit of intelligence on setting N.

Even better, you get a cool side-effect: since the check isn't so resource intensive, you can set the time between checks lower, and make the updates less regular.

Now that I think of it: it seems that this would be a nice nginx plugin, with a simple client side library to handle it for reference. Anyone want to collaborate on this over the weekend? Should be relatively straight-forward.


> A common solution to this problem, is to make a 2 stage process, where step 1 is a request of "should I download?", where there are 2 possible replies: "no, check again in N time" and "yes, here is a token". Step 2 is then presenting the token to the api point for download, and getting the file.

You don't even need two steps, just have one step with previously known data. That's how HTTP conditional requests (Last-Modified/If-Modified-Since and ETag/If-None-Match) work: the client states "I want this file, I already have one from such moment with such metadata", and the server replies either "you're good" (304) or "here's your file (200).

Issue is, that only works when the file changes rarely enough, or you need additional server logic to reply that the file is still good when it's not.

> Now that I think of it: it seems that this would be a nice nginx plugin, with a simple client side library to handle it for reference. Anyone want to collaborate on this over the weekend?

I'd be very surprised if nginx didn't support conditional requests already.

edit: according to [0] and [1] — which may be outdated — Nginx provides built-in support for last-modified on static files, it does not provide ETag support (the developer believes this is not useful for static files — which is usually correct[2]) but [1] has apparently written a module to do so [3]. The module being 4 years old, it might be way out of date.

[0] http://serverfault.com/questions/211637/what-headers-to-add-...

[1] https://mikewest.org/2008/11/generating-etags-for-static-con...

[2] There are two situations in which it is not (keep in mind that this is for static content, dynamic is very different): if somebody willfully touches a file, it will change its Last-Modified but not its checksum, triggering a new send without ETag but not with it; and ETags can be coherent across servers (even in CDNs), the chances of last-modified being exactly the same on all your servers is far smaller.

On the other hand, no etag is better than a shitty etag, and both Apache and IIS generate dreadful etags — which may hinder more than help — by default.

[3] https://github.com/mikewest/nginx-static-etags/


Yes, this work for cache updating, and it is fantastic for that purpose. It does not solve the actual stated problem, which is that periodic checks in an attempt to smooth server loading away from peaks don't usually drift towards extremely bursty behavior. When the file does change, you still get a large number of clients trying to download the new content all at once. The solution I was suggesting is similar to what you are talking about, but also has the feature of smoothing the load curves.

Issue is, that only works when the file changes rarely enough, or you need additional server logic to reply that the file is still good when it's not.

My algorithm is that logic -- albeit implemented with client side collusion rather than pure server side trickery (this allows better control should the client ignore the etags).


> The solution I was suggesting is similar to what you are talking about, but also has the feature of smoothing the load curves.

It has no more feature of smoothing the load curve than using Cache-Control with the right max-age.

> My algorithm is that logic

It is no more that logic than doing what I outlined with proprietary behaviors.

> this allows better control should the client ignore the etags

by making the whole client use a custom communication channel? I'd expect ensuring the client correctly speaks HTTP would be easier than implementing a custom client from scratch.


You still seem to be missing the point. Cache-Control as implemented commonly, and by your description, will instantly serve every request the new file as soon as a new file is available. It takes into account exactly one variable: file age.

The algorithm I describe takes into account variables which affect current system loading, and returns a "no, try again later", even when the file is actually different, because the server is trying to conserve some resource (usually in such cases it is bandwidth). Like I said, this can be done with etags, but a more explicit form of control is nicer. Which brings us to this:

> this allows better control should the client ignore the etags

by making the whole client use a custom communication channel? I'd expect ensuring the client correctly speaks HTTP would be easier than implementing a custom client from scratch.

A client speaking proper http would be perfect for this. So point your http client to:

domain.com/getlatest

if there is a token available, respond with a:

307 domain.com/reallatest?token=foo

If no token is available and no if-modified headers are sent, reply with:

503 + Retry-After N

if there is not a token available, and the requestor supplied approrpiate if modified headers respond with a:

304 + cache control for some scheduled time in the future (which the client can ignore or not)

Of course that last condition is strictly optional and not really required, since then it would be abusing cache control, rather than the using 503 as intended.

(also note, a request to domain.com/reallatest with an invalid token or no token could result in a 302 to /getlatest or a 403, or some other form of denial, depending on the specifics of the application).

edit: Strictly speaking, the multiple url scheme above isn't even needed, just a smart responder associated with the 503 is needed, however the url redirect method above was there because there may be a larger application context around system, in which getlatest does more than just serve the file, or in which multiple urls would redirect to reallatest, both easily imaginable situations.


> If no token is available and no if-modified headers are sent, reply with:

> 503 + Retry-After N

That's cool. There's still no reason for the second url and the 307, and you're still getting hit with requests so you're not avoiding the request load, only the download. You're smoothing out bandwidth, but not CPU & sockets.


This is sort of true. I don't know of a way to simply limit the number of incoming sockets without getting a lot of ISP level involvement or just outright rejecting connections. It does limit the number of long-lived sockets for file transfer. On static file serves, I am assuming the cpu has plenty of spare capacity for doing the algorithm, so I am not worried about that. Finally I am assuming the limiting factor is bandwidth here, so bandwidth smoothing the main goal.


This is the more robust solution. The simple solution would be to generate a random number and convert that to a time of the week :)


But if you define a week as 7 days, then you will still experience a Monday peak for work computers. It doesn't solve the problem at all.


I assume changes are usually small, you may want to try serving diffs?

I.e. have the clients poll for the md5 of their _current_ list-version.

On the server store the diff that will upgrade them to the current version under that filename. If a client requests an unknown md5 (e.g. because he has no list or his list is corrupted) default him to a patch that contains the full file.

This requires a little logic on both ends (diff/patch), but would probably slash your bandwidth requirements to a fraction.

A little napkin math:

25 lists * 150kb * 1mio fetches = ~3.75T

vs

25 lists * 1kb (patch) * 1mio fetches = 25G (0.025T)


This is probably the Right Way, but it would be more work than minor tweaks to the delay logic.


call me oldschool, but having a huge peak demand is the perfect application for distributed source, like torrent. I know it is much more complicated to introduce P2P and way more risky if it gets poisoned, but it seems to me this underlying problem of huge peak demand was solved 10 years ago.


but there is a problem with bittorrent. Most Schools and works places block bittorrent. We would need to fallback to http or any other method that works in restricted places.


I wonder if there's a market for Bittorrent over HTTP? Node.js, websockets...surely it's possible?


All of those are strictly client-to-server, not P2P. You could in theory proxy bittorrent over it, but you wouldn't gain anything over just serving the file from the server.

You can probably write a true P2P client as a Firefox extension, since its API gives you very low level access (raw sockets, for example), but certainly not for e.g. Chrome.


WebRTC[1] seems to be the perfect platform for these sorts of things. It's in Chrome dev channel / Firefox Alpha right now.

[1] http://www.webrtc.org/


Yes, except this is a browser plugin, and no web browsers support bittorrents, so the download is not going to happen unless the plugin user installs a bittorrent updater engine, which probably isn't going to happen.


> and no web browsers support bittorrents

Actually, Opera has native support for torrents downloading.

And because they basically have complete and absolute freedom, it should be possible to build a torrent-downloading Firefox extension if that does not already exist (it probably does).


I love random numbers for distribution. I had a similar problem with a set of distributed clients that needed to download email, but only one client downloading at a time. The email servers also had an issue where a large number of emails in the inbox would cause the server to slow down exponentially. (eg. it didn't matter how many MB of email were in the inbox but it did matter if there were more than about 1000 emails)

The downloaders would download the list of inboxes to be fetched, randomize them and then lock the inbox when they started downloading, then the downloader would randomly pick a size cutoff for the max email size it would download, 10K, 1 MB, unlimited with a n inversely proportional maximum email count so that about 100MB could be downloaded at anytime.

We even had an issue with one server behind an old cisco router that barf'd on window scaling, so a few machines in the pool had window scaling disabled and that account would naturally migrate to those servers with window scaling disabled.

It worked wonders for distributing the load and keeping the Inbox counts to a minimum.


I know it's overkill for a browser extension, but wouldnt this be easily solved by having built-in bittorrent for updates?

The publisher would always be seeding the latest version, and the clients would connect maybe every other day. It would lower the preassure on the publishers servers and make sure everyone could always have the latest version.

With theese fancy magnet links, the publisher would only have to send the magnet and the actual file a couple of times, and then the peer to peer swarm would do the rest.


I would just sign it, stick it on S3, and forget it. Did I miss why that wasn't considered?


It is too expensive. 1TB of bandwidth costs about $120. A project like adblock plus will be consuming about 3 - 4 TB a month which will add up to around $450 a month.

Adblock list subscriptions are maintained and hosted by individual people who do at their spare time. They mostly pay for the servers out of their pockets. As one of the co-author of popular adblock list, I wouldn't want to break my bank to pay for S3 hosting. Our current solutions works out and when we reach our bandwidth limit, we could just simply buy addition TB of bandwidth at a much cheaper price than S3.

Btw, i just made a rough calculation using AWS simple monthly calculator. So correct me if I am wrong about S3 pricing.


Terabytes per month? That's insane. That's a million users (I can believe) downloading a megabyte (I can't quite believe). It appears my patterns.ini file is 600K, or about 150K compressed, so if I download it 30/5 = 6 times a month, that's... a megabyte. Wow.


Wow, that suggestion elsewhere in the thread, to serve diffs instead seems rather important now :)


While this workaround has merit, it doesn't actually solve the underlying problem. I guess even Amazon will eventually pick up the phone and ask you to stop sending them weekly bandwidth spikes when the figures involved get large enough (I've personally seen this with another well known PaaS provider).


I guess even Amazon will eventually pick up the phone

Why would they?

You'd have to push dozens of GBit/s to even appear on their radar. The only time they'll call you is when they can't charge your CC anymore (a sustained 1 GBit/s will set you back $1000/day at their current rate).


Yep, they'll happily bill you. I imagine the budget adblock plus has is pretty small and, you know, can't be subsidized with ads.


Didnt they alter it to allow some ads through by default; or was that all reverted?


Why not assign people a day and time, and then if they regularly miss that time, assign them a different one?


> with the effect that people always download on the same weekday

What's so bad about that?


Server load goes really high on that day, and if you get more popular, you'll need more servers and hence more money.


Maybe it wasn't clear, but in this case the load is still distributed evenly across all days. I always download on Monday, you always download on Tuesday. The author suggests that this is not desirable.

FTA,

The “Total” row looks very nice, it’s exactly what we would like to have. However, our download interval matches the number of days now, with the effect that people always download on the same weekday (and those currently downloading on a Monday will continue downloading on a Monday).


Use a CDN? It's not like the list is tailored to anybody.


Isn't that something that nginx/varnish should easily be able to handle? It is just a static file download after all...


CPU and bandwidth are entirely different issues. Sure, nginx can handle the processing. But do you have the piping to match?

A run of the mill dedicated server has a 100mbit uplink. Do the math. (Hint: it's easy to saturate in no time).



This is just downloading a single static text file so there's nothing to optimize.


Even serving a static file is a burden on the server when you have millions of requests.


What are the exact numbers?

A quote from: http://wiki.nginx.org/Main :

> I currently have Nginx doing reverse proxy of over tens of millions of HTTP requests per day (thats a few hundred per second) on a single server. At peak load it uses about 15MB RAM and 10% CPU on my particular configuration (FreeBSD 6).


https://easylist.adblockplus.org/blog/2011/09/01/easylist-st... is the first thing i found, talking about 11.5 million total users and 80% of them using easylist, 9.2 million. According to the blogpost, the still existing (i just noticed that this was after the update-behaviour change) monday-peak was 118.5% of the expected (week total / 7) - 73 million download in august, so (73 / 4 / 7) * 1.185 = 3 million for that specific list and 3.75 for all of them. Hope i didn't miscalculate ^^ Add to that the growth till than. The easylist.txt seems to have a size of 528kb.


+ SSL


Everybody downloading at once makes for slow servers. The author probably could pay much less in hosting costs if everybody downloaded the file along a uniform disturbance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: