Hacker News new | past | comments | ask | show | jobs | submit login
How I sent 300k emails through Github's API in a matter of minutes (badlogicgames.com)
196 points by badlogic on Sept 14, 2013 | hide | past | favorite | 47 comments



That took my server down i'm afraid. Good day today. Here's a "cached" version.

To all watchers of the libgdx repository: i’m terribly sorry and hope i didn’t interfer with your work in any way

This is meant as a cautionary tale about using Github’s API on a repository with quite a few watchers (460 in this case).

Earlier this year we migrated our code from Google Code to Github. We didn’t have a good migration plan for the 1200 or so issues back then, so we kept them on Google Code. We now have about 1700 issues on the tracker

Today i finally wanted to tackle the issue tracker migration, using a Python script [1] i found on Github. The script requires one to specify a Github user account that owns the repository the issues will get migrated to. I did a dry run on a fork of the main repo using my Github account, fixed up some issues in the script, and validated things to the best of my abilities. Things looked good.

Then i ran it on the main repository. Luckily i was watching our IRC channel. After about 4 minutes, people started to scream. They each received 789 e-mails from Github. Every single issue i migrated, and every single comment of each issue triggered an e-mail notification to all watchers of the main repository.

This wasn’t apparent to me during the dry runs, as i used my own Github account. The script posts all issues/comments with the user account i supplied, so naturally, i did not get any notification mails.

I stopped the script after 130 issues (4 minutes), and immediately started sending out apologies and a mail to Github support, to which i haven’t received an answer yet. I send roughly 300k mails through their servers in a matter of minutes. If i hadn’t watched IRC, i’d have send out about 4 million mails to 460 people within an hour.

Let me assure you that i’m extremely sorry about this incident. I know that things like this can interrupt daily workflows quite a bit, even if getting rid of those mails is not a Herculean task. I’d be rather upset if a repo maintainer pulled something like this on me. Please accept my deepest apologies.

The lesson for Github API users: think hard about the implications of automating tasks through the Github API if you have more than a few watchers.

The lesson for Github/API designers: consider safe-guarding against such issues in your API, in case other idiots like me pull off something similar in the future.

[1] https://github.com/tgoyne/google-code-issues-migrator


While you may be the "visible culprit" - I'd hardly say you're at fault. The action you're attempting is not a "send email" action, even if the API provider has decided that it should send email notifications.

The implementation you'd probably want to see is for the API's email notifications to be batched if more than small-n trigger within a short period of time. Then the end user gets a notification that "1700 updates have occured" instead of getting 1700 emails.

The API should also be set up with exceptional event detection. Spikes which are several orders of magnitude above normal should get paused in the task queue and flagged for immediate manual review. Users do things you don't expect.


That's indeed a good solution. Here's hoping that Github might adapt something like that in the near future. Manual repo maintainance is quite a burden, so making it harder to shoot yourself in the foot using the API would be much appreciated.


We hit this situation a while back on our product and implemented batching as a result (http://blog.meldium.com/home/2013/4/22/dont-spam-your-users-...)


Wow, bummer. I'm sure the others will understand though - it's really not your fault.

Some years ago the company I was at had a policy of sending an email on every error - to every developer, so nobody actually checked them. One day the 3 main servers went down, all at the same time. Transpired that a) someone had introduced a bug into the code that completely 500'd a site b) it got through QA on to live and c) there was a relay chain for the email servers (I have no idea why). By the time we'd worked out what was going on the hard drives on all the machines had reached capacity and all the live systems were down. Of course the timeouts that started happening as the load came on just cascaded the issue. Gotta watch those automated emails :)


This. Is an amazing story.


Once upon a time, (about 5years ago) a 256M ram "server" holding a wordpress-mu (mu as in multi-user). There was "agile" methodologies, but management did fuck up with the splits, and there were multiple sysadmins for the same systems working "with their team". I was called in by "performance issues", I was working in other "stuff". There was 47.000 mysql tables. There was triplicated "dumps" from cron. There was no "renice" or "ionice" involved in any of them. There was no ntp, the VM time was with a 7 hours time offset. MySQL was with the "distro defaults" :.(

etc ...

In resume: human fails happen. You where smart enough to realize and stop it. That puts you over the average.


Install W3TC[1], enable all the caches. Test to make sure it doesn't screw up your CSS and JS. Problem solved.

[1]: http://wordpress.org/plugins/w3-total-cache/


Why don't they make this the default? Almost every time I see this happening to a blog, its a Wordpress instance...


Wow, thanks for the tip!


Ouch. Well, I'm not sure if you're at fault here as much as the Github API itself that bears a lot of the blame (I don't understand why kitchen sink notifications are really necessary for everything). Throttling the event rates or a queue would help in the future.

I'm glad this can serve a learning experience and no actual damage was done. Sweeping email, while annoying and little bit disruptive, isn't the end of the world and those who choose to understand, will do so.


It's not your fault.

It's actually a PITA to overcome issues like this on a technical level because you have to run something akin to buffer queue, that works similar to how "debouncing" works.

The best approach as I have found is to...

- You rate limit events as they happen... So you might let 5 events through (within 10 minutes), and then start to rate limit them by adding each item to a queue which you will merge down every 10 minutes (but that exponentially/incrementally back off each time you exceed the 5 items, so the next queue takes 30 minutes before it's popped, and then 90 minutes... etc)

- So for example, you might have an instant pop from queue where less than 10 events have been triggered within 10 minutes.

- Then if more than 10 events have been triggered, you add each item to a queue, and after X minutes, you pop each item off and send a bulk email.

----

It's a real pain to manage such a system, because your "typical" job server, such as Gearman, doesn't let you add a "delay" on jobs...

Ideally, you'd want to make sure that you ignore any new events for at least X minutes... So you are left with the only option of running another pseudo-queue system just to catalogue all of your throttled events.

Let's talk strategy. How else do you guys handle instant email notifications?... without this spamming issue. PS. I'm referring to GitHub implementing this strategy, not the OP in case there was any confusion.


The problem is github makes all notifications STD-level contagious. They broke out "watching" and "staring" a while ago, but they are still very overzealous with notifying everybody about everything.

The poster didn't use "send email" API, he was just automatically importing things, and every import triggered emails nonsensically.


The problem is that GitHub doesn't really have an import/export API; when you migrate from google code or sourceforge, all issues will appear created and commented from your own account, and at the current date/time.

I guess there's a commercial reason for this: http://giovanni.bajo.it/post/60836467126/github-is-missing-i...

I was going to make the same mistake myself (thanks OP!). Is there a workaround?


> I guess there's a commercial reason for this

I doubt that's the case. I bet you that if GitHub were to break out their revenue by plan that the vast majority comes from business plans and GitHub Enterprise. I would be astonished if any of these customers would ever export their dormant repositories for storage.

I'm sure that it's just a matter of them prioritizing this vs. the million other things that are on their backlog. As with most other software companies, your best bet would probably be to start emailing their support with requests for it so that they know it's important to the community at large.


> Is there a workaround?

Not that I know of. I added a note to the README about it sending lots of emails to help others avoid accidentally doing this, though.


I was about to send a PR for the README, will save a few folks some headaches.


Memcache is a pretty good solution.


I caused a similar issue while running edits on a Confluence Wiki instance as an intern. I was helping our publications department add some macro to every single page of the site, which I found out they were doing by hand! A bit shocked, I told them I could write something to automate that in a matter of minutes.

Sure enough, several minutes later all of the pages were updated. All 50 or so pages in each of the 15 spaces. And everyone who had ever touched one of those pages got an emailed for that page.

The nice thing about the Confluence API is that you can specify "minor" updates to prevent exactly this scenario from happening.

I guess since GitHub is built on the git foundation, adding some sort of "silent" flag might not be as easily possible, but certainly it's desirable.


I'm surprised that Github has not implemented a feature for repo migration from google code and sourceforge.


I don't think they view that as a significant growth vector. Skimming an existing market is rarely a good tactic.

Plus it's up to the developer. We can't have one-click-do-all buttons for everything.


Even if there's no one-click option, it'd be nice if the API exposed a much more flexible set of issue editing options to make it possible for us to write better importers ourselves.


If only they had access to some kind of command line tool which lets you bulk pull and push and merge histories. Do you think they have something like that at github? :-)


There is a command line tool, unfortunately it's not very usable.


I don’t think the GP was talking about ed scripts.


They say you shouldn't explain your own jokes. My comment was that github is built around a tool that makes it really easy and efficient to import and merge histories. Namely git. It's amusing that the value-add portion of github, the web bits, suffer from problems like this, that a tool like git could conceivably be applied to.


Indeed they do say that. My – admittedly rather poor attempt at a – joke was that the ‘command line tool’ to which you were referring is indeed quite usable and user-friendly; as opposed to, say, ed scripts.


Well if you want someone to do something, you should make it as easy as possible. Therefore, if you want someone to migrate their repo to your site, you should make it as easy as possible for them to do so.


Following just one busy repo can take over your inbox (ahem, Docker.) So, I'm sure your people have good filters in place so that they aren't too distracted from a flood of messages from Github.


Take a look at https://github.com/jpetazzo/gunsub to get the notifications under control.


All the more reason Github should implement a solution like glimcat proposes which bulks e-mails instead of sending one for each update.


Not his fault and pretty cool he was in touch with users of the library enough to catch it and stop it.

Always figured I'd do Cocos2d-x or Unity for any serious game I do next. I used Cocos2d before and written Unity plugins before. I even have a contractor working on a Unity project right now. Will have to give libgdx a few extra points when deciding in the future, though, for having a caring maintainer.

I actually wrote an OpenGL game engine for Android back before any of the later things came out like Replica Island, AndEngine, the Cocos2D port, etc.. Almost makes me wish I'd open sourced it. It did have some awesome stuff like batching all the sprites with similar draw states together into one draw call.


I hope there's plenty of other reasons to give gdx a few extra points, at least compared to Cocos2d-x. Being able to write your game for desktop, iOS and Android with Scala or better performance than Cocos2d-x for example. SCNR

Glad to see other Android "old-timers", would love to see what you came up with back then.


As someone watching Mario and libgdx from the jMonkeyEngine side of the fence, give it a shot :)


Do people actually let GitHub emails go to their inbox? After they got really aggressive about signing me up as a "watcher" to repos I found it necessary to just route all GitHub email to a a dedicated folder that I never read, then only whitelist repositories I actually care about.


> After they got really aggressive about signing me up as a "watcher" to repos.

You may want to turn off auto-subscriptions to repositories you have push access to: https://github.com/watching


Wow finding that page is a nightmare. I was recently trying to find out how I can stop auto-watching repos I'm giving access too (like if someone creates a repo under an organization) and I searched all over the settings for this.

Thank you.


You should always assume many people will operate according to the defaults.


Well, that is a big yikes, but crises (mostly) averted.


I did this with my a company repo, when I wrote a script to migrate issues from a spreadsheet into GitHub. I only sent 50 issues * 15 people though.


Anyone have a mirror or cached copy of this page? This site hit the front page of hn/proggit twice this week, and it was down both times.


Yes, pretty annoying. We should have something that takes a copy of a page before it will be shown here.


He's posted the content as a comment on this page.


Site is majorly foobared


Recently: And a bottle of rum (http://www.amazon.co.uk/And-Bottle-Rum-History-Cocktails/dp/...)

I loved it. A history of rum, including all of the politics around it (like the role it played in the slave trade and American independence), great read :)


Ehh? Is this spam or what?

note: the site is down so I don't know if this is a reference to the original article


The author has posted the article in the comments. Read the third or fourth comment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: