I take it from this blog that any engineer at GitHub can get a copy of anyone's data. I find this vaguely terrifying, especially because Yahoo! was so careful internally about what an engineer could actually get access to. Most data like this had usernames/names/emails/other sensitive info redacted.
Can anyone at GitHub get a copy of my private repo(s) because I'm out if they can.
"Move fast and take a copy of the referrers table."
I'm curious: what have you ever seen Github publish that would indicate there was any such security around your private repos within their organisation? I would have assumed there was none myself, but you seem to've assumed they'd treat your private repos like financial data or something.
Legit question, in case someone thinks I'm trolling or whatever.
I don't run a company myself but I would think that the enterprise offering would have some security assurances. Does Github say that its services aren't appropriate for purposes other than open-source development?
By security, though, I wouldn't expect the guarantees of financial data or whatever, but internal audit and access controls, i.e. not the scenario of "any engineer at Github" that the GP has inferred. IIRC, for example, at Google only specific engineers, such as SREs, are given uber-access to user data as part of their duties: https://techcrunch.com/2010/09/14/google-engineer-spying-fir...
I think they'd treat my data as if it's like gold and not any engineer can take a copy because:
a) potentially threaten GitHub's business (blackmail, disgruntled employee, internal threat, security services, concerned comments like this etc.)
b) real companies store their source code on GitHub; their source code is, at least, 50% of their business.
c) GitHub isn't really a startup anymore, they have hundreds of employees, you can't let that many people have unfettered access to data. Someone could loose a laptop...
d) As mentioned, Yahoo! was very careful with customer data
e) It's called a Private Repo
I'm making a big set of assumptions, you are right. I'll move my GitHub to my own server now... installing GitLab which I absolutely hate but it's better than trusting an unknown set of engineers.
I've just realised probably any Digital Ocean engineer can get root on my box if they want to right :-(
Your overall point is valid, but employee losing a laptop is a much narrower problem to solve for: set JAMF policies that enforce full-disk encryption and reasonably strong passwords, write off any machines that are lost/stolen. Figuring out employee trust models is a much broader/harder problem.
You want to at least have some control over the hardware. Disk encryption on a dedicated server or (even more ideally) a caged, locked, colocated rack.
The system administrators of anything multi-tenant can access your data. The problem is they probably don't care. It's up to you where to draw that line, though.
"Administering a mail host is sort of like being a nurse; there's a brief period at the start when the thought of seeing people's privates might be vaguely titillating in a theoretical sense, but that sort of thing doesn't last long when it's up against the daily reality of mess. Now that I think about it, administering a mail host is exactly like being a nurse, only people die slightly less often."
You're technically correct, which is of course the best kind of correct. But it's not quite so easy.
I can't speak for any other Git hosting service, of course, but if you have a private repository in Microsoft Visual Studio Team Services then:
- all engineers are given a background check before being hired
- very few engineers have access to production servers or databases, only those who have an operational neccesity
- engineers that do have production access are given a still stricter background check
Certainly we don't have private repository just lying around, and yes we do treat them like they're your financial data.
Microsoft hosts all of its source code in VSTS, including Windows. This is data that we absolutely want to protect. Our customers' data is no different.
Super interesting post. Would love to read more detail about their backup and restore infrastructure.
If Tom and/or Shlomi are reading this: you mention taking multiple logical backups per day. What benefit does this bring versus just having one per day and doing a point-in-time restore using binlogs? Is this just a tradeoff between time taken for a restore and storage you're willing to dedicate to backups?
@jivid the logical backups are done per-table, not per-server.
Per-table logical backups are useful to the engineers owning the data. It makes it easy for them to restore data from a single table.
When an engineer loads logical backup data, it loads into a non-production private zone where the engineer has access to the data, and can then make informed decisions on whether there is need to re-apply data changes (due to bug, due to need to review historical data, etc.).
This of course has the advantage of quicker restores (only need a single table), and this happens to cover the vast majority of cases. This doesn't cover the case where we need to restore consistent data for two or more different tables.
What's the difference between gh-ost and the Percona tool? ( pt-online-schema-change ) Also did you try to use a recent version of MySQL that supports live migration?
Great post! Do you use semi-synchronous or asynchronous replication? If you use asynchronous replication, when a server crashes and this triggers the automated failover, do you lose the last transactions?
Tangentially, in that blog post I was impressed by the list of companies running orchestrator.
> orchestrator is actively maintained by GitHub. It manages automated failovers at GitHub. It manages automated failovers at Booking.com, one of the largest MySQL setups on this planet. It manages automated failovers as part of Vitess. These are some names I’m free to disclose, and browsing the issues shows a few more users running failovers in production. Otherwise, it is used for topology management and visualization in a large number of companies such as Square, Etsy, Sendgrid, Godaddy and more.
Can anyone at GitHub get a copy of my private repo(s) because I'm out if they can.
"Move fast and take a copy of the referrers table."