Yes, we did it at Kenna Security. About 300 paying customers, but over 1000 with...

Yes, we did it at Kenna Security. About 300 paying customers, but over 1000 with trials, and overall about 6B vulnerabilities being tracked (the largest table in aggregate). Some of the tables were business intelligence data accessible to all customers, so they were on a “master” DB that all could access; and some of the tables were fully multi-tenant data, so each customer had their MySQL DB for it.

The motivation was that we were on RDS’s highest instance and growing, with jobs mutating the data taking a less and less excusable amount of time.

The initial setup was using just the Octopus gem and a bunch of Ruby magic. That got real complicated really fast (Ruby is not meant to do systems programming stuff, and Octopus turned out very poorly maintained), and the project turned into a crazy rabbit hole with tons of debt we never could quite fix later. Over time, we replaced as many Ruby bits as we could with lower-level stuff, leveraging proxySQL as we could; the architecture should have been as low-level as possible from the get-go... I think Rails 6’s multi-DB mode was going to eventually help out too.

One fun piece of debt: after we had migrated all our major clients to their own shards, we started to work in parallel on making sure new clients would get their own shard too. We meant to just create the new shard on signup, but that’s when we found out, when you modify Octopus’s in-memory config of DBs, it replaces that config with a bulldozer, and interrupts all DB connections in flight. So, if you were doing stuff right when someone else signs up, your stuff would fail. We solved this by pre-allocating shards manually every month or so, triggering a manual blue-green deploy at the end of the process to gracefully refresh the config. It was tedious but worked great.

And of course, since it was a bunch of Active Record hacks, there’s a number of data-related features we couldn’t do because of the challenging architecture, and it was a constant effort to just keep it going through the constant bottlenecks we were meeting. Ha, scale.

Did we regret doing it? No, we needed to solve that scale problem one way or another. But it was definitely not solved the best way. It’s not an easy problem to solve.