It spends about 2 to 10 ACU per hour in the small VM, and ten times as much on the large one. No credits spent during sleep and "waiting for response" time as far as I observed.
1. They mention that the largest tables ran into several TBs, and they would have soon topped the max IOPS supported by RDS. RDS for PostgreSQL peaks at 256,000 IOPS for a 64 TB volume. For a multi-AZ setup, this costs ~$70K/mo.
2. Let's assume the final outcome was a 5-way shard with each shard supporting ~50,000 IOPS and ~12 TB data. For a multi-AZ setup, this costs ~$100K/mo.
3. It took 9 months to shard their first table. Since it required application changes as well, let's assume this was 9mo * 20 work days/mo * (3 DB engineers + 2 app engineers) = 900 work days. Even at $100K avg. annual pay for an engineer, this is ~$400K.
4. A PostgreSQL-compatible NewSQL like YugabyteDB should cost ~$15K/mo to match top-of-the-line RDS performance. So Figma spent ~25x ($400K/$15K) to implement horizontal sharding in-house, and is still on RDS which costs ~6x ($100K/$15K)
> A PostgreSQL-compatible NewSQL like YugabyteDB should cost ~$15K/mo to match top-of-the-line RDS performance.
"to match" is doing a lot of work here. It's extremely unwise that a "compatible" database will have the same performance characteristics and no performance cliffs.
> So Figma spent ~25x ($400K/$15K)
They nearly got acquired for $20B, I don't think they give a hoot about 400K if it means keeping their stack the same and getting to keep all of the existing organizational knowledge about how to keep the thing online.
That’s the problem with case studies, isn’t it?
What works for one company in one industry might be a death sentence for another organization with slimmer margins…
Figma, worth $10billion, was migrating what seems like their core production data. They probably didn't want to bet the company on a comparatively small software vendor like Yugabyte.
Most likely the engineering cost was much much higher than your quotes, but still insignificant compared to the potential risks. And migrating from RDS to not-RDS could easily not have been cheap in engineering time either, depending on how much software they've built around it.
Past a certain data size, migrations are always a nightmare. For a much longer time than what you initially estimated, you are managing two systems with all the related operational costs and complexity, as well as all of the IOPs and bandwidth migrating the data.
I imagine scaling out RDS instead mitigated a lot of those costs.
Their rationale for this choice is covered in the article somewhat extensively near the top.
> Additionally, over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house. While migrating, we would have had to rebuild our domain expertise from scratch. Given our very aggressive growth rate, we had only months of runway remaining. De-risking an entirely new storage layer and completing an end-to-end-migration of our most business-critical use cases would have been extremely risky on the necessary timeline. We favored known low-risk solutions over potentially easier options with much higher uncertainty, where we had less control over the outcome.
TL;DR they were working in a short timeline, with a limited team size, and wanted to minimise any risks to the business.
Clearly cost is an issue for Figma, but downtime, or worse data loss, would have a ginormous impact on their business and potential future growth. If your product is already profitable, your user base growing fast, and with your ARR. Why would risk that growth and future ARR just to save a few $10Ks a month? A very low risk DB migration that lets you keep scaling and raking in more money, is way better than a high risk migration that might save some cash in the long term, but also risks killing your primary business if it goes wrong.
Ok, what risk? Cockroachdb is already proven technology and costs marginally more (if you use their serverless setup, it's free until you hit real scale). At the startups I've been at that hit scale, scaling sql was always a massive undertaking and affected product development on every single time.
If you don't want downtime, don't use databases that require downtime to do a migration?
Netflix, roblox, every single online gambling website all use cockroachdb.
Sounds like their discomfort was in the migration path to 'any other database' alongside not having the experience with another database to mitigate any unknown unknowns.
> During our evaluation, we explored CockroachDB, TiDB, Spanner, and Vitess. However, switching to any of these alternative databases would have required a complex data migration to ensure consistency and reliability across two different database stores.
> ensure consistency and reliability across two different database stores.
This is main known known. And this is hard thing to attain.
My favorite story on that is testing of tendermint consensus implementation [1]. The testing process found a way to break the consensus and the reason was that protocol implementation and KV store controlled by protocol used different databases.
Never used cockroach so pardon my ignorance, but are there no operational challenges with running/using them? Or are they the same challenges? And how compatible is it from an application developer perspective?
The managed service is hassle free and it's auto sharded so you don't have traditional scaling issues. You do need to think about how your index choices spread writes and reads on the cluster to avoid hotspots. It's almost completely compatible with postgres wire protocol but it doesn't support things like extensions for the most part.
There are TONS of operational issues running cockroach. At the last company I was at cockroach was probably over used as a magical way to run multiple DCs and keep things consistent without high developer overhead, but it was #1 source of large outages. So much so that we’d run a cockroach segmented out for a single microservice to limit the blast radius when it eventually failed.
That and its comically more expensive than Postgres, if you think IOPs are expensive wait till you see the service contract.
In ye olden times I used to stop bosses from throwing away the slowest machine we had, and try to get at least one faster machine.
It’s still somewhat the case, but at the time the world was rotten with concurrent code that only worked because an implicit invariant (almost) always held. One that was enforced by the relative time or latency involved with two competing tasks. Get new motherboards or storage or memory and that invariant goes from failing only when the exact right packet loss happens, to failing every day, or hour, or minute.
Yes, it’s a bug, but it wasn’t on your radar and the system was trucking along yesterday and now everything is on fire.
The people who know this think the parent is a very interesting question. The people who don’t, tend to think it’s a non sequitur.
Except for the un-implemented features which they might need.
It also uses serializable isolation and in their implementation reads are blocked by writes unlike in Postgres. Those are both significant changes that can have far reaching application impacts
IOPS isn’t a linear thing here. The vacuums needed to prevent transaction wraparound (vaccum
Freeze)can’t be throttled and are much more expensive than regular vacuums. By splitting the tables they are likely reducing the need for those vacuums (by a large margin) and significantly reducing IOPS needs.
Before taking any of the steps you will have to finalise the approach i.e whether you are looking for a longer term solution or a shorter term solution.
Most of the startups start with short term plan i.e hire folks as consultants with US registered entity. And then as per the long term roadmap, hire a site lead who can start the process of registering india entity and setup all the payroll and other stuff. After a certain number of people things like PF and all become necessary. There are a few consultancy firms who handle all of these things.
Consultant approach works well for initial hires. Also, the india entity registration process takes atleast 3 months, so in the starting most of the firms start with consultants.