Is it necessary to fully replicate the persistent store onto the striped SSD arr...

mikesun · on Aug 15, 2022

> Is it necessary to fully replicate the persistent store onto the striped SSD array? I admire such a simple solution, but I wonder if something like an LRU cache would achieve similar speedups while using fewer resources. On the other hand, it could be a small cost to pay for a more consistent and predictable workload.

One of the reasons an LRU cache like dm-cache wasn't feasible was because we had a higher than acceptable bad sector read rate which would cause a cache like dm-cache to bubble up a block device error up to the database. The database would then shut itself down when it encountered an disk-level error.

> How does md handle a synchronous write in a heterogenous mirror? Does it wait for both devices to be written? Yes, md waits for both mirrors to be written.

> I'm also curious how this solution compares to allocating more ram to the servers, and either letting the database software use this for caching, or even creating a ramdisk and putting that in raid1 with the persistent storage. Since the SSDs are being treated as volatile anyways. I assume it would be prohibitively expensive to replicate the entire persistent store into main memory. Yeah, we're talking many terabytes.

> I'd also be interested to know how this compares with replacing the entire persistent disk / SSD system with zfs over a few SSDs (which would also allow snapshoting). Of course it is probably a huge feature to be able to have snapshots be integrated into your cloud... Would love if we could've used ZFS, but Scylla requires XFS.

saurik · on Aug 15, 2022

Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly? (I have been doing this kind of thing using bcache over a raid array of EBS drives on Amazon for around a decade now, and I was surprised you were so concerned about those read failures so heavily to forgo the LRU benefits when to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.)

mikesun · on Aug 16, 2022

> Is it really a problem to bubble up that error (and kill the database server) if you can just bring up a new database server with a clean cache (potentially even on the same computer without rebooting it) instantly?

Our estimations for MTTF for our larger clusters would mean there'd be a risk of simultaneous nodes stopping due to bad sector reads. Remediation in that case would basically require cleaning and rewarming the cache, which for large data sets could be on the order of an hour or more, which would mean we'd lose quorum availability during that time.

> to me your solution sounds absolutely brutal for needing a complete copy of the remote disk on the local "cache" disk at all times for the RAID array to operate at all, meaning it will be much harder to quickly recover from other hardware failures.) In Scylla/Cassandra, you need to run full repairs that scan over all of the data. Having an LRU cache doesn't work well with this.

baobob · on Aug 15, 2022

Curious what the perf tradeoffs are of multiple EBSes vs. a single large EBS? I know of their fixed IOPs-per-100Gb ratio or whatnot, maybe there is some benefit to splitting that up across devices.

Is it about the ability to dynamically add extra capacity over time or something?

saurik · on Aug 16, 2022

Yeah: I do it when the max operations or bandwidth per disk is less than what I "should get" for the size and cost allocated. I had last done this long enough ago that they simply had much smaller maxes, but I recently am doing it again as I have something like 40TB of data and I split it up among (I think) 14 drives that can then be one of the cheapest EBS variants designed more for cold storage (as the overall utilization isn't anywhere near as high as would be implied by its size: it is just a lot of mostly dead data, some of which people care much more about).

ahepp · on Aug 16, 2022

Thanks for the insights!

Is the bad sector read rate abnormally high? Are GCE's SSDs particularly error prone? Or is the failure rate typical, but a bad sector read is just incredibly expensive?

I assume you investigated using various RAID levels to make an LRU cache acceptably reliable?

It's also surprising to me that GCE doesn't provide a suitable out of the box storage solution. I thought a major benefit of the cloud is supposed to be not having to worry about things like device failures. I wonder what technical constraints are going on behind the scenes at GCE.

jhgg · on Aug 16, 2022

> Are GCE's SSDs particularly error prone?

Yes, incredibly error prone. Bad sector reads were observed at an alarming rate over a short period - well beyond what is expected if you were to just buy an enterprise nvme and slap it into a server.

ec109685 · on Aug 16, 2022

That's quite an indictment of GCP if their Local SSD's reliability is that bad.