Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
EC2 Instance Update – C5 Instances with Local NVMe Storage (amazon.com)
108 points by jeffbarr on May 18, 2018 | hide | past | favorite | 62 comments


Depending on which kernel version you are, C5 (and M5) instances can be real sources of pain.

The disk is exposed as a /dev/nvme* block device, and as such I/O goes through a separate driver. The earlier versions of the driver had a hard limit of 255 seconds before I/O operation times out. [0,1,2]

When the timeout triggers, it is treated as a hard failure and the filesystem gets remounted read-only. Meaning: if you have anything that writes intensively to an attached volume, C5/M5 instances are dangerous. We experimented with them for our early prometheus nodes. Not a good idea. Having the alerts for an entire fleet start flapping due to a seemingly nonsensical "out of disk, write error" monitoring node failure is not fun.[ß]

If you run stateless, in-memory only applications on them (preferably even without local logging), then you should be fine.

0: https://bugs.launchpad.net/ubuntu/bionic/+source/linux/+bug/...

1: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119

2: https://www.reddit.com/r/aws/comments/7s5gui/c5_instances_nv...

ß: We handle nodes dying. The byzantine failure mode of nodes suddenly spewing wrong data is harder to deal with.


Ubuntu has a specific kernel for AWS, and partners with AWS to optimise the kernel for AWS environments. Part of that is fixing issues exactly like this. That issue was fixed as per the bug that you linked.

Some more information from the original announcement: https://blog.ubuntu.com/2017/04/05/ubuntu-on-aws-gets-seriou...

(Disclaimer: I work at Canonical/Ubuntu, if that matters)


Interesting - I’ve been using NVMe devices on Linux for a couple years now and never run into this problem. And an I/O timeout of 255 seconds seems really high to begin with. Is there frequently that much latency in the EBS storage backplane? (We also run c5.9xl instances and have not yet experienced the phenomenon you discuss.)

In any event, the instance storage is unlikely to run afoul of any such timeouts since it’s more or less directly attached (albeit virtualized) and there’s no SAN involved.


The nvme disks are local, not remote EBS. Latency will be the PCI bus.


EBS is exposed as nvme devices on c5 and m5 as well, which is what I assume otterley is talking about.

Presumably the same applies to i3.metal as well, though I have not yet personally checked.


Yes, the root volume on i3.metal is exposed as NVMe and is EBS-backed.


The EC2 User Guide includes documentation on how to avoid these issues:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs...

No need for a kernel with the default set - just put it in your kernel cmdline options in grub. Make an AMI with the change after that, and you're good to go.

Edit: My apologies. I didn't read your comment carefully enough, and didn't realize you had specifically called out needing to make these changes, and that older kernel versions had the 255 maximum. Leaving the above for posterity, and so people can notice I'm a bad reader :)

Did you see these issues even with it set to 255? That seems like it should have been enough timeout if everything is working normally. Perhaps small GP2 volumes that are out of credits with a large queue depth might see this? I can't say I've run into it yet.


> Did you see these issues even with it set to 255?

Oh yes we did. To make sure we didn't just suffer from a single fluke, we read through docs and changed the value to 255 after the first time. Then waited. Didn't have to wait for long, the thing broke again in less than 3 weeks.

The workload was a pretty aggressively tuned prometheus. At that point we went into compaction after two weeks, so it would have been doing _very_ heavy I/O for a few days.

Some extra details about our monitoring setup here: https://smarketshq.com/wait-what-is-my-fleet-doing-2e7b1b06f... (We have 10s scraping intervals for everything and 1s for our most critical, highly latency-sensitive services.)


I am curious, what kind of write characteristics can manage to saturate a 255s timeout on a storage device that does 10k+ iops and gigabytes per second throughput? Normally writes slowing down leads to backpressure because the syscalls issueing them take longer to return.

I can imagine some crazy random access, small-record, vectored IO from large thread pools. But that's not exactly common because most software that is IO-heavy tries really hard to avoid these things.


According to the bug here: https://bugs.launchpad.net/ubuntu/bionic/+source/linux/+bug/...

It was actually filed about EBS block disks using NVME (on these new instances, there is a hardware card that presents network EBS volumes as a PCI-E NVME device). In certain failure cases since this is a network block store, they can fail for some period of time exceeding this timeout.

The idea of this change is to ensure once they come back the machine is left in a usable state.

Of course, I would not expect to see this on Local NVME disks which is what they announced - that you can now get such instances with local disk as well as EBS.


Ah yeah, the article was about local NVMe, so the concern is probably not relevant here.


Unrelated: love your pwgen username


EBS probably has really terrible tail latencies.


Sorry for that. The timeout behavior on earlier kernels is a bit of a pain.

There's a lot to love about NVMe and timeouts are not actually part of the NVMe specification itself but rather a Linux driver construct. Unfortunately, early versions of the driver used an unsigned char for the timeout value and also have a pretty short timeout for network-based storage.

As mentioned elsewhere in the thread, recent AMIs are configured to avoid this problem out of the box.


This is probably a big deal, if it's anything like the local SSD storage on GCE. The performance of local SSDs on Google Cloud is nearing absurd compared to anything else you can find right now.

That being said, I think one of the less compelling parts of this is that it'll probably vary per instance type quite a bit, being limited to C5 to start, so if you have a workload needing way better disk I/O than CPU performance, you might have to waste. That's one thing GCP really does have on AWS, better granularity.


> being limited to C5 to start

I3 and F1 instances also have NVMe SSD instance store volumes https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-inst...

Here's the full list of instances with instance storage: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instance...


Nice, didn't know that.


GCE local SSD was the fastest thing around when first released in 2014, but hasn't improved and has since fallen behind. GCE local SSD 4k read IOPs max out at around 200k/volume and 800k/instance w/4 volumes. I haven't measured c5 local nvme yet, but on i3, local nvme volumes run at 100-400k/volume and up to 3.3 million IOPs on the largest instance type (i3.16xl).


Hey, you seem to know a lot about this. Can you explain why the docs seem to claim they max out at 680k/volume read IOPS and 360k/volume write IOPS? They seem to be talking about 4k blocks.

https://cloud.google.com/compute/docs/disks/performance#type...

Is Google being disingenuous? I haven't benchmarked it, but I will say that anecdotally the GCP local NVMe SSDs feel very fast for things like building software compared to other setups I've done on other clouds (AWS, UpCloud.)


The sequential write performance of local NVME on GCP is... not great.


Would love some benchmarks, if you happen to have them. Frankly, I don't have any good benchmarks directly comparing GCP and AWS, but my current heuristics is that GCP "seems" to have better I/O and AWS "seems" to have higher frequency CPUs. Could be way off.


I did some benchmarks of AWS vs GCP storage [0] (NOT including local storage), and the short version is that Google's slow storage is slower than AWS's, and Google's fast storage is faster than AWS's. And I tested NVMe, but on vSphere, and it was very, very fast.

0: http://engineering.pivotal.io/post/gobonniego_results/


With more instance types and storage classes, EC2 offers a much wider spectrum of performance capabilities for CPU, storage (ephemeral and durable), and networking.


I don't see how this has to do with my question about benchmarks of sequential disk I/O on GCP instances with local NVMe SSDs. Did you respond to the wrong message?

Either way, I use both GCP and AWS but I am finding that I'm spending less on GCP. I don't believe every workload will be cheaper in GCP, but it is clearly competitive at least.


I'm responding to your comment "GCE seems to have better I/O and AWS seems to have higher frequency CPUs" - this isn't true. You can get faster I/O, networking and CPU with EC2 if you select the correct instance/storage and are willing to pay for it.


For seq read/write I've observed up to around 15000/2900 MB/s on i3.16xl and 2750/1400 MB/s on GCE w/4xlocal SSD


This is lacking in details:

- What OS image(s) were used? This would aid in reproducibility.

- What tool was used for benchmarking? Preferably with the configuration so I could reproduce it.

- What GCE instance is used? I assume you will bottleneck on something other than I/O if comparing a very small GCE instance to a very large AWS instance.

I will probably benchmark this on my own at some point now that my interest is piqued, but if you have already done the work to benchmark this I'd love to hear more. So far though it's too vague for me to use meaningfully. Is there a blog post or other material that is publicly accessible? I've Googled for such before but come up surprisingly empty handed.


I've just tried to benchmark, but I'm not sure how. I tried using dd to measure sequential speed and this was giving pretty good results on Google Cloud:

> dd if=/dev/zero of=/dev/disk/by-id/google-local-ssd-0 bs=8192

> 63356186624 bytes (63 GB, 59 GiB) copied, 36.6496 s, 1.7 GB/s

> dd of=/dev/zero if=/dev/disk/by-id/google-local-ssd-0 bs=8192

> 63356186624 bytes (63 GB, 59 GiB) copied, 17.5434 s, 3.6 GB/s

...But I'm perplexed by why it cuts off at 64 GB.

On Amazon I couldn't figure out how to get decent speeds at all. Comparable to my Google Cloud box, I set up an i3.xlarge, which had 4 CPUs and 1 SSD. While the dd operation was working fine, it slowed down to a paltry 430 MB/s quickly and never recovered. I doubt this is the true sequential performance of the drive, so I gave up.

I would've attempted to do multiple SSD configurations, but those seemed even harder. mdadm-based RAIDs were clearly bottlenecking somewhere other than hardware, because they were slower than writing or reading directly.

tl;dr: It turns out these new-fangled NVMe instance store setups are too complicated for me to understand.


You'll need to apply some SSD specific methodology to correctly characterize their performance. Writing zeros to some SSDs will trigger special logic in their flash translation layer (FTL) that will prevent the writes from hitting the flash media. They simply note in the metadata that you wrote some zeros, and reading them back will also be impossibly quick.

This isn't the case only on new-fangled NVMe devices...


If I remember correct, all old ec2 instance support local storage. Then as the growing of EBS, they disabled local storage. Now it's back again..


It was less that local storage was disabled, and more that the newer hardware types didn't have any local storage to share.


Amazing! I hope they adopt this for RDS.

EBS volumes are great and all but not for database where the dataset is many multiples of the working set.


I've been intrigued by the idea of running databases on EC2 instance storage for a long time. (You couldn't use RDS though, at least not today.) Putting your db on something also called "ephemeral storage" seems risky, but maybe not much riskier than putting it on plain HDDs. The big issue to me is that most instances don't come with much space. If you need more you have to scale up the whole instance (not a separate dimension like with EBS), and if you're already on the biggest instance type you're just out of luck. I guess it could be worthwhile to use separate tablespaces so you could have some data on instance storage and some on EBS. But so far I've gotten acceptable perf by RAIDing over gp2 EBS volumes (12 TB in my case), following the approach here: https://news.ycombinator.com/item?id=13842044


All of Reddit was Postgres on raided EBS up till I left in 2011 and I think still is today but I kinda hope not.

It’s totally safe to use local storage if you build it right. But those raided EBSs caused a lot of problems. In short, when one gets slow the whole volume gets slow because software raid isn’t hardware raid.

The main advantage of RDS is that they take care of the mundane redundancy for you.


Hey Jeremy, I’m saying they should take care of it for me.

As a database operator I treat safety on i3 similarly where I have multiple hot replicas of my data so that if any fails I’m good to go. Additionally, there isn’t any reason you couldn’t have a EBS replica of an ephemeral node.

What we typically do with i3 is mirror the data locally, replicate it, have an EBS replica, and take backups. This is probably overkill but the data needs to be both accessed quickly and secure so that’s where we are at.


I'm wondering, how do you handle failovers?

Is it automatic or manual?

On infrastructure I handled from top to bottom, I used VIPs with keepalived (only the vrrp part, with a weight linked to success/failure of a check script).

But in AWS, I'm wondering how to do it properly, maybe DNS records with low TTL (like 1 second).


We use instance store for Postgres and Cassandra now on i3 and i2 respectively.


I thought reddit went to cassandra.


Cassandra is used for some things, Postgres for the rest. Unless they went full Cassandra recently, which is possible. But for many years we ran both.


We’ve been running Postgres on i3 instances with their attached SSDs. Performance is solid and it’s cheaper too. Having up to date replicas becomes crucial, along with incremental backups (we use wal-e for that).

As you mentioned, it is limited by instance size, but for a DB that fits it works great and has fewer moving parts. Knowing that your entire database is essentially ephemeral raises the stakes too and forces you to take replication, backups and restore testing seriously.


We've been running databases on ephemeral drives for many years, the key is using a database with good replication and failover.

I don't think you should trust your data to a single disk, whether or not it's a physical device in your own datacenter or an EBS in AWS. Everything fails eventually.


Yup. We run Postgres on i3 instances on their native SSDs and its way faster for OLAP use cases. Check out avien.io for a hosted RDS style solution that does this.


I noticed the following comment in the article:

> Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.

Does anyone know if the existing i3 EC2 instances NVMe drives are also encrypted in this fashion? I can't find any articles stating this...

Thanks!


i3 and f1 also have encrypted disks. I have found some references on blogs for this, but the only place I've seen it mentioned by AWS directly is in this presentation from re:Invent 2017: https://youtu.be/o9_4uGvbvnk?t=30m20s (at 30:25 the presenter mentions that the nitro cards "protect the underlying flash device and customer data through encryption").


Hi, it's presenter in the video here.

The documentation at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-inst... will be updated soon with the same information.


Excellent, thank you. It would be very good to have that information in the official documentation. I've wanted to refer to something like that for compliance reasons, for example.


We use i3 instances for some workloads which are great, I just wish the CPU was as powerful as other instance types.


That makes this a perfect fit


I'm keen for 2 vCPU (but faster) with 450GB+ SSD; these new ones are a bit too expensive for that much SSD without going up in vCPU, but a good start.


It looks like i3.metal is also available - seeing them in the console


Great news! I think Amazon should provide more specs on those NVME drivers. what kind of R/W latency/max bandwidth/low and high QD throughput I can expect?


This is really exciting to see! I had assumed that they would be launching most new instance types with only EBS storage so this is awesome that it looks like this will be coming to even more instance types too.

The bottom of the article mentions "PS – We will be adding local NVMe storage to other EC2 instance types in the months to come, so stay tuned!"


So back to instance storage we go?


Now they just need R5 instances for our RAM hungry apps. Keep the local nvme please.


I'm not seeing the usage here. What would need a (very very) performant temporary storage that you could not achieve with io volumes ?


io1 volumes are expensive and not nearly as performant as local nvme SSD's.

io1's top out at 32K PIOPS per volume and a 32K PIOPS 225GB volume would cost around 2100/month.

But an entire c5d.2xlarge with 225GB of local SSD will only cost around $283/month, and (based on i3 performance) will give you around 180K write IOPS.


Redis flash, memcached with its external storage shim, ScyllaDB, Aerospike, RocksDB, EVCache for Netflix perhaps...there's a bunch of em


Network disks really are terrible in comparison, even the super charged expensive ones (io volumes).

Local SSD is just orders of magnitude faster if you need a fast place to put temporary (or replicated) data.


It was a harebrained idea to not distinguish between local vs remote data needs. For a time Amazon pretended they could overcome physics.


Aerospike


caches


docker layer caches

jenkins workspaces




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: