Achieving 11M IOPS and 66 GB/S IO on a Single ThreadRipper Workstation

whalesalad · on Jan 29, 2021

This post is fantastic. I wish there was more workstation porn like this for those of us who are not into the RGB light show ripjaw hacksaw aorus elite novelty stuff that gamers are so into. Benchmarks in the community are almost universally focused on gaming performance and FPS.

I want to build an epic rig that will last a long time with professional grade hardware (with ECC memory for instance) and would love to get a lot of the bleeding-edge stuff without compromising on durability. Where do these people hang out online?

tanelpoder · on Jan 29, 2021

Thanks! In case you're interested in building a ThreadRipper Pro WX-based system like mine, then AMD apparently starts selling the CPUs independently from March 2021 onwards:

https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a...

Previously you could only get this CPU when buying the Lenovo ThinkStation P620 machine. I'm pretty happy with Lenovo Thinkstations though (I bought a P920 with dual Xeons 2.5 years ago)

ksec · on Jan 29, 2021

And just in time article

https://www.anandtech.com/show/16462/hands-on-with-the-asus-...

I guess I should submit this on HN as well.

Edit: I was getting too ahead of myself I thought these are for TR Pro with Zen 3. Turns out they are not out yet.

walrus01 · on Jan 30, 2021

My only quibble with that board is that I worry about how easily replaced the fan on the chipset is. In my experience that exact type of fan will inevitably fail in a moderately dusty environment... And it doesn't look like anything you could screw on/off from the common industry standard sizes of 40mm or 60mm 12VDC fans that come in various thicknesses.

opencl · on Jan 30, 2021

Supermicro's WRX80 motherboard looks like it has an easily replaceable chipset fan, not sure about the Gigabyte one.

https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a...

smolder · on Jan 31, 2021

Fortunately you can often swap the complete heatsink and fan combo on the chipset with a different one. If the mounting method is strange one can use thermal epoxy or thermal adhesive tape.

greggyb · on Jan 29, 2021

STH: https://www.youtube.com/user/ServeTheHomeVideo https://www.servethehome.com/

GamersNexus (despite the name, they include a good amount of non-gaming benchmarks, and they have great content on cases and cooling): https://www.youtube.com/user/GamersNexus https://www.gamersnexus.net/

Level1Techs (mentioned in another reply): https://www.youtube.com/c/Level1Techs https://www.level1techs.com/

r/homelab (and all the subreddits listed in its sidebar): https://www.reddit.com/r/homelab/

Even LinusTechTips has some decent content for server hardware, though they stay fairly superficial. And the forum definitely has people who can help out: https://linustechtips.com/

And the thing is, depending on what metric you judge performance by, the enthusiast hardware may very well outperform the server hardware. For something that is sensitive to memory, e.g., you can get much faster RAM in enthusiast SKUs (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than you'll find in server hardware. Similarly, the HEDT SKUs out-clock the server SKUs for both Intel and AMD.

I have a Threadripper system that outperforms most servers I work with on a daily basis, because most of my workloads, despite being multi-threaded, are sensitive to clockspeed.

1996 · on Jan 29, 2021

Indeed, serious people now use gamer computer parts because it's just faster!

greggyb · on Jan 29, 2021

It's not "just faster".

No one's using "gamer NICs" for high speed networking. Top of the line "gaming" networking is 802.11ax or 10GbE. 2x200Gb/s NICs are available now.

Gaming parts are strictly single socket - software that can take advantage of >64 cores will need server hardware - either one of the giant Ampere ARM CPUs or a 2+ socket system.

If something must run in RAM and needs TB of RAM, well then it's not even a question of faster or slower. The capability only exists on server platforms.

Some workloads will benefit from the performance characteristics of consumer hardware.

mgerdts · on Jan 30, 2021

The desktop used in TFA supports 1 TB of RAM.

https://www.lenovo.com/us/en/thinkstation-p620

greggyb · on Jan 30, 2021

Workstations and desktops are distinct market segments. The machine in the article uses a workstation platform. And the workstation processors available in that Lenovo machine clock slower than something like a 5950X mainstream processor. The RDIMMS you need to get to 1TB in the machine run much slower than the UDIMMS I linked above.

philsnow · on Jan 29, 2021

I'm with you on this, I just built a (much more modest than the article's) workstation/homelab machine a few months ago, to replace my previous one which was going on 10 years old and showing its age.

There's some folks in /r/homelab who are into this kind of thing, and I used their advice a fair bit in my build. While it is kind of mixed (there's a lot of people who build pi clusters as their homelab), there's still plenty of people who buy decommissioned "enterprise" hardware and make monstrous-for-home-use things.

zhdc1 · on Jan 29, 2021

Look at purchasing used enterprise hardware. You can buy a reliable x9 or X10 generation supermicro server (rack or tower) for around a couple of hundred.

ashkankiani · on Jan 29, 2021

I've been planning to do this, but enterprise hardware seems like it requires a completely different set of knowledge on how to purchase it and maintain it, and especially as a consumer.

It's not quite as trivial of a barrier to entry as consumer desktops, but I suppose that's the point. Still, it would be nice if there was a guide that could help me make good decisions to start.

jqcoffey · on Jan 29, 2021

Also, purpose built data center chassis are designed for high airflow and are thus really quite loud.

modoc · on Jan 29, 2021

Very true. I have a single rack mount server in my HVAC room, and it's still so loud I had to glue soundproofing foam on the nearby walls:)

bombcar · on Jan 30, 2021

It’s actually much EASIER in my experience- enterprise gear is made to be easily reparable and most if not all parts can be swapped without tools.

Loud though - most of them run pretty quiet if not doing anything.

gogopuppygogo · on Jan 30, 2021

Most people who get into home labs spend some time on research and throw some money at gaining an education.

Compute is so cheap second hand.

igorstellar · on Jan 30, 2021

Downside of buying enterprise for home use is noise - their turbofan coolers are insanely loud while consumer grade 120mm (Noctua et al) coolers are most quiet.

smartbit · on Jan 30, 2021

Another downside is powerconsumption at rest. Supermicro board with 2x Xeon use 80watt at minimum. Add a 10Gbit switch and a few more peripherals and you’re looking at an additional $€80/month electricity bill. Year after year, that is $€10.000 in 10years.

Of course that is nothing compared to what you’d pay at Google/Azure/AWS for the AMD machine of this news item :-)

12V only PSUs like OEMs use or ATX12VO in combination with a motherboard without IPMI, similar to the German Fujitsu motherboards, have significant lower power consumption at rest. Somewhere around 8-10Watt without HDD. Much better for home use IHMO.

devonkim · on Jan 30, 2021

In the US, electricity rates are typically much cheaper than the EU. My rate is roughly .08 €/kWh, for example, and I don't get any subsidies to convert to solar, so I have no way to make it pay off for myself within 15 years (beyond the time most people expect to stay in a home here), while other states in the US subsidize so heavily or rates for electricity are so high most people have solar panels at least (see: Hawaii with among the highest costs for electricity in the US).

Regardless of electricity cost, all that electricity usage winds up with a lot of heat in a dwelling. To help offset the energy consumption in the future I plan to use a hybrid water heater that can act as a heat pump and dehumidifier and capture the excess heat as a way to reduce energy consumption for hot water.

jordanbeiber · on Jan 30, 2021

It’s mostly about casing though - density is important with enterprise stuff, and noise level is almost irrelevant hence small chassis with small, loud, fans.

I’ve got a 3.5” x16 bay gooxi chassis that I’ve put a supermicro mb + xeon in.

Something like this:

https://www.xcase.co.uk/collections/3u-rackmount-cases/produ...

I got this specific nas chassis because it got a fan wall with 3x120mm fans, not because I need bays.

With a few rather cool SSD’s for storage and quiet noctua fans it is barely a whisper.

Also - vertical rack mounting behind a closet door! I can have a massive chassis that basically takes no place at all. Can’t belive I didn’t figure that one out earlier...

leptons · on Jan 30, 2021

Noise isn't the only downside - the power they consume can cost $$$. These things aren't typically the most energy efficient machines.

girvo · on Jan 30, 2021

Would swapping them out for Noctuas be difficult?

cfn · on Jan 30, 2021

Mostly yes because server chassis are very compact and sometimes use proprietary connectors and fans. Still, many people have done that with good results, have a look in YouTube to know which server models are best suited for that kind of customization.

alvern · on Jan 30, 2021

I've not been successful trying this with HPE servers. Most server fans (Foxconn/Delta) run 2.8 amp or higher. Not aware of any "silent" gaming grade fans that use more than 0.38 amps. That's not even considering the CFM.

weehoo · on Jan 30, 2021

Why would current be relevant here? Shouldn’t operating voltage be the only thing that matters?

sokoloff · on Jan 30, 2021

Amps * Volts is power. Power is a proxy (a moderately good one) for air movement (a mix of volume/mass at a specific [back-]pressure).

It’s not likely that a silent 2W fan will move a similar amount of air as the stock 14W fans. The enterprise gear from HPE is pretty well engineered; I’m skeptical that they over-designed the fans by a 7x factor.

Operating voltage tells you “this fan won’t burn up when you plug it in”. It doesn’t tell you “will keep the components cool”.

COGlory · on Jan 30, 2021

Check out Wendell Wilson, of Level1Techs on YouTube (https://www.youtube.com/channel/UCOWcZ6Wicl-1N34H0zZe38w or https://www.youtube.com/user/teksyndicate), and https://forum.level1techs.com

piinbinary · on Jan 29, 2021

The level1techs forums seems to have a lot of people with similar interests

tomc1985 · on Jan 30, 2021

Though I have to wonder.... would these be good gaming systems? Are there any scenarios where the perks (stupid numbers of cores, 8-channel memory, 128 PCI-E lanes, etc) would help?

eldelshell · on Jan 30, 2021

Not really. Gaming is bound to latency and rendering, not scale nor bandwidth. Memory and IO usage is pretty constant while the game is running.

sp332 · on Jan 30, 2021

Not many games are written to scale out that far. I remember Ashes of the Singularity was used to showcase Ryzen CPUs though.

arminiusreturns · on Jan 29, 2021

Check out HardForum. Lots of very knowledgable people on there helped me mature my hardware level knowledge. Back when I was building 4 cpu, 64 core opteron systems. Also decent banter.

deagle50 · on Jan 29, 2021

Happy to help if you want feedback. Servethehome forums are also a great resource of info and used hardware, probably the best community for your needs.

gigatexal · on Jan 30, 2021

+1 to ServeTheHome, the forums have some of the nicest and smartest people I’ve ever met online.

Hard forum is cool too

vmception · on Jan 29, 2021

> RGB light show ripjaw hacksaw aorus elite novelty stuff

haha yeah I bought a whole computer from someone and was wondering why the RAM looked like rupies from Zelda

apparently that is common now

but at least I'm not cosplaying as a karate day trader for my Wall Street Journal expose'

_dujt · on Jan 30, 2021

Because flashy RGB is the default mode used for marketing purposes.

I’m not trying to be snarky here but you can always just turn off the lights or set it to be a solid color of your preference.

tanelpoder · on Jan 29, 2021

Author here: This article was intended to explain some modern hardware bottlenecks (and non-bottlenecks), but unexpectedly ended up covering a bunch of Linux kernel I/O stack issues as well :-) AMA

ksec · on Jan 29, 2021

I just love this article. Especially when the norm is always about scaling out instead of scaling up. We can have 128 Core CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We could even fit a Petabyte in 1U for SSD Storage.

I remember WhatsApp used to operate its 500M user with only a dozen of large FreeBSD boxes. ( Only to be taken apart by Facebook )

So Thank you for raising awareness. Hopefully the pendulum is swinging back to conceptually simple design.

>I also have a 380 GB Intel Optane 905P SSD for low latency writes

I would love to see that. Although I am waiting for someone to do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to 1.5M with lower than 6us Latency.

[1] https://www.servethehome.com/new-intel-optane-p5800x-100-dwp...

phkahler · on Jan 29, 2021

>> I remember WhatsApp used to operate its 500M user with only a dozen of large FreeBSD boxes.

With 1TB of RAM you can have 256 bytes for every person on earth live in memory. With SSD either as virtual memory or keeping an index in RAM, you can do meaningful work in real time, probably as fast as the network will allow.

zie · on Jan 30, 2021

My math doesn't compute with yours:

depending on how you define a TB(memory tends to favour the latter definition, but YMMV):

1,000,000,000,000 / 7.8billion = 128.21 bytes per human.

1,099,511,627,776 / 7.8billion = 140.96 bytes per human.

population source via Wikipedia.

swader999 · on Jan 30, 2021

Faster then they can type!

rektide · on Jan 29, 2021

Intel killing off prosumer optane 2 weeks ago[1] made me so so so sad.

The new P5800X should be sick.

[1] https://news.ycombinator.com/item?id=25805779

texasbigdata · on Jan 29, 2021

Second on Optane.

maerF0x0 · on Jan 30, 2021

When I first moved to the bay area, the company that hired me asked me what kind of computer I wanted and gave me a budget (like $3000 or something)... I spent a few days crafting a parts list so I could build an awesome workstation. Once I sent it over they were like "Uh, we just meant which macbook do you want?" and kind of gave me some shade about it. They joked, so how are you going to do meetings or on call?

I rolled with it, but really wondered if they knew I could get 2x the hardware and have a computer at home and at work for less money than the MBP ... Most of the people didnt seem to understand that laptop CPUs are not the same as desktop/workstation ones, especially when they hit thermal down throttling.

noir_lord · on Jan 30, 2021

Last but one job boss offered me an iMac Pro, I asked if I could just have the equivalent money for hardware and he said sure.

Which is how I ended up with an absolute monster of a work machine, these days I WFH and while work issued me a Macbook Pro it sits on the shelf behind me.

Fedora on a (still fast) Ryzen/2080 and 2x4K 27" screens vs a Macbook Pro is a hilarious no brainer for me.

Upgrading soon but can't decide whether I need the 5950X or merely want it - realistically except for gaming I'm nowhere near tapping out this machine (and it's still awesome for that an VR which is why the step-son is about to get a in his words "sick" PC).

technofiend · on Feb 1, 2021

Were you forced to make it a hackintosh to run company-required software or did you end up with a linux build?

noir_lord · on Feb 1, 2021

Linux (Fedora) I was only on-site dev and everything was linux anyway.

walrus01 · on Jan 30, 2021

I mean it would have been a totally valid answer to say that you intended to use a $600 laptop as effectively a thin client, and spend $2400 on a powerful workstation PC to drive remotely.

KingOfCoders · on Jan 31, 2021

Was a macbookpro user for a decade+ then dropped laptops for desktop machines, first imacpro, currently ryzen 12 core. I no longer understand why I had a laptop for so long. Status I guess (only talking about me)

KaiserPro · on Jan 29, 2021

Excellent write up.

I used to work for a VFX company in 2008. At that point we used lustre to get high throughput file storage.

From memory we had something like 20 racks of server/disks to get a 3-6 gigabyte/s (sustained) throughput on a 300tb filesystem.

It is hilarious to think that a 2u box can now theoretically saturate 2x100gig nics.

tarasglek · on Jan 29, 2021

You should look at cpu usage. There is a good chance all your interrupts are hitting cpu-0. you can run hwloc to see what chiplet the pci cards are on and handle interrupts on those cores.

tanelpoder · on Jan 29, 2021

Thanks for the "hwloc" tip. I hadn't thought about that.

I was thinking of doing something like that. Weirdly I got sustained throughput differences when I killed & restarted fio. So, if I got 11M IOPS, it stayed at that level until I killed fio & restarted. If I got 10.8M next, it stayed like it until I killed & restarted it.

This makes me think that I'm hitting some PCIe/memory bottleneck, dependent on process placement (which process happens to need to move data across infinity fabric due to accessing data through a "remote" PCIe root complex or something like that). But then I realized that Zen 2 has a central IO hub again, so there shouldn't be a "far edge of I/O" like on current gen Intel CPUs (?)

But there's definitely some workload placement and I/O-memory-interrupt affinity that I've wanted to look into. I could even enable the NUMA-like-mode from BIOS, but again with Zen 2, the memory access goes through the central infinity-fabric chip too, I understand, so not sure if there's any value in trying to achieve memory locality for individual chiplets on this platform (?)

tarasglek · on Jan 30, 2021

So there are 2 parts to cpu affinity. a) cpu assigned to ssd for handling interrupts and b) cpu assigned to fio. numactl is your friend for experimenting with with changing fio affinity.

https://access.redhat.com/documentation/en-us/red_hat_enterp... tells you how to tweak irq handlers.

You usually want to change both. pinning each fio process + each interrupt handler to specific cpus will reach highest perf.

You can even use isolcpus param to linux kernel to reduce jitter from things you don't care about to minimize latency.(wont do much for bandwidth)

wtallis · on Jan 29, 2021

The PCIe is all on a single IO die, but internally it is organized into quadrants that can produce some NUMA effects. So it is probably worth trying out the motherboard firmware settings to expose your CPU as multiple NUMA nodes, and using the FIO options to allocate memory only on the local node, and restricting execution to the right cores.

tanelpoder · on Jan 29, 2021

Yep, I enabled the "numa-like-awareness" in BIOS and ran a few quick tests to see whether the NUMA-aware scheduler/NUMA balancing would do the right thing and migrate processes closer to their memory over time, but didn't notice any benefit. But yep I haven't manually locked down the execution and memory placement yet. This placement may well explain why I saw some ~5% throughput fluctuations only if killing & restarting fio and not while the same test was running.

syoc · on Jan 29, 2021

I have done some tests on AMD servers and I the Linux scheduler does a pretty good job. I do however get noticeable (a couple percent) better performance by forcing the process to run on the correct numa node.

Make sure you get as many numa domains as possible in your BIOS settings.

I recommend using numactl with the cpu-exclusive and mem-exclusive flags. I have noticed a slight perfomance drop when RAM cache fills beyond the sticks local to the cpus doing work.

One last comment is that you mentioned interrupts being "stiped" among CPUs. I would recommend pinning the interrupts from one disk to one numa-local CPU and using numactl to run fio for that disk on the same CPU. An additional experiment is to, if you have enough cores, pin interrupts to CPUs local to disk, but use other cores on the same numa node for fio. That has been my most successful setup so far.

mgerdts · on Jan 30, 2021

I have the same box, but with the 32 core CPU and fewer NVMe drives. I've not poked at all the PCIe slots yet, but all that I've looked at are in NUMA node 1. This includes the on board M.2 slots. It is in NPS=4 mode.

tanelpoder · on Jan 30, 2021

Mine goes only up to 2 NUMA nodes (as shown in numactl --hardware), despite setting NPS4 in BIOS. I guess it's because I have only 2 x 8-core chiplets enabled (?)

mgerdts · on Jan 31, 2021

Yes, that is what I would expect.

jeffbee · on Jan 29, 2021

Why would that happen with the linux nvme stack that puts a completion queue on each CPU?

wtallis · on Jan 29, 2021

I think that in addition to allocating a queue per CPU, you need to be able to allocate a MSI(-X) vector per CPU. That shouldn't be a problem for the Samsung 980 PRO, since it supports 128 queues and 130 interrupt vectors.

sitkack · on Jan 29, 2021

Could you explain some of your thought processes and methodologies when approaching problems like this?

What is your mental model like? How much experimentation do you do verses reading kernel code? How do you know what questions to start asking?

*edit, btw I understand that a response to these questions could be an entire book, you get the question-space.

tanelpoder · on Jan 29, 2021

Good question. I don't ever read kernel code as a starting point, only if some profiling or tracing tool points me towards an interesting function or codepath. And interesting usually is something that takes most CPU in perf output or some function call with an unusually high latency in ftrace, bcc/bpftrace script output. Or just a stack trace in a core- or crashdump.

As far as mindset goes - I try to apply the developer mindset to system performance. In other words, I don't use much of what I call the "old school sysadmin mindset", from a time where better tooling was not available. I don't use systemwide utilization or various get/hit ratios for doing "metric voodoo" of Unix wizards.

The developer mindset dictates that everything you run is an application. JVM is an application. Kernel is an application. Postgres, Oracle are applications. All applications execute one or more threads that run on CPU or do not run on CPU. There are only two categories of reasons why a thread does not run on CPU (is sleeping): The OS put the thread to sleep (involuntary blocking) or the thread voluntarily wanted to go to go to sleep (for example, it realized it can't get some application level lock).

And you drill down from there. Your OS/system is just a bunch of threads running on CPU, sleeping and sometimes communicating with each other. You can directly measure all of these things easily nowadays with profilers, no need for metric voodoo.

I have written my own tools to complement things like perf, ftrace and BPF stuff - as a consultant I regularly see 10+ year old Linux versions, etc - and I find sampling thread states from /proc file system is a really good (and flexible) starting point for system performance analysis and even some drilldown - all this without having to install new software or upgrading to latest kernels. Some of the tools I showed in my article too:

https://tanelpoder.com/psnapper & https://0x.tools

In the end of my post I mentioned that I'll do a webinar "hacking session" next Thursday, I'll show more how I work there :-)

nicioan · on Jan 30, 2021

Excellent article, thank you! I really like the analysis and profiling part of the evaluation. I also have some experience in I/O performance in linux -- we measured 30GiB/s in a pcie Gen3 box (shameless plug[0]).

I have one question / comment: did you use multiple jobs for the BW (large IO) experiments? If yes, then did you set randrepeat to 0? I'm asking this because fio by default uses the same sequence of offsets for each job, in which case there might be data re-used across jobs. I had verified that with blktrace a few years back, but it might have changed recently.

[0]https://www.usenix.org/conference/fast19/presentation/kourti...

edit: fixed typo

tanelpoder · on Jan 30, 2021

Looks interesting! I wonder whether there'd be interesting new database applications on NVMe when doing as small as 512 byte I/Os (with more efficient "IO engine" than Linux bio, that has too high CPU overhead with such small requests).

I mean, currently OLTP RDBMS engines tend to use 4k, 8k (and some) 16k block size and when doing completely random I/O (or, say traversing an index on customer_id that now needs to read random occasional customer orders across years of history). So you may end up reading 1000 x 8 kB blocks just to read 1000 x 100B order records "randomly" scattered across the table from inserts done over the years.

Optane persistent memory can do small, cache line sized I/O I understand, but that's a different topic. When being able to do random 512B I/O on "commodity" NVMe SSDs efficiently, this would open some interesting opportunities for retrieving records that are scattered "randomly" across the disks.

edit: to answer your question, I used 10 separate fio commands with numjobs=3 or 4 for each and randrepeat was set to default.

guerby · on Jan 29, 2021

71 GB/s is 568 Gbit/s so you'll need about 3 dual 100 Gbit/s cards to pump data out at the rate you can read it from the NVMe drives.

And ethernet (unless LAN jumbo frames) is about 1.5kByte per frame (not 4kB).

One such PC should be able to do 100k simultaneous 5 Mbps HD streams.

Testing this would be fun :)

drewg123 · on Jan 29, 2021

At Netflix, I'm playing with an EPYC 7502P with 16 NVME and dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS offload, we're able to serve about 350Gb/s of real customer traffic. This goes down to about 240Gb/s when using software kTLS, due to memory bandwidth limits.

This is all FreeBSD, and is the evolution of the work described in my talk at the last EuroBSDCon in 2019: https://papers.freebsd.org/2019/eurobsdcon/gallatin-numa_opt...

ksec · on Jan 29, 2021

>we're able to serve about 350Gb/s of real customer traffic.

I still remember the post about breaking 100Gbps barrier, that was may be in 2016 or 17 ? And wasn't that long ago it was 200Gbps and if I remember correct it was hitting memory bandwidth barrier as well.

And now 350Gbps?!

So what's next? Wait for DDR5? Or moving to some memory controller black magic like POWER10?

drewg123 · on Jan 29, 2021

Yes, before hardware inline kTLS offload, we were limited to 200Gb/s or so with Naples. With Rome, its a bit higher. But hardware inline kTLS with the Mellanox CX6-DX eliminates memory bandwidth as a bottleneck.

The current bottleneck is IO related, and its unclear what the issue is. We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s

jiggawatts · on Jan 30, 2021

> But hardware inline kTLS with the Mellanox CX6-DX eliminates memory bandwidth as a bottleneck.

For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?

I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?

> We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s

Something I explained to a colleague recently is that a modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air than my first four computers had combined.

You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)

toast0 · on Jan 30, 2021

> For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?

> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?

It may depend on what you're sending. Netflix's use case is generally sending files. If you're doing software encryption you would load the plain text file into memory (via the filesystem/unified buffer cache), then write the (session specific) encrypted text into separate memory, then tell give that memory to the NIC to send out.

If the NIC can do the encryption, you would load the plain text into memory, then tell the NIC to read from that memory to encrypt and send out. That saves at least a write pass, and probably a read pass. (256 MB of L3 cache on latest EPYC is a lot, but it's not enough to expect cached reads from the filesystem to hit L3 that often, IMHO)

If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.

Not that this is a totally different case from encrypting dynamic data that's necessarily touched by the CPU.

> You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)

I had no problem serving 10 Gbps of files on a dual Xeon E5-2690 (v1; a 2012 CPU), although that CPU isn't great at AES, so I think it only did 8 Gbps or so with TLS; the next round of servers for that role had 2x 10G and 2690 v3 or v4 (2014 or 2016; but I can't remember when we got them) and thanks to better AES instructions, they were able to do 20 G (and a lot more handshakes/sec too). If your 2020 servers aren't as good as my circa 2012 servers were, you might need to work on your stack. OTOH, bulk file serving for many clients can be different than a single connection iperf.

drewg123 · on Jan 30, 2021

> If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.

You're spot on. I have a slide that I like to show NIC vendors when they question why TLS offload is important. See pages 21 and 22 of: https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf

magila · on Jan 30, 2021

> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?

I assume NF's software pipeline is zero copy, so if TLS is done in the NIC data only gets read from memory once when it is DMA'd to the NIC. With software TLS you need to read the data from memory (assuming it's not already in cache, which given the size of data NF deals with is unlikely), encrypt it, then write it back out to main memory so it can be DMA'd to the NIC. I know Intel has some fancy tech that can DMA directly to/from the CPU's cache, but I don't think AMD has that capability (yet).

floatboth · on Jan 30, 2021

> Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice

Easy line rate if you crank the MTU all the way to 9000 :D

> modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air

If you're using the boost algorithm rather than a static overclock, and when that boost is thermally limited rather than current limited. With a good cooler it's not too hard to always have thermal headroom.

jiggawatts · on Jan 30, 2021

> Easy line rate if you crank the MTU all the way to 9000 :D

In my experience jumbo frames provide at best an improvement of about 20% in rare cases, such as ping-pong UDP protocols such as TFTP or Citrix PVS streaming.

ksec · on Jan 29, 2021

Oh wow! Cant wait to hear more about it.

zamadatix · on Jan 29, 2021

Mellanox has a 2x200 Gbps NIC these days. Haven't gotten to play with it yet though.

tanelpoder · on Jan 29, 2021

Which NICs would you recommend for me to buy for testing at least 1x100 Gbps (ideally 200 Gbps?) networking between this machine (PCIe 4.0) and an Intel Xeon one that I have with PCIe 3.0. Don't want to spend much money, so the cards don't need to be too enterprisey, just fast.

And - do such cards even allow direct "cross" connection without a switch in between?

drewg123 · on Jan 29, 2021

All 100G is enterprisy.

For a cheap solution, I'd get a pair of used Mellanox ConnectX4 or Chelsio T6, and a QSFP28 direct attach copper cable.

namibj · on Jan 30, 2021

If you care about price, check out (used, ofc) infiniband cards.

They all seem to offer/suggest daisy-chain connectivity at least for those with two ports per card as one potential topology.

zamadatix · on Jan 29, 2021

+1 on what the sibling comment said.

As for directly connecting them absolutely, works great. Id recommend a cheap DAC off fs.com to connect them in that case.

tanelpoder · on Jan 29, 2021

I should (finally) receive my RTX 3090 card today (PCIe 4.0 too!), I guess here goes my weekend (and the following weekends over a couple of years)!

jeffbee · on Jan 29, 2021

Great article, I learned! Can you tell me if you looked into aspects of the NVMe device itself, such as whether it supports 4K logical blocks instead of 512B? Use `nvme id-ns` to read out the supported logical block formats.

tanelpoder · on Jan 29, 2021

Doesn't seem to support 4k out of the box? Some drives - like Intel Optane SSDs allow changing this in firmware (and reformatting) with a manufacturer's utility...

  $ lsblk -t /dev/nvme0n1
  NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
  nvme0n1         0    512      0     512     512    0 none     1023 128    0B

  $ sudo nvme id-ns -H /dev/nvme0n1 | grep Size
  LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

guerby · on Jan 29, 2021

Here is an article about nvme-cli tool :

https://nvmexpress.org/open-source-nvme-management-utility-n...

On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are supported:

   # nvme id-ns /dev/nvme0n1 -n 1 -H|grep "^LBA Format"
   LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

jeffbee · on Jan 29, 2021

Thanks for checking. SSD review sites never mention this important detail. For some reason the Samsung datacenter SSDs support 4K LBA format, and they are very similar to the retail SSDs which don't seem to. I have the a retail 970 Evo that only provides 512.

wtallis · on Jan 29, 2021

I just checked my logs, and none of Samsung's consumer NVMe drives have ever supported sector sizes other than 512B. They seem to view this feature as part of their product segmentation strategy.

Some consumer SSD vendors do enable 4kB LBA support. I've seen it supported on consumer drives from WD, SK hynix and a variety of brands using Phison or SMI SSD controllers (including Kingston, Seagate, Corsair, Sabrent). But I haven't systematically checked to see which brands consistently support it.

floatboth · on Jan 30, 2021

At least early WD Black models don't really seem to have 4K LBA support. The format option is listed, but it refuses to actually run the command to reformat the drive to the new "sector" size.

wtallis · on Jan 30, 2021

Put your system to sleep and wake it back up. (I use `rtcwake -m mem -s 10`). Power-cycling the drive like this resets whatever security lock the motherboard firmware enables on the drive during the boot process, allowing the drive to accept admin commands like NVMe format and ATA secure erase that would otherwise be rejected. Works on both the WD Black SN700 and SN750 models, doesn't seem to be necessary on the very first (Marvell-based) WD Black or the latest SN850.

floatboth · on Jan 30, 2021

I'm pretty sure this is the very first one though — WDS250G2X0C, firmware 101110WD.

wtallis · on Jan 30, 2021

I think that's the second-gen WD Black, but the first one that had their in-house SSD controller rather than a third-party controller. The marketing and packaging didn't prominently use a more specific model number to distinguish it from the previous WD Black, but on the drive's label it does say "PC SN700". Also, the first-gen WD Black was 256GB and 512GB capacities, while the later generations are 250/500/1000/2000GB. Firmware version strings for the first-gen WD Black were stuff like "B35200WD", while the SN700/720/730/750 family have versions like "102000WD" and "111110WD". So I would definitely expect your drive to require the sleep-wake cycle before it'll let you reformat to 4k sectors.

jeffbee · on Jan 30, 2021

You seem to have a lot of info on this topic. Do you run a blog or some other way you disseminate this stuff?

wtallis · on Jan 31, 2021

I'm at https://www.anandtech.com/Author/182

But this thread gets into details that are more esoteric than what I cover in most reviews, which are written with a more Windows-oriented audience in mind. Since I do most of my testing on Linux and have an excess of SSDs littering my office, I'm well-equipped to participate in a thread like this.

I highly recommend reddit.com/r/NewMaxx as the clearinghouse for consumer SSD news and Q&A. I'm not aware of a similarly comprehensive forum for enterprise storage, where this thread would probably be a better fit.

1996 · on Jan 29, 2021

Is it genuine 512?

As in, what ashift value do you use with zfs?

wtallis · on Jan 29, 2021

Regardless of what sector size you configure the SSD to expose, the drive's flash translation layer still manages logical to physical mappings at a 4kB granularity, the underlying media page size is usually on the order of 16kB, and the erase block size is several MB. So what ashift value you want to use depends very much on what kind of tradeoffs you're okay with in terms of different aspects of performance and write endurance/write amplification. But for most flash-based SSDs, there's no reason to set ashift to anything less than 12 (corresponding to 4kB blocks).

1996 · on Jan 30, 2021

> for most flash-based SSDs, there's no reason to set ashift to anything less than 12 (corresponding to 4kB blocks).

matching the page size?

> the underlying media page size is usually on the order of 16kB

I'd say that's a good reason to set ashift=14 as 2^14=16kb

wtallis · on Jan 30, 2021

There are downsides to forcing the OS/FS to always use larger block sizes for IO. You might simply be moving some write amplification out of the SSD and into the filesystem, while losing some performance in the process. Which is why it really depends on your workload, and to some extent on the specific SSD in question. I'm not convinced that ashift=14 is a sensible one size fits all recommendation, even if we're talking only about recent-model consumer-grade NAND SSDs.

mgerdts · on Jan 30, 2021

FWIW, WD SN850 has similar performance and supports 512 and 4k sectors.

PragmaticPulp · on Jan 29, 2021

This is a great article. Thanks for writing it up and sharing.

vinay_ys · on Jan 29, 2021

Very cool rig and benchmark. Kudos. Request: add network io load to your benchmarking load while nvme io load is running.

tanelpoder · on Jan 29, 2021

Thanks, will do in a future article! I could share the disks out via NFS or iSCSI or something and hammer them from a remote machine...

rafaelturk · on Jan 29, 2021

Thanks for well written article, makes me think about inefficiencies in our over-hyped cloud environment.

tanelpoder · on Jan 29, 2021

Oh yes - and incorrectly configured on-premises systems too!

perryizgr8 · on Jan 30, 2021

It would be interesting to know what you intend to use this rig for, if that is not some secret :)

tanelpoder · on Jan 30, 2021

Valid question!

1) Learning & researching capabilities of modern HW

2) Running RDBMS stress tests (until breaking point), Oracle, Postgres+TimescaleDB, MySQL, probably ScyllaDB soon too

3) Why? As a performance troubleshooter consultant+trainer, I regularly have to reproduce complex problems that show up only under high concurrency & load - stuff that you can't just reproduce in a VM in a laptop.

4) Fun - seeing if the "next gen" hardware's promised performance is actually possible!

FYI I have some videos from my past complex problem troubleshooting adventures, mostly Oracle stuff so far and some Linux performance troubleshooting:

https://tanelpoder.com/videos/

antongribok · on Jan 30, 2021

Great article!

Any chance you could post somewhere the output of:

  lstopo --of ascii

Or similar?

tanelpoder · on Jan 30, 2021

I can do it tomorrow, please drop me an email (email listed in my blog)

pbalcer · on Jan 29, 2021

What I find interesting about the performance of this type of hardware is how it affects the software we are using for storage. The article talked about how the Linux kernel just can't keep up, but what about databases or kv stores. Are the trade-offs those types of solutions make still valid for this type of hardware?

RocksDB, and LSM algorithms in general, seem to be designed with the assumption that random block I/O is slow. It appears that, for modern hardware, that assumption no longer holds, and the software only slows things down [0].

[0] - https://github.com/BLepers/KVell/blob/master/sosp19-final40....

bob1029 · on Jan 29, 2021

I have personally found that making even the most primitive efforts at single-writer principle and batching IO in your software can make many orders of magnitude difference.

Saturating an NVMe drive with a single x86 thread is trivial if you change how you play the game. Using async/await and yielding to the OS is not going to cut it anymore. Latency with these drives is measured in microseconds. You are better off doing microbatches of writes (10-1000 uS wide) and pushing these to disk with a single thread that monitors a queue in a busy wait loop (sort of like LMAX Disruptor but even more aggressive).

Thinking about high core count parts, sacrificing an entire thread to busy waiting so you can write your transactions to disk very quickly is not a terrible prospect anymore. This same ideology is also really useful for ultra-precise execution of future timed actions. Approaches in managed lanaguages like Task.Delay or even Thread.Sleep are insanely inaccurate by comparison. The humble while(true) loop is certainly not energy efficient, but it is very responsive and predictable as long as you dont ever yield. What's one core when you have 63 more to go around?

mikepurvis · on Jan 29, 2021

Isn't the use or non-use of async/await a bit orthogonal to the rest of this?

I'm not an expert in this area, but wouldn't it be just as lightweight to have your async workers pushing onto a queue, and then have your async writer only wake up when the queue is at a certain level to create the batched write? Either way, you won't be paying the OS context switching costs associated with blocking a write thread, which I think is most of what you're trying to get out of here.

pbalcer · on Jan 29, 2021

Right, I agree. I'd go even further and say that async/await is a great fit for a modern asynchronous I/O stack (not read()/write()). Especially with io_uring using polled I/O (the worker thread is in the kernel, all the async runtime has to do is check for completion periodically), or with SPDK if you spin up your own I/O worker thread(s) like @benlwalker explained elsewhere in the thread.

throwawaygimp · on Jan 30, 2021

Very interesting. I'm currently desiging and building a system which has a separate MCU just for timing accurate stuff rather than having the burdon of realtime kernel stuff, but I never considered just dedicating a core. Then I could also use that specifically to handle some IO queues too perhaps, so it could do double duty and not necessarily be wasteful. Thanks... now I need to go figure out why I either didn't consider that - or perhaps I did and discarded it for some reason beyond me right now. Hmm... thought provoking post of the day for me

pbalcer · on Jan 29, 2021

The authors of the article I linked to earlier came to the same conclusions. And so did the SPDK folks. And the kernel community (or axboe :)) when coming up with io_uring. I'm just hoping that we will see software catching up.

MrFoof · on Jan 30, 2021

>Latency with these drives is measured in microseconds.

For context and to put numbers around this, the average read latency of the fastest, latest generation PCI 4.0 x4 U.2 enterprise drives is 82-86µs, and the average write latency is 11-16µs.

AtlasBarfed · on Jan 29, 2021

scylladb had a blogpost once about how surprisingly small amounts of cpu time are available to process packets at the modern highest speed networks like 40gbit and the like.

I can't find it now. I think they were trying to say that cassandra can't keep up because of the JVM overhead and you need to be close to metal for extreme performance.

This is similar. Huge amounts of flooding I/O from modern PCIx SSDs really closes the traditional gap between CPU and "disk".

The biggest limiter in cloud right now is the EBS/SAN. Sure you can use local storage in AWS if you don't mind it disappearing, but while gp3 is an improvement, it pales to stuff like this.

Also, this is fascinating:

"Take the write speeds with a grain of salt, as TLC & QLC cards have slower multi-bit writes into the main NAND area, but may have some DIMM memory for buffering writes and/or a “TurboWrite buffer” (as Samsung calls it) that uses part of the SSDs NAND as faster SLC storage. It’s done by issuing single-bit “SLC-like” writes into TLC area. So, once you’ve filled up the “SLC” TurboWrite buffer at 5000 MB/s, you’ll be bottlenecked by the TLC “main area” at 2000 MB/s (on the 1 TB disks)."

I didn't know controllers could swap between TLC/QLC and SLC.

PeterCorless · on Jan 30, 2021

Hi! From ScyllaDB here. There are a few things that help us really get the most out of hardware and network IO.

1. Async everywhere - We use AIO and io_uring to make sure that your inter-core communications are non-blocking.

2. Shard-per-core - It also helps if specific data is pinned to a specific CPU, so we partition on a per-core basis. Avoids cross-CPU traffic and, again, less blocking.

3. Schedulers - Yes, we have our own IO scheduler and CPU scheduler. We try to get every cycle out of a CPU. Java is very "slushy" and though you can tune a JVM it is never going to be as "tight" performance-wise.

4. Direct-attached NVMe > networked-attached block storage. I mean... yeah.

We're making Scylla even faster now, so you might want to check out our blogs on Project Circe:

• Introducing Project Circe: https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...

• Project Circe January Update: https://www.scylladb.com/2021/01/28/project-circe-january-up...

The latter has more on our new scheduler 2.0 design.

1996 · on Jan 29, 2021

> I didn't know controllers could swap between TLC/QLC and SLC.

I wish I could control the % of SLC. Even dividing a QLC space by 16 makes it cheaper than buying a similarly sized SLC

tanelpoder · on Jan 29, 2021

I learned the last bit from here (Samsung Solid State Drive TurboWrite Technology pdf):

https://images-eu.ssl-images-amazon.com/images/I/914ckzwNMpS...

StillBored · on Jan 29, 2021

Yes a number of articles about these newer TLC drives talk about it. The end result is that an empty drive is going to benchmark considerably different from one 99% full of uncompressable files.

for example:

https://www.tomshardware.com/uk/reviews/intel-ssd-660p-qlc-n...

tyingq · on Jan 29, 2021

A paper on making LSM more SSD friendly: https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf

pbalcer · on Jan 29, 2021

Thanks for sharing this article - I found it very insightful. I've seen similar ideas being floated around before, and they often seem to focus on what software can be added on top of an already fairly complex solution (while LSM can appear to be conceptually simple, its implementations are anything but).

To me, what the original article shows is an opportunity to remove - not add.

1MachineElf · on Jan 29, 2021

Reminds me of the Solid-State Drive checkbox that VirtualBox has for any VM disks. Checking it will make sure that the VM hardware emulation doesn't wait for the filesystem journal to be written, which would normally be advisable with spinning disks.

jeffbee · on Jan 29, 2021

If you think about it from the perspective of the authors of large-scale databases, linear access is still a lot cheaper than random access in a datacenter filesystem.

digikata · on Jan 29, 2021

Not only the assumptions at the application layer, but potentially the filesystem too.

ddorian43 · on Jan 29, 2021

Disappointed there was no lmdb comparison in there.

benlwalker · on Jan 29, 2021

Plug for a post I wrote a few years ago demonstrating nearly the same result but using only a single CPU core: https://spdk.io/news/2019/05/06/nvme/

This is using SPDK to eliminate all of the overhead the author identified. The hardware is far more capable than most people expect, if the software would just get out of the way.

tanelpoder · on Jan 29, 2021

Yes I had seen that one (even more impressive!)

When I have more time again, I'll run fio with the SPDK plugin on my kit too. And would be interested in seeing what happens when doing 512B random I/Os?

benlwalker · on Jan 29, 2021

The system that was tested there was PCIe bandwidth constrained because this was a few years ago. With your system, it'll get a bigger number - probably 14 or 15 million 4KiB IO per second per core.

But while SPDK does have an fio plug-in, unfortunately you won't see numbers like that with fio. There's way too much overhead in the tool itself. We can't get beyond 3 to 4 million with that. We rolled our own benchmarking tool in SPDK so we can actually measure the software we produce.

Since the core is CPU bound, 512B IO are going to net the same IO per second as 4k. The software overhead in SPDK is fixed per IO, regardless of size. You can also run more threads with SPDK than just one - it has no locks or cross thread communication so it scales linearly with additional threads. You can push systems to 80-100M IO per second if you have disks and bandwidth that can handle it.

tanelpoder · on Jan 29, 2021

Yeah, that’s what I wondered - I’m ok with using multiple cores, would I get even more IOPS when doing smaller I/Os. Is the benchmark suite you used part of the SPDK toolkit (and easy enough to run?)

benlwalker · on Jan 29, 2021

Whether you get more IOPs with smaller I/Os depends on a number of things. Most drives these days are natively 4KiB blocks and are emulating 512B sectors for backward compatibility. This emulation means that 512B writes are often quite slow - probably slower than writing 4KiB (with 4KiB alignment). But 512B reads are typically very fast. On Optane drives this may not be true because the media works entirely differently - those may be able to do native 512B writes. Talk to the device vendor to get the real answer.

For at least reads, if you don't hit a CPU limit you'll get 8x more IOPS with 512B than you will with 4KiB with SPDK. It's more or less perfect scaling. There's some additional hardware overheads in the MMU and PCIe subsystems with 512B because you're sending more messages for the same bandwidth, but my experience has been that it is mostly negligible.

The benchmark builds to build/examples/perf and you can just run it with -h to get the help output. Random 4KiB reads at 32 QD to all available NVMe devices (all devices unbound from the kernel and rebound to vfio-pci) for 60 seconds would be something like:

perf -q 32 -o 4096 -w randread -t 60

You can specify only test specific devices with the -r parameter (by BUS:DEVICE:FUNCTION essentially). The tool can also benchmark kernel devices. Using -R will turn on io_uring (otherwise it uses libaio), and you simply list the block devices on the command line after the base options like this:

perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1

You can get ahold of help from the SPDK community at https://spdk.io/community. There will be lots of people willing to help.

Excellent post by the way. I really enjoyed it.

tanelpoder · on Jan 29, 2021

Thanks! Will add this to TODO list too.

StillBored · on Jan 29, 2021

Yah this has been going on for a while. Before SPDK it was with custom kernel bypasses and fast inifiband/FC arrays. I was involved with a similar project in the early 2000's. Where at the time the bottleneck was the shared xeon bus, and then it moved to the PCIe bus with opterons/nehalem+. In our case we ended up spending a lot of time tuning the application to avoid cross socket communication as well since that could become a big deal (of course after careful card placement).

But SPDK has a problem you don't have with bypasses and uio_ring, in that it needs the IOMMU enabled, and that can itself become a bottleneck. There are also issues for some applications that want to use interrupts rather than poll everything.

Whats really nice about uio_ring is that it sort of standardizes a large part of what people were doing with bypasses.

peluse · on Feb 3, 2021

FYI SPDK doesn't strictly require the IOMMU be enabled. See https://spdk.io/doc/system_configuration.html There's also a new experimental interrupt mode (not for everything) finding some valuable use cases in SPDK, see https://github.com/spdk/spdk/blob/master/CHANGELOG.md and feel free to jump on the SPDK slack channel or email list for more info on either of these https://spdk.io/community/

ogrisel · on Jan 29, 2021

As a nitpicking person, I really like to read a post that does not confuse GB/s for GiB/s :)

https://en.wikipedia.org/wiki/Byte#Multiple-byte_units

ogrisel · on Jan 29, 2021

Actually now I realize that the title and the intro paragraph are contradicting each other...

tanelpoder · on Jan 29, 2021

Yeah, I used the formally incorrect GB in the title when I tried to make it look as simple as possible... GiB just didn't look as nice in the "marketing copy" :-)

I may have missed using the right unit in some other sections too. At least I hope that I've conveyed that there's a difference!

secondcoming · on Jan 29, 2021

> For final tests, I even disabled the frequent gettimeofday system calls that are used for I/O latency measurement

I was knocking up some profiling code and measured the performance of gettimeofday as a proof-of-concept test.

The performance difference between running the test on my personal desktop Linux VM versus running it on a cloud instance Linux VM was quite interesting (cloud was worse)

I think I read somewhere that cloud instances cannot use the VDSO code path because your app may be moved to a different machine. My recollection of the reason is somewhat cloudy.

muro · on Jan 29, 2021

This article was great, thanks for sharing!

Anyone has advice on optimizing a windows 10 system? I have a haswell workstation (E5-1680 v3) that I find reasonably fast and works very well under Linux. In windows, I get lost. I tried to run the userbenchark suite which told me I'm below median for most of my components. Is there any good advice how to improve that? Which tools give good insight into what the machine is doing under windows? I'd like first to try to optimize what I have, before upgrading to the new shiny :).

anarazel · on Jan 29, 2021

Have you checked if using the fio options (--iodepth_batch_*) to batch submissions helps? Fio doesn't do that by default, and I found that that can be a significant benefit.

Particularly submitting multiple up requests can amortize the cost of setting the nvme doorbell (the expensive part as far as I understand it) across multiple requests.

tanelpoder · on Jan 29, 2021

I tested various fio options, but didn't notice this one - I'll check it out! It might explain why I still kept seeing lots of interrupts raised even though I had enabled the I/O completion polling instead, with io_uring's --hipri option.

edit: I ran a quick test with various IO batch sizes and it didn't make a difference - I guess because thanks to using io_uring, my bottleneck is not in IO submission, but deeper in the block IO stack...

wtallis · on Jan 29, 2021

I think on recent kernels, using the hipri option doesn't get you interrupt-free polled IO unless you've configured the nvme driver to allocate some queues specifically for polled IO. Since these Samsung drives support 128 queues and you're only using a 16C/32T processor, you have more than enough for each drive to have one poll queue and one regular IO queue allocated to each (virtual) CPU core.

tanelpoder · on Jan 29, 2021

That would explain it. Do you recommend any docs/links I should read about allocating queues for polled IO?

anarazel · on Jan 29, 2021

It's terribly documented :(. You need to set the nvme.poll_queues to the number of queues you want, before the disks are attached. I.e. either at boot, or you need to set the parameter and then cause the NVMe to be rescanned (you can do that in sysfs, but I can't immediately recall the steps with high confidence).

anarazel · on Jan 29, 2021

Ah, yes, shell history ftw. Of course you should ensure no filesystem is mounted or such:

    root@awork3:~# echo 4 > /sys/module/nvme/parameters/poll_queues
    root@awork3:~# echo 1 > /sys/block/nvme1n1/device/reset_controller
    root@awork3:~# dmesg -c
    [749717.253101] nvme nvme1: 12/0/4 default/read/poll queues
    root@awork3:~# echo 8 > /sys/module/nvme/parameters/poll_queues
    root@awork3:~# dmesg -c
    root@awork3:~# echo 1 > /sys/block/nvme1n1/device/reset_controller
    root@awork3:~# dmesg -c
    [749736.513102] nvme nvme1: 8/0/8 default/read/poll queues

tanelpoder · on Jan 29, 2021

Thanks for the pointers, I'll bookmark this and will try it out someday.

anarazel · on Jan 29, 2021

> I tested various fio options, but didn't notice this one - I'll check it out! It might explain why I still kept seeing lots of interrupts raised even though I had enabled the I/O completion polling instead, with io_uring's --hipri option.

I think that should be independent.

> edit: I ran a quick test with various IO batch sizes and it didn't make a difference - I guess because thanks to using io_uring, my bottleneck is not in IO submission, but deeper in the block IO stack...

It probably won't get you drastically higher speeds in an isolated test - but it should help reduce CPU overhead. E.g. on one of my SSDs fio --ioengine io_uring --rw randread --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1 --iodepth 48 uses about 25% more CPU than when I add --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But the resulting iops are nearly the same as long as there are enough cycles available.

This is via filesystem, so ymmv, but the mechanism should be mostly independent.

qaq · on Jan 29, 2021

Now price this in terms of AWS and marvel at the markup

speedgoose · on Jan 29, 2021

I'm afraid Jeff Bezos himself couldn't afford such IOs on AWS.

rektide · on Jan 29, 2021

Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1] which we talked about not even two weeks ago!

I was quite surprised to hear in that thread that AMD's infiniband was so oversubscribed. There's 256GBps of pcie on a 1P butit seems like this 66GBps is all the fabric can do. A little under a 4:1 oversubscription!

[1] https://news.ycombinator.com/item?id=25863093

electricshampo1 · on Jan 29, 2021

66GBps is from each of 10 drives doing ~6.6 GBps; don't think the infinity fabric is the limiter here

rektide · on Jan 29, 2021

I'd been going off this link[1] from the previous "Ram is the new diskx thread, but I think last time I read it I'd only counted one Infinity Fabric Inter-Socket on the 1P diagram (which provides the PCIe). On review, willing to bet, yeah, the pcie lanes aren't all sharing the one IFIS. The diagram is to give an idea, not the actual configuration.

[1] https://en.wikichip.org/wiki/amd/infinity_fabric#Scalable_Da...

occamschainsaw · on Jan 31, 2021

I love articles like these, taking a deep dive into achieving the absolute best on neglected metrics like IO. I am trying to get very high resolution (~ 2 * 10^6 samples/s) voltage and current measurements of a sensor for a control system. Has anyone tried that? Should it be done through PCIe?

sitkack · on Jan 31, 2021

For a lab environment I would look at

https://www.picotech.com/oscilloscope/2000/picoscope-2000-sp...

or

https://redpitaya.readthedocs.io/en/latest/developerGuide/12...

ashkankiani · on Jan 29, 2021

When I bought a bunch of NVME drives, I was disappointed with how slow the maximum speed I could achieve with them was given my knowledge and available time at the time. Thanks for making this post to give me more points of insight into the problem.

I'm on the same page with your thesis that "hardware is fast and clusters are usually overkill," and disk I/O was a piece that I hadn't really figured out yet despite making great strides in the software engineering side of things. I'm trying to make a startup this year and disk I/O will actually be a huge factor in how far I can scale without bursting costs for my application. Good stuff!

maerF0x0 · on Jan 30, 2021

> Shouldn’t I be building a 50-node cluster in the cloud “for scalability”? This is exactly the point of my experiment - do you really want to have all the complexity of clusters or performance implications of remote storage if you can run your I/O heavy workload on just one server with local NVMe storage?

Anyone have a story to share about their company doing just this? "Scale out" has basically been the only acceptable answer across most of my career. Not to mention High Availability.

tanelpoder · on Jan 30, 2021

You can get high availability without a "distibuted system", just an active/passive failover cluster may be enough for some requirements. Even failover (sometimes seamless) on a VMWare cluster can help with planned maintenance scenarios without downtime, etc.

Another way of achieving HA together with satisfying disaster recovery requirements is replication (either app level or database log replication, etc). So, no distributed system is necessary unless, you have legit scaling requirements.

If you work on ERP-like databases for traditional Fortune 500-like companies, few people run such "sacred monolith" applications on modern distributed NoSQL databases, it's all Oracle, MSSQL or some Postgres nowadays. Data warehouses used to be all Oracle, Teradata too - although these DBs support some cluster scale-out, they're still "sacred monoliths" from a different era (they are still doing - what they were designed for - very well). Now of course Snowflake, BigQuery, etc are taking over the DW/analytics world for new greenfield projects, existing systems usually stay as they are due to lock-in & extremely high cost of rewriting decades of existing reports and apps.

maerF0x0 · on Jan 31, 2021

> "distibuted system", just an active/passive failover cluster

I would call this a distributed system. To me HA means 0 downtime deploys, are there SQL/RDBMS that offer that even for schema changes?

MrFoof · on Jan 30, 2021

U.2 form factor drives (also NVMe protocol) can achieve higher IOPS (particularly writes) still over M.2 form factor (especially M.2 2280), with higher durability, but you'll need your own controllers which are sparse on the market for the moment. Throughput (MB/sec, not IOPS) will be about the same, but the U.2 drives can do it for longer.

U.2 means more NAND to parallelize over, more spare area (and higher overall durability), potentially larger DRAM caches, and a far larger area to dissipate heat. Plus it has all the fancy bleeding-edge features you aren't going to see on consumer-grade drives.

-- -----

The big issue with U.2 for "end user" applications like workstations is you can't get drivers from Samsung for things like the PM1733 or PM9A3 (which blow the doors off the 980 Pro, especially for writes and $/GB, plus other neat features like Fail-In-Place) unless you're an SI, in which you also co-developed the firmware. The same goes for SanDisk, KIOXIA and other makers of enterprise SSDs.

The kicker is enterprise U.2 drives are about the same $/GB as SATA drives, but being NVMe PCIe 4.0 x4. blow the doors off about everything. There's also the EDSFF, NF1 and now E.1L form factors, but U.2 is very prevalent. Enterprise SSDs are attractive as that's where the huge volume is (hence the low $/GB), but end-user support is really limited. You can use "generic drivers", but you won't see anywhere near the peak performance of the drives.

The good news is both Micron and Intel have great support for end-users, where you can get optimized drivers and updated firmware. Intel has the D7-P5510 probably hitting VARs and some retail sellers (maybe NewEgg) within about 60 days. Similar throughput to the Samsung drives, far more write IOPS (especially sustained), lower latencies, FAR more durability (with a big warranty), far more capacity, and not too bad a price (looking like ~$800USD for 3.84TB with ~7.2PB of warrantied writes over 5 years).

-- -----

My plan once Genesis Peak (Threadripper 5XXX) hits is four 3.84TB Intel D7-P5510s in RAID10, connected to a HighPoint SSD7580 PCIe 4.0 x16 controller. Figure ~$4,000 for a storage setup of ~7.3TB usable space after formatting, 26GB/sec peak writes, ~8GB/sec peak reads, with 2.8M 4K read iops, 700K 4K write iops, and ~14.3PB of warrantied write durability.

floatboth · on Jan 30, 2021

How would a model-specific driver for something that speaks NVMe even work? Is it for Linux? Is it open? Is it just modifications to the stock Linux NVMe driver that take some drive specifics into account? Or is it some stupid proprietary NVMe stack?

wtallis · on Jan 31, 2021

I think he may have meant you can't get the drives, not the drivers. Samsung, Kioxia, etc. enterprise NVMe SSDs work fine with standard Linux NVMe drivers and I don't think they offer custom NVMe drivers except possibly for Windows. The problem is that their enterprise drives mostly aren't sold at retail. If you aren't buying them as part of a big B2B deal, you simply can't acquire the hardware.

tutfbhuf · on Jan 29, 2021

This is a very synthetical fio benchmark, I would like to see how actual applications like a postgres databases would perform on such a tuned machine.

tanelpoder · on Jan 30, 2021

Yep, some “real” workload tests are coming next (using filesystems). I wanted to start from low level basics and later build on top of that.

37ef_ced3 · on Jan 29, 2021

Somebody please tell me how many ResNet50 inferences you can do per second on one of these chips

Here is the standalone AVX-512 ResNet50 code (C99 .h and .c files):

https://nn-512.com/browse/ResNet50

Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible

tanelpoder · on Jan 29, 2021

You mean on the CPU, right? This CPU doesn't support AVX-512:

  $ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
  avx
  avx2
  misalignsse
  popcnt
  sse
  sse2
  sse4_1
  sse4_2
  sse4a
  ssse3

What compile/build options should I use?

37ef_ced3 · on Jan 29, 2021

No AVX-512, forget it then

wyldfire · on Jan 29, 2021

Whoa, this code looks interesting. Must've been emitted by something higher-level? Something like PyTorch/TF/MLIR/TVM/Glow maybe?

If that is the case, then maybe it could be emitted again while masking the instruction sets Ryzen doesn't support yet.

xxpor · on Jan 29, 2021

They don't have avx512 instructions.

qaq · on Jan 29, 2021

Now honestly say for how long two boxes like this behind a load balancer would be more than enough for your startup.

chx · on Jan 30, 2021

A terabyte of RAM on your desktop.

It's been less than a quarter century ago, 1997, when Microsoft and Compaq launched the TerraServer which was a wordplay on terabyte -- it stored a terabyte of data and it was a Big Deal. Today's that not storage, that's main RAM, unencumbered by NUMA.

drmadera · on Jan 30, 2021

Great article. Did you consider doing Optane tests? I built a 3990x WS with all-optanes and I get blazing fast access times, but 3gb/s top speeds. It might be interesting to look at them for these tests, specially in time-sensitive scenarios.

tanelpoder · on Jan 30, 2021

I have 2 Optane 905P M.2 cards and I intend to run some database engine tests, putting their transaction logs (and possibly temporary spill areas for sorts, hashes) on Optane.

When I think about Optane, I think about optimizing for low latency where it's needed and not that much about bandwidth of large ops.

jacquesm · on Jan 29, 2021

Lovely article, zero fluff, tons of good content and modest to boot. Thank you for this write-up, I'll pass it around to some people who feel that the need for competent system administration skills has passed.

svacko · on Jan 30, 2021

I wonder, is increasing temperature of the M.2 NVMe disks affecting the measured performance? Or is P620 cooling system efficient enough to keep temp of the number of disks low?

Anyway, thanks for the inspirative post!

tanelpoder · on Jan 30, 2021

Both quad SSDs adapters had a fan on it and the built in M.2 ones had a heatsink, right in front of one large chassis fan & air intake. I didn't measure the SSD temperatures, but the I/O rate didn't drop over time. I was bottlenecked by CPU when doing small I/O tests, I monitored the current MHz from /proc/cpuinfo to make sure that the CPU speeds didn't drop lower than their nominal 3.9 GHz (and they didn't).

Btw, even the DIMMs have dedicated fans and enclosure (one per 4 DIMMs) on the P620.

jayonsoftware1 · on Jan 29, 2021

https://www.asus.com/us/Motherboard-Accessories/HYPER-M-2-X1... vs https://highpoint-tech.com/USA_new/nvme_raid_controllers.htm . One card is about x10 expensive, but looks like performance is same. Am I missing some thing.

tanelpoder · on Jan 29, 2021

The ASUS one doesn't have its own RAID controller nor PCIe switch onboard. It relies on the motherboard-provided PCIe bifurcation and if using hardware RAID, it'd use AMD's built-in RAID solution (but I'll use software RAID via Linux dm/md). The HighPoint SSD7500 seems to have a proprietary RAID controller built in to it and some management/monitoring features too (it's the "somewhat enterprisey" version)

wtallis · on Jan 29, 2021

The HighPoint card doesn't have a hardware RAID controller, just a PCIe switch and an option ROM providing boot support for their software RAID.

PCIe switch chips were affordable in the PCIe 2.0 era when multi-GPU gaming setups were popular, but Broadcom decided to price them out of the consumer market for PCIe 3 and later.

tanelpoder · on Jan 29, 2021

Ok, thanks, good to know. I misunderstood from their website.

rektide · on Jan 29, 2021

pcie switches getting expensive is so the suck.

qaq · on Jan 29, 2021

Would be cool to see pgbench score for this setup

tutfbhuf · on Jan 29, 2021

This article focuses on IOPS and throughput, but what is also important for many applications is I/O latency, which can be measured with ioping (apt-get install ioping). Unfortunately, even 10x PCIe 4.0 NVMe do not provide any better latency than a single NVMe drive. If you are constrained by disk latency then 11M IOPS won't gain you much.

cheeze · on Jan 29, 2021

Does this come up in practice? What kind of use cases suffer from disk latency?

This stuff is all fascinating to me. I have a zfs NAS but I feel like I've barely scratched the surface of SSDs

tutfbhuf · on Jan 29, 2021

> Does this come up in practice? What kind of use cases suffer from disk latency?

One popular example is HFT.

And from my experience on a desktop PC it is better to disable swap and have the OOM killer do his work, instead of swapping to disk, which makes my system noticeable laggy, even with a fast NVMe.

sitkack · on Jan 30, 2021

Anything with transaction SLOs in the microsecond or millisecond range. Adtech, fintech, fraud detection, call records, shopping carts.

Two big players in this space are Aerospike and ScyllaDB.

nwmcsween · on Jan 29, 2021

So Linus was wrong on his rant to Dave about the page cache being detremental on fast devices

namero999 · on Jan 29, 2021

You should be farming Chia on that thing [0]

Amazing, congrats!

[0] https://github.com/Chia-Network/chia-blockchain/wiki/FAQ

tyingq · on Jan 29, 2021

I'm somewhat curious what happens to the long standing 4P/4U servers from companies like Dell and HP. The Ryzen/EPYC has really made going past 2P/2U a more rare need.

tanelpoder · on Jan 29, 2021

Indeed, 128 EPYC cores in 2 sockets (with total 16 memory channels) will give a lot of power. I guess it's worth mentioning that the 64-core chips have much lower clock rate than 16/32 core ones though. And with some expensive software that's licensed by CPU core (Oracle), you'd want faster cores, but possibly pay a higher NUMA price when going with a single 4 or 8 sockets machine for your "sacred monolith".

toast0 · on Jan 30, 2021

At least when I was actively looking at hardware (2011-2018), 4 socket Xeon was available off the shelf, but at quite the premium over 2 socket Xeon. If your load scaled horizontally, it still made sense to get a 2P Xeon over 2x 1P Xeon, but 2x 2P Xeon was way more cost efficient than a 4P Xeon. 8P or 16P seemed to exist, but maybe only in catalogs.

I'm not really in the market anymore, but Epyc looks like 1P is going to solve a lot of needs, and 2P will be available at a reasonable premium, but 4P will probably be out of reach.

StillBored · on Jan 29, 2021

There always seems to be buyers for more exotic high end hardware. That market has been shrinking and expanding, well since the first computer, as mainstream machines become more capable and people discover more uses for large coherent machines.

But users of 16 socket machines, will just step down to 4 socket epyc machines with 512 cores (or whatever). And someone else will realize that moving their "web scale" cluster from 5k machines, down to a single machine with 16 sockets results in lower latency and less cost. (or whatever).

wtallis · on Jan 29, 2021

I think at this point the only reasons to go beyond 2U are to make room for either 3.5" hard drives, or GPUs.

rektide · on Jan 29, 2021

Would love to see some very dense blade style ryzen offerings. The 4 2P nodes in 2U is great. Good way to share some power supies, fan, chassis, ideally multi-home nic too.

Turn those sleds into blades though, put em on their side, & go even denser. It should be a way to save costs, but density alas is a huge upsell, even though it should be a way to scale costs down.

thinkingkong · on Jan 29, 2021

You might be able to buy a smaller server but the rack density doesnt necessarily change. You still have to worry about cooling and power so lots of DCs would have 1/4 or 1/2 racks.

tyingq · on Jan 29, 2021

Sure. I wasn't really thinking of density, just the interesting start of the "death" of 4 socket servers. Being an old-timer, it's interesting to me because "typical database server" has been synonymous with 4P/4U for a long, long time.

vinay_ys · on Jan 29, 2021

I haven't seen a 4 socket machine in a long time.

wiradikusuma · on Jan 29, 2021

I've been thinking about this. Would traditional co-location (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g. you're serving local (country-wise) market?

derefr · on Jan 29, 2021

Depends on how long you need the server, and the ownership model you've chosen to pursue for it.

If you purchase a server and stick it in a co-lo somewhere, and your business plans to exist for 10+ years — well, is that server still going to be powering your business 10 years from now? Or will you have moved its workloads to something newer? If so, you'll probably want to decommission and sell the server at some point. The time required to deal with that might not be worth the labor costs of your highly-paid engineers. Which means you might not actually end up re-capturing the depreciated value of the server, but instead will just let it rot on the shelf, or dispose of it as e-waste.

Hardware leasing is a lot simpler. Whether you lease servers from an OEM like Dell, there's a quick, well-known path to getting the EOLed hardware shipped back to Dell and the depreciated value paid back out to you.

And, of course, hardware renting is simpler still. Renting the hardware of the co-lo (i.e. "bare-metal unmanaged server" hosting plans) means never having to worry about the CapEx of the hardware in the first place. You just walk away at the end of your term. But, of course, that's when you start paying premiums on top of the hardware.

Renting VMs, then, is like renting hardware on a micro-scale; you never have to think about what you're running on, as — presuming your workload isn't welded to particular machine features like GPUs or local SSDs — you'll tend to automatically get migrated to newer hypervisor hardware generations as they become available.

When you work it out in terms of "ten years of ops-staff labor costs of dealing with generational migrations and sell-offs" vs. "ten years of premiums charged by hosting rentiers", the pricing is surprisingly comparable. (In fact, this is basically the math hosting providers use to figure out what they can charge without scaring away their large enterprise customers, who are fully capable of taking a better deal if there is one.)

rodgerd · on Jan 29, 2021

> If you purchase a server and stick it in a co-lo somewhere, and your business plans to exist for 10+ years — well, is that server still going to be powering your business 10 years from now? Or will you have moved its workloads to something newer?

Which, if you have even the remotest fiscal competence, you'll have funded by using the depreciation of the book value of the asset after 3 years.