At the last user conference for some iSeries-based software that we run, IBM had...

thoughtpolice · on Aug 17, 2015

I think the horsepower on those machines shouldn't be underestimated, because they are not entirely as equivalent as you think... I was thoroughly surprised when an unoptimized (but correct!) ChaCha20/8 implementation I wrote on a 3.0GHz POWER8 little-endian machine was about as fast as the latest 3.5gHz Xeons @ AES-256 with AESNI (about 1.3cpb vs 1.0cpb IIRC, but the latter has a dedicated hardware unit for it!) On that same Xeon, the ChaCha20 code only hit somewhere around 5cpb - that's software vs silicon!

It also has 170 cores and was actually a QEMU instance (w/ hardware virtualization extensions) vs raw dedicated metal. If you're doing any kind of numerical or analytic workloads (even databases), I wouldn't throw them aside so quickly. You can even get CUDA for them these days, and certain physical addons like CAPI allow you to map and coherently share physical CPU address space with FPGAs or GPUs. If I could get those things in a reasonable workstation configuration, I'd probably go for it tbh.

(I'd be more than willing to repeat this and post some more accurate numbers if anyone cares. I also need to get around to benchmarking AESNI vs that POWER8 machines _actual_ dedicated AES unit. The benchmark above was only flexing its vector/integer unit capabilities. ;)

ajross · on Aug 17, 2015

If you're getting a 4x difference in IPC using a crypto microbenchmark from compiled C code (i.e. it doesn't sound like you're bandwidth or I/O limited), there has to be something else at work. POWER8 is a nice core, but it's not that wide. Maybe the compiler was recognizing your operations and replacing them with AES primitives?

rdtsc · on Aug 17, 2015

Caches and memory latency/bandwidth can have serious effects as well.

ajross · on Aug 17, 2015

Yes, but at this kind of multiplier only in the case where the entire test is 100% cache-resident on one CPU and spilling on the other. Crypto stuff tends to have small working sets, so my intuition is that it's got to be something else.

throwaway2048 · on Aug 18, 2015

an ASM optimized chacha20 is faster than AES-NI on newer intel chips.

McGlockenshire · on Aug 17, 2015

> Equivalent x86 Linux servers are 1/4 the price.

You're severely underestimating the cost of dual-proc 16-core Xeons (about $3500 each for the E5-2698v3), and by the time you add memory, storage, I/O, networking, and other necessities, you're easily in the $15-20k range.

Source: I work for an integrator.

aus_ · on Aug 17, 2015

Just to clarify: P-series (Power/POWER8) Linux is not the same as what is announced here. LinuxONE runs on the System z (mainframe / s390x) platform.

rodgerd · on Aug 17, 2015

Yes, but they now share the same microarchitecture. s390x is mostly a difference in the microcode.

nickpsecurity · on Aug 17, 2015

It has a custom processor:

https://en.wikipedia.org/wiki/IBM_zEC12_%28microprocessor%29

ska · on Aug 17, 2015

"Equivalent x86 Linux servers are 1/4 the price."

That doesn't sound right to me. What are you considering an equivalent x86 machine?

nickpsecurity · on Aug 17, 2015

Hypervisor is built-in. Single-core up to 2x faster clock than x86 per core. Double the cache. Decimal support built-in is great for financial calculations. Security advantage in that about every malware and attack tool is written for x86 with some attention shifting to ARM.