I own a Talos II (https://www.raptorcs.com/TALOSII/) computer. It actually runs ...

cipherboy · on Aug 20, 2019

Fedora [0], Red Hat [1], Ubuntu [2], and SUSE [3] all have their own ppc64le ports as well so there are lots of choices out there if anyone is interested.

Even Gentoo has one [4][5]!

[0]: https://alt.fedoraproject.org/alt/

[1]: https://access.redhat.com/documentation/en-us/red_hat_enterp...

[2]: https://ubuntu.com/download/server/power

[3]: https://www.suse.com/products/power/

[4]: https://wiki.gentoo.org/wiki/Handbook:PPC64

[5]: https://www.gentoo.org/downloads/

classichasclass · on Aug 20, 2019

Fedora 30 on this Talos II. Works well.

_emacsomancer_ · on Aug 21, 2019

Void Linux as well: https://www.talospace.com/2019/01/void-linux-goes-power9.htm... Although it's not official at this point I don't think.

classichasclass · on Aug 21, 2019

No, though my impression is it's progressing pretty well, so I think it will get there.

voldacar · on Aug 20, 2019

I have drooled over the Talos II for quite some time...

Do you have a particular use case that makes POWER make sense over x86, or do you share my paranoia and love of non-mainstream ISAs?

einpoklum · on Aug 20, 2019

Use of GPUs. Not Talos II it seems (?), but with POWER, GPUs are first-class citizens on the system, with NVLink-2-bus access to main memory as the CPU - 150 GB/sec in each direction! (simultaneously!)

bubblethink · on Aug 21, 2019

Actual GPU use on Talos seems to be problematic from their wiki page. The CUDA use case is supported, but that bandwidth seems too high. Or are you quoting some future number ? The current bandwidth on a P9 system with nvlink is closer to 30 GB/s. And I don't think Talos supports nvlink.

madez · on Aug 20, 2019

Are all accesses to the memory from the GPU still checked for permissions at the hardware level by an IOMMU?

rrss · on Aug 21, 2019

Yeah, checked for permissions in hardware, but not by an IOMMU. Requests from the GPU are forwarded to the "standard" SMMU. See http://www.ieee-hpec.org/2018/2018program/index_htm_files/13...

classichasclass · on Aug 20, 2019

I don't especially, because my Talos II is "just" my desktop. I want a computer I can trust and that I know what it's doing from the ground up. It was already the best choice for that and today's announcement made the choice even better.

Koshkin · on Aug 20, 2019

> I know what it's doing from the ground up

Do you now? There is not even a hidden embedded micro-core running a "secure operating system"?

classichasclass · on Aug 20, 2019

You can audit the firmware and build it yourself. I did it. Raptor even encourages it: https://wiki.raptorcs.com/wiki/Compiling_Firmware

The biggest problem remaining is whatever blobs are in devices. That's being rapidly worked on.

nickpsecurity · on Aug 21, 2019

It's worse than that:

https://lobste.rs/s/noed0h/day_2_keynote_openpower_blows_doo...

You can't trust any modern computer to not be subverted. So, you have to change how you use them. True secrets should be kept out of computers or rooms with technology. Go old school.

tempguy9999 · on Aug 21, 2019

> Go old school

OK. How?

_delirium · on Aug 21, 2019

Interesting. How do you find it as a desktop? I'd read in reviews that it's incredibly loud, so more suited for datacenter than office or home use, but maybe it's not as bad as I'd gathered?

classichasclass · on Aug 21, 2019

This is a very early unit (#12) and the initial firmware was indeed deafeningly loud. However, the current firmware is whisper quiet, certainly much quieter than the Quad G5 next to it (and the G5 is throttled down), and I also have super-quiet power supplies installed. I find it perfectly liveable.

avhception · on Aug 21, 2019

I've had a system with two quad-core CPUs running at 100% load under my desk for many days, whisper-quiet.

neop1x · on Aug 21, 2019

For me, the biggest problems are the long booting of HosBoot and lack of "suspend to ram"

kop316 · on Aug 20, 2019

I share and respect your paranoia. I have a love of inspecting code and not having backdoors in my processor.

voldacar · on Aug 20, 2019

For sure. It's nice that open source firmware replacements have been making progress, especially since the Intel ME fiasco, but it never ceases to amaze me that right now you can go out and get a modern, ultra high performance workstation with every single chip running auditable firmware. Hopefully we will start seeing more affordable POWER systems now that it is a fully open architecture

acqq · on Aug 20, 2019

> with every single chip running auditable firmware

But disks? Isn’t their firmware closed?

tpearson-raptor · on Aug 21, 2019

What you can do with a trusted CPU domain is use FDE. FDE is standard practice for anyone even remotely concerned about security in the first place.

So the firmware that matters -- the firmware that can subvert the system due to privilege level, etc. -- is open. No other vendor aside from some lower end ARM toy SoCs can say that.

justinclift · on Aug 21, 2019

Maybe OpenSSD would be functional enough to use:

http://openssd.io

http://www.openssd-project.org/wiki/The_OpenSSD_Project

throw0101a · on Aug 21, 2019

> But disks? Isn’t their firmware closed?

Encrypt your data in-memory with a file system feature (or something like LUKS/dm-crypt) before it's sent down the SATA cable to the disk.

The NSA has gone after disk firmware:

* https://www.theregister.co.uk/2015/02/17/kaspersky_labs_equa...

voldacar · on Aug 20, 2019

Ugh I guess there always has to be an exception. Maybe you could run everything in a ramdisk? It supports up to 2TB of ram after all

marmaduke · on Aug 20, 2019

Isn't the price tag pretty amazing too?

voldacar · on Aug 20, 2019

Well yeah it is hardly cheap but for the people who can afford it, I totally get how having a computer you can trust is worth the price tag

Annatar · on Aug 21, 2019

Most people won't pay that kind of money just to tinker with it.

wolfgke · on Aug 21, 2019

> Most people won't pay that kind of money just to tinker with it.

The interesting question rather is: how many of these simply cannot afford it and how many think that this is not worth it?

marmaduke · on Aug 21, 2019

I looked at the Talos website. They're asking 2-4k$ for a 4 core (4 way SMT, so let's say 8 core when comparing to x86_64 just to be nice) dev desktop, with 8 to 16 GB ram. The same spend on a Dell Xeon workstation nets quite a bit more hardware.

avhception · on Aug 21, 2019

While not exactly cheap in absolute terms, there is the Blackbird from Raptor. It's a single-CPU board, cheaper than Talos.

shaklee3 · on Aug 21, 2019

I wouldn't drool over it. See the latest benchmarks comparing it to epyc and Intel. Power9 does pretty poorly throughout almost every test:

https://www.phoronix.com/scan.php?page=article&item=rome-pow...

tpearson-raptor · on Aug 21, 2019

Both of those processors insist on you ceding full system control to the vendor in perpetuity, with a literal "skeleton key" that let's the vendor in and keeps you out (the centrally signed, unremovable ME/PSP). If this doesn't concern you, then why are you looking at a local machine at all when a cloud system may very well be less expensive to lease than to purchase and keep current, not to mention run, local hardware? Unless you're loading the local machine 24/7, you're leaving a resource sitting idle for parts of the day without any real increase in security or control, meaning the cloud vendor can give you a cheaper experience overall by keeping hardware utilization over time high.

And no, ME cleaner does NOT (and cannot) fully remove a modern ME. The PSP "disable" toggle in the UEFI configuration does NOT disable the PSP from running during startup.

wolfgke · on Aug 21, 2019

> why are you looking at a local machine at all when a cloud system may very well be less expensive to lease than to purchase and keep current, not to mention run, local hardware?

Because a cloud machine is rented and not owned. And because of the ping latency: there is a reason why there is for example still hardly any cloud gaming.

tpearson-raptor · on Aug 21, 2019

And what exactly do you call a machine that you are, by design, cryptographically locked out of, but a third party has access to?

Put another way, would you call a car that I kept duplicate keys and retained title for, but said you could use and maintain at your sole expense for a single upfront payment, rented or owned?

Latency is being solved, Google etc. are working that problem. I'm playing devils advocate here, but fundamentally if you don't care about actually controlling or being able to modify something, and pricing is cheaper to rent, why own?

wolfgke · on Aug 21, 2019

> And what exactly do you call a machine that you are, by design, cryptographically locked out of, but a third party has access to?

Not a perfect solution, but such a problem can be mitigated by a firewall that blocks such ingoing/outgoing packets.

> I'm playing devils advocate here, but fundamentally if you don't care about actually controlling or being able to modify something, and pricing is cheaper to rent, why own?

Since I love to tinker with my computers, the answer is obvious to me.

shaklee3 · on Aug 21, 2019

I think this is apples and oranges. Sure, if those kinds of things are that important to you, then POWER9 is your only option. But if performance is important, POWER9 is a longshot from being the best. Most companies likely don't care about the things you're suggesting.

beezle · on Aug 21, 2019

My understanding is that it is not really useful to compare Power9 to other cpus in these types of benchmarks, that Power9 is all about computation with massive datasets, not how fast it can zip a file.

Recurecur · on Aug 22, 2019

> My understanding is that it is not really useful to compare Power9 to other cpus in these types of benchmarks, that Power9 is all about computation with massive datasets, not how fast it can zip a file.

Your understanding is wrong. For instance, running Java workloads on servers is a major Power9 use case.

The thing to remember, though, is that the Talos is only two four-core CPUs, for eight total. These benchmarks are comparing it to the Epyc 7742, which is a 64 core chip.

Naturally the Epyc will kill it on most highly threaded benchmarks. The individual cores on Power9 are quite fast, though.

fluffything · on Aug 24, 2019

> The individual cores on Power9 are quite fast, though.

Are there any benchmarks for single thread performance there that I could see ?

mshook · on Aug 20, 2019

If I may ask why did you get it, which reasons? I like the cool non x86 factor but it's quite expensive...

EDIT: Forgot to mention the open argument which is quite amazing as well (I've followed what Talos does).

kop316 · on Aug 20, 2019

I tend to buy server level hardware for my own usage. It tends to last a LOT longer. With that in mind, it was roughly comparable to what I would have paid for a comparable Intel Xeon, and I like the fact that I know all of the code that runs on it (the only code that I can't actually change is the OTP memory that it first executes when it boots up, and even then you can inspect it!).

Avamander · on Aug 20, 2019

Thanks for supporting the development of open hardware and software. Very few of us can afford to do so.

kop316 · on Aug 20, 2019

They actually came out with a Blackbird, and I have been considering getting one of them to replace my server (it runs FreeNAS, but I have gotten my Debian system to run an encrypted ZFS drive).

dragontamer · on Aug 20, 2019

The 18-core has decent performance. If it weren't for AMD EPYC Rome chips coming out a month ago, I would have considered a Talos II.

18-cores with 4x SMT == 72 threads per Power9. That's a lot of threads, no matter how you look at it.

slovenlyrobot · on Aug 20, 2019

That's a whole lot of SMT. Can anyone comment on how it behaves compared to hyperthreading? I'm assuming at 4x each core must have a ton more execution units to go around

dragontamer · on Aug 20, 2019

Power9 is basically "Bulldozer done right". Each SMT4 Power9 core is incredibly fat, with 4x load/store units 4x ALUs, 2x Vector units. Bulldozer probably would have called each SMT4 core a collection of 4-cores.

But only 1x divider, 1x crypto unit per SMT4 core.

The chief downside to Power9 is that it only supports 128-bit vectors, and these 128-bit vectors are executed by ganging-together the ALU units. (so 4x 64-bit ALUs == 2x 128-bit vectors processed per clock tick). Compared to AMD Zen (4x 128-bit pipelines), AMD Zen 2 (4x 256-bit pipelines), and Intel Skylake-X (3x 512-bit pipelines), Power9's SIMD capabilities are tiny.

Another oddity: most instructions take 2-clock ticks to execute, even simple instructions like XOR or Add. This increased latency is likely the reason why it performs so poorly with Python / PHP code.

But when code is written for Power9, it works quite well. Stockfish chess seems to work extremely well on Power9, likely because Stockfish scales to many "cores" well (fully taking advantage of SMT4), and only has 64-bit operations.

One more wildcard: Power9 has 10MB (!!!) L3 cache for every 2-cores. That's 90MB L3 cache on the 18-core. I presume that real-life database applications would benefit greatly from this oversized L3 cache.

EDIT: It should be noted that the L3 caches serve as victim-caches of other L3 caches. So Power9 core-pair 01 can have its 10MB L3 cache serve as a "L3.1 cache" of core-pair 23. AMD Zen / Zen2 L3 cache CANNOT use this functionality. So AMD Zen2 64-core may have 128MB of L3 cache, but each core only "really" can go up to 16MB of L3 cache (because the other 112MB of L3 cache is only for other cores/module)

EDIT: Also note, Power9 came out a few years ago at 14nm, while Zen2 came out on the 7nm node a month ago. I think a new 7nm Power9 update is planned, but I don't know what its timeframe is.

In effect, you could have 1-program using the entire 90MB L3 cache for itself on Power9. While AMD Zen2 requires (at minimum) 8-programs, each program using only 16MB L3. This design decision is clear in the intended use of the chips: Zen2 is clearly targeted at the cloud-market, while Power9 is big-iron / databases.

--------

Unfortunately, most of the benchmarks these days show that AMD EPYC / Rome is just the better overall processor. Still, 18-core Power9 is relatively cheap: a complete 18-core / 72-thread system for $4000ish: https://secure.raptorcs.com/content/TLSDS3/purchase.html

Cheap for Power9 anyway. AMD EPYC is also relatively cheap. You can get a 16-core / 32-thread / 32MB L3 cache AMD Ryzen 9 3950x for only $700 these days (and maybe a complete system build for only $2500).

phire · on Aug 20, 2019

I don't think "bulldozer done right" is the correct way to describe POWER9.

I see it more as a single big massively wide OoO core with 23 execution units (putting skylake's 10 execution units to shame). The slices are more there for design reasons, to simplify the design process by making it more symmetrical.

Bulldozer is clearly two integer cores sharing some execution units between them, a thread can only exist on one of the two integer units.

In contrast, a thread on POWER9 can simultaneously use all 4 slices, all 23 execution units. The dispatcher can dynamically mix and match which slice it's sending a threads instruction steam to based on slice utilization.

That single difference puts it in a complete different class of CPU architecture to bulldozer.

dragontamer · on Aug 20, 2019

> In contrast, a thread on POWER9 can simultaneously use all 4 slices

My reading of the documentation is different.

> The most significant partitioning related to threads occurs when more than two threads are active, placing the core in SMT4 mode. In SMT4 mode, the decode/dispatch pipeline, shown in the blue shaded area in Figure 25-1 on page 321, is split into two pipelines, each pipeline is three iops wide and each pipeline serves two threads. The split decode/dispatch pipes each feed one of the two superslices, shown in the green shaded box in Figure 25-1, providing two execution slices for each pair of threads. The branch slice and LS-slices are shared between all threads.

Page 322 of 496: https://ibm.ent.box.com/s/8uj02ysel62meji4voujw29wwkhsz6a4

-------

The left superslices serves 2-threads, while the right superslice serves 2-threads. All 4 threads are "behind" the singular decoder.

It seems very "Bulldozer-esque" to me, especially in SMT4 mode.

---------

You are correct in that there is an SMT1 mode where one-thread could potentially utilize the entire processor. But with 2-latency on even Add / XOR instructions (see Appendix A), I don't foresee SMT1 code to be very useful on Power9. The processor is clearly designed to run most effectively on SMT2 or SMT4 modes.

I'm not even sure how easy or hard it is to switch into SMT1 to SMT2 or SMT4 modes. I don't think Linux can switch cores while running, and may need to reboot for instance. Maybe AIX can switch between the modes on the fly?

I guess if your code has enough Instruction Level Parallelism (ILP) available in its code stream, it could benefit from SMT1 mode. But I'd imagine that most 64-bit CPU-code wouldn't have much ILP.

phire · on Aug 20, 2019

It's worth noting that in SMT2 mode, it's still 2 threads dynamically scheduled across all 4 slices.

It's only in SMT4 mode that it starts statically partitioning the threads onto superslices. Even then, it's two threads sharing two slices.

I assume the static patitioning is an optimisation, that preformance increases due to the split L1d caches (and I'm guessing there is a delay cycle when one slice depends on data from another, I haven't read the documentation that closely).

It's the fact that slices can be dynamically scheduled across all four slices which makes it "not bulldozer" in my mind, and I don't think the presence of a mode that does statically partition superslices should make it "like bulldozer", even if that is the most common mode. It's just an optimisation.

> I'm not even sure how easy or hard it is to switch into SMT1 to SMT2 or SMT4 modes.

Idealy, the CPU core would dynamically drop down to SMT1 or SMT2 mode whenever the the extra threads are executing idle instruction.

dragontamer · on Aug 21, 2019

> It's the fact that slices can be dynamically scheduled across all four slices which makes it "not bulldozer" in my mind, and I don't think the presence of a mode that does statically partition superslices should make it "like bulldozer", even if that is the most common mode. It's just an optimisation.

Well, its certainly a Bulldozer-like mode of operation :-)

Power9 is obviously a very different chip than Bulldozer. So I guess it all comes down to opinion, whether or not the chip is similar enough to warrant a comparison.

ivl · on Aug 20, 2019

> EDIT: Also note, Power9 came out a few years ago at 14nm, while Zen2 came out on the 7nm node a month ago. I think a new 7nm Power9 update is planned, but I don't know what its timeframe is.

I believe 7nm POWER10 will be the next move, they had announced Samsung as the partner for their next chips back in December if I remember right.

shaklee3 · on Aug 21, 2019

The power 9 deceptively "came out a few years ago". But in reality, it didn't. The only ones available for a year or so we're demo units at IBM. The rest were being promoted as part of the summit supercomputer. Just like AMD's MI50/60 has been available since November 2018. But try to search/buy one. Good luck...

cptnapalm · on Aug 20, 2019

I have a 2009 Mac Pro with dual 3.2 GHz hexcore Xeons (so 24 threads) and 2 older GPUs and 48 GB RAM that cost less than $700 for the whole thing. I'm beginning to think I lucked out on it more than I already thought I did.

dragontamer · on Aug 20, 2019

Each Nehelem hexcore Xeon is (EDIT) ~120 Watts of power, so your computer will be drawing well over 300W under load, maybe over 500W. (I mean, Mac Pro 2009 has a 1200W PSU. I presume its expecting to use around half of that power)

The Power9 18-core / 72-thread is going to come in at under 150W total.

The main advancement the past decade has been in power-efficiency. Cloud-scale providers keep their computers running at max load as well, so 500W does add up over months / years into a sizable amount of money.

Especially when you consider that 500W computer needs 500W of Air-conditioning, so the "True cost" of a 500W computer is roughly ~1200W or so (500W from the computer, 700W to power an air-conditioner to move 500W of heat)

----------

A 12-core / 24-thread AMD Ryzen 3900x is just $500, with a total system cost under $1500. The big advantage of a Ryzen 3900x would be a max clock-rate of 4.7 GHz, while your Nehelem 2009 computer is... what? 2.5 GHz? Probably? And computers of that age didn't have deep sleep capabilities, wasting even more power than usual. Modern computers idle at 20W, even servers and desktops. Tons of power-saving features these days which add up.

I think a typical $1500 computer these days would be more than twice as fast with 1/4th the power usage. I don't think anybody seriously in this hobby should be using anything as old as Nehelem these days.

IMO, the price/performance "old computers" seems to be Haswell (~2014 era servers), if people want to buy old equipment. But 2009 is definitely too old, there are lots of used servers that are a little bit more expensive but a LOT more power efficient / faster in practice.

gpm · on Aug 20, 2019

> Especially when you consider that 500W computer needs 500W of Air-conditioning, so the "True cost" of a 500W computer is roughly ~1200W or so (500W from the computer, 700W to power an air-conditioner to move 500W of heat)

I thought air conditioners/heat pumps were supposed to be substantially better than 1w of heat moved outside per watt of electricity?

dragontamer · on Aug 20, 2019

Hmm... a typical home Air Conditioner is 15 to 20 SEER, which apparently stands for 15 BTU/hr per Watt.

15 BTU/hr == 5 Watts of cooling per Watt of input.

So it appears you are correct. To move 500W watts of heat, you only need 100W of air conditioner power.

cptnapalm · on Aug 20, 2019

The Mac Pro is a 4,1 flashed to a 5,1 and uses Westmere 3.3 GHz CPUs, but your point of power consumption is taken. As I can't possibly afford a $1500 PC, I'm still happy with what I've got. A multiseat desktop/server I could afford that is pretty happy with whatever I've thrown at it is a lot better than a bare CPU sitting idly on my desk.

dragontamer · on Aug 20, 2019

If $600 or $700 is your budget, my main point was to look for Haswell (2014-era) systems.

For example, the Dell PowerEdge R630 (2014-era) server is in and around $600 to $1000 on Ebay, and will be more power-efficient and faster than any 2009-era system.

I think 2014-era servers are where the price/performance point is for the home-server enthusiast, especially if we're talking about sub $1000 price points.

https://www.ebay.com/itm/Dell-Poweredge-R630-2x-Xeon-E5-2640...

2x8 core dual socket Intel Xeon E2640 v3 (Haswell) with 64GB of RAM. Its an auction, so it will probably go up another $100 or $200 from there, but I would expect it to sell well south of $1000.

2014-era equipment is the current price/performance king for home hobbyists. Obviously, a modern desktop with all the bells and whistles is a bit more expensive at $1500, but for $6oo to $700, you can get a pretty good 2014-era system.

-------

My rule of thumb is to buy something 5-years out of date. That's roughly the time when businesses get rid of old equipment and upgrade. So 5-years old equipment tends to win in price/performance.

cptnapalm · on Aug 20, 2019

I did, oddly enough, look at used PowerEdge servers, but I wanted a multiseat desktop too, so the step-son and I could play games together at the same time. Less than $700 bought the Mac Pro, 2 video cards, 48 GB of RAM (3 sticks) and, not included in my original equipment tally, a 4 TB SSD and 24" AOC monitor. The bare Mac Pro was $250. As I got it early last year, the 5 year rule of thumb almost applied as a 2009 and a 2012 Mac Pro are nearly identical, the former being able to just be flashed to the latter. In another couple of years, if I have any cash to spare, I'll likely get a used PowerEdge, though. The cost of those things, for what you get, is exceedingly good.

dragontamer · on Aug 21, 2019

Ah right, multiseat desktop.

Well, I guess the Mac Pro is fine for that, as long as you're fine with the Mac OSX operating system. The Mac Pro line hasn't really had many updates, so maybe the 5-year heuristic doesn't really apply.

cptnapalm · on Aug 21, 2019

Linux all the way! OSX doesn't actually do multiseat. So, have a Linux Mac Pro that I can ssh into, or if that's blocked, get a shell or even my desktop in a web browser among other things. All in all, rather happy with it, though I really would like one of those Raptor Power9 boards for the hell of it.

jsjohnst · on Aug 20, 2019

> The Power9 18-core / 72-thread is going to come in at under 150W total.

The TDP on the 18 core (and 22 core as well) is 190W as listed on Raptor’s website.

CrystalGamma · on Aug 21, 2019

That's an IBM TDP (i. e. maximum ever power), not an Intel TDP (i. e. maximum power at some arbitrary power state declared as 'base clock speed').

zrm · on Aug 20, 2019

It's going to be highly dependent on the workload. For some it's counterproductive because the working set of fewer threads will fit into a given cache level when more threads won't, and then it slows things down -- but then you can turn it off or run fewer threads per core.

Where it's a big win is for pointer chasing workloads or big databases, where the working set isn't going to fit in cache anyway and then it's effectively like having really fast context switches. You have four threads and three of them are waiting on main memory while you keep the core busy with the fourth, then that thread has a cache miss but by then one of the other threads has the data it was waiting on.

slovenlyrobot · on Aug 20, 2019

That pointer chasing benefit has been my experience on Intel, especially on anaemic low power designs. I'm more curious how/why Intel stops at 2 whereas Sparc/Power can manage much higher numbers. Maybe it's not architectural, but more just about product fit or something

zrm · on Aug 21, 2019

It's probably a combination of target market and trade offs.

To make SMT-4 perform well you want to have larger caches so that cache contention between the threads doesn't become the bottleneck, but that eats a lot of transistors. It's essentially a brute force trade off between performance and manufacturing cost and IBM is more willing to say "damn the cost" than Intel.

There's also the matter of who needs a machine like that. There is a lot of ugly pointer-chasing code in the world, but to take advantage of SMT-4 it has to be well-threaded ugly pointer-chasing code. You basically need a customer that needs their application to scale and is willing to do the bare minimum necessary to make that possible, but not spend a lot of resources actually optimizing the code once they get it to the point that throwing more hardware at it is a viable alternative. That's the enterprise market in a nutshell right there, and that's where IBM lives.

stingraycharles · on Aug 20, 2019

That sounds fascinating. Do you have any examples / study material that describes these programming techniques?

slovenlyrobot · on Aug 20, 2019

My hyperthreading enlightenment came from discovering a parallelized XML parsing task (using libxml2) running on Atom N2800 (2 cores) absolutely trouncing a similar run on a much beefier Xeon with HT disabled. It came very close to a 2x speedup.

This is what the parent comment means when referring to pointer chasing -- XML documents are a big random access graph in memory, CPU cache and prefetch is close to useless in that environment, so when walking the DOM as part of some parsing task, much of the time is spent waiting on memory, with the execution units lying idle.

OTOH many 'genuinely computational' jobs like say, an ffmpeg encode have very noticeable slowdowns with hyperthreading enabled. In those kinds of jobs where the code is already highly optimized to keep the CPU pipeline busy, there will be contention for the single set of execution units shared by both threads, and so the illusion is destroyed.

As to why it results in a measurable slowdown, someone else would need to answer that, but it is at least conceivable that software overheads to manage the increased task partitioning might account for some of it

flukus · on Aug 21, 2019

> This is what the parent comment means when referring to pointer chasing -- XML documents are a big random access graph in memory, CPU cache and prefetch is close to useless in that environment, so when walking the DOM as part of some parsing task, much of the time is spent waiting on memory, with the execution units lying idle.

Bare in mind that this is only true if you parse with the DOM model, if you care about efficiency and it's at all possible then the SAX model is much faster, you won't be bound by pointer chasing as there's very little in memory at once. IME the next big gain comes from eliminating string comparisons with hash values. By that point xml parsing is entirely limited by how fast you can stream the documents.

slovenlyrobot · on Aug 21, 2019

You can achieve a similar (although I guess not nearly as efficient) effect with DOM, without sacrificing convenience given a suitable library. For example the Python lxml library grants access to the tree as it is being constructed, if you are careful not to delete a node it will later modify, it's entirely safe to e.g. parse one element at a time from a big serialized array, then deleting the element from its parent container, so memory usage remains constant. By the end of the parse, you're left with a stub DOM describing an empty container.

The advantage is not losing access to lovely tooling like XPath for parsing

(If anyone had not seen this trick before, the key to avoid deleting elements out from under the parser is to keep a small history of elements to be deleted later. For an array, it's only necessary to save the node describing the previous array element)

sbierwagen · on Aug 21, 2019

I'm not sure I would describe an IO-bound problem as "genuinely computational".

imtringued · on Aug 21, 2019

Video encoding is one of the most CPU intensive problems that your average user will encounter.

ddorian43 · on Aug 20, 2019

https://m.youtube.com/watch?v=j9tlJAqMV7U

This is an extreme version of yield on memory access

jammygit · on Aug 22, 2019

Have you had any issues with it, or has it required any additional configuration? I'm extremely curious about the real world use of Power chips

martin1975 · on Aug 21, 2019

can you run AIX on it?

CrystalGamma · on Aug 21, 2019

No. AFAIK AIX only runs on PowerVM systems, which none of the OpenPOWER systems are.

Annatar · on Aug 21, 2019

Then what's the point if I can't run AIX on it?!?

Annatar · on Aug 21, 2019

"Talos™ II 2U Rack Mount Server TL2SV1 Talos™ II 2U Rack Mount Server Starting at $6,089.00"

Not at $6,089.00; they can forget that. It has to cost no more than $500 USD or this will be a repeat of the same mistake Sun Microsystems did. Will these companies ever learn?

One cannot charge enterprise prices if one wants to build an upward spiral. Intel systems dominate because they are dirt cheap and convenient to buy.

fluffything · on Aug 24, 2019

You can't find a modern Intel Xeon Gold CPU for less than 2000$. If you buy 2, and a motherboard, you are already in the 6000$ ballpark, and then you still need to buy everything else (PSU, RAM, SSDs, GPGPU, etc.).

Annatar · on Sept 2, 2019

I can build (and have) a fully decked-out intel-based 1U server for $1,800 USD, so this Talos thing can't compete: it's not cost-effective no matter how one slices and dices it.

This company is repeating the same mistake IBM, hp, SGI and Sun before it made.

Those who do not learn from history are doomed to repeat mistakes of those who came before them.

Have you bought one of those Talos systems?

fluffything · on Sept 7, 2019

I'll believe you if you are able to provide a link to two Xeon Gold CPUs costing the same or less than the 1800$ you claim you are able to build a full 1U rack with two of them.