Graviton 3: First Impressions

Dunedan · on May 29, 2022

> The final result is a chip that lets AWS sell each Graviton 3 core at a lower price, while still delivering a significant performance boost over their previous Graviton 2 chip.

That's not correct. AWS sells Graviton 3 based EC2 instances at a higher price than Graviton 2 based instances!

For example a c6g.large instance (powered by Graviton 2) costs $0.068/hour in us-east-1, while a c7g.large instance (powered by Graviton 3) costs $0.0725/hour [1]. Both instances have the same core count and memory, although c7g instances have slightly better network throughput.

I believe that is pretty unusual as, if my memory serves me right, newer instances family generations are usually cheaper than the previous generation.

[1]: https://aws.amazon.com/ec2/pricing/on-demand/

adrian_b · on May 29, 2022

Based on the first published benchmarks, even the programs which have not been optimized for Neoverse V1, and which do not benefit from its much faster floating-point and large-integer computation abilities, still show a performance increase of at least 40%, so greater than the price increase.

So I believe that using Graviton 3 at these prices is still a much better deal than using Graviton 2.

myroon5 · on May 29, 2022

Definitely unusual, as the graph here shows: https://github.com/patmyron/cloud/

WASDx · on May 29, 2022

Could it be due to increasing global energy prices?

usefulcat · on May 29, 2022

I don't follow. You seem to be implying that Amazon would like to reduce their electricity usage. If so, shouldn't they be charging less for the more efficient instance type?

nine_k · on May 29, 2022

No, they charge for compute, which the new CPU provides more of, even though it consumes the same amount of electricity as a unit.

pattato · on May 30, 2022

Did you create these?

mastax · on May 29, 2022

Given the surrounding context I read that sentence to mean that focusing on compute density allowed them to sell each core at a lower price vs focusing on performance, not that Graviton 3 is cheaper than Graviton 2.

arpinum · on May 29, 2022

Graviton 2 (c6g) also cost more than the Graviton 1 (a1) instances

jeffbee · on May 29, 2022

It would be irrational to expect a durable lower price on graviton. Amazon will price it lower initially to entice customers to port their apps, but after they get a critical mass of demand the price will rise to a steady state where it costs the same as Intel. The only difference will be which guy is taking your money.

jhugo · on May 29, 2022

I dunno, this take is a bit weird to me. The work we did to support Graviton wasn't "moving" from Intel to ARM, it was making our build pipeline arch-agnostic. If Intel one day works out cheaper again we'll use it again with basically zero friction.

pm90 · on May 30, 2022

Its great that your migration worked in a way that made your deployment pipeline arch agnostic. Expecting everyone did it that way? Not sure about that.

I bet quite a lot of migrations involved changing hardcoded instance types, docker image shas etc. Would be hard to revert back to quickly.

I think the take is pretty good one.

jhugo · on May 30, 2022

Then you'll be locked in as GP implies, but not because of AWS, because of your own substandard engineering.

ykevinator2 · on May 29, 2022

greatpostman · on May 29, 2022

I don’t think Amazon has ever raised their prices. This comment is based on nothing

losteric · on May 29, 2022

Prime has gone up quite a bit

Nearly every business seeks to maximize profit. Right now AWS is in growth phase - why wouldn't they raise rates in the future?

orf · on May 29, 2022

I mean, they just raised their graviton prices between generations.

I don’t think the point was that they would increase the cost of existing instance types, only that over time and generations the price will trend upwards as more workloads shift over.

staticassertion · on May 29, 2022

I wouldn't call that "raising prices"... you can still use Graviton 2 if it's a better price for you.

dilyevsky · on May 29, 2022

Considering blank stares that I get when mentioning arm as potential cost saving measure it will take years and maybe decades before that happens by which point you’re def getting your money’s worth as early adopter

thejosh · on May 30, 2022

I honestly think it's because of shortage of CPU. If everyone switches over, they won't have enough hardware. Looking at the phoronixbenchmarks[0] it's a great uplift for the slight price increase.

I'm sure once hardware settles it'll go down in price. Given this was announced like 6 months ago, I'm sure they wanted to launch something.

It doesn't even have spot pricing yet.

[0] https://www.phoronix.com/scan.php?page=article&item=aws-grav...

spookthesunset · on May 29, 2022

When is the last time Amazon has raised cloud prices?

jeffbee · on May 29, 2022

Literally 6 days ago when they introduced this thing.

dragonwriter · on May 29, 2022

> Literally 6 days ago when they introduced this thing.

Offering a new option is not a price increase. You can still do all the same things at the same prices, plus if the new thing is more efficient for your particular task you have an additional option.

jeffbee · on May 29, 2022

When they introduced c6i they did it at the same price as c5, even though the c6i is a lot more efficient. They're raising the price on c7g vs. c6g to bring it closer to the pricing of c6i, which is pretty much exactly what I suggested?

deanCommie · on May 29, 2022

You're being highly obtuse.

Universally everyone understands "raising prices" to be - "raising prices without any customer action".

As in you consider your options, take into consideration pricing, design your architecture, you deploy it, and you get a bill. Then suddenly, later, without any action of your own, your bill goes up.

THAT is raising prices, and it is something AWS has essentially never done.

What you're describing is a situation where a customer CHOOSES to upgrade to a new generation of instances, and in doing so gets a larger bill. That is nowhere near the same thing.

reitzensteinm · on May 30, 2022

I read the original post and didn't consider an interpretation other than Graviton would lose the Intel relative discount generation on generation. It'll take a decade for everyone to port their software after all.

zrail · on May 29, 2022

Do you have a cite on Amazon raising prices like that at any other point in their history?

electroly · on May 30, 2022

M6i is the same price as M5 rather than cheaper as well. It seems like they've had difficulty driving prices down this generation.

rwmj · on May 29, 2022

> GCC will flat out refuse to emit SVE instructions (at least in our limited experience), even if you use assembly,

This seems ... wrong? I haven't tried it but according to the link below SVE2 intrinsics are supported in GCC 10 (and Clang 9):

https://community.arm.com/arm-community-blogs/b/tools-softwa...

adrian_b · on May 29, 2022

Yes, gcc 10.1 has introduced support for the SVE2 intrinsics (ACLE).

Moreover, starting with the 8.1 version, gcc began to use SVE in certain cases when it succeeded to auto-vectorize loops (if the correct -march option had been used).

Nevertheless, many Linux distributions are still shipped with older gcc versions, so SVE/SVE2 does not work with the available compiler or cross-compiler.

You must upgrade gcc to 10.1 or a newer version.

Hizonner · on May 29, 2022

It feels like we've gone badly wrong somewhere when processors have to spend so many of their resources guessing about the program. I am not saying I have a solution, just that feeling.

canarypilot · on May 29, 2022

Why would you consider prediction based on dynamic conditions to be the sign of a dystopian optimization cycle? Isn’t it mostly intuitive that interesting program executions are not going to be things you can determine statically (otherwise your compiler would have cleaned them up for you with inlining etc.), or could be determined statically but at too great cost to meet execution deadlines (JiTs and so on), or resource constraints (you don’t really want N code clones specialising each branch backtrace to create strictly predictable chains).

Or is the worry on the other side; that processors have gotten so out-of-order that only huge dedication to guesswork can keep the beast sated? I don’t see this as a million miles from software techniques in JiT compilers to optimistically optimize and later de-deoptimize when an assumption proves wrong.

I think you might be right to be nervous if you wrote programs that took fairly regular data and did fairly regular things to it. But, as Itanium learned the hard way, programs have much more dynamic, emergent and interesting behaviour than that!

amelius · on May 29, 2022

I guess the fear is that the CPU might start guessing wrong, causing your program to miss deadlines. Also, the heuristics are practically useless for realtime computing, where timings must be guaranteed.

nine_k · on May 29, 2022

I suppose that if you assume in-order execution and count the clock cycles, you should get a guaranteed lower bound of performance. It may be, say, 30-40% of the performance you really observe, but having some headroom should feel good.

bastawhiz · on May 29, 2022

Isn't that the whole promise of general purpose computing? That you don't need to find specialized hardware for most workloads? Nobody wants to be out shopping for CPUs that have features that align particularly well with their use case, then switching to different CPUs when they need to release an update or some customer comes along with data that runs less efficiently with the algorithms as written.

Since processors are expensive and hard to change, they do tricks to allow themselves to be used more efficiently in common cases. That seems like a reasonable behavior to me.

cesaref · on May 29, 2022

I thought this was the reasoning behind Itanium, the idea that scheduling could be worked out in advance by the compiler (probably profile guided from tests or something like that) which would reduce the latency and silicon cost of implementations.

However, it wasn't exactly a raging success, with I think the predicted amazing compiler tech not materialising, but maybe it is the right answer, but the implementation was wrong? I'm no CPU expert...

Veedrac · on May 29, 2022

Itanium was a really badly designed architecture, which a lot of people skip over when they try to draw analogies to it. It was a worst of three worlds, in that it was big and hot like an out-of-order, it had the serial dependency issues of an in-order, and it had all the complexity of fancy static scheduling without that fancy scheduling actually working.

There have been a small number of attempts since Itanium, like NVIDIA's Denver, which make for much better baselines. I don't think those are anywhere close to optimal designs, or really that they tried hard enough to solve in-order issues at all, but they at least seem sane.

speed_spread · on May 29, 2022

Would Itanium have been better served with bytecode and a modern JIT? Also, doesn't RISC-V kinda get back on that VLIW track with macro-ops fusion, using a very basic instruction set and letting the compiler figure out the best way to order stuff to help target CPU make sense of it?

Veedrac · on May 30, 2022

Those are all very different things. It is probably possible to argue that using a JIT would have solved some of Itanium's compilation issues, like it would make it easier to make compilation decisions about where to do software data miss handling, but I don't think it would have made the hardware fundamentally more sensible, or all that performant. RISC-V isn't really anything like VLIW, it is about as close to a traditional RISC as anything gets nowadays, and macro-op fusion is just a simple front end trick that doesn't overly influence whether the back end is in-order or out-of-order (or whatever else).

nine_k · on May 29, 2022

I heard that the desire to make x86 emulation performant on Itanium made things really bad, compared to a "clean" VLIW architecture.

Hizonner · on May 29, 2022

I'm not sure what happened with Itanium.

I do think a big part of the problem is that people want to distribute binaries that will run on a lot of CPUs that are physically really different inside. But nowadays there's JIT compilation even for JavaScript, so you could distribute something like LLVM, or even (ecch) JavaScript itself, and have the "compiler scheduling" happen at installation time or even at program start.

imtringued · on May 29, 2022

You can't distribute LLVM for that purpose without defining a stable format like WebAssembly or SPIR-V.

inkyoto · on May 30, 2022

Yes, you can if you use Bitcode which has been a stable format since around 2015. It is possible to distribute an application as a pure Bitcode binary that can be statically translated into the underlying hardware ISA unless the source code uses inline assembly – see https://lowlevelbits.org/bitcode-demystified/ for details.

booblik · on May 30, 2022

What about linking?

lmm · on May 30, 2022

Itanium was designed decades ago when compilers were a lot worse, and had other issues. Maybe it's time for another attempt in that direction.

tyingq · on May 29, 2022

There's the Mill CPU, which sounds terrific on paper. Hard to gauge when it might turn into something commercially usable though.

0xCMP · on May 29, 2022

Mill would definitely make things more interesting. They were supposed to have their simulator online a while ago, but sounds like they needed to redo the work on the compiler (from what I understood). Once that comes out it sounds like the next step is getting the simulator online for people to play with.

mhh__ · on May 29, 2022

People have tried over and over again to "fix" this and it hasn't worked.

The interesting probabilities are all decided at runtime.

Now we have AI workloads there is a place for a big lump of dumb compute again, but not in general purpose code.

Cthulhu_ · on May 29, 2022

It always did feel like a weird hack to me to avoid parts of the CPU to be idle. I mean the performance benefits are there, but it's at the cost of power usage in the end.

Can branch prediction be turned off on a compiler or application level? If you're optimizing for energy use that is. Disclaimer: I don't actually know if disabling branch prediction is more energy efficient.

Veedrac · on May 29, 2022

Disabling branch prediction would have such a catastrophic effect on performance, there is no way it would pay for itself. Actually this is true for most parts of a CPU; Apple's little cores are extremely power efficient and yet they are fairly impressive out-of-order designs. It would take a very meaningful redesign of how a CPU works to beat a processor like that, at least at useful speeds.

thrwyoilarticle · on May 29, 2022

At the cost of power doesn't necessarily mean at the cost of energy. If it takes a minimum of 1 gazillion electrons to do something useful then it's better to move those electrons quickly and return to idle than it is to eke them out slowly over a longer time despite the latter having a lower power usage.

ip26 · on May 30, 2022

Clocks and leakage burn most of the power. So the most power efficient thing to do is to get as much work done per clock as possible, and if a small fraction turns out to be wasted, you still come out ahead.

imtringued · on May 29, 2022

Turning off branch prediction sounds like a weird hack that serves no purpose, just underclock and undervolt your CPU if you care about power consumption that much.

rwmj · on May 29, 2022

Uli Drepper has this tool which you can use to annotate source code with explanations of which optimisations are applied. In this case it would rely on GCC recognizing branches which are hard to predict (eg. a branch in an inner loop which is data-dependent), and I'm not sure GCC is able to do that.

https://github.com/drepper/optmark

adrian_b · on May 29, 2022

A majority of the non-deterministic and speculative hardware mechanisms that exist in a modern CPU are required due to the consequences of one single hardware design decision: to use a data cache memory.

The data cache memory is one of the solutions to avoid the extremely long latency of loading data from a DRAM memory.

The alternative to a data cache memory is to have a hierarchy of memories with different speeds, which are addressed explicitly.

The latter variant is sometimes chosen for embedded computers where determinism is more important than programmer convenience. However, for general-purpose computers this variant could be acceptable only if the hierarchy of memories would be managed automatically by a high-level language compiler.

It appears that writing a compiler that could handle the allocation of data into a heterogeneous set of memories and the transfers between them is a more difficult task than designing a CPU that becomes an order of magnitude more complex due to having a hierarchy a data cache memories and a long list of other hardware mechanisms that must be added due to the existence of the data cache memory.

Once it is decided that the CPU must have a data cache memory, a lot of other hardware design decisions follow from it.

Because there is an inverse relationship between the load latency and the data cache memory size, the cache memory must be split into a multi-level hierarchy of cache memories.

To reduce the number of cache misses, data cache prefetchers must be added, to speculatively fill the cache lines in advance of load requests.

Now, when a data cache exists, most loads have a small latency, but from time to time there still is a cache miss, when the latency is huge, long enough to execute hundreds of instructions.

There are 2 solutions to the problem of finding instructions to be executed during cache misses, instead of stalling the CPU: simultaneous multi-threading and out-of-order execution.

For explicitly addressed heterogeneous memories, neither of these 2 hardware mechanisms is needed, because independent instructions can be scheduled statically to overlap the memory transfers. With a data cache, this is not possible, because it cannot be predicted statically when cache misses will occur (mainly due to the activity of other execution threads, but even an if-then-else can prevent the static prediction of the cache state, unless additional load instructions are inserted by the compiler, to ensure that the cache state does not depend on the selected branch of the conditional statement; this does not work for external library functions or other execution threads).

With a data cache memory, one or both of SMT and OoOE must be implemented. If out-of-order execution is implemented, then the number of registers needed to avoid false dependencies between instructions becomes larger than it is convenient to encode in the instructions. so register renaming must also be implemented.

And so on.

In conclusion, to avoid the huge amount of resources needed by a CPU for guessing about the programs, the solution would be a high-level language compiler able to transparently allocate the data into a hierarchy of heterogeneous memories and schedule transfers between them when needed, like the compilers do now for register allocation, loading and storing.

Unfortunately nobody has succeeded to demonstrate a good compiler of this kind.

Moreover, the existing compilers have frequently difficulties in discovering the optimal allocation and transfer schedule for registers, which is a simpler problem.

Doing efficiently the same for a hierarchy of heterogeneous memories seems out-of-reach for the current compilers.

neonsunset · on May 30, 2022

While not addressing explicitly, modern languages do already perform cache-size aware optimizations. Namely, .NET runtime will base its heap sizes and allocation behavior based on CPU's cache size so that data is allocated closer together and further reads, and subsequent garbage collection are friendly to CPU's memory prefetching behavior. On the other hand, register allocation is a much more complex problem to solve because there is no single heuristic that fits all CPUs (and even operating systems!) sharing the same micro-architecture. As a result, "good enough" sometimes is chosen unless the language relies on LLVM/GCC logic.

Alloc/GC: https://github.com/dotnet/runtime/blob/main/docs/design/core...

Reg alloc: https://github.com/dotnet/runtime/blob/main/docs/design/core...

solarexplorer · on May 29, 2022

We do have these architectures already in the embedded space and as DSPs. I suppose, they would be interesting for supercomputers as well. But for general purpose CPUs they would be a difficult sell. Since the memory size and latency would be part of the ISA, binaries could not run unchanged on different memory configurations, you would need another software layer to take care of that. Context switching and memory mapping would also need some rethinking. Of course, all of this can be solved, but it would make adoption more difficult.

And last not least, unknown memory latency is not the only source of problems, branch (mis)predictions are another. And they have the same remedies as cache misses: multithreading and speculative execution.

So if you wanted to get rid of branch prediction as well, you could come up with something like the CRAY-1.

adrian_b · on May 29, 2022

You are right that a kind of multi-threading can be useful to mitigate the effects of branch mispredictions.

However, for this, fine-grained multi-threading is enough. Simultaneous multi-threading does not bring any advantage, because the thread with the mispredicted branch cannot progress.

Out-of-order execution cannot be used during branch mispredictions, so like I have said, both SMT and OoOE are techniques useful only when a data cache memory exists.

Any CPU with pipelined instruction execution needs a branch predictor and it needs to execute speculatively the instructions on the predicted path, in order to avoid the pipeline stalls caused by control dependencies between instructions. An instruction cache memory is also always needed for a CPU with pipelined instruction execution, to ensure that the instruction fetch rate is high enough.

Unlike simultaneous multi-threading, fine-grained multi-threading is useful in a CPU without a data cache memory, not only because it can hide the latencies of branch mispredictions, but also because it can hide the latencies of any long operations, like it is done in all GPUs.

Fine-grained multi-threading is significantly simpler to implement than simultaneous multi-threading.

mr_toad · on May 30, 2022

Seems likely that programmers would game the system and try and hog all the fast memory for their programs.

staticassertion · on May 29, 2022

IDK, that seems like how brains work, and brains are pretty cool. They guess all the time in order to save time.

lmm · on May 30, 2022

Citation needed. The way I see it brains have a lot of baked in behaviour that isn't tuned at runtime, e.g. the length of time that different types of things are kept in different types of memory is remarkably consistent.

staticassertion · on May 30, 2022

It's definitely an extremely well supported fact that brains make predictions about the future in order to optimize various things.

lmm · on May 30, 2022

If your argument is "anything that attempts to make predictions is a good idea because it's like a brain", that seems enormously fallacious.

staticassertion · on May 30, 2022

My argument is that brains are cool, brains do this thing, ergo this thing might be cool.

lmm · on May 30, 2022

If "this thing" is just "making predictions", that's still so broad as to be more or less meaningless.

yjftsjthsd-h · on May 30, 2022

In fairness, that's a source of bugs in our brains, too.

ppg677 · on May 30, 2022

Your own brain is basically a predictor. That's largely how thinking works.

ncmncm · on May 30, 2022

Thinking works mostly by hashing. Prediction is an optimization on top of that.

DeathArrow · on May 29, 2022

Such a shame we can't play with a socketed CPU like this and a motherboard with EFI support as a workstation.

wmf · on May 29, 2022

Related Graviton 3 benchmarks: https://www.phoronix.com/scan.php?page=article&item=graviton...

WhitneyLand · on May 29, 2022

How much can SVE instructions help with machine learning?

I’ve wondered why Apple Silicon made the trade off decision to not include SVE support yet, given that support for lower precision FP vectorization seems like it could have made their NVidia perf gap smaller.

Erlangen · on May 29, 2022

I don't understand these graphs titled "Branch Predictor Pattern Recognition". What do they mean? Could someone here explain it a bit in detail? Thanks ahead.

ip26 · on May 30, 2022

It’s showing you how complicated of a branch sequence it can train on. Simple sequences are in the lower left, complex in the top right, and the ideal result would be a flat surface at the bottom. The earlier the surface rises into a mesa, the easier the branch predictor was overwhelmed.

invalidname · on May 29, 2022

While the article is interesting I would be more interested in details about carbon footprint and cost reduction. Also how would this impact more typical node, Java loads?

shepherdjerred · on May 29, 2022

AWS is pushing to move its internal services (most which are in Java) to graviton, so I would expect it to be excellent for “normal” workloads/languages

jeffbee · on May 29, 2022

Virtually 100% of cloud operating expenses are electricity, so you can pretty much assume that if it costs less it has a lower carbon footprint.

_joel · on May 29, 2022

+ Rent, support staff, development costs, regulation and compliance, network, maintenance (cooling, fire suppression + lots more), marketing.

Speaking as someone who did sys admin for a small independent cloud provider, it definitely isn't virtually 100% of operating costs

jeffbee · on May 29, 2022

No offense intended to your personal experience, but I don't think "small independent cloud" is terribly important in the global analysis. This paper concludes that TDP and TCO have become the same thing, i.e. power is heat, power is opex.

https://www.gwern.net/docs/ai/scaling/hardware/2021-jouppi.p...

ranman · on May 29, 2022

One key call out re: power from that paper - "Most OpEx cost is for provisioning power and not for electricity use [2], so saving power already provisioned for doesn’t improve TCO as much as one might hope."

ranman · on May 29, 2022

I'd peg it closer to 75%. There's some significant opex on the endpoint connectivity side as well. That's why a lot of these providers are buying stakes in some of the transoceanic cable projects (and some terrestrial ones as well). Or they're agreeing not to charge each other (kind of sort of- see open bandwidth alliance).

mr_toad · on May 30, 2022

Depreciation is an operating expense. Not a cash expense, but if you ignore depreciation you’ll end up like Detroit.

jeffbee · on May 30, 2022

The paper I cited did not ignore depreciation.

Hizonner · on May 29, 2022

You know, if you wanted to improve carbon footprint, a better place to look might be at software bloat. The sheer number of times things get encoded and decoded to text is mind boggling. Especially in "typical node, Java loads".

tyingq · on May 29, 2022

Logging and Cybersecurity are bloaty areas as well. I've seen plenty of AWS cost breakdowns where the cybersec functions were the highest percentage of spend. Or desktops where carbon black, or windows defender were using most of the CPU or IO cycles. And networks where syslog traffic was the biggest percentage of traffic.

orangepurple · on May 29, 2022

Norton, Symantec, and McAfee contribute greatly to global warming in the financial services sector. At least half of CPU cycles on employee laptops are devoted to them.

Cthulhu_ · on May 29, 2022

But do they actually work? For years I've been of the opinion that most anti-virus solutions don't actually stop virusses, instead they give you a false sense of security and their messaging is intentionally alarmist to make individuals and organizations pay their subscription fees.

In my limited and sheltered experience, the only viruses I've gotten in the past decade or so was from dodgy pirated stuff or big "download" button ads on download sites.

MrBuddyCasino · on May 29, 2022

At best they don’t work, in reality they are an attack vector themselves and a performance nightmare. They should (mostly) not exist.

MaxBarraclough · on May 29, 2022

Presumably then they're knocking hours off the laptops' battery lives?

Dunedan · on May 29, 2022

As AWS doesn't price services based on carbon footprint, you can't infer the carbon footprint from the cost.

I agree however that certain AWS services are disproportional expensive.

maxerickson · on May 29, 2022

Presumably the price provides some sort of bounds.

(Unless they are doing something like putting profits towards some sort of carbon maximization scheme)

tyingq · on May 29, 2022

Well and a fair amount of cybersec oriented services are a pattern of "sniff and copy every bit of data and do something with it" or "trawl all state". Which is inherently heavy.

ip26 · on May 30, 2022

This has been looked at; the electricity savings of more efficient programming are too small to be worth it with a fixed compute capacity. However, if you can reduce the amount of machines or machine-time you require, that can become material.

jadbox · on May 29, 2022

I'd love to see benchmarks for webservers like Node or Py.

tomrod · on May 29, 2022

Very interesting! I'm not terribly well versed in ARM vs x86 so its helpful to see these kinds of benchmarks and reports.

One bit of feedback for the author: the sliding scale is helpful, but the y axes are different between the visualizations so you cannot see the apples to apples comparison needed. Suggest regenerating those.

nnx · on May 30, 2022

I wonder how Graviton3 compare with M1 Max/Ultra, for the same number of CPU cores. The GPU and "ML" cores would totally destroy it for some use cases tho.

bullen · on May 29, 2022

What about memory contention when many cores try to write/read the same memory?

There is no point to add more cores if they can't cooperate.

How come I'm the only one pointning this out?

I think 4 cores will max out the memory contention, so keep on piling these 128 core heaters on. But they will not outlive a simple Raspberry 4!?

Sirened · on May 29, 2022

It's likely that it's going to need a post on its own since it's an extremely complicated topic. Someone else wrote an awesome post about this for the Neoverse 2 chips [1] and they found that with LSE atomics, the N2 performs as well or better than Icelake. Given gravitron3 has a much wider fabric, I would assume this lead only improves.

[1] https://travisdowns.github.io/blog/2020/07/06/concurrency-co...

bullen · on May 29, 2022

Ah, yes I remember this post, but it reads pretty cryptic to me. I would like to know what the slowdowns actually become in practice, does it add latency to the execution of other threads and how will the machine as a whole behave?

I know M4 had much better multicore shared memory perf. than M3, but now both of those are old and I don't have users to test anything now.

gpderetta · on May 29, 2022

Reading the same memory usually ok.

Writing us not, but respecting the single writer principle is usually rule zero of parallel programming optimisation.

If you mean reading/writing to the same memory bus in general, then yes, the bus need to be sized according to the need of the expected loads (i.e. the machine need to be balanced).

electricshampo1 · on May 29, 2022

The whole chip in general will be used in aggregate by independent vms/containers etc that do NOT read and write to the same memory. Some kernel datastructures within a given vm are still shared, ditto for within a single process, but good design minimizes that (per cpu/thread data structures, sharded locks, etc etc).

MaxBarraclough · on May 29, 2022

I don't think they were referring to contention across VM boundaries.

ykevinator2 · on May 29, 2022

No burstable graviton3's yet :-(

jazzythom · on May 29, 2022

Neato