> The final result is a chip that lets AWS sell each Graviton 3 core at a lower price, while still delivering a significant performance boost over their previous Graviton 2 chip.
That's not correct. AWS sells Graviton 3 based EC2 instances at a higher price than Graviton 2 based instances!
For example a c6g.large instance (powered by Graviton 2) costs $0.068/hour in us-east-1, while a c7g.large instance (powered by Graviton 3) costs $0.0725/hour [1]. Both instances have the same core count and memory, although c7g instances have slightly better network throughput.
I believe that is pretty unusual as, if my memory serves me right, newer instances family generations are usually cheaper than the previous generation.
Based on the first published benchmarks, even the programs which have not been optimized for Neoverse V1, and which do not benefit from its much faster floating-point and large-integer computation abilities, still show a performance increase of at least 40%, so greater than the price increase.
So I believe that using Graviton 3 at these prices is still a much better deal than using Graviton 2.
I don't follow. You seem to be implying that Amazon would like to reduce their electricity usage. If so, shouldn't they be charging less for the more efficient instance type?
Given the surrounding context I read that sentence to mean that focusing on compute density allowed them to sell each core at a lower price vs focusing on performance, not that Graviton 3 is cheaper than Graviton 2.
It would be irrational to expect a durable lower price on graviton. Amazon will price it lower initially to entice customers to port their apps, but after they get a critical mass of demand the price will rise to a steady state where it costs the same as Intel. The only difference will be which guy is taking your money.
I dunno, this take is a bit weird to me. The work we did to support Graviton wasn't "moving" from Intel to ARM, it was making our build pipeline arch-agnostic. If Intel one day works out cheaper again we'll use it again with basically zero friction.
Its great that your migration worked in a way that made your deployment pipeline arch agnostic. Expecting everyone did it that way? Not sure about that.
I bet quite a lot of migrations involved changing hardcoded instance types, docker image shas etc. Would be hard to revert back to quickly.
I mean, they just raised their graviton prices between generations.
I don’t think the point was that they would increase the cost of existing instance types, only that over time and generations the price will trend upwards as more workloads shift over.
Considering blank stares that I get when mentioning arm as potential cost saving measure it will take years and maybe decades before that happens by which point you’re def getting your money’s worth as early adopter
I honestly think it's because of shortage of CPU. If everyone switches over, they won't have enough hardware. Looking at the phoronixbenchmarks[0] it's a great uplift for the slight price increase.
I'm sure once hardware settles it'll go down in price. Given this was announced like 6 months ago, I'm sure they wanted to launch something.
> Literally 6 days ago when they introduced this thing.
Offering a new option is not a price increase. You can still do all the same things at the same prices, plus if the new thing is more efficient for your particular task you have an additional option.
When they introduced c6i they did it at the same price as c5, even though the c6i is a lot more efficient. They're raising the price on c7g vs. c6g to bring it closer to the pricing of c6i, which is pretty much exactly what I suggested?
Universally everyone understands "raising prices" to be - "raising prices without any customer action".
As in you consider your options, take into consideration pricing, design your architecture, you deploy it, and you get a bill. Then suddenly, later, without any action of your own, your bill goes up.
THAT is raising prices, and it is something AWS has essentially never done.
What you're describing is a situation where a customer CHOOSES to upgrade to a new generation of instances, and in doing so gets a larger bill. That is nowhere near the same thing.
I read the original post and didn't consider an interpretation other than Graviton would lose the Intel relative discount generation on generation. It'll take a decade for everyone to port their software after all.
Yes, gcc 10.1 has introduced support for the SVE2 intrinsics (ACLE).
Moreover, starting with the 8.1 version, gcc began to use SVE in certain cases when it succeeded to auto-vectorize loops (if the correct -march option had been used).
Nevertheless, many Linux distributions are still shipped with older gcc versions, so SVE/SVE2 does not work with the available compiler or cross-compiler.
It feels like we've gone badly wrong somewhere when processors have to spend so many of their resources guessing about the program. I am not saying I have a solution, just that feeling.
Why would you consider prediction based on dynamic conditions to be the sign of a dystopian optimization cycle? Isn’t it mostly intuitive that interesting program executions are not going to be things you can determine statically (otherwise your compiler would have cleaned them up for you with inlining etc.), or could be determined statically but at too great cost to meet execution deadlines (JiTs and so on), or resource constraints (you don’t really want N code clones specialising each branch backtrace to create strictly predictable chains).
Or is the worry on the other side; that processors have gotten so out-of-order that only huge dedication to guesswork can keep the beast sated? I don’t see this as a million miles from software techniques in JiT compilers to optimistically optimize and later de-deoptimize when an assumption proves wrong.
I think you might be right to be nervous if you wrote programs that took fairly regular data and did fairly regular things to it. But, as Itanium learned the hard way, programs have much more dynamic, emergent and interesting behaviour than that!
I guess the fear is that the CPU might start guessing wrong, causing your program to miss deadlines. Also, the heuristics are practically useless for realtime computing, where timings must be guaranteed.
I suppose that if you assume in-order execution and count the clock cycles, you should get a guaranteed lower bound of performance. It may be, say, 30-40% of the performance you really observe, but having some headroom should feel good.
Isn't that the whole promise of general purpose computing? That you don't need to find specialized hardware for most workloads? Nobody wants to be out shopping for CPUs that have features that align particularly well with their use case, then switching to different CPUs when they need to release an update or some customer comes along with data that runs less efficiently with the algorithms as written.
Since processors are expensive and hard to change, they do tricks to allow themselves to be used more efficiently in common cases. That seems like a reasonable behavior to me.
I thought this was the reasoning behind Itanium, the idea that scheduling could be worked out in advance by the compiler (probably profile guided from tests or something like that) which would reduce the latency and silicon cost of implementations.
However, it wasn't exactly a raging success, with I think the predicted amazing compiler tech not materialising, but maybe it is the right answer, but the implementation was wrong? I'm no CPU expert...
Itanium was a really badly designed architecture, which a lot of people skip over when they try to draw analogies to it. It was a worst of three worlds, in that it was big and hot like an out-of-order, it had the serial dependency issues of an in-order, and it had all the complexity of fancy static scheduling without that fancy scheduling actually working.
There have been a small number of attempts since Itanium, like NVIDIA's Denver, which make for much better baselines. I don't think those are anywhere close to optimal designs, or really that they tried hard enough to solve in-order issues at all, but they at least seem sane.
Would Itanium have been better served with bytecode and a modern JIT? Also, doesn't RISC-V kinda get back on that VLIW track with macro-ops fusion, using a very basic instruction set and letting the compiler figure out the best way to order stuff to help target CPU make sense of it?
Those are all very different things. It is probably possible to argue that using a JIT would have solved some of Itanium's compilation issues, like it would make it easier to make compilation decisions about where to do software data miss handling, but I don't think it would have made the hardware fundamentally more sensible, or all that performant. RISC-V isn't really anything like VLIW, it is about as close to a traditional RISC as anything gets nowadays, and macro-op fusion is just a simple front end trick that doesn't overly influence whether the back end is in-order or out-of-order (or whatever else).
I do think a big part of the problem is that people want to distribute binaries that will run on a lot of CPUs that are physically really different inside. But nowadays there's JIT compilation even for JavaScript, so you could distribute something like LLVM, or even (ecch) JavaScript itself, and have the "compiler scheduling" happen at installation time or even at program start.
Yes, you can if you use Bitcode which has been a stable format since around 2015. It is possible to distribute an application as a pure Bitcode binary that can be statically translated into the underlying hardware ISA unless the source code uses inline assembly – see https://lowlevelbits.org/bitcode-demystified/ for details.
Mill would definitely make things more interesting. They were supposed to have their simulator online a while ago, but sounds like they needed to redo the work on the compiler (from what I understood). Once that comes out it sounds like the next step is getting the simulator online for people to play with.
It always did feel like a weird hack to me to avoid parts of the CPU to be idle. I mean the performance benefits are there, but it's at the cost of power usage in the end.
Can branch prediction be turned off on a compiler or application level? If you're optimizing for energy use that is. Disclaimer: I don't actually know if disabling branch prediction is more energy efficient.
Disabling branch prediction would have such a catastrophic effect on performance, there is no way it would pay for itself. Actually this is true for most parts of a CPU; Apple's little cores are extremely power efficient and yet they are fairly impressive out-of-order designs. It would take a very meaningful redesign of how a CPU works to beat a processor like that, at least at useful speeds.
At the cost of power doesn't necessarily mean at the cost of energy. If it takes a minimum of 1 gazillion electrons to do something useful then it's better to move those electrons quickly and return to idle than it is to eke them out slowly over a longer time despite the latter having a lower power usage.
Clocks and leakage burn most of the power. So the most power efficient thing to do is to get as much work done per clock as possible, and if a small fraction turns out to be wasted, you still come out ahead.
Turning off branch prediction sounds like a weird hack that serves no purpose, just underclock and undervolt your CPU if you care about power consumption that much.
Uli Drepper has this tool which you can use to annotate source code with explanations of which optimisations are applied. In this case it would rely on GCC recognizing branches which are hard to predict (eg. a branch in an inner loop which is data-dependent), and I'm not sure GCC is able to do that.
A majority of the non-deterministic and speculative hardware mechanisms that exist in a modern CPU are required due to the consequences of one single hardware design decision: to use a data cache memory.
The data cache memory is one of the solutions to avoid the extremely long latency of loading data from a DRAM memory.
The alternative to a data cache memory is to have a hierarchy of memories with different speeds, which are addressed explicitly.
The latter variant is sometimes chosen for embedded computers where determinism is more important than programmer convenience. However, for general-purpose computers this variant could be acceptable only if the hierarchy of memories would be managed automatically by a high-level language compiler.
It appears that writing a compiler that could handle the allocation of data into a heterogeneous set of memories and the transfers between them is a more difficult task than designing a CPU that becomes an order of magnitude more complex due to having a hierarchy a data cache memories and a long list of other hardware mechanisms that must be added due to the existence of the data cache memory.
Once it is decided that the CPU must have a data cache memory, a lot of other hardware design decisions follow from it.
Because there is an inverse relationship between the load latency and the data cache memory size, the cache memory must be split into a multi-level hierarchy of cache memories.
To reduce the number of cache misses, data cache prefetchers must be added, to speculatively fill the cache lines in advance of load requests.
Now, when a data cache exists, most loads have a small latency, but from time to time there still is a cache miss, when the latency is huge, long enough to execute hundreds of instructions.
There are 2 solutions to the problem of finding instructions to be executed during cache misses, instead of stalling the CPU: simultaneous multi-threading and out-of-order execution.
For explicitly addressed heterogeneous memories, neither of these 2 hardware mechanisms is needed, because independent instructions can be scheduled statically to overlap the memory transfers. With a data cache, this is not possible, because it cannot be predicted statically when cache misses will occur (mainly due to the activity of other execution threads, but even an if-then-else can prevent the static prediction of the cache state, unless additional load instructions are inserted by the compiler, to ensure that the cache state does not depend on the selected branch of the conditional statement; this does not work for external library functions or other execution threads).
With a data cache memory, one or both of SMT and OoOE must be implemented. If out-of-order execution is implemented, then the number of registers needed to avoid false dependencies between instructions becomes larger than it is convenient to encode in the instructions. so register renaming must also be implemented.
And so on.
In conclusion, to avoid the huge amount of resources needed by a CPU for guessing about the programs, the solution would be a high-level language compiler able to transparently allocate the data into a hierarchy of heterogeneous memories and schedule transfers between them when needed, like the compilers do now for register allocation, loading and storing.
Unfortunately nobody has succeeded to demonstrate a good compiler of this kind.
Moreover, the existing compilers have frequently difficulties in discovering the optimal allocation and transfer schedule for registers, which is a simpler problem.
Doing efficiently the same for a hierarchy of heterogeneous memories seems out-of-reach for the current compilers.
While not addressing explicitly, modern languages do already perform cache-size aware optimizations. Namely, .NET runtime will base its heap sizes and allocation behavior based on CPU's cache size so that data is allocated closer together and further reads, and subsequent garbage collection are friendly to CPU's memory prefetching behavior.
On the other hand, register allocation is a much more complex problem to solve because there is no single heuristic that fits all CPUs (and even operating systems!) sharing the same micro-architecture. As a result, "good enough" sometimes is chosen unless the language relies on LLVM/GCC logic.
We do have these architectures already in the embedded space and as DSPs. I suppose, they would be interesting for supercomputers as well. But for general purpose CPUs they would be a difficult sell. Since the memory size and latency would be part of the ISA, binaries could not run unchanged on different memory configurations, you would need another software layer to take care of that. Context switching and memory mapping would also need some rethinking. Of course, all of this can be solved, but it would make adoption more difficult.
And last not least, unknown memory latency is not the only source of problems, branch (mis)predictions are another. And they have the same remedies as cache misses: multithreading and speculative execution.
So if you wanted to get rid of branch prediction as well, you could come up with something like the CRAY-1.
You are right that a kind of multi-threading can be useful to mitigate the effects of branch mispredictions.
However, for this, fine-grained multi-threading is enough. Simultaneous multi-threading does not bring any advantage, because the thread with the mispredicted branch cannot progress.
Out-of-order execution cannot be used during branch mispredictions, so like I have said, both SMT and OoOE are techniques useful only when a data cache memory exists.
Any CPU with pipelined instruction execution needs a branch predictor and it needs to execute speculatively the instructions on the predicted path, in order to avoid the pipeline stalls caused by control dependencies between instructions. An instruction cache memory is also always needed for a CPU with pipelined instruction execution, to ensure that the instruction fetch rate is high enough.
Unlike simultaneous multi-threading, fine-grained multi-threading is useful in a CPU without a data cache memory, not only because it can hide the latencies of branch mispredictions, but also because it can hide the latencies of any long operations, like it is done in all GPUs.
Fine-grained multi-threading is significantly simpler to implement than simultaneous multi-threading.
Citation needed. The way I see it brains have a lot of baked in behaviour that isn't tuned at runtime, e.g. the length of time that different types of things are kept in different types of memory is remarkably consistent.
How much can SVE instructions help with machine learning?
I’ve wondered why Apple Silicon made the trade off decision to not include SVE support yet, given that support for lower precision FP vectorization seems like it could have made their NVidia perf gap smaller.
I don't understand these graphs titled "Branch Predictor Pattern Recognition". What do they mean? Could someone here explain it a bit in detail? Thanks ahead.
It’s showing you how complicated of a branch sequence it can train on. Simple sequences are in the lower left, complex in the top right, and the ideal result would be a flat surface at the bottom. The earlier the surface rises into a mesa, the easier the branch predictor was overwhelmed.
While the article is interesting I would be more interested in details about carbon footprint and cost reduction. Also how would this impact more typical node, Java loads?
AWS is pushing to move its internal services (most which are in Java) to graviton, so I would expect it to be excellent for “normal” workloads/languages
No offense intended to your personal experience, but I don't think "small independent cloud" is terribly important in the global analysis. This paper concludes that TDP and TCO have become the same thing, i.e. power is heat, power is opex.
One key call out re: power from that paper - "Most OpEx cost is for provisioning power and not for electricity use [2], so saving power already provisioned for doesn’t improve TCO as much as one might hope."
I'd peg it closer to 75%. There's some significant opex on the endpoint connectivity side as well. That's why a lot of these providers are buying stakes in some of the transoceanic cable projects (and some terrestrial ones as well). Or they're agreeing not to charge each other (kind of sort of- see open bandwidth alliance).
You know, if you wanted to improve carbon footprint, a better place to look might be at software bloat. The sheer number of times things get encoded and decoded to text is mind boggling. Especially in "typical node, Java loads".
Logging and Cybersecurity are bloaty areas as well. I've seen plenty of AWS cost breakdowns where the cybersec functions were the highest percentage of spend. Or desktops where carbon black, or windows defender were using most of the CPU or IO cycles. And networks where syslog traffic was the biggest percentage of traffic.
Norton, Symantec, and McAfee contribute greatly to global warming in the financial services sector. At least half of CPU cycles on employee laptops are devoted to them.
But do they actually work? For years I've been of the opinion that most anti-virus solutions don't actually stop virusses, instead they give you a false sense of security and their messaging is intentionally alarmist to make individuals and organizations pay their subscription fees.
In my limited and sheltered experience, the only viruses I've gotten in the past decade or so was from dodgy pirated stuff or big "download" button ads on download sites.
Well and a fair amount of cybersec oriented services are a pattern of "sniff and copy every bit of data and do something with it" or "trawl all state". Which is inherently heavy.
This has been looked at; the electricity savings of more efficient programming are too small to be worth it with a fixed compute capacity. However, if you can reduce the amount of machines or machine-time you require, that can become material.
Very interesting! I'm not terribly well versed in ARM vs x86 so its helpful to see these kinds of benchmarks and reports.
One bit of feedback for the author: the sliding scale is helpful, but the y axes are different between the visualizations so you cannot see the apples to apples comparison needed. Suggest regenerating those.
I wonder how Graviton3 compare with M1 Max/Ultra, for the same number of CPU cores.
The GPU and "ML" cores would totally destroy it for some use cases tho.
It's likely that it's going to need a post on its own since it's an extremely complicated topic. Someone else wrote an awesome post about this for the Neoverse 2 chips [1] and they found that with LSE atomics, the N2 performs as well or better than Icelake. Given gravitron3 has a much wider fabric, I would assume this lead only improves.
Ah, yes I remember this post, but it reads pretty cryptic to me. I would like to know what the slowdowns actually become in practice, does it add latency to the execution of other threads and how will the machine as a whole behave?
I know M4 had much better multicore shared memory perf. than M3, but now both of those are old and I don't have users to test anything now.
Writing us not, but respecting the single writer principle is usually rule zero of parallel programming optimisation.
If you mean reading/writing to the same memory bus in general, then yes, the bus need to be sized according to the need of the expected loads (i.e. the machine need to be balanced).
The whole chip in general will be used in aggregate by independent vms/containers etc that do NOT read and write to the same memory. Some kernel datastructures within a given vm are still shared, ditto for within a single process, but good design minimizes that (per cpu/thread data structures, sharded locks, etc etc).
That's not correct. AWS sells Graviton 3 based EC2 instances at a higher price than Graviton 2 based instances!
For example a c6g.large instance (powered by Graviton 2) costs $0.068/hour in us-east-1, while a c7g.large instance (powered by Graviton 3) costs $0.0725/hour [1]. Both instances have the same core count and memory, although c7g instances have slightly better network throughput.
I believe that is pretty unusual as, if my memory serves me right, newer instances family generations are usually cheaper than the previous generation.
[1]: https://aws.amazon.com/ec2/pricing/on-demand/