Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL parsing benchmark

Kwpolska · on May 3, 2023

A comparison with x86_64 CPUs (e.g. those seen in comparable MacBooks and AWS machines) would be useful.

Also, I'm not sure if "correcting" the numbers for 3 GHz is reasonable and reflects real-life performance. Perhaps some throttling could be applied to test the CPUs using a common frequency?

zamalek · on May 3, 2023

> I'm not sure if "correcting" the numbers for 3 GHz is reasonable and reflects real-life performance

It's not useful at all. It effectively measures IPC (instructions per clock), which is just chip vendor bragging rights.

Assuming that all the chips meet some baseline performance criteria: for datacenter and portable devices, the real benchmark would be "instructions per joule."

For desktop devices "instructions per dollar" would be most relevant.

aylmao · on May 3, 2023

> It's not useful at all. It effectively measures IPC (instructions per clock), which is just chip vendor bragging rights.

+1. Moreover the author then seems to conclude from this benchmark:

> Overall, these numbers suggest that the Qualcomm processor is competitive.

This is an odd conclusion to draw from this test and these numbers, given how little this benchmark tests (just string operations). Does this benchmark want to test raw CPU power? Then why "normalize" to 3GHz? Does it want to test CPU capabilities? If so why use such a "narrow" test?

IMO this benchmarking does for a good data-point, but far from enough to draw much of a conclusion from.

xoa · on May 4, 2023

>For desktop devices "instructions per dollar" would be most relevant.

Well, it's one measure, but scaling potential and absolute performance limit matters as well. In use cases where a desktop is performing work to help one or more humans, the cost of salary and other support may utterly drown out even a very expensive extremely high power system. Ie., a 1200W workstation being used at maximum power 10 hours a day 365 days a year (an absurd utilization ratio) at $0.15/kWh (average US electricity cost) would still only be around $650/year. If it boosted the productivity of a typical tech worker even 1% it'd pay for itself no problem. It's easy to lose sight sometimes of how historically incredible the bang for the buck is in tech.

I think it's important to remember because some designs that do incredibly well at small sizes run into challenges scaling. Like with CPUs, the nature of silicon fabrication makes it ever more difficult to grow a monolithic die. Switching to a chiplet-based design does mean absolute efficiency and minimum power challenges amongst others, but dramatically improves scalability at the high end. It's a sort of infrastructure tradeoff. Apple for example has done incredibly well with its big silicon SoCs from handhelds to portable Macs, but it's been struggling to do a Mac Pro or even updated Studio. It's not clear that they physically can do something on the level of a modern Epyc chip anymore than Intel could with monolithic Xeons.

So normalized performance/watt and performance/$ and so on do matter, but absolute final density and scale up are another part of the matrix for certain use cases too.

jnsaff2 · on May 4, 2023

Your tech worker only gets about 4.1 days of not working? Wow.

That aside, I think it would be more important to look at the carbon emissions impact of code/cpu's.

That one person parsing urls might save few hundred dollars of not optimising their code but if it is being used billions of times a day the energy (and therefore emissions) impact can be huge.

adfghuibnuio · on May 4, 2023

It's pretty amusing to me how the Megahertz Myth has been inverted nowadays. It used to be everyone cared about the fastest clock speed and irrationally discounted IPC. Now people care about IPC and irrationally discount clock speed. Pretty weird, in my opinion.

saagarjha · on May 4, 2023

Clock speed is comparatively easier to boost.

Aeolun · on May 4, 2023

Maybe for you? If I’m looking at a new CPU for my desktop the only thing really relevant is how many instructions it can crank out per second.

I’m not trying to optimize for cost or energy efficiency.

gpderetta · on May 4, 2023

Instructions per second is an useful performance metric. Instructions per clock isn't.

Octoth0rpe · on May 3, 2023

> For desktop devices "instructions per dollar" would be most relevant.

For cloud customers as well

zamalek · on May 3, 2023

> For cloud customers as well

Cloud costs are dominated by power delivery and cooling. Both of those are directly influenced by how much power the chip uses to achieve it's performance target.

I guess it does indirectly influence dollar cost, but I was referring to MSRP of the chip. As a simple example: the per-chip cost of Graviton is probably enormous (if you factor R&D into the cost of a chip), but it's still cheaper for Amazon customers. Why? Power and cooling.

boulos · on May 4, 2023

Disclosure: I used to work on GCE.

I don't understand where these power and cooling mantras came from, but large-scale cloud providers have very low PUE (Google publishes historical data at https://www.google.com/about/datacenters/efficiency/). That means you can take the basic power from a thread, add some for memory, and multiply by just a bit to get Watts. Take that and plug in your favorite $/kWh guess and get a price.

Ignoring Graviton, which doesn't have published power data to my knowledge, you can look up the TDP for a bunch of chips and see that it's a a few Watts per thread [1]. Similar calculations can be done for RAM. You end up at 5ish Watts per thread with some RAM attached. Let's call it 10 all in with cooling or other stuff. Since 10W is 1% of a kW, we end up with .01 kWh per hour. The top hit for "us power commercial rates" [2] says that we should assume 7c per kWh or so. That means our instance with cooling and overheads, needs to include .01 x .07 => $.0007/hr of power and cooling costs.

A single core w/ 4 GiB of memory on GCP at 3yr commitment rates (so we're focused on the long-term depreciation price) is .009815 + 4x.001316 => $.015/hr or about 20x as much as the power.

tl;dr: Power costs add up, but they are not even close to dominating the costs of cloud pricing.

[1] https://wccftech.com/amd-epyc-7h12-cpu-64-core-zen-2-280w-td...

[2] https://www.statista.com/statistics/190680/us-industrial-con...

jpgvm · on May 4, 2023

Reaching those PUE numbers is pretty damn impressive though.

I think the misconceptions come from enterprise DC environments with traditional hot/cold aisle designs, servers running way cooler than they need to be and peak power requirements leading to overly expensive power/cooling costs. PUE for these closer to 2.0 than GCE is to 1.0.

If you have something like GCE where you can control for all of those, i.e run the DCs hot, eliminate transient peaks, source power cheaply and use super efficient cooling like evaporative or geothermal pumped water etc then yeah, it's a completely different ballgame.

It's pretty safe to assume all the hyperscalers are doing all of these things too, they aren't stupid. :)

cubefox · on May 3, 2023

Which would probably include the cost of the chips in some way, not just electricity.

foobiekr · on May 4, 2023

Your typical DSP beats your typical CPU on power and instructions/sec/dollar by a pretty wide margin.

gpderetta · on May 4, 2023

You have to measure effective (not theoretical peak) instructions on a specific workload.

DSPs will beat CPUs on DSP workloads, but, as expected, utterly fail for any general purpose workload.

foobiekr · on May 5, 2023

Clearly so.

pr0zac · on May 3, 2023

It feels like a pretty useless benchmark as far as translating to real world performance but for the heck of it I did ten benchmarks on my three year old Thinkpad with a 4.5Ghz Intel i7-9750H running Ubuntu and got a best of ns/url=225.63 and worst of ns/url=275.118 whatever those scores mean.

tlamponi · on May 4, 2023

On my (not idle) i7-12700K with DDR5 4800 memory Workstation, using Proxmox VE 7.4 (based on Debian 11) with our 6.2 based Linux kernel, I get something between 139.44 ns/url and 140.66 ns/url (ran it about 10 times in a row), so 1.35x faster than the M2.

StillBored · on May 4, 2023

One of the comments on the article notes a 7950x doing "time/url=109ns"

so I tossed it on my 5950x and it reports "ns/url=148.994" which is about right from what I understand (zen4 gets a nice ~30% bump on a number of Linux benchmarks over zen3), here its 35% with only a 16% clock advantage.

Which is about the diff between the M2 and my old 5950X in a generation older process/etc.

So, yah current M2 slower than previous gen Intel and AMD both in this benchmark.

heredoc · on May 4, 2023

I constantly get between 154 and 156 ns/url when running on an M2 Air 2022 while using asahi linux instead of MacOs. 22 Celsius according to smartmontools :)

tlamponi · on May 5, 2023

So seems a good part of the speed-up comes from using the Linux kernel, which tbh. doesn't really surprise me. It's hard to match the engineering amount and quality getting poured into it (well at least the core stuff, some niche driver will be a mess like everywhere). The remainder then comes from modern x86_64/amd64 CPU combined with DDR5 beating the M2, well at least in this micro benchmark.

stingraycharles · on May 3, 2023

In my totally unscientific (but consistent) benchmarks for our CI build servers, m6g.8xlarge compile our c++ codebase in about 9.5 minutes, where m6a.8xlarge takes about 11 minutes. The price difference is about 20% as well iirc, so it’s generally a good deal.

Of course the types of optimisations that a compiler may (or may not) do on aarch64 vs x86_64 are completely different and may explain the difference (we actually compile with -march=haswell for x86_64), but generally Graviton seems like a really good deal.

foota · on May 3, 2023

You're probably leaving a lot of performance on the floor if you're building for haskell and running on skylakeish or newer.

Edit: yes, haswell:-)

nickpeterson · on May 3, 2023

*haswell, in case the Haskell people come after you.

speed_spread · on May 3, 2023

"don't poke the endofunctor"

paulddraper · on May 3, 2023

Ah, that makes more sense.

stingraycharles · on May 4, 2023

Yeah we know, it’s just that we ship these binaries to customers and need a reasonably old architecture that we know “almost everyone” uses.

Aeolun · on May 4, 2023

Why not use the c type instances? Is the compilation memory bound? I’d expect it to be the cpu that’s the limiting factor.

stingraycharles · on May 4, 2023

Heavy C++ template metaprogramming, we need up to 4GB of memory per process in some situations.

deltaci · on May 3, 2023

It's a benchmark of GitHub Actions(Azure) vs a really old Macbook Pro 15, not exactly what you are looking for, but it tells the vibe already.

https://buildjet.com/for-github-actions/blog/a-performance-r...

015a · on May 3, 2023

This is a big, general problem with CI providers I don't hear talked about enough: because they charge per-minute, they are actively incentivized to run on old hardware, slowing builds and milking more from customers in the process. Doubly-so when your CI is hosted by a major cloud provider who would otherwise have to scrap these old machines.

I wish this were only a theoretical concern, a theoretical incentive, but its not. Github Actions is slow, and Gitlab suffers from a similar problem; their hosted SaaS runners are on GCP n1-standard-1 machines. The oldest machine type in GCP's fleet, the n1-standard-1 is powered by a variety of dusty, old CPUs Google Cloud has no other use for, from Sandy Bridge to Skylake. That's a 12 year old CPU.

AnthonyMouse · on May 3, 2023

There are workloads where newer CPUs are dramatically faster (e.g. AVX-512), but in general the difference isn't huge. Most of what the newer CPUs get you is more cores and higher power efficiency, which you don't care about when you're paying per-vCPU. Which vCPU is faster, a ten year old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon Platinum 8352V at 2.1GHz? It depends on the workload. Which has more memory bandwidth per core?

But the cloud provider prefers the latter because it has 500% more cores for 50% more power. Which is why the latter still goes for >$2000 and the former is <$15.

015a · on May 3, 2023

> Which vCPU is faster, a ten year old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon Platinum 8352V at 2.1GHz? It depends on the workload.

It really does not depend on the workload, when those workloads we're talking about are by-and-large bounded to 1vCPU or less (CI jobs, serverless functions, etc). Ice Lake cores are substantially faster than Ivy Bridge; the 8352V will be faster in practically any workload we're talking about.

However, I do agree with this take, if we're talking about, say, lambda functions. The reason being that the vast majority of workloads built on lambda functions are bounded by IO, not compute; so newer core designs won't result in a meaningful improvement in function execution. Put another way: Is a function executing in 75ms instead of 80ms worth paying 30% more? (I made these numbers up, but its the illustration that matters).

CI is a different story. CI runs are only bound by IO for the smallest of projects; downloading that 800mb node:18 base docker image takes some time, but it can very easily and quickly be dwarfed by all the things that happen afterward. This is not an uncontroversial opinion; "the CI is slow" is such a meme of a problem at engineering companies nowadays that you'd think more people would have the sense to look at the common denominator (the CI hosts suck) and not blame themselves (though, often there's blame to go around). We've got a project that can build locally, M2 Pro, docker pull and push included, in something like 40 seconds; the CI takes 4 minutes. Its the crusty CPUs; its slow networking; its the "step 1 is finished, wait 10 seconds for the orchestrator to realize it and start step 2".

And I think we, the community, need to be more vocal about this when speaking on platforms that charge by the minute. They are clearly incentivized to leave it shitty. It should even surface in discussions about, for example, the markup of lambda versus EC2. A 4096mb lambda function would cost $172/mo if ran 24/7, back-to-back. A comparable c6i-large: $62/mo; a third the price. That's bad enough on the surface, and we need to be cognizant that its even worse than it initially appears because Amazon runs Lambda on whatever they have collecting dust in the closet, and people still report getting Ivy Bridge and Haswell cores sometimes, in 2023; and the better comparison is probably a t2-medium @ $33/mo; a 5-6x markup.

This isn't new information; lambda is crazy expensive; blah blah blah; but I don't hear that dimension brought up enough. Calling back to my previous point: Is a function executing in 75ms instead of 80ms worth paying 30% more? Well, we're already paying 550% more; the fact that it doesn't execute in 75ms by default is abhorrent. Put another way: if Lambda, and other serverless systems like it such as hosted CI runners, enables cloud providers to keep old hardware around far longer than performance improvements say it should be; the markup should not be 500%. We're doing Amazon a favor by using Lambda.

AnthonyMouse · on May 4, 2023

> It really does not depend on the workload, when those workloads we're talking about are by-and-large bounded to 1vCPU or less (CI jobs, serverless functions, etc). Ice Lake cores are substantially faster than Ivy Bridge; the 8352V will be faster in practically any workload we're talking about.

If you were comparing e.g. the E5-2667v2 to the Xeon Gold 6334 you would be right, because they have the same number of cores and the 6334 has a higher rather than lower clock speed.

But the newer CPUs support more cores per socket. The E5-2643v2 has 6, the Xeon Platinum 8352V has 36.

To make that fit in the power budget, it has a lower base clock, which eats a huge chunk out of Ice Lake's IPC advantage. Then the newer CPU has around twice as much L3 cache, 54MB vs. 25MB, but that's for six times as many cores. You get 1.5MB/core instead of >4MB/core. It has just over three times the memory bandwidth (8xDDR4-2933 vs. 4xDDR3-1866), but again six times as many cores, so around half as much per core. It can easily be slower despite being newer, even when you're compute bound.

> We've got a project that can build locally, M2 Pro, docker pull and push included, in something like 40 seconds; the CI takes 4 minutes. Its the crusty CPUs; its slow networking; its the "step 1 is finished, wait 10 seconds for the orchestrator to realize it and start step 2".

Inefficient code and slow hardware are two different things. You can have the fastest machine in the world that finishes step 1 in 4ms and still be waiting 10 full seconds if the system is using a timer.

But they're operating in a competitive market. If you want a faster system, patronize a company that provides one. Just don't be surprised if it costs more.

icedchai · on May 4, 2023

Lambda is good for bursty, typically low activity applications, where it just wouldn't make sense to have EC2 instances running 24x7. There about some line-of-business app that gets a couple of requests every minute or so. Maybe once a quarter there will be a spike in usage. Lambda scales up and just handles it. If requests execute in 50ms (unlikely!) or 500ms, it just doesn't matter.

throwaway2990 · on May 4, 2023

Lambda is not crazy expensive. It’s expensive if you’re running something 24/7 in place of a physical server or VM.

spockz · on May 4, 2023

It would be nice if I could still use all lambda functionality and tooling but instead have it running in my own vms with long time commitment.

throwaway2990 · on May 4, 2023

Not quite sure I follow. But I built an asp.net api and deployed it into lambda and it cost $2/m and when it started to get more traffic and the cost got to $20/m I moved it to a t4g instance.

When I moved it, I didn’t need to make any code changes :) I just made a systemd file and deployed it.

nwmcsween · on May 5, 2023

For this to be true IPC would have to have stagnated for 10 years which is not the case. Look at Agner's instruction tables for different uarchs and compare.

willcipriano · on May 3, 2023

Sometimes when I run a lot of builds in a short period of time I feel like I get demoted to the slower boxes.

eek2121 · on May 4, 2023

perf/watt/density would be the best way to measure. Frequency should NOT be normalized because x86 chips are designed to be high frequency parts. That is why measuring IPC is meaningless. A cutting edge ARM chip can possibly beat a modern x86 chip in terms of IPC, but the x86 chip will almost certainly scale to higher frequencies. Certain models of EPYC Genoa (Zen 4) for example, boost to 4GHz, and both Intel and AMD have chips that boost to 5.7+ GHz.

Density is important because if you can only have a max of, for example, 20 cores for an ARM solution, but 96 cores in the case of EPYC Genoa, Genoa is going to win out for any multi-core workload.

nspattak · on May 4, 2023

The author very clearly and explicitely says in the post that this is "purely to help us think".

mhh__ · on May 3, 2023

Without more context about what the code actually does this doesn't tell me all that much, other than what I could guess from the intended usecases of the chips.

The strength of Apple silicon is that it can crush benchmarks and transfer that power very well to real world concurrent workloads too e.g. is this basically just measuring the L1 latency? If not are the compilers generating the right instructions etc. (One would assume they are but I have had issues with getting good arm codegen previously, only to find that the compiler couldn't work out what ISA to target other than a conservative guess)

dan-robertson · on May 3, 2023

It does seem the benchmark has its data in cache, based on the timings.

If the benchmark were only measuring L1 latency, what would that imply about the ‘scaling by inverse clock speed’ bit? My guess is as follows. Chips with higher clock rates will be penalised: (a) it is harder to decrease latencies (memory, pipeline length, etc) in absolute terms than run at a higher clock speed to maybe do non-memory things faster; and (b) if you’re waiting 5ns to read some data, that hurts you more after the scaling if your clock speed is higher. The fact that the M1 wins after the scaling despite the higher clock rate suggests to me that either they have a big advantage on memory latency or there’s some non-memory-latency advantage in scheduling or branch prediction that leads to more useful instructions being retired per cycle.

But maybe I’m interpreting it the wrong way.

renewiltord · on May 3, 2023

Git repo available from post.

mhh__ · on May 3, 2023

I ain't readin' all that (OK maybe I will but lemire does do this quite a lot, his blog is 40% gems 60% slightly sloppy borderline factoids that only make sense if you think in exactly the same way he does)

KerrAvon · on May 3, 2023

Yes. Since it's Ada, I'm suspicious of codegen tuning being a major factor here.

zimpenfish · on May 3, 2023

"Ada is a fast and spec-compliant URL parser written in C++."

Wouldn't modern C++ compilers have decent codegen tuning for all these platforms?

wtallis · on May 3, 2023

Is there something specific about this library that makes you suspicious, or are you assuming from the name that this is using the Ada programming language?

monocasa · on May 3, 2023

Part of what they don't mention is that Graviton 3 and the Snapdragon 8cx Gen 3 have pretty much the same processor core. The Neoverse V1 is only a slightly modified Cortex X1. Hence the same results when you account for clock frequency.

ksec · on May 4, 2023

If I remember correctly,

The 8cx 3rd Gen is based on Cortex X1, the same as Snapdragon 888.

The Graviton 3 is based on Neoverse V1 which in itself is tweaked version of Neoverse N1 with relaxed pref / watt / die area and much improved SIMD work load. And N1 is based on Cortex X1.

The current Snapdragon is on Cortex X3. With Cortex X4 coming soon.

tuetuopay · on May 4, 2023

FWIW:

- on an AMD Ryzen 5 Pro 4650U (laptop) I get 270 ns/url

- on an Ampere Altra Max M128 I get 340ns/url

So yeah, this show how great Apple's M2 are

moralestapia · on May 4, 2023

Apple M2 is 190 ns/url for those interested

peoplefromibiza · on May 4, 2023

tried on ~2021 i9-11950H (11th Gen Intel, laptop), got 166 ns/url

jeffbee · on May 3, 2023

To me it would be somewhat more interesting to compare head-to-head mobile CPUs instead of comparing laptops and servers. In this particular microbenchmark, mobile 12th and 13th-generation Core performance cores, and even the efficiency cores on the 13th generation, are faster than the M2.

Dalewyn · on May 3, 2023

Intel 12th and 13th gen both use the same efficiency cores.

jeffbee · on May 3, 2023

Well, on the ones I happen to have on hand the 12th gen hits 3300MHz and the 13th gen goes all the way to 4200MHz.

Dalewyn · on May 3, 2023

Yeah well, the efficiency cores on a 12900K will be faster than those on an N100, so what is your point?

We're discussing overall compute power differences between CPU architectures, minute differences in performance between identical-architecture CPU cores stemming from higher clock speeds is outside the scope of this discussion.

refulgentis · on May 3, 2023

Not for me, I understand and have had this reaction and it sort of undervalues the value of related info. I learned a bit from him and your response for example

scns · on May 3, 2023

Even when they are faster, the M2 is on the same die as the RAM and the bandwidth and latency are way better. That matters for compilation or am i mistaken?

jeffbee · on May 3, 2023

You are mistaken. Just like everyone else who has ever repeated the myth that Apple puts large-scale, high-performance logic and high density DRAM on the same die, which is impossible.

Apple uses LPDDR4 modules soldered to a PCB, sourced from the same Korean company that everyone else uses. Intel has used the exact same architecture since Cannon Lake, in 2018.

inkyoto · on May 4, 2023

For the sake of the conversation,

> Apple uses LPDDR4 modules soldered to a PCB, sourced from the same Korean company that everyone else uses.

Apple might have sourced and soldered on LPDDR modules from the same company, but it is not LPDDR4 and it is LPDDR5 6400 connected via a 256 (Pro), 512 (Max) or 1024 (Ultra) bit wide memory bus.

> Intel has used the exact same architecture since Cannon Lake, in 2018.

Other than LPDDR5 6400 had not existed in 2018, no Intel CPU has ever used a 512, leave alone 1024, bit wide memory bus even in the server setup. Wide memory bus is conducive of faster concurrent complex builds, and Rust / Haskell builds show significantly faster build speeds as well.

kiratp · on May 4, 2023

Case in point

> Rust, cargo build for our "test world binary".: 64 core Threadripper 3990x @ PBO level 3, ~400w, with Optane drive - 1 min 37s M1 Max, 64 GB, on battery @ 30% charge - 1:34s

https://twitter.com/kiratpandya/status/1457438725680480257

wtallis · on May 3, 2023

Cannon Lake is a weird red herring to bring in to the discussion, because it was only a real product in the narrowest sense possible. It may have technically been the first CPU Intel shipped with LPDDR4 support (or was it one of their Atom-based chips?), but the exact generation of LPDDR isn't really relevant because both Apple and Intel have supported multiple generations of LPDDR over the years and both have moved past 4 and 4x to 5 now.

What is somewhat relevant as the source of confusion here is that Apple puts the DRAM on the same package as the processor rather than on the motherboard nearby like is almost always done for x86 systems that use LPDDR. (But there's at least one upcoming Intel system that's been announced as putting the processor and LPDDR on a shared module that is itself then soldered to the motherboard.) That packaging detail probably doesn't matter much for the entry-level Apple chips that use the same memory bus width as x86 processors, but may be more important for the high-end parts with GPU-like wide memory busses.

jeffbee · on May 4, 2023

Look, I had to go one way or the other: either I said it originated with Ice Lake mobile, and a pedant would come along and say Cannon Lake even though almost nobody physically possesses a Cannon Lake laptop, or I could say Cannon Lake, be pedantically correct, and get your response. Tiger Lake was probably even more widespread and that also predates Apple Silicon.

wtallis · on May 4, 2023

None of those possibilities get you anywhere closer to being right. LPDDR didn't start with LPDDR4 so it doesn't matter which Intel processor was first to support LPDDR4. There's nothing special about that generation for Intel or Apple.

ricw · on May 3, 2023

I don't think that is true.

Last I checked apple M1 max chips have up to 800GB/s throughput, whilst AMDs high end chips taper out at around ~250GB/s or so, closer to what a standard M2 chip does (not max, or pro version). at the top end they've got at least 2x the memory bandwidth than other CPU vendors, and that's likely the case further down too.

jsheard · on May 4, 2023

It's worth noting that Apples stated bandwidth numbers aren't really attainable in CPU-only workloads, so for the sake of comparison to discrete CPUs they're not that useful. The M1 Max is advertised as having 400GB/sec bandwidth but the CPU cores can only use about half of that - still impressive, but not as impressive as you might think from the marketing.

(The M1 Ultra is the one with 800GB/sec bandwidth on paper)

inkyoto · on May 4, 2023

Anandtech has measured the 104 Gb/sec per a CPU core for the M1 Max and 210 Gb/sec circa per a CPU cluster bandwidths. M1 Max has 3x CPU clusters: 1x power effecient (2x cores) and 2x performance ones (4x cores each).

Where it gets interesting is how access to the memory is multiplexed in the most extreme case where 3x CPU clusters, 32x GPU and 16x ANE cores are attempting to fetch memory blocks at different locations and at once. It is not unreasonable to presuppose that Apple Silicon contraptions use the switched memory architecture, however with such a high degree of parallelism it is very intriguing to know how the memory architecture has been actually designed and optimised. High performant memory access has always been a big deal that usually comes with big money attached to it via a, naturally, non-memory bus associated connection.

wmf · on May 3, 2023

The Mx Max should be compared to a discrete CPU+GPU combination that does have comparable total memory bandwidth. It isn't automatically better to put everything on one chip.

MBCook · on May 4, 2023

It’s not on die, but isn’t it on package? Thus making signal lengths much shorter and allowing more lines making the memory bandwidth much higher than a (theoretical) equivalent chip with the memory soldered to the motherboard like the Intel machines were?

kcb · on May 4, 2023

Yea theoretically, but in memory latency tests the M series chips are well behind Intel and AMD systems.

scns · on May 3, 2023

Thank you for the correction then. Ah yes the soldered LPDDR4 dies allow higher bandwiths since more pins allow parallel access.

psanford · on May 3, 2023

It looks like there's been some good progress on getting Linux running natively on the Windows Dev Kit 2023 hardware[0]. There was a previous discussion here about this hardware back in 2022-11[1].

[0]: https://github.com/linux-surface/surface-pro-x/issues/43

[1]: https://news.ycombinator.com/item?id=33418044

smoldesu · on May 3, 2023

> Note that you cannot blindly correct for frequency in this manner because it is not physically possible to just change the frequency as I did

They're also not the same core architecture? Comparing ARM chips that conform to the same spec won't necessarily scale the same across frequencies. Even if all of these CPUs did have scaling clock speeds, their core logic is not the same. Hell, even the Firestorm and Icestorm cores on the M1 SOC shouldn't be considered directly comparable if you scale the clock speeds.

wmf · on May 3, 2023

That's the point. He knows they're different architectures (although X1 and V1 are related) so normalizing frequency exposes the architectural differences.

thejosh · on May 4, 2023

We've been having good success with Graviton 3 for our small data-processing workloads.

It seems with Apple now focusing on AARCH64 the architecture being taken seriously by developers (because ya know, phones, raspberry pis etc weren't enough :)) and there is good support for most software I use day-to-day.

Now it seems Amazon is pushing HARD on Graviton, as the release of 3 has seen many services now being featured running on it. They made great process with Graviton 3, so it's not hard to see why.

Exciting times :)

ausudhz · on May 4, 2023

It's also due to cost. At scale is cheaper to produce your own professors than buy it from a vendor.

Not to mention performance and energy consumption benefits.

The phone had it to begin with due to the low consumption and easy instructions set. Then the Moore's law did the rest and now they're powerful enough for data centers

seiferteric · on May 3, 2023

Not that it's for sure, but M3 is probably coming out late this year/early next year and will be on 3nm, so once again having a huge node advantage. Just seems like Apple will have the latest node before everyone else for the foreseeable future.

webaholic · on May 3, 2023

Apple pays a premium to TSMC to reserve the early runs on the next gen nodes. They can do this because they can charge their users a premium for Apple devices. I am not sure the rest of the players have that much pricing power or margins.

MBCook · on May 4, 2023

Plus they can guarantee big volumes. Even if only the pros get the new chip this year as rumored that’s still a very large order to make.

Next year it’s likely all iPhones (plus possibly iPads) will also be on the new process.

It looks like Apple sells around 200 million iPhones a year, and the two pro models are somewhat more popular combined than the non-pros.

So even if we assume 2/3rds of Apple sales are older models, the pros would still need around 40 million chips on the new process in the first year.

For comparison it looks like AMD sells about 80 million chips a year, across all CPU models.

kramerger · on May 4, 2023

Actually, both Apple and AMD have reduced their orders with TSMC due to expected drop in sales.

https://wccftech.com/tsmc-faces-order-cutback-from-major-5nm...

smoldesu · on May 3, 2023

Nvidia does, which is why they bought out the 4nm node (and beat Apple in GPU compute by a country mile).

monocasa · on May 4, 2023

Apple A16 is on N4 too.

bushbaba · on May 3, 2023

Seems weird to compare the c7g.large vs m2 and not the largest VM sizes.

monocasa · on May 3, 2023

It's a single threaded test that doesn't use a lot of memory.

Thaxll · on May 3, 2023

I think the performance of my oracle free instance ( arm cpu ) is 10x worse than those results.

dylan604 · on May 3, 2023

my oracle free instance uses mariadb instead of mysql, but i'm guessing you meant that as the free instance provided by oracle instead of an instance not using anything from oracle. =)

Thaxll · on May 3, 2023

Yes I'm talking about: https://docs.oracle.com/en-us/iaas/Content/FreeTier/freetier...

Ampere A1 Compute instances

cpascal · on May 4, 2023

There's probably some virtualization overhead on the c7g.large, right?

likeabbas · on May 4, 2023

Not a lot of its using firecracker https://firecracker-microvm.github.io/

wmf · on May 4, 2023

EC2 has virtually no overhead these days.

avrionov · on May 4, 2023

For comparison I got ns/url=194.976 on Apple MacBook Pro M1.

MBCook · on May 4, 2023

That makes sense. The M2 is extremely similar to the M1 for a single core.