A comparison with x86_64 CPUs (e.g. those seen in comparable MacBooks and AWS machines) would be useful.
Also, I'm not sure if "correcting" the numbers for 3 GHz is reasonable and reflects real-life performance. Perhaps some throttling could be applied to test the CPUs using a common frequency?
> I'm not sure if "correcting" the numbers for 3 GHz is reasonable and reflects real-life performance
It's not useful at all. It effectively measures IPC (instructions per clock), which is just chip vendor bragging rights.
Assuming that all the chips meet some baseline performance criteria: for datacenter and portable devices, the real benchmark would be "instructions per joule."
For desktop devices "instructions per dollar" would be most relevant.
> It's not useful at all. It effectively measures IPC (instructions per clock), which is just chip vendor bragging rights.
+1. Moreover the author then seems to conclude from this benchmark:
> Overall, these numbers suggest that the Qualcomm processor is competitive.
This is an odd conclusion to draw from this test and these numbers, given how little this benchmark tests (just string operations). Does this benchmark want to test raw CPU power? Then why "normalize" to 3GHz? Does it want to test CPU capabilities? If so why use such a "narrow" test?
IMO this benchmarking does for a good data-point, but far from enough to draw much of a conclusion from.
>For desktop devices "instructions per dollar" would be most relevant.
Well, it's one measure, but scaling potential and absolute performance limit matters as well. In use cases where a desktop is performing work to help one or more humans, the cost of salary and other support may utterly drown out even a very expensive extremely high power system. Ie., a 1200W workstation being used at maximum power 10 hours a day 365 days a year (an absurd utilization ratio) at $0.15/kWh (average US electricity cost) would still only be around $650/year. If it boosted the productivity of a typical tech worker even 1% it'd pay for itself no problem. It's easy to lose sight sometimes of how historically incredible the bang for the buck is in tech.
I think it's important to remember because some designs that do incredibly well at small sizes run into challenges scaling. Like with CPUs, the nature of silicon fabrication makes it ever more difficult to grow a monolithic die. Switching to a chiplet-based design does mean absolute efficiency and minimum power challenges amongst others, but dramatically improves scalability at the high end. It's a sort of infrastructure tradeoff. Apple for example has done incredibly well with its big silicon SoCs from handhelds to portable Macs, but it's been struggling to do a Mac Pro or even updated Studio. It's not clear that they physically can do something on the level of a modern Epyc chip anymore than Intel could with monolithic Xeons.
So normalized performance/watt and performance/$ and so on do matter, but absolute final density and scale up are another part of the matrix for certain use cases too.
Your tech worker only gets about 4.1 days of not working? Wow.
That aside, I think it would be more important to look at the carbon emissions impact of code/cpu's.
That one person parsing urls might save few hundred dollars of not optimising their code but if it is being used billions of times a day the energy (and therefore emissions) impact can be huge.
It's pretty amusing to me how the Megahertz Myth has been inverted nowadays. It used to be everyone cared about the fastest clock speed and irrationally discounted IPC. Now people care about IPC and irrationally discount clock speed. Pretty weird, in my opinion.
Cloud costs are dominated by power delivery and cooling. Both of those are directly influenced by how much power the chip uses to achieve it's performance target.
I guess it does indirectly influence dollar cost, but I was referring to MSRP of the chip. As a simple example: the per-chip cost of Graviton is probably enormous (if you factor R&D into the cost of a chip), but it's still cheaper for Amazon customers. Why? Power and cooling.
I don't understand where these power and cooling mantras came from, but large-scale cloud providers have very low PUE (Google publishes historical data at https://www.google.com/about/datacenters/efficiency/). That means you can take the basic power from a thread, add some for memory, and multiply by just a bit to get Watts. Take that and plug in your favorite $/kWh guess and get a price.
Ignoring Graviton, which doesn't have published power data to my knowledge, you can look up the TDP for a bunch of chips and see that it's a a few Watts per thread [1]. Similar calculations can be done for RAM. You end up at 5ish Watts per thread with some RAM attached. Let's call it 10 all in with cooling or other stuff. Since 10W is 1% of a kW, we end up with .01 kWh per hour. The top hit for "us power commercial rates" [2] says that we should assume 7c per kWh or so. That means our instance with cooling and overheads, needs to include .01 x .07 => $.0007/hr of power and cooling costs.
A single core w/ 4 GiB of memory on GCP at 3yr commitment rates (so we're focused on the long-term depreciation price) is .009815 + 4x.001316 => $.015/hr or about 20x as much as the power.
tl;dr: Power costs add up, but they are not even close to dominating the costs of cloud pricing.
Reaching those PUE numbers is pretty damn impressive though.
I think the misconceptions come from enterprise DC environments with traditional hot/cold aisle designs, servers running way cooler than they need to be and peak power requirements leading to overly expensive power/cooling costs. PUE for these closer to 2.0 than GCE is to 1.0.
If you have something like GCE where you can control for all of those, i.e run the DCs hot, eliminate transient peaks, source power cheaply and use super efficient cooling like evaporative or geothermal pumped water etc then yeah, it's a completely different ballgame.
It's pretty safe to assume all the hyperscalers are doing all of these things too, they aren't stupid. :)
It feels like a pretty useless benchmark as far as translating to real world performance but for the heck of it I did ten benchmarks on my three year old Thinkpad with a 4.5Ghz Intel i7-9750H running Ubuntu and got a best of ns/url=225.63 and worst of ns/url=275.118 whatever those scores mean.
On my (not idle) i7-12700K with DDR5 4800 memory Workstation, using Proxmox VE 7.4 (based on Debian 11) with our 6.2 based Linux kernel, I get something between 139.44 ns/url and 140.66 ns/url (ran it about 10 times in a row), so 1.35x faster than the M2.
One of the comments on the article notes a 7950x doing "time/url=109ns"
so I tossed it on my 5950x and it reports "ns/url=148.994" which is about right from what I understand (zen4 gets a nice ~30% bump on a number of Linux benchmarks over zen3), here its 35% with only a 16% clock advantage.
Which is about the diff between the M2 and my old 5950X in a generation older process/etc.
So, yah current M2 slower than previous gen Intel and AMD both in this benchmark.
I constantly get between 154 and 156 ns/url when running on an M2 Air 2022 while using asahi linux instead of MacOs.
22 Celsius according to smartmontools :)
So seems a good part of the speed-up comes from using the Linux kernel, which tbh. doesn't really surprise me.
It's hard to match the engineering amount and quality getting poured into it (well at least the core stuff, some niche driver will be a mess like everywhere).
The remainder then comes from modern x86_64/amd64 CPU combined with DDR5 beating the M2, well at least in this micro benchmark.
In my totally unscientific (but consistent) benchmarks for our CI build servers, m6g.8xlarge compile our c++ codebase in about 9.5 minutes, where m6a.8xlarge takes about 11 minutes. The price difference is about 20% as well iirc, so it’s generally a good deal.
Of course the types of optimisations that a compiler may (or may not) do on aarch64 vs x86_64 are completely different and may explain the difference (we actually compile with -march=haswell for x86_64), but generally Graviton seems like a really good deal.
This is a big, general problem with CI providers I don't hear talked about enough: because they charge per-minute, they are actively incentivized to run on old hardware, slowing builds and milking more from customers in the process. Doubly-so when your CI is hosted by a major cloud provider who would otherwise have to scrap these old machines.
I wish this were only a theoretical concern, a theoretical incentive, but its not. Github Actions is slow, and Gitlab suffers from a similar problem; their hosted SaaS runners are on GCP n1-standard-1 machines. The oldest machine type in GCP's fleet, the n1-standard-1 is powered by a variety of dusty, old CPUs Google Cloud has no other use for, from Sandy Bridge to Skylake. That's a 12 year old CPU.
There are workloads where newer CPUs are dramatically faster (e.g. AVX-512), but in general the difference isn't huge. Most of what the newer CPUs get you is more cores and higher power efficiency, which you don't care about when you're paying per-vCPU. Which vCPU is faster, a ten year old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon Platinum 8352V at 2.1GHz? It depends on the workload. Which has more memory bandwidth per core?
But the cloud provider prefers the latter because it has 500% more cores for 50% more power. Which is why the latter still goes for >$2000 and the former is <$15.
> Which vCPU is faster, a ten year old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon Platinum 8352V at 2.1GHz? It depends on the workload.
It really does not depend on the workload, when those workloads we're talking about are by-and-large bounded to 1vCPU or less (CI jobs, serverless functions, etc). Ice Lake cores are substantially faster than Ivy Bridge; the 8352V will be faster in practically any workload we're talking about.
However, I do agree with this take, if we're talking about, say, lambda functions. The reason being that the vast majority of workloads built on lambda functions are bounded by IO, not compute; so newer core designs won't result in a meaningful improvement in function execution. Put another way: Is a function executing in 75ms instead of 80ms worth paying 30% more? (I made these numbers up, but its the illustration that matters).
CI is a different story. CI runs are only bound by IO for the smallest of projects; downloading that 800mb node:18 base docker image takes some time, but it can very easily and quickly be dwarfed by all the things that happen afterward. This is not an uncontroversial opinion; "the CI is slow" is such a meme of a problem at engineering companies nowadays that you'd think more people would have the sense to look at the common denominator (the CI hosts suck) and not blame themselves (though, often there's blame to go around). We've got a project that can build locally, M2 Pro, docker pull and push included, in something like 40 seconds; the CI takes 4 minutes. Its the crusty CPUs; its slow networking; its the "step 1 is finished, wait 10 seconds for the orchestrator to realize it and start step 2".
And I think we, the community, need to be more vocal about this when speaking on platforms that charge by the minute. They are clearly incentivized to leave it shitty. It should even surface in discussions about, for example, the markup of lambda versus EC2. A 4096mb lambda function would cost $172/mo if ran 24/7, back-to-back. A comparable c6i-large: $62/mo; a third the price. That's bad enough on the surface, and we need to be cognizant that its even worse than it initially appears because Amazon runs Lambda on whatever they have collecting dust in the closet, and people still report getting Ivy Bridge and Haswell cores sometimes, in 2023; and the better comparison is probably a t2-medium @ $33/mo; a 5-6x markup.
This isn't new information; lambda is crazy expensive; blah blah blah; but I don't hear that dimension brought up enough. Calling back to my previous point: Is a function executing in 75ms instead of 80ms worth paying 30% more? Well, we're already paying 550% more; the fact that it doesn't execute in 75ms by default is abhorrent. Put another way: if Lambda, and other serverless systems like it such as hosted CI runners, enables cloud providers to keep old hardware around far longer than performance improvements say it should be; the markup should not be 500%. We're doing Amazon a favor by using Lambda.
> It really does not depend on the workload, when those workloads we're talking about are by-and-large bounded to 1vCPU or less (CI jobs, serverless functions, etc). Ice Lake cores are substantially faster than Ivy Bridge; the 8352V will be faster in practically any workload we're talking about.
If you were comparing e.g. the E5-2667v2 to the Xeon Gold 6334 you would be right, because they have the same number of cores and the 6334 has a higher rather than lower clock speed.
But the newer CPUs support more cores per socket. The E5-2643v2 has 6, the Xeon Platinum 8352V has 36.
To make that fit in the power budget, it has a lower base clock, which eats a huge chunk out of Ice Lake's IPC advantage. Then the newer CPU has around twice as much L3 cache, 54MB vs. 25MB, but that's for six times as many cores. You get 1.5MB/core instead of >4MB/core. It has just over three times the memory bandwidth (8xDDR4-2933 vs. 4xDDR3-1866), but again six times as many cores, so around half as much per core. It can easily be slower despite being newer, even when you're compute bound.
> We've got a project that can build locally, M2 Pro, docker pull and push included, in something like 40 seconds; the CI takes 4 minutes. Its the crusty CPUs; its slow networking; its the "step 1 is finished, wait 10 seconds for the orchestrator to realize it and start step 2".
Inefficient code and slow hardware are two different things. You can have the fastest machine in the world that finishes step 1 in 4ms and still be waiting 10 full seconds if the system is using a timer.
But they're operating in a competitive market. If you want a faster system, patronize a company that provides one. Just don't be surprised if it costs more.
Lambda is good for bursty, typically low activity applications, where it just wouldn't make sense to have EC2 instances running 24x7. There about some line-of-business app that gets a couple of requests every minute or so. Maybe once a quarter there will be a spike in usage. Lambda scales up and just handles it. If requests execute in 50ms (unlikely!) or 500ms, it just doesn't matter.
Not quite sure I follow. But I built an asp.net api and deployed it into lambda and it cost $2/m and when it started to get more traffic and the cost got to $20/m I moved it to a t4g instance.
When I moved it, I didn’t need to make any code changes :) I just made a systemd file and deployed it.
For this to be true IPC would have to have stagnated for 10 years which is not the case. Look at Agner's instruction tables for different uarchs and compare.
perf/watt/density would be the best way to measure. Frequency should NOT be normalized because x86 chips are designed to be high frequency parts. That is why measuring IPC is meaningless. A cutting edge ARM chip can possibly beat a modern x86 chip in terms of IPC, but the x86 chip will almost certainly scale to higher frequencies. Certain models of EPYC Genoa (Zen 4) for example, boost to 4GHz, and both Intel and AMD have chips that boost to 5.7+ GHz.
Density is important because if you can only have a max of, for example, 20 cores for an ARM solution, but 96 cores in the case of EPYC Genoa, Genoa is going to win out for any multi-core workload.
Without more context about what the code actually does this doesn't tell me all that much, other than what I could guess from the intended usecases of the chips.
The strength of Apple silicon is that it can crush benchmarks and transfer that power very well to real world concurrent workloads too e.g. is this basically just measuring the L1 latency? If not are the compilers generating the right instructions etc. (One would assume they are but I have had issues with getting good arm codegen previously, only to find that the compiler couldn't work out what ISA to target other than a conservative guess)
It does seem the benchmark has its data in cache, based on the timings.
If the benchmark were only measuring L1 latency, what would that imply about the ‘scaling by inverse clock speed’ bit? My guess is as follows. Chips with higher clock rates will be penalised: (a) it is harder to decrease latencies (memory, pipeline length, etc) in absolute terms than run at a higher clock speed to maybe do non-memory things faster; and (b) if you’re waiting 5ns to read some data, that hurts you more after the scaling if your clock speed is higher. The fact that the M1 wins after the scaling despite the higher clock rate suggests to me that either they have a big advantage on memory latency or there’s some non-memory-latency advantage in scheduling or branch prediction that leads to more useful instructions being retired per cycle.
I ain't readin' all that (OK maybe I will but lemire does do this quite a lot, his blog is 40% gems 60% slightly sloppy borderline factoids that only make sense if you think in exactly the same way he does)
Is there something specific about this library that makes you suspicious, or are you assuming from the name that this is using the Ada programming language?
Part of what they don't mention is that Graviton 3 and the Snapdragon 8cx Gen 3 have pretty much the same processor core. The Neoverse V1 is only a slightly modified Cortex X1. Hence the same results when you account for clock frequency.
The 8cx 3rd Gen is based on Cortex X1, the same as Snapdragon 888.
The Graviton 3 is based on Neoverse V1 which in itself is tweaked version of Neoverse N1 with relaxed pref / watt / die area and much improved SIMD work load. And N1 is based on Cortex X1.
The current Snapdragon is on Cortex X3. With Cortex X4 coming soon.
To me it would be somewhat more interesting to compare head-to-head mobile CPUs instead of comparing laptops and servers. In this particular microbenchmark, mobile 12th and 13th-generation Core performance cores, and even the efficiency cores on the 13th generation, are faster than the M2.
Yeah well, the efficiency cores on a 12900K will be faster than those on an N100, so what is your point?
We're discussing overall compute power differences between CPU architectures, minute differences in performance between identical-architecture CPU cores stemming from higher clock speeds is outside the scope of this discussion.
Not for me, I understand and have had this reaction and it sort of undervalues the value of related info. I learned a bit from him and your response for example
Even when they are faster, the M2 is on the same die as the RAM and the bandwidth and latency are way better. That matters for compilation or am i mistaken?
You are mistaken. Just like everyone else who has ever repeated the myth that Apple puts large-scale, high-performance logic and high density DRAM on the same die, which is impossible.
Apple uses LPDDR4 modules soldered to a PCB, sourced from the same Korean company that everyone else uses. Intel has used the exact same architecture since Cannon Lake, in 2018.
> Apple uses LPDDR4 modules soldered to a PCB, sourced from the same Korean company that everyone else uses.
Apple might have sourced and soldered on LPDDR modules from the same company, but it is not LPDDR4 and it is LPDDR5 6400 connected via a 256 (Pro), 512 (Max) or 1024 (Ultra) bit wide memory bus.
> Intel has used the exact same architecture since Cannon Lake, in 2018.
Other than LPDDR5 6400 had not existed in 2018, no Intel CPU has ever used a 512, leave alone 1024, bit wide memory bus even in the server setup. Wide memory bus is conducive of faster concurrent complex builds, and Rust / Haskell builds show significantly faster build speeds as well.
Cannon Lake is a weird red herring to bring in to the discussion, because it was only a real product in the narrowest sense possible. It may have technically been the first CPU Intel shipped with LPDDR4 support (or was it one of their Atom-based chips?), but the exact generation of LPDDR isn't really relevant because both Apple and Intel have supported multiple generations of LPDDR over the years and both have moved past 4 and 4x to 5 now.
What is somewhat relevant as the source of confusion here is that Apple puts the DRAM on the same package as the processor rather than on the motherboard nearby like is almost always done for x86 systems that use LPDDR. (But there's at least one upcoming Intel system that's been announced as putting the processor and LPDDR on a shared module that is itself then soldered to the motherboard.) That packaging detail probably doesn't matter much for the entry-level Apple chips that use the same memory bus width as x86 processors, but may be more important for the high-end parts with GPU-like wide memory busses.
Look, I had to go one way or the other: either I said it originated with Ice Lake mobile, and a pedant would come along and say Cannon Lake even though almost nobody physically possesses a Cannon Lake laptop, or I could say Cannon Lake, be pedantically correct, and get your response. Tiger Lake was probably even more widespread and that also predates Apple Silicon.
None of those possibilities get you anywhere closer to being right. LPDDR didn't start with LPDDR4 so it doesn't matter which Intel processor was first to support LPDDR4. There's nothing special about that generation for Intel or Apple.
Last I checked apple M1 max chips have up to 800GB/s throughput, whilst AMDs high end chips taper out at around ~250GB/s or so, closer to what a standard M2 chip does (not max, or pro version). at the top end they've got at least 2x the memory bandwidth than other CPU vendors, and that's likely the case further down too.
It's worth noting that Apples stated bandwidth numbers aren't really attainable in CPU-only workloads, so for the sake of comparison to discrete CPUs they're not that useful. The M1 Max is advertised as having 400GB/sec bandwidth but the CPU cores can only use about half of that - still impressive, but not as impressive as you might think from the marketing.
(The M1 Ultra is the one with 800GB/sec bandwidth on paper)
Anandtech has measured the 104 Gb/sec per a CPU core for the M1 Max and 210 Gb/sec circa per a CPU cluster bandwidths. M1 Max has 3x CPU clusters: 1x power effecient (2x cores) and 2x performance ones (4x cores each).
Where it gets interesting is how access to the memory is multiplexed in the most extreme case where 3x CPU clusters, 32x GPU and 16x ANE cores are attempting to fetch memory blocks at different locations and at once. It is not unreasonable to presuppose that Apple Silicon contraptions use the switched memory architecture, however with such a high degree of parallelism it is very intriguing to know how the memory architecture has been actually designed and optimised. High performant memory access has always been a big deal that usually comes with big money attached to it via a, naturally, non-memory bus associated connection.
The Mx Max should be compared to a discrete CPU+GPU combination that does have comparable total memory bandwidth. It isn't automatically better to put everything on one chip.
It’s not on die, but isn’t it on package? Thus making signal lengths much shorter and allowing more lines making the memory bandwidth much higher than a (theoretical) equivalent chip with the memory soldered to the motherboard like the Intel machines were?
It looks like there's been some good progress on getting Linux running natively on the Windows Dev Kit 2023 hardware[0]. There was a previous discussion here about this hardware back in 2022-11[1].
> Note that you cannot blindly correct for frequency in this manner because it is not physically possible to just change the frequency as I did
They're also not the same core architecture? Comparing ARM chips that conform to the same spec won't necessarily scale the same across frequencies. Even if all of these CPUs did have scaling clock speeds, their core logic is not the same. Hell, even the Firestorm and Icestorm cores on the M1 SOC shouldn't be considered directly comparable if you scale the clock speeds.
That's the point. He knows they're different architectures (although X1 and V1 are related) so normalizing frequency exposes the architectural differences.
We've been having good success with Graviton 3 for our small data-processing workloads.
It seems with Apple now focusing on AARCH64 the architecture being taken seriously by developers (because ya know, phones, raspberry pis etc weren't enough :)) and there is good support for most software I use day-to-day.
Now it seems Amazon is pushing HARD on Graviton, as the release of 3 has seen many services now being featured running on it. They made great process with Graviton 3, so it's not hard to see why.
It's also due to cost. At scale is cheaper to produce your own professors than buy it from a vendor.
Not to mention performance and energy consumption benefits.
The phone had it to begin with due to the low consumption and easy instructions set. Then the Moore's law did the rest and now they're powerful enough for data centers
Not that it's for sure, but M3 is probably coming out late this year/early next year and will be on 3nm, so once again having a huge node advantage. Just seems like Apple will have the latest node before everyone else for the foreseeable future.
Apple pays a premium to TSMC to reserve the early runs on the next gen nodes. They can do this because they can charge their users a premium for Apple devices. I am not sure the rest of the players have that much pricing power or margins.
my oracle free instance uses mariadb instead of mysql, but i'm guessing you meant that as the free instance provided by oracle instead of an instance not using anything from oracle. =)
Also, I'm not sure if "correcting" the numbers for 3 GHz is reasonable and reflects real-life performance. Perhaps some throttling could be applied to test the CPUs using a common frequency?