Apple's new Processor Trace instrument is incredible

Veserv · 2025-08-16T23:57:08 1755388628

This is just standard instruction trace.

Intel has supported such capability via Intel Processor Trace (PT) since at least 2014 [1]. Here is a full trace recorder built by Jane Street feeding into standard program trace visualizers [2].

ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.

If you pair it with standard data trace, which is less commonly available, then you have the prerequisites for a hardware trace time travel debugger as originally seen in the early 2000s [5]

You can get similar performance/function tracing entirely in software via software-instrumented instruction trace and similar debugging information (though less granular performance information) via record-replay time travel debugger recordings.

[1] https://www.intel.com/content/www/us/en/support/articles/000...

[2] https://blog.janestreet.com/magic-trace/

[3] https://developer.arm.com/documentation/ihi0035/b/Program-Fl...

[4] https://developer.arm.com/documentation/ddi0158/d

[5] https://jakob.engbloms.se/archives/1564

sthomps · 2025-08-17T01:37:52 1755394672

Yes, and Intel Processor Trace (IPT) can be used for more than just performance - we use it for very specific memory protection security (see more: https://info.preludesecurity.com/hubfs/Content/Closing%20the...)

wmf · 2025-08-17T00:14:17 1755389657

Most Linux devs think Valgrind is a good profiler so if Apple can shame them into being only 10 years behind that's pretty good.

astrange · 2025-08-17T03:45:03 1755402303

The most useful performance tool on macOS is spindump, which is just a straightforward whole-system CPU sampler. Second most useful is MallocStackLogging.

Other OSes have those too, but they're harder to use and the interfaces aren't as good.

viraptor · 2025-08-17T05:58:22 1755410302

The most useful tool was dtrace... Until Apple killed it with M2.

astrange · 2025-08-17T07:51:19 1755417079

That's overkill for most things since it relies on being able to patch running kernel code, which is also the definition of an RCE exploit.

I think it should work if you run `bputil -c`? Didn't try it though.

Someone · 2025-08-17T11:20:54 1755429654

> That's overkill for most things since it relies on being able to patch running kernel code, which is also the definition of an RCE exploit.

The idea of dtrace is that you can do that, yes, but to that, you need to be authorized, and the things you can do are (supposed to be) limited to looking to see what’s happening in the system. You can read arguments, but can’t, for example, change the arguments of a system call or disable authorization checks on system calls.

astrange · 2025-08-17T18:26:52 1755455212

The reason dtrace doesn't work by default is the kernel memory is locked down so nothing can write to it. There are no exceptions to this even if you're authorized.

(Or rather, the exception requires restarting the machine to turn the security off.)

viraptor · 2025-08-17T09:13:19 1755421999

It just doesn't work at all since M2. I've got a bug report open and confirmed and Apple officially does not care anymore. If you try to run it, it will crash your whole system.

dwattttt · 2025-08-17T10:31:53 1755426713

Windows Performance Analyser is pretty amazing. Not the simplest tool to use, but not particularly hard. Bruce Dawson's blog (https://randomascii.wordpress.com) has lots of articles using it.

joshvm · 2025-08-17T08:55:39 1755420939

I was going to ask, isn't this basically callgrind? I used that a lot in grad school to optimise our group's code (on an iMac no less). Incredibly useful and there are some nice visualisation / inspection tools.

delta_p_delta_x · 2025-08-17T00:18:33 1755389913

Linux devs in general seem to be quite content with absolutely atrocious tooling.

jiggawatts · 2025-08-17T03:14:16 1755400456

Why use a propriety steam hammer when you can bash things with open source rocks?

wmf · 2025-08-17T04:14:25 1755404065

Nah, there are plenty of newer open source tools that people either resist using or don't even know about.

01HNNWZ0MV43FF · 2025-08-17T03:46:21 1755402381

this but unironically

sitkack · 2025-08-17T01:52:27 1755395547

Something something grug printf is all you need. People who LARP as Unix neckbeards rejoice in having nothing. Not content, actively seek out worse solutions.

Real Scottish craftspeople enjoy having amazing tooling. They even know how to use a debugger!

jauntywundrkind · 2025-08-18T04:55:39 1755492939

Your either deep in the shit amazed anything works and respecting the world about you, so happy any tool at all is trying to make things better…

Or you have the free time to be a loudmouth online and mouth off about how tech's x y and/or z are dumb and how a shitty bash script is more than enough.

Thankfully I think now Linux has some really great tooling now. But very few people have calls of duty where they are called to do super serious work. Mostly the job calls for pretty simple shit. Make the dumb app go. Throw more servers at the problem someday. The tension is real. It's unfortunate that the cutting edge is so spread out, is so far in advance of the main body of devs.

sitkack · 2025-08-18T10:35:22 1755513322

I agree, and thanks to AI we can scream debuggers at each other online, amazed that THAT actually works.

snihalani · 2025-08-17T02:12:40 1755396760

I don't get why tho.

pdhborges · 2025-08-17T13:05:29 1755435929

>> ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.

Where are the performace tools that wrap those capabilities? IPT has Magic Trace what is the equivalent tool for ARM?

Veserv · 2025-08-17T17:20:47 1755451247

Green Hills Software Path Analyzer[1] and newer History[2] which likely invented or at least popularized the modern trace/callstack visualization used by Perfetto, Firefox Profiler, etc.

Segger trace[3].

Lauterbach trace[4].

TI Code Composer trace[5].

[1] https://www.ghs.com/video/debug_in_minutes.html

[2] https://ghs.com/products/MULTI_IDE.html

[3] https://youtu.be/sT7N580EI-M?si=53S_DQ5E4IN8AXqM

[4] https://www2.lauterbach.com/pdf/trace_tutorial.pdf

[5] https://software-dl.ti.com/ccs/esd/documents/users_guide_ccs...

jesse__ · 2025-08-17T00:35:04 1755390904

10/10 comment

MBCook · 2025-08-17T00:00:04 1755388804

> The catch, as usual with new Apple features, is the hardware requirements. This only works on M4 chips and iPhone 16 devices, which means you’re out of luck if you’re still developing on older hardware. It’s frustrating but not surprising. Apple has a habit of using new developer tools to push hardware upgrades.

This seems unfair. Isn’t there a pretty good likelihood that the number of performance counters in the CPU (or whatever) simply don’t exist in the production versions of the previous CPUs?

seliopou · 2025-08-17T01:27:44 1755394064

Yes, there's no way that capturing all the information needed to reconstruct a useful trace would be possible without built-in hardware support.

bobmcnamara · 2025-08-16T23:54:18 1755388458

Hardware wise, this seems quite similar to many existing Tracing systems from other CPU cores.

I know Arm and XTensa have offered on board trace buffers for ages so operating systems could record themselves.

What's neat here is that Apple has bundled this nicely into a polished developer tool rather than one more discreet tool.

jauntywundrkind · 2025-08-16T23:12:29 1755385949

Longer term I sort of dream of doing computing from the inside out, using all this tracing data we've started gathering not just for observability but as a log and engine of compute: the record of what computing has been done as an event-source, for an event sourcing computing architecture.

alephnerd · 2025-08-17T02:44:48 1755398688

The security industry is trending this way. Observability has been the name of the game for a couple years now, and a lot of really cool grassroots startups have taken off in the runtime observability space. Think XDR+SIEM+SOAR but unified and way less bloated.

ip26 · 2025-08-16T23:39:30 1755387570

The present opportunity, in my view, is to feed this tracing into the development of superior compilers. This is starting to happen with automated profiling by the compiler, but you can imagine the profiling expanding to an enormous degree, with the compiler tracing the program it is building in great detail.

layer8 · 2025-08-17T00:35:00 1755390900

The compiler often doesn't run on the same CPU model the program will later run on, so that will only be feasible/useful in limited circumstances.

imoverclocked · 2025-08-17T01:26:03 1755393963

This is true partially because of the current landscape. With enough pressure (if the optimization is good enough) then things might shift to accommodate.

astrange · 2025-08-17T18:28:37 1755455317

See https://en.wikipedia.org/wiki/Time_travel_debugging

do_not_redeem · 2025-08-16T22:34:29 1755383669

> Instead of statistical sampling like most profilers, you get a complete picture of your app’s execution flow.

Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.

https://developer.apple.com/documentation/xcode/analyzing-cp...

These screenshots look a lot like kcachegrind with a slightly reimagined UI. Is there actually anything new here, or is this another case of Apple finally catching up to the open source world?

nkurz · 2025-08-16T23:05:36 1755385536

As 'GeekyBear' implies in a sibling comment, valgrind works with an emulation of an ideal processor rather than directly on the actual CPU. Sometimes this gives you a good idea of how the program will actually run, and sometimes it doesn't. As processors became more complex, it got farther and farther from the truth. Personally, I started in the Valgrind era and stopped using it as soon as better tools using native instrumentation became available. If Apple's approach works as well as described, it is much better than anything from that era.

do_not_redeem · 2025-08-16T23:16:45 1755386205

I've never found cachegrind inaccurate, but maybe I'm not doing hardcore enough performance work. You can also use perf and get you numbers straight from the hardware if that's what you need. Truth be told I mainly use cachegrind because I prefer kcachegrind's UI to hotspot.

(I even prefer cachegrind's approach since the numbers will be less distorted by other random background activity on the machine, but that could just be idealism on my part, who knows.)

If perf or the vendor-specific tools like vtune/uprof aren't sufficient for you then I'm curious what do you use?

nkurz · 2025-08-16T23:45:39 1755387939

I switched from emulator tools like valgrind to tools with hardware support like perf, pmu-tools, and VTune. I generally found them sufficient, but sometimes buggy and difficult to use.

Cachegrind is occasionally inaccurate due to an inaccurate model, but the greater problem was that cache hit percentages only tell a fraction of the story. To be able to predict performance I often needed to be able to accurately measure things like the number of memory requests in flight.

Searching now for an example, I hit on a comment I made here a few years ago where this new tool probably would have been helpful: https://news.ycombinator.com/item?id=18442131

In general I have much greater faith in the on chip performance registers. That said, other than glancing at news stories like this I haven't been keeping up with recent advances. I guess it's possible that cachegrind and friends have improved since I was using them.

do_not_redeem · 2025-08-17T00:12:36 1755389556

I've always reached for llvm-mca when I need stuff like that, but again it's all predicted/ideal numbers, not live off the hardware. And you need to start off with another tool first to pinpoint where to look.

I've never come across pmu-tools, thanks for the tip. I'll try it out next time I'm in the trenches.

GeekyBear · 2025-08-16T22:50:37 1755384637

> Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.

Looking at the kcachegrind homepage, it doesn't sound like they are pulling their data directly from the CPU core itself:

> Callgrind uses runtime instrumentation via the Valgrind framework for its cache simulation and call-graph generation.

https://kcachegrind.github.io/html/Home.html

Apple seems to have modified it's core design so that it will stream data to a log file while the code is running.

> Recent Apple silicon devices can capture a processor trace where the CPU stores information about the code it runs, including the branches it takes and the instructions it jumps to. The CPU streams this information to an area on the file system so that you can analyze it with the Processor Trace instrument.

do_not_redeem · 2025-08-16T23:03:43 1755385423

If you need data straight from the hardware you can use e.g. perf+hotspot, although I've heard that perf's tracing (not sampling!) supports fewer CPUs (but still more than just 1)

jauntywundrkind · 2025-08-16T23:06:03 1755385563

Intel has a Performance Monitoring Unit on its core that has significant overlap.

Forgetting this tool-space, but at least some of these tools can make use of that hardware:

https://github.com/intel/pcm https://github.com/andikleen/pmu-tools

kaladin-jasnah · 2025-08-17T00:26:47 1755390407

IIRC perf does it as well.

stochastician · 2025-08-16T23:42:56 1755387776

Is there anything like this for more commodity arm cores (neoverse v2) or do we think the insights from apple silicon cores will generalize well to those other ARM architectures?

jauntywundrkind · 2025-08-18T04:58:50 1755493130

Linux 6.17 is adding support for arm's Branch Record Buffer Extension (BRBE), which does something similar of recording branches taken!

Not a new capability at all, but Linux support is new. The lag to adoption for this sort of stuff always seems very high. https://www.phoronix.com/news/Linux-6.17-ARM64

mananaysiempre · 2025-08-16T23:08:22 1755385702

Finally, a processor manufacturer defects from the obfuscatory equilibrium. Granted, Apple’s processor people are not saints—I’ve yet to see even a full table of throughputs, latencies, and port loads from them, let alone an accurate CPU model—but I welcome anything that might maybe, hopefully, pretty please start a race of giving more accurate data to people doing low-level optimization.

touisteur · 2025-08-16T23:27:25 1755386845

Intel Processor Trace was already pretty great. Built a MC-DC coverage tool with it. Used it for fine profiling, live program monitoring...

bri3d · 2025-08-16T23:12:43 1755385963

What’s your beef with VTune and uProf?

urbandw311er · 2025-08-16T22:43:47 1755384227

I feel like it probably would work on older hardware, this very much smacks of forced obsolescence. Just guessing though.

nozzlegear · 2025-08-16T23:14:56 1755386096

Is forced obsolescence the right term for a somewhat obscure debug tool built for developers of macOS/iOS software? I don't imagine there are many people who would feel forced to upgrade their machines more quickly just to get access to this.

astrange · 2025-08-16T22:49:18 1755384558

It would not. You could port cachegrind I suppose.

(Even if hardware support did exist earlier, you don't want to deal with errata for a new hardware feature. It's kind of amazing anything ever works.)