Intel has supported such capability via Intel Processor Trace (PT) since at least 2014 [1]. Here is a full trace recorder built by Jane Street feeding into standard program trace visualizers [2].
ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.
If you pair it with standard data trace, which is less commonly available, then you have the prerequisites for a hardware trace time travel debugger as originally seen in the early 2000s [5]
You can get similar performance/function tracing entirely in software via software-instrumented instruction trace and similar debugging information (though less granular performance information) via record-replay time travel debugger recordings.
The most useful performance tool on macOS is spindump, which is just a straightforward whole-system CPU sampler. Second most useful is MallocStackLogging.
Other OSes have those too, but they're harder to use and the interfaces aren't as good.
> That's overkill for most things since it relies on being able to patch running kernel code, which is also the definition of an RCE exploit.
The idea of dtrace is that you can do that, yes, but to that, you need to be authorized, and the things you can do are (supposed to be) limited to looking to see what’s happening in the system. You can read arguments, but can’t, for example, change the arguments of a system call or disable authorization checks on system calls.
The reason dtrace doesn't work by default is the kernel memory is locked down so nothing can write to it. There are no exceptions to this even if you're authorized.
(Or rather, the exception requires restarting the machine to turn the security off.)
It just doesn't work at all since M2. I've got a bug report open and confirmed and Apple officially does not care anymore. If you try to run it, it will crash your whole system.
Windows Performance Analyser is pretty amazing. Not the simplest tool to use, but not particularly hard. Bruce Dawson's blog (https://randomascii.wordpress.com) has lots of articles using it.
I was going to ask, isn't this basically callgrind? I used that a lot in grad school to optimise our group's code (on an iMac no less). Incredibly useful and there are some nice visualisation / inspection tools.
Something something grug printf is all you need. People who LARP as Unix neckbeards rejoice in having nothing. Not content, actively seek out worse solutions.
Real Scottish craftspeople enjoy having amazing tooling. They even know how to use a debugger!
Your either deep in the shit amazed anything works and respecting the world about you, so happy any tool at all is trying to make things better…
Or you have the free time to be a loudmouth online and mouth off about how tech's x y and/or z are dumb and how a shitty bash script is more than enough.
Thankfully I think now Linux has some really great tooling now. But very few people have calls of duty where they are called to do super serious work. Mostly the job calls for pretty simple shit. Make the dumb app go. Throw more servers at the problem someday. The tension is real. It's unfortunate that the cutting edge is so spread out, is so far in advance of the main body of devs.
>> ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.
Where are the performace tools that wrap those capabilities? IPT has Magic Trace what is the equivalent tool for ARM?
Green Hills Software Path Analyzer[1] and newer History[2] which likely invented or at least popularized the modern trace/callstack visualization used by Perfetto, Firefox Profiler, etc.
> The catch, as usual with new Apple features, is the hardware requirements. This only works on M4 chips and iPhone 16 devices, which means you’re out of luck if you’re still developing on older hardware. It’s frustrating but not surprising. Apple has a habit of using new developer tools to push hardware upgrades.
This seems unfair. Isn’t there a pretty good likelihood that the number of performance counters in the CPU (or whatever) simply don’t exist in the production versions of the previous CPUs?
Longer term I sort of dream of doing computing from the inside out, using all this tracing data we've started gathering not just for observability but as a log and engine of compute: the record of what computing has been done as an event-source, for an event sourcing computing architecture.
The security industry is trending this way. Observability has been the name of the game for a couple years now, and a lot of really cool grassroots startups have taken off in the runtime observability space. Think XDR+SIEM+SOAR but unified and way less bloated.
The present opportunity, in my view, is to feed this tracing into the development of superior compilers. This is starting to happen with automated profiling by the compiler, but you can imagine the profiling expanding to an enormous degree, with the compiler tracing the program it is building in great detail.
This is true partially because of the current landscape. With enough pressure (if the optimization is good enough) then things might shift to accommodate.
These screenshots look a lot like kcachegrind with a slightly reimagined UI. Is there actually anything new here, or is this another case of Apple finally catching up to the open source world?
As 'GeekyBear' implies in a sibling comment, valgrind works with an emulation of an ideal processor rather than directly on the actual CPU. Sometimes this gives you a good idea of how the program will actually run, and sometimes it doesn't. As processors became more complex, it got farther and farther from the truth. Personally, I started in the Valgrind era and stopped using it as soon as better tools using native instrumentation became available. If Apple's approach works as well as described, it is much better than anything from that era.
I've never found cachegrind inaccurate, but maybe I'm not doing hardcore enough performance work. You can also use perf and get you numbers straight from the hardware if that's what you need. Truth be told I mainly use cachegrind because I prefer kcachegrind's UI to hotspot.
(I even prefer cachegrind's approach since the numbers will be less distorted by other random background activity on the machine, but that could just be idealism on my part, who knows.)
If perf or the vendor-specific tools like vtune/uprof aren't sufficient for you then I'm curious what do you use?
I switched from emulator tools like valgrind to tools with hardware support like perf, pmu-tools, and VTune. I generally found them sufficient, but sometimes buggy and difficult to use.
Cachegrind is occasionally inaccurate due to an inaccurate model, but the greater problem was that cache hit percentages only tell a fraction of the story. To be able to predict performance I often needed to be able to accurately measure things like the number of memory requests in flight.
In general I have much greater faith in the on chip performance registers. That said, other than glancing at news stories like this I haven't been keeping up with recent advances. I guess it's possible that cachegrind and friends have improved since I was using them.
I've always reached for llvm-mca when I need stuff like that, but again it's all predicted/ideal numbers, not live off the hardware. And you need to start off with another tool first to pinpoint where to look.
I've never come across pmu-tools, thanks for the tip. I'll try it out next time I'm in the trenches.
Apple seems to have modified it's core design so that it will stream data to a log file while the code is running.
> Recent Apple silicon devices can capture a processor trace where the CPU stores information about the code it runs, including the branches it takes and the instructions it jumps to. The CPU streams this information to an area on the file system so that you can analyze it with the Processor Trace instrument.
If you need data straight from the hardware you can use e.g. perf+hotspot, although I've heard that perf's tracing (not sampling!) supports fewer CPUs (but still more than just 1)
Is there anything like this for more commodity arm cores (neoverse v2) or do we think the insights from apple silicon cores will generalize well to those other ARM architectures?
Finally, a processor manufacturer defects from the obfuscatory equilibrium. Granted, Apple’s processor people are not saints—I’ve yet to see even a full table of throughputs, latencies, and port loads from them, let alone an accurate CPU model—but I welcome anything that might maybe, hopefully, pretty please start a race of giving more accurate data to people doing low-level optimization.
Is forced obsolescence the right term for a somewhat obscure debug tool built for developers of macOS/iOS software? I don't imagine there are many people who would feel forced to upgrade their machines more quickly just to get access to this.
Intel has supported such capability via Intel Processor Trace (PT) since at least 2014 [1]. Here is a full trace recorder built by Jane Street feeding into standard program trace visualizers [2].
ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.
If you pair it with standard data trace, which is less commonly available, then you have the prerequisites for a hardware trace time travel debugger as originally seen in the early 2000s [5]
You can get similar performance/function tracing entirely in software via software-instrumented instruction trace and similar debugging information (though less granular performance information) via record-replay time travel debugger recordings.
[1] https://www.intel.com/content/www/us/en/support/articles/000...
[2] https://blog.janestreet.com/magic-trace/
[3] https://developer.arm.com/documentation/ihi0035/b/Program-Fl...
[4] https://developer.arm.com/documentation/ddi0158/d
[5] https://jakob.engbloms.se/archives/1564