Intel 64 and IA32 Architectures Performance Monitoring Events [pdf]

strstr · on Dec 12, 2017

Perf counters are super useful. On linux the perf tool (and perf event api) make these usable: https://perf.wiki.kernel.org/index.php/Main_Page

The counters vary per Intel CPU, though the most useful ones are universal (e.g. cycle counts). AMD has similar counters.

soulbadguy · on Dec 13, 2017

ocperf is wrapper around perf provided by someone at intel. At the first run, it downloads a list of counter specific the CPU detected, pretty cool; https://github.com/andikleen/pmu-tools

lallysingh · on Dec 13, 2017

Excellent timing. I'd just updated my performance tool (ppt). https://github.com/lally/libmet

Includes really easy to use performance counter support.

CalChris · on Dec 13, 2017

If you're a low level hacker a reading knowledge of these events is useful but really we should be using VTune as our PME tool. Still, it's possible that a particular event may shed light on a particular piece of code and using an API like PCM.

https://github.com/opcm/pcm

kev009 · on Dec 13, 2017

vtune is first class but most people will be using perf on linux or pmcstat on freebsd so you do need to crossreference a doc like this occasionally when you want to probe a new counter to look for bottlenecks.

pcm is also quite nice to monitor what an entire system is doing in terms of memory bandwidth, NUMA link traffic, and other package level concerns but doesn't give any kernel or application level tracing like the other tools.

grandmczeb · on Dec 13, 2017

Open question to other commenters: are there hardware performance counters/features that you would like to see implemented but currently aren’t?

_chris_ · on Dec 13, 2017

As a RISC-V core implementer, I'm super interested in answers to this question. Some of the things I've pondered is ways to figure out 1) what branch am I constantly mispredicting and 2) what load is constantly cache-missing. Not sure the best way to expose that to the programmer, particularly in a way that's cheap for most cores.

strstr · on Dec 13, 2017

1) Modern LBR might solve this. LWN has a summary (though I've only skimmed this): https://lwn.net/Articles/680996/

2) Not sure for this, though I can think of some crappy hacks:

--A) Timed LBR mentioned in that LWN article (somewhat indirect, but might get the job done)

--B) use perf counter overflow interrupts (for cache misses) and set the perf counter initial value high (which should let you sample the cache miss locations). This can only tell you if a particular load is making up a large fraction of your overall cache misses (which is probably not super useful).

Edit: Forgot about PEBS, which is really what you want for 2).

lallysingh · on Dec 13, 2017

Unless there's information available now, I'd love to know more about CPU port utilization. Can I determine how to reorder my instructions for better scheduling?

wyldfire · on Dec 13, 2017

Are the uncore features well represented with perf counters? I've been out of the loop for a while but that was one area that was challenging to investigate back in the day.

lallysingh · on Dec 13, 2017

Something like the LBR for cache misses. I'd love to know which IP/address values caused an l3 miss.