ocperf is wrapper around perf provided by someone at intel. At the first run, it downloads a list of counter specific the CPU detected, pretty cool;
https://github.com/andikleen/pmu-tools
If you're a low level hacker a reading knowledge of these events is useful but really we should be using VTune as our PME tool. Still, it's possible that a particular event may shed light on a particular piece of code and using an API like PCM.
vtune is first class but most people will be using perf on linux or pmcstat on freebsd so you do need to crossreference a doc like this occasionally when you want to probe a new counter to look for bottlenecks.
pcm is also quite nice to monitor what an entire system is doing in terms of memory bandwidth, NUMA link traffic, and other package level concerns but doesn't give any kernel or application level tracing like the other tools.
As a RISC-V core implementer, I'm super interested in answers to this question. Some of the things I've pondered is ways to figure out 1) what branch am I constantly mispredicting and 2) what load is constantly cache-missing. Not sure the best way to expose that to the programmer, particularly in a way that's cheap for most cores.
2) Not sure for this, though I can think of some crappy hacks:
--A) Timed LBR mentioned in that LWN article (somewhat indirect, but might get the job done)
--B) use perf counter overflow interrupts (for cache misses) and set the perf counter initial value high (which should let you sample the cache miss locations). This can only tell you if a particular load is making up a large fraction of your overall cache misses (which is probably not super useful).
Edit: Forgot about PEBS, which is really what you want for 2).
Unless there's information available now, I'd love to know more about CPU port utilization. Can I determine how to reorder my instructions for better scheduling?
Are the uncore features well represented with perf counters? I've been out of the loop for a while but that was one area that was challenging to investigate back in the day.
The counters vary per Intel CPU, though the most useful ones are universal (e.g. cycle counts). AMD has similar counters.