Technically if you use jemalloc, which most everyone should do anyway, it comes with built-in instrumentation but you need to enable it compile time and generally not many are aware of this.
I think tcmalloc will output protos too now which google pprof tool understands. If you’re using standard glibc malloc you’re probably leaving a lot of performance on the table