One place I had issues was rapidly allocating space I needed temporarily but then discarding it.
The space I needed was too large to be added to the heap, so I used mmap. Because of the nature of the processing (mmap, process, yeet mmap) I put the system under a lot of pressure. Maintaining the set of mapped blocks and reusing them around fixed the issue.
Freeing memory to the OS or doing things like munmap involves a TLB shootdown which by definition is a performance bottleneck. Probably what you ended up experiencing.
The space I needed was too large to be added to the heap, so I used mmap. Because of the nature of the processing (mmap, process, yeet mmap) I put the system under a lot of pressure. Maintaining the set of mapped blocks and reusing them around fixed the issue.