Profiling in production with the JVM using something like YourKit or JProfiler is the typical case for myself. Ironically, I've found profiling with Python in production easier for the reasons you've mentioned. If something is down or running slowly already, adding another 3%+ latency is hardly going to be an issue. Architecturally, with big monolithic programs that do too many things attaching a profiler to try to analyze 1% of the program's responsibilities or surface area becomes a risk to other production operations unfortunately. In most cases slowdowns happen because of resource saturation, things timing out, blocking on shared resources. In the first scenario, trying to run a profiler can exacerbate the problem or even fail to start, so the only way forensics can be done there is by emitting observability data prior to the failure point.
Other approaches taken have been the more Erlang style "let it fail" methodology which is fine for newer projects but represents a rewrite for most systems in practice and is thus far, far beyond profiling discussions.
Other approaches taken have been the more Erlang style "let it fail" methodology which is fine for newer projects but represents a rewrite for most systems in practice and is thus far, far beyond profiling discussions.