> Traces need to be parsed in their entirety to do anything useful with them, making them impossible to stream.
We started implementing distributed tracing using the precursor to OpenTelemetry with the vendor Lightstep. They have a neat solution that allows streaming OTel spans: use large ring buffers to hold and analyze as many traces as you can afford. It finds relations between spans and surfaces interesting spans. This you run in your own cluster. Then that ships a subset of total spans and traces to their saas for a good UI.
Unfortunately, at our scale, that was really expensive and we still had to resort to sampling.
Trying to understand the document as a Go outsider, but it seems that the document is assuming that the reader knows what Ps, Ms, and Gs are? Anyone knows what they are? It seems that it’s a scheduling resource… but I can’t find any further info.
P‘s are logical processors, M‘s are OS Threads and G‘s are goroutines. I don’t know who coined these terms or abbreviations but I first heard of them from Bill Kennedy from ArdanLabs in his ultimate go course. So I think these are go specific terms. Maybe the go Team introduced them prior.
New to golang, I have recently started looking at some of the open source projects in golang. Looking a at some of the code you get to see that there are producers and there are consumers.
The problem is you won't see them connected in a single stack trace which makes looking at the flow of data in the system difficult. I mean there are profiling tools but they don't go that far.
Another way is to put print on log statements all over function on their entry and exit and try to find a coherent flow.
How do people wrriting golang everyday deal with this problem?
> How do people wrriting golang everyday deal with this problem?
For your development setup, have everything run in a docker container and use breakpoints in your IDE to catch program flow. Similar to what you do with a standard/monolith program.
Is this the kind of thing you're asking about, or something different?
Yes, Application Observability is a lot of the purpose of this.
Instrumentation will be richer, more accurate, low cost, and able to be of value throughout the application for the purposes of tracing and profiling.
What it will mean is that thing like https://github.com/open-telemetry/opentelemetry-go will be able to use this to better instrument Go applications, and when OTel includes profiling (as well as traces) then you'll be able to use Pyroscope, Polar Signals, etc for that in addition to Tempo (or whatever you use for tracing), as well as using something like Grafana and Datadog to view all of the above.
(I work for Grafana, but the above is relatively vendor neutral as yes we recently acquired Pyroscope and launched Grafana Cloud Profiles, and yes some of the people cited work for Grafana, but others are doing great work here too and it all benefits people who write and run applications using Go).
I am not sure what an Apm agent is, but you don't need to inject anything in your code to use the Go execution tracer today. In general, it's exposed through HTTP (though it needs to be explicitely enabled by the programmer).
This design document is about improving the implementation of the Go execution tracer, but IIUC the user-level tooling and behavior will remain the same (except that the wire format will be documented).
I think you and OP are talking about different things (pprof vs OTEL spans). OTEL spans need to be initialized in every function you want traced, it sucks.
I think low-overhead fine-grained tracing of programs is a pretty undervalued technique. That is, I think it is valuable but I see little written about it online and few good tools that make it easy. I think it is a little more valued in some tech companies (e.g. I would guess that Google values it based on this stuff in Go, the Fuchsia tracing infrastructure, perfetto, as well as their distributed tracing, chrome tracing stuff, and Dick Sites’s crazy-looking printouts. Not sure about others).
I was surprised a bit by the discussion about using the clock instead of rdtsc equivalents. I think because I’m not familiar with the synchronisation issues between cores. I was also surprised by their timings (rdtsc should take ~20 cycles so I would have expected more like 6ns than 10, and I wasn’t expecting the clock call to be so fast). I know some processors round rdtsc values giving pretty poor precision so maybe the clock call will do better in those cases too. It seems like a reasonable approach.
Supporting tracing in the compiler (rather than as a library) seems right to me. Being able to mess with the compiler is useful because it is important to keep these traces low overhead and that generally means you would much rather write down a small constant integer than copy in some strings for each event. The compiler can ensure these integers are unique and you have some table of them somewhere. (I think it’s a little possible to do this sort of thing in C with just macros as you can drop into assembly and use more advanced assembler features like pushing/popping different sections and suchlike). And the go runtime does enough scheduling that you’ll want to trace so you’ll want to be able to get that information too.
Stepping back a bit to look at tracing more generally, the landscape of tools just feels kinda bad to me. There are some powerful things at low levels but they are very hard to use.
- Intel ipt and Arm coresight offer hardware-assisted tracing of things like jumps/function calls (from which you can mostly figure out control flow though exceptions/things like go routines can mess that up) and there is great support for both in perf but you only get a textual ‘script’ output which is obviously going to be a lot and hard to deal with. I think these tracing features can be virtualised (in newer chips) but I think hypervisors often don’t do it so eg you can’t access these features on a typical cloud vm. You also don’t have it on amd cpus.
- Linux has ebpf and such programs can be attached to various events in the kernel to trace them (I’m not sure how to efficiently get data out yet. There’s a thing called a ‘map’ which can be a ring buffer but I only know about an interface where you can read from it at one element per syscall. Someone else told me that it’s possible to just mmap it and then read it if you know the format). There is a great tool in bpftrace but it doesn’t easily give you fine-grained output, I think.
- Solaris / MacOS have dtrace, which I confess I don’t know much about.
- strace exists. It gives a lot of data but the output format isn’t exactly machine readable (I admit I haven’t tried very hard). For a busy application built in something like go or nodejs where many threads are doing many different things, the output is pretty hard to deal with. I think the ptrace(2) api it uses isn’t great for having a low overhead and I don’t know how it will trace io_uring syscalls.
- there is an Xcode tool called instruments which uses dtrace to offer a visualisation of syscalls and various call stacks but it is mostly a statistical profiler and I find the ui a bit unpleasant.
- Chrome and Firefox have their own trace recorders/profilers and viewers. You can also (somewhat) view chrome traces in perfetto and some programs will generate the chrome trace format. I haven’t used the Firefox profiler but it seems more geared towards statistical profiles than traces. There’s a cli tool called samply which outputs a format the Firefox profiler can read.
- There seem to be a bunch of different trace formats and viewers and producers and they aren’t generally very compatible with each other. I’m not sure how that will get better over time.
- There are various tools for windows which I don’t know much about (many targeted at game developers), of which many are proprietary and I think some are well loved.
I hope other language implementations will be able to copy some of the things go does as the advantage of this kind of tracing becomes more clear. It might be harder to have a good way of tracing JavaScript (which has promises instead of goroutines (in some sense these are different features but I think they are often used for similar kinds of concurrency)) but the async/await syntax and all the effort that went into improving debugger support for it suggests to me that it might be possible.
I won’t say Not interested for the average programmers go internals are usually well explained and this doesn’t dwell on what’s written but how it’s written
We started implementing distributed tracing using the precursor to OpenTelemetry with the vendor Lightstep. They have a neat solution that allows streaming OTel spans: use large ring buffers to hold and analyze as many traces as you can afford. It finds relations between spans and surfaces interesting spans. This you run in your own cluster. Then that ships a subset of total spans and traces to their saas for a good UI.
Unfortunately, at our scale, that was really expensive and we still had to resort to sampling.