From the Arrakis paper: "To analyze the sources of overhead, we record timestamps at various stages of kernel and user-space processing." You can probably implement this with something like perf dynamic tracing: http://www.brendangregg.com/perf.html#DynamicTracing
"How did you measure the time spent in each section (HW, kernel, app)? how did you get such granularity?"