Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to strace (fbarr.net)
151 points by aburan28 on Dec 1, 2016 | hide | past | favorite | 24 comments



No mention of overhead, and finishes with an strace -c and "This option is very useful when trying to find out why a program is running slow." Well, it will run slow if you run "strace -c", because you're running "strace -c".

Because of the overheads, I wouldn't trust the high resolution timestamps also included in the blog post.

Also not mentioned: "strace -e" to select a single syscall doesn't reduce overhead -- you pay the tax for tracing all syscalls anyway.

Because of strace's behavior, there's also been situations in past kernels where it can hang processes and need a kill -9. Your application is now paused while you're madly typing at the command line -- if I did that on one of my instances at work (although I believe that bug has been fixed a long time ago), request timeouts may have the instance fail over before I could finish hitting enter. That's not so bad, but in your environment it could be much worse. You may get such timeouts just from the overheads of strace.

It's one thing to write a post that has errors. It's another to write a post that's dangerous.

This is why I wrote http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-...


Brendan, I appreciate all your contributions to the software tool landscape.

> It's another to write a post that's dangerous.

Huh? It's called "introduction to strace". I agree with everything you said, it's easy to bork things up with strace. But I'd hardly call this article "dangerous."

90% of the time I'm running some script or opaque binary which gives some bogus unclear error or just stops executing (or appears to halt). strace is the perfect tool for finding out which syscall is failing and why.

But, yes, your warnings are good reminders that attaching to a service that's already running can have bad consequences.


It's a tool with an unexpected side effect that it can hurt or kill production. And this post (and others like it) encourages people to use it without a single mention of "overhead" or dangers.

One may not need to belabor the dangers of tools like shutdown, ifconfig, fdisk, etc, since as administration tools one should _expect_ dangers. The worst tools are where the dangers are unexpected. People are going to learn it the hard way if you don't provide a warning.


It looks like this was written Jan. 2014. Your post is a few months later.

The content on your site is great and personally I really appreciate it. Thanks for adding some context here.


> Because of the overheads, I wouldn't trust the high resolution timestamps also included in the blog post.

Naive question why would the timestamps be inaccurate, surely each step of the program's execution should just be slower? Am I missing something about how the timestamp of a syscall would be off?


they're absolutely inaccurate for the reason you mentioned, but they're also relatively inaccurate because not all of the program's execution is syscalls. regular program code runs at the same speed (more or less), but syscalls are made much slower so appear to take more time than they really do.


I think an explanation would be PTRACING a process and tapping on its syscalls will always add some latency , for example time(process(open())) < time(ptrace(process(open))) if that makes sense ? , similar to selinux ... there's a cost to tapping syscalls.


Its too bad no one has written a tool that keeps the familiar strace cli flags interface and output format but under the hood uses perf/ebpf. If such a tool existed I would be much more likely to reach for that than to fall back to strace.


Back in the early aughts I was applying as a junior sysadmin for one of those up-and-coming LAMP web-hosting companies here in Vienna. The interviewer was a great, all-around-nice guy and a very experienced sysadmin. All went rather well until he asked me on how I would go about fixing a specific Apache httpd issue.. suffice to say I was still green behind my ears and didn't know about strace and that was pretty much it.

I'm still very grateful that I funked up that interview as otherwise, down the line I maybe wouldn't have gone into systems programming / unix internals as easily.

Julia Evans has a great strace related zine on her website, check it out as well:

http://jvns.ca/blog/2015/04/14/strace-zine/

Also - if you are on MacOS/BSD you should check out the somewhat related dtrace/dtruss, very powerful tools.


Good brief intro article.

At some point I tried to get deeper and understand ptrace(2) syscall, the technology behind strace command line tool. I wrote this piece:

https://idea.popcount.org/2012-12-11-linux-process-states/

The ptrace() articles were never finished, but oh well. I guess ptrace() is doomed to be undocumented and barely understood. Recently I found this gem in 1983 4.2BSD operating system man page :

> Ptrace is unique and arcane; it should be replaced with a special file which can be opened and read and written. The control functions could then be implemented with ioctl(2) calls on this file. This would be simpler to understand and have much higher performance.

http://www.tuhs.org/cgi-bin/utree.pl?file=4.2BSD/usr/man/man...


I wrote an article explaining how ptrace works, which may interest you: https://blog.packagecloud.io/eng/2016/02/29/how-does-strace-...


ptrace() isn't undocumented, it has a fairly long manpage: http://man7.org/linux/man-pages/man2/ptrace.2.html

It's certainly pretty hairy, but on the other hand only a handful of programs (debuggers, strace, similar debug type tools) are ever going to need to use it.


Pretty hairy is an understatement. There are so many edge cases with ptrace based on how it works(interactions with signals especially) that make it almost impossible to make a truly robust debugger with it. Solaris did it right with their procfs debugger interface. I really wish Linux would step up the game here.


You can also use gdb to debug more than just C code.

I used to know a DBA that had symbols installed on the production database server so he could attach a debugger to the running Postgres backend process, and use the backtrace to tell you what the query was doing and why it was running slowly.

Also, note that running strace on a production process can change its observed behavior. http://man7.org/linux/man-pages/man2/ptrace.2.html


Maybe I'm not following you, but isn't that just attaching gbd to C code?


There's often macros you can use to help. In that case, sure, it's C code. Another "it's just C code" is debugging CPython: https://wiki.python.org/moin/DebuggingWithGdb These macros let you inspect both the C stack and the Python stack, and some other stuff. Super super handy for rare cases where shit just gets weird.


Well, yes... But it's probably being used to troubleshoot a bottleneck in postgres being triggered by a higher level language application.


Tangentially, there's a funny Easter Egg in strace – with some trial and error, you can get it to strace its own pid.

  $ strace -p 957 strace: I'm sorry, I can't let you do that, Dave.


let z=$(readlink /proc/self)+1 && strace -p $z

probably a race. i'm not a real programmer.


  sh -c 'exec strace -p $$'
This is race-free and POSIX-compliant I think.


Great introductory article, thanks for writing and sharing this!

I wrote an article explaining the inner workings of strace [1], and a detailed article about Linux system calls [2] which others interested in this article may find relevant.

[1]: https://blog.packagecloud.io/eng/2016/02/29/how-does-strace-...

[2]: https://blog.packagecloud.io/eng/2016/04/05/the-definitive-g...


I much prefer Julia Evan's zine on strace:

http://jvns.ca/strace-zine-unfolded.pdf


The problem with strace is that it is not reentrant. I.e. you can't run strace through strace.

So don't use it in scripts unless you are debugging.


If you're not debugging, why are you using strace?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: