Excellent writeup. Computer systems are discoverable. That attitude, along with some of the basic tools (such as strace, ltrace, man pages, debuggers and a compiler) and a willingness to dive into system source code go a long way. If your tracing leads you to a library call and you don't know what's going on inside, find the source code. If it's the kernel, load up lxr (http://lxr.linux.no/).
Very interesting. I spotted two minor problems with the posted code.
Doing this:
#define BUF_SIZE 1024 * 1024 * 5
to define a numerical constant is a bit scary, since depending on how the symbol is used it can break due to dependencies. It's better to enclose the value in parenthesis. Personally I would probably write the right hand side as (5 << 20).
The first time the modification to skip entries with inode 0 is mentioned, the code is wrong:
I did this by adding
if (dp->d_ino == 0) printf(...);
This should use !=, not == (it's correct the second time, but this adds confusion).
On second thought (too late to edit), it's quite likely that I would not even bother with a define for this, but instead just set the size directly in the code, e.g. char buffer[5 << 20], then rely on sizeof buffer in the call where the buffer's size is needed. I prefer sizeof whenever possible.
Using (5 << 20) instead of (5 * 1024 * 1024) is premature optimization. Modern C compilers will take the expression (5 * 1024 * 1024) and turn it into the combination of shifts and additions that are appropriate for your architecture.
And I definitely prefer defined constants for two reasons. One, it's likely you'll have to declare multiple such buffers in different places. Two, if I want to tune the parameter, I'd rather do it at the top of a source file with other such defined constants than hunting for the declaration in the code. I do agree that sizeof() is preferable when it's an option.
Correct, to me it's idiomatic. I realize it's not for everyone, so the ultimate best code is probably the way it was written ... On the other hand, sometimes it's fun to push the envelope just a tiny bit, and hope that readers will actually learn something from the code. But that's a thin edge to be balancing on.