The easy way to do this is: find . -maxdepth 1 -mindepth 1 Those arguments will ...

haberman · on Aug 15, 2011

The article discusses a bottleneck in readdir(), not stat(). Running your command has the same problem as running ls:

    open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
    fcntl(5, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
    fchdir(5)                               = 0
    getdents(5, /* 2 entries */, 32768)     = 48
    getdents(5, /* 0 entries */, 32768)     = 0
    close(5)                                = 0

It's only reading 32k at a time, but the author had 500MB of data to be fetched.

moe · on Aug 15, 2011

Actually 'find' will also stat each entry no matter what.

Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.

That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".

tedunangst · on Aug 16, 2011

The only issue I'm aware of with gzip is actually in zlib, where it stored 32-bit byte counters, but those were strictly optional and it works fine with data that overflowed them. The zlib window size may be only 32k, but bzip2 doesn't do that much better with 900k and a better algorithm, so I wouldn't consider it embarrassingly inefficient.

moe · on Aug 16, 2011

I was referring to the lack of SMP support in gzip (see http://www.zlib.net/pigz/).

Steve_Baker · on Aug 16, 2011

How do you tell a file from a directory without stat()ing it? The d_type field is not portable. Since find and other tools like it need to recursively descend a directory tree, a stat() for each file to determine its type is unavoidable.

Ralith · on Aug 16, 2011

But times have changed, and development isn't dead. Why haven't they been updated? The optimizations you imply are often straightforward and well-understood; not major undertakings to implement.

sixtofour · on Aug 16, 2011

"But times have changed, and development isn't dead. Why haven't they been updated?"

Maybe because listing 8M files is not a common use case, and there just isn't the motivation to update otherwise perfectly working code. It's not an itchy problem.

Ralith · on Aug 17, 2011

Compressing/decompressing large quantities of data is, at the very least.

bcx · on Aug 15, 2011

In this case since it was a virtualized disk and it was reading in 32K chunks, I am fairly confident that this wouldn't have helped.

Certainly find . would have been faster without calling stat().

Does os.listdir() stat?

jesboat · on Aug 15, 2011

H tried using find, but, in this case, the libc readdir function was the bottleneck, so find doesn't help.

Often, when ls is being slow, you can speed things up drastically by disabling the things that require a stat (colors, -F, etc.) that are often added by distro's shell files (invoking ls with a full path is an easy way to disable shell aliases.) Also, sorting is on by default, which slows things for obvious reasons.

When all you need to do is kill the stat overhead for small dirs on slow file systems, "echo *" is beautifully simple.

pyre · on Aug 15, 2011

  > invoking ls with a full path is an easy way to
  > disable shell aliases

Or doing this:

  > 'ls'

machrider · on Aug 16, 2011

Or \ls.

emmelaich · on Aug 16, 2011

And you don't want to bother sorting, so:

\ls -f

pyre · on Aug 16, 2011

Or:

  env ls

Notes:

* IIRC 'env' is a built-in on csh/tcsh though, and doesn't behave like I would expect it to. You may want to read that manpage in that case.

* This is how the following works:

  #!/usr/bin/env python