Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The easy way to do this is:

  find . -maxdepth 1 -mindepth 1
Those arguments will remove the need for find to stat each directory entry. Regardless, this is a nice walk through of low level details often overlooked.



The article discusses a bottleneck in readdir(), not stat(). Running your command has the same problem as running ls:

    open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
    fcntl(5, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
    fchdir(5)                               = 0
    getdents(5, /* 2 entries */, 32768)     = 48
    getdents(5, /* 0 entries */, 32768)     = 0
    close(5)                                = 0
It's only reading 32k at a time, but the author had 500MB of data to be fetched.


Actually 'find' will also stat each entry no matter what.

Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.

That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".


The only issue I'm aware of with gzip is actually in zlib, where it stored 32-bit byte counters, but those were strictly optional and it works fine with data that overflowed them. The zlib window size may be only 32k, but bzip2 doesn't do that much better with 900k and a better algorithm, so I wouldn't consider it embarrassingly inefficient.


I was referring to the lack of SMP support in gzip (see http://www.zlib.net/pigz/).


How do you tell a file from a directory without stat()ing it? The d_type field is not portable. Since find and other tools like it need to recursively descend a directory tree, a stat() for each file to determine its type is unavoidable.


But times have changed, and development isn't dead. Why haven't they been updated? The optimizations you imply are often straightforward and well-understood; not major undertakings to implement.


"But times have changed, and development isn't dead. Why haven't they been updated?"

Maybe because listing 8M files is not a common use case, and there just isn't the motivation to update otherwise perfectly working code. It's not an itchy problem.


Compressing/decompressing large quantities of data is, at the very least.


In this case since it was a virtualized disk and it was reading in 32K chunks, I am fairly confident that this wouldn't have helped.

Certainly find . would have been faster without calling stat().

Does os.listdir() stat?


H tried using find, but, in this case, the libc readdir function was the bottleneck, so find doesn't help.

Often, when ls is being slow, you can speed things up drastically by disabling the things that require a stat (colors, -F, etc.) that are often added by distro's shell files (invoking ls with a full path is an easy way to disable shell aliases.) Also, sorting is on by default, which slows things for obvious reasons.

When all you need to do is kill the stat overhead for small dirs on slow file systems, "echo *" is beautifully simple.


  > invoking ls with a full path is an easy way to
  > disable shell aliases
Or doing this:

  > 'ls'


Or \ls.


And you don't want to bother sorting, so:

\ls -f


Or:

  env ls
Notes:

* IIRC 'env' is a built-in on csh/tcsh though, and doesn't behave like I would expect it to. You may want to read that manpage in that case.

* This is how the following works:

  #!/usr/bin/env python




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: