You can list a directory containing 8 million files But not with ls..

scott_s · on Aug 15, 2011

Excellent writeup. Computer systems are discoverable. That attitude, along with some of the basic tools (such as strace, ltrace, man pages, debuggers and a compiler) and a willingness to dive into system source code go a long way. If your tracing leads you to a library call and you don't know what's going on inside, find the source code. If it's the kernel, load up lxr (http://lxr.linux.no/).

unwind · on Aug 16, 2011

Very interesting. I spotted two minor problems with the posted code.

Doing this: #define BUF_SIZE 1024 * 1024 * 5

to define a numerical constant is a bit scary, since depending on how the symbol is used it can break due to dependencies. It's better to enclose the value in parenthesis. Personally I would probably write the right hand side as (5 << 20).

The first time the modification to skip entries with inode 0 is mentioned, the code is wrong:

I did this by adding if (dp->d_ino == 0) printf(...);

This should use !=, not == (it's correct the second time, but this adds confusion).

unwind · on Aug 16, 2011

On second thought (too late to edit), it's quite likely that I would not even bother with a define for this, but instead just set the size directly in the code, e.g. char buffer[5 << 20], then rely on sizeof buffer in the call where the buffer's size is needed. I prefer sizeof whenever possible.

scott_s · on Aug 16, 2011

Using (5 << 20) instead of (5 * 1024 * 1024) is premature optimization. Modern C compilers will take the expression (5 * 1024 * 1024) and turn it into the combination of shifts and additions that are appropriate for your architecture.

And I definitely prefer defined constants for two reasons. One, it's likely you'll have to declare multiple such buffers in different places. Two, if I want to tune the parameter, I'd rather do it at the top of a source file with other such defined constants than hunting for the declaration in the code. I do agree that sizeof() is preferable when it's an option.

__rkaup__ · on Aug 16, 2011

I don't think he meant it as an optimization. It's easy to understand as 5 * 2^20, and quicker to type.

scott_s · on Aug 16, 2011

I much prefer 5 * 1024 * 1024. I have to stop and do some reasoning with 5 << 20.

unwind · on Aug 26, 2011

Correct, to me it's idiomatic. I realize it's not for everyone, so the ultimate best code is probably the way it was written ... On the other hand, sometimes it's fun to push the envelope just a tiny bit, and hope that readers will actually learn something from the code. But that's a thin edge to be balancing on.

tzs · on Aug 16, 2011

I suspect the author is incorrect in his claim that reading in 32k chunks is responsible for the slowness. Due to read ahead and buffering, Unix-like systems tend to do reasonably well on small reads. Yes, big reads are better, but small reads are not unreasonably slow.

To test this, he should try "ls | cat". On large directories that often runs many orders of magnitude faster than "ls". This is because, I believe, ls by default on most people's systems, want to display information such as file modes or type via coloring or otherwise decorating the file names, and getting that information requires looking at the inode for each file.

It's all those inode lookups that slow things down, as the inodes are likely to be scattered all over the place. When you do "ls | cat", ls noticed output is not a terminal, and so turns off the fancy stuff, and just lists the file names. The names can be determined entirely from the directory, and so performance is much better.

bcx · on Aug 16, 2011

The original onus for the post was python's os.listdir() which as far as I know doesn't stat(). ls, just made the blog post more interesting :-).

I was surprised that the 32K reads were taking so long. It's possible since it was on a virtualized disk ("in the cloud") that something else was slowing down disk IO (like Xen).

But I can assure you that a larger read buffer performed much better in this given scenario.

I'd welcome more tests though.

nitrogen · on Aug 16, 2011

This is just a hypothesis based on very little actual knowledge, but perhaps a very long scheduling interval is responsible for the slowness with smaller reads? Consider this scenario: the virtualization hypervisor is on a reasonably loaded system, and decides to block the virtual machine for every single read. Since the physical system has several other VMs on it, whenever the VM in question loses its time slice it has to wait a long time to get another one. Thus, even if the 32K read itself happens quickly, the act of reading alone causes a delay of n milliseconds. If you increase the read size, your VM still gets scheduled 1000/n times per second, but each time it gets scheduled it reads 5MB instead of 32K.

alyandon · on Aug 16, 2011

I just tried this on non-virtualized hardware with 10 million files in a single directory using the zfsonlinux native zfs implementation for linux. It took a little over 4 minutes to do a "\ls -f | wc -l" so this might very well be something to do with virtualization.

I'll try an ext3 file system just for giggles and post the results.

Edit:

Didn't have an ext3 file system with enough free inodes handy so I used an ext4 file system. It takes too long to create 10 million files so I cut the test off early. It took about 7 seconds to complete a "\ls -f | wc -l" with 6.2 million files in a single directory.

tedunangst · on Aug 16, 2011

You are right that 32k buffers should be more than enough to read 5MB in a reasonable time, regardless of disk/VM architecture, but I don't think stat was the problem either. My guess would be readdir or possibly getdents is probably O(n^2) somewhere.

[just noticed it was 500M (oh wow), but same difference]

nodata · on Aug 16, 2011

> To test this, he should try "ls | cat".

Running /bin/ls will bypass the alias.

adestefan · on Aug 16, 2011

Prefixing the command with a \ will also disable the alias (e.g. \ls)

threecheese · on Aug 20, 2011

Do you know where this is documented? (I just skimmed the bash docs and couldnt find it) thx.

Ralith · on Aug 16, 2011

At first glance what you say makes sense, but then why did find and the python call both have similar issues?

brndnhy · on Aug 16, 2011

Really surprised by all the high-fiving and positive excitement going on about this article.

Putting eight million files in one directory level aside, the whole basis for this event - using the filesystem as a storage layer for a k/v 'database' - is just twisted.

Happy not to be working with devs like this.

scott_s · on Aug 16, 2011

Systems people tend to do the simplest, reasonable thing that works, then forget about it until it doesn't work anymore. This means that sometimes you'll make design choices that look ugly, but really, they don't matter.

See this talk by Jonathan Blow, the designer and programmer behind Braid: http://news.ycombinator.com/item?id=2689999

sausagefeet · on Aug 16, 2011

The author points out it was a bug that caused this, not a valid state. I think they should probably be using sqlite anyways though.

joe_the_user · on Aug 16, 2011

Indeed,

Sure, you can find out a lot by doing things you really shouldn't do, like using the directory system as k/v store.

But in the end, you should still learn the lesson that it is really a bad idea.

drhodes · on Aug 16, 2011

getdents stands for "get directory entries", in case anyone else was wondering.

shabble · on Aug 16, 2011

Hark back to the days when 8 character function names were a luxury!

And remember to put all your variable declarations at the top of each block, so the compiler can handle it all in a single pass :)

Luyt · on Aug 16, 2011

The author has a great tip for kernel/filesystem developers:

"Perhaps the buffer should be dynamically set based on the size of the directory entry file"

This would eliminate the readdir() bottleneck.

jharsman · on Aug 16, 2011

readdir() is implemented in libc, not the kernel. The kernel interface is getdents() and that has a configurable buffer size.

carbonica · on Aug 15, 2011

>>> "Don’t be afraid to compile code and modify it"

I was a bit thrown by this advice. Are there folks out there that are afraid to compile code and modify it?

blhack · on Aug 16, 2011

Absolutely. The sad reality is that, to a lot of linux users right now (and I'm talking about people that know they're running linux, not android users), it's a black box.

Software is "packages" that you install with "synaptic".

Source code? Compiler? What's that?

carbonica · on Aug 16, 2011

This seemed like a developer-targeted article, though. I mean, who is having trouble with even 1 million directory entries, let alone 8 million, who is afraid of a compiler?

bcx · on Aug 16, 2011

Developers who only write in interpreted languages :-)

sixtofour · on Aug 16, 2011

That's not sad reality, that's success.

Ralith · on Aug 16, 2011

Both valid perspectives; it depends on your priorities. Some people want computers to be a black box, because when the black box works, it means less cognitive load to use.

sixtofour · on Aug 16, 2011

The most interesting thing (in this subthread) is that Linux is able to be a black box. That wasn't the case not too long ago.

alnayyir · on Aug 16, 2011

Well, once the hardware started working and the graphical toolkit options expanded beyond Tcl/Tk / Perl/Tk...

Ubuntu accelerated the process. Despite being a long time Linux user and programmer, I'd rather know the machine is goin' to work when I haven't done anything to mangle the beast.

I've spent hundreds, if not thousands of hours in the past just getting networking drivers to function. The brave new world of Linux is a good thing, the Interp-Only volken serve only to bolster the ecosystem, not harm it.

uxp · on Aug 15, 2011

Throw a novice Ubuntu user into a FreeBSD system, and tell him to install a port, and 9 times out of 10 they'll freak out once they see GCC output on the screen.

Nothing against Ubuntu (RedHat, SuSE, Debian, Arch, et. al.), but source compilation is something they all have been letting their users avoid for a long time. The target audience is different.

sp332 · on Aug 16, 2011

That's because linux commands are generally quiet unless something goes wrong. ./configure, make and gcc can produce pages of output even when nothing is wrong.

__rkaup__ · on Aug 16, 2011

Yeah, but packaged source code probably shouldn't be riddled with warnings.

lloeki · on Aug 16, 2011

Even when there's a compilation step it can be hidden. Homebrew on OS X does compile each package, but produces no actual compiler output. VMWare Player regularly recompiles kernel modules when it detects a kernel upgrade invalidating them. To the user it looks like an bullet list installation step.

qntm · on Aug 16, 2011

Speaking as a novice Ubuntu user, I don't even know what "install a port" means. Do you mean "open a port"?

bmurphy · on Aug 16, 2011

Nope, he does not mean open a port. Ports is the name of the packaging system FreeBSD uses, and a port is the equivalent (more or less) of an RPM or deb source package.

http://www.freebsd.org/ports/

qntm · on Aug 16, 2011

Doesn't that become awfully confusing?

uxp · on Aug 16, 2011

Only out of context.

If your server had a network connection problem and you needed to open up a port, we're already talking about network ports, not software, so you would try to diagnose your network.

If you asked me how you could get the old game `rogue` on your system, I would tell you to go install the freebsd-games port, which has nothing to do with networking, and so it clearly means that you need to look in your ports tree.

sausagefeet · on Aug 16, 2011

Ports tree is the *BSD source package manager.

jamesgeck0 · on Aug 16, 2011

Arch has a Ports-inspired system called Pkgbuild. Given the level of competence the distro expects of the user to start with, I doubt most Arch users would have a great deal of trouble adapting to FreeBSD.

alnayyir · on Aug 16, 2011

I've rarely seen macports, FreeBSD ports, or NetBSD ports used as a harness to install modified versions of software. Hack jobs are almost always manual, so this purported benefit is a canard.

nitrogen · on Aug 16, 2011

Are there folks out there that are afraid to compile code and modify it?

Some developers (myself included) may have a general preference for sticking with the "official" packages to avoid extra work when bringing up a new machine or migrating to a new distro version.

burgerbrain · on Aug 16, 2011

Surely it should be the other way around too. Compiling code then modifying it seems impractical. :)

rarrrrrr · on Aug 15, 2011

The easy way to do this is:

  find . -maxdepth 1 -mindepth 1

Those arguments will remove the need for find to stat each directory entry. Regardless, this is a nice walk through of low level details often overlooked.

haberman · on Aug 15, 2011

The article discusses a bottleneck in readdir(), not stat(). Running your command has the same problem as running ls:

    open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
    fcntl(5, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
    fchdir(5)                               = 0
    getdents(5, /* 2 entries */, 32768)     = 48
    getdents(5, /* 0 entries */, 32768)     = 0
    close(5)                                = 0

It's only reading 32k at a time, but the author had 500MB of data to be fetched.

moe · on Aug 15, 2011

Actually 'find' will also stat each entry no matter what.

Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.

That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".

tedunangst · on Aug 16, 2011

The only issue I'm aware of with gzip is actually in zlib, where it stored 32-bit byte counters, but those were strictly optional and it works fine with data that overflowed them. The zlib window size may be only 32k, but bzip2 doesn't do that much better with 900k and a better algorithm, so I wouldn't consider it embarrassingly inefficient.

moe · on Aug 16, 2011

I was referring to the lack of SMP support in gzip (see http://www.zlib.net/pigz/).

Steve_Baker · on Aug 16, 2011

How do you tell a file from a directory without stat()ing it? The d_type field is not portable. Since find and other tools like it need to recursively descend a directory tree, a stat() for each file to determine its type is unavoidable.

Ralith · on Aug 16, 2011

But times have changed, and development isn't dead. Why haven't they been updated? The optimizations you imply are often straightforward and well-understood; not major undertakings to implement.

sixtofour · on Aug 16, 2011

"But times have changed, and development isn't dead. Why haven't they been updated?"

Maybe because listing 8M files is not a common use case, and there just isn't the motivation to update otherwise perfectly working code. It's not an itchy problem.

Ralith · on Aug 17, 2011

Compressing/decompressing large quantities of data is, at the very least.

bcx · on Aug 15, 2011

In this case since it was a virtualized disk and it was reading in 32K chunks, I am fairly confident that this wouldn't have helped.

Certainly find . would have been faster without calling stat().

Does os.listdir() stat?

jesboat · on Aug 15, 2011

H tried using find, but, in this case, the libc readdir function was the bottleneck, so find doesn't help.

Often, when ls is being slow, you can speed things up drastically by disabling the things that require a stat (colors, -F, etc.) that are often added by distro's shell files (invoking ls with a full path is an easy way to disable shell aliases.) Also, sorting is on by default, which slows things for obvious reasons.

When all you need to do is kill the stat overhead for small dirs on slow file systems, "echo *" is beautifully simple.

pyre · on Aug 15, 2011

  > invoking ls with a full path is an easy way to
  > disable shell aliases

Or doing this:

  > 'ls'

machrider · on Aug 16, 2011

Or \ls.

emmelaich · on Aug 16, 2011

And you don't want to bother sorting, so:

\ls -f

pyre · on Aug 16, 2011

Or:

  env ls

Notes:

* IIRC 'env' is a built-in on csh/tcsh though, and doesn't behave like I would expect it to. You may want to read that manpage in that case.

* This is how the following works:

  #!/usr/bin/env python

justincormack · on Aug 15, 2011

The man pages really do try to put you off using the syscalls. I only used getdents for the first time when I was writing a syscall interface for Lua https://github.com/justincormack/ljsyscall - not tried it on 8m files but you can set the buffer size so it should work.

The native aio interface has advantages over the posix wrapper too, and some of the other interfaces. On the other hand the futex interface is not nice to use, as it requires architecture specific assembly which seems undocumented...

malkia · on Aug 15, 2011

In a similar fashion, something that is reading directories as files on Windows: https://gist.github.com/1148070 - it allows me to get faster some things (it's very crude, rough code, written for exploration, rather than real usage).

Also to get the NTFS streams (metadata).

http://msdn.microsoft.com/en-us/library/aa364226(v=vs.85).as...

ms4720 · on Aug 15, 2011

Just curious did you try 'ls -f' to get unsorted output. I think a lot of your time may have been in getting stuff sorted for output.

bcx · on Aug 15, 2011

I actually didn't try ls -f, but I did try find . and os.listdir and both hung with very similar straces

mitchty · on Aug 16, 2011

I hit this with a log trimming script I wrote in perl a few months back. We had one directory with about 4 million files.

Just in case you're curious I left it run and it took about 25 hours to get all of the directory entries. One with about 3 million files took 12 hours, so i'm not sure how long it would have taken if you'd have let ls run on its own accord.

Only 200k files afterwards though. :D

I think i'll poke around with this after I get a coffee, I remember stracing the interpreter and getting annoyed at its 32k or bust behavior. But since I didn't have a time limit I didn't much care about runtime. Thanks for the write up!

ck2 · on Aug 16, 2011

Why aren't files stored in subdirectories based on the first character of the filename to breakup the volume?

archangel_one · on Aug 16, 2011

They were never meant to have anywhere near that number of files in the first place. The article mentions that the cleaning up of old files failed - presumably they didn't bother with subdirectories because under normal circumstances the single directory approach was working fine.

cookiecaper · on Aug 16, 2011

My impulse would be to modify libc's readdir() to use a larger buffer size instead of using my own C program. Would that much stupider for some reason (besides packaging/dependencies)? Do libc clients expect 32k buffers and die if they receive something else?

_ytji · on Aug 15, 2011

I ran into this problem before. One of our engineers had a script that kept some temporary data in his home directory, which went crazy at one point and generated millions of files (no clue the reasoning for this). Anyways, this was HELL on our backup solution, which ended up failing several nights in a row on that area. Luckily, since the files were largely useless, this left our options open. I think the kicker was the 'rm' command not working (same issue as listing? This was a Solaris 8 system I think). I believe we ended up moving all of the other data off of that filesystem and newfs'ing it.

pornel · on Aug 16, 2011

Wouldn't the reason be that `ls` tries to sort and stat the results?

Perhaps

    ls -f1  # (one) disables sorting and just prints filenames

would be fast enough?

davvid · on Aug 16, 2011

Having a fan-out directory instead of putting all files at the root would have helped to avoid this problem altogether.

Here's an example. The .git/objects/ directory can grow to have up to 256 sub-directories. If an object hashes to 0a1b2c3d then it gets written to objects/0a/b2c3d. Lookups are still fast and navigating can be done without resorting to writing an 'ls' replacement.

jasonmoo · on Aug 16, 2011

I feel compelled to point out that perl can do this rather easily. But I really did like a little more low level on how a directory is listed.

http://blogs.perl.org/users/randal_l_schwartz/2011/03/perl-t...

bcx · on Aug 15, 2011

Should be fixed completely and fast now, SPW was a new blog so the cache config was support wrong :-)

mml · on Aug 16, 2011

What about 'echo *' ? That's usually my last resort when dealing with malfunctioning ls.

joblessjunkie · on Aug 16, 2011

The shell will attempt to expand '*' into arguments before it spawns the 'echo' process.

The shell cannot handle long argument lists, so this will fail rather quickly.

smackfu · on Aug 16, 2011

Yep, this is usually the first point of failure for commands acting on somewhat large directories, that drives people to use find and exec.

nbashaw · on Aug 15, 2011

Sorry our blog is down now - we're trying to get it back up ASAP!

mjpizz · on Aug 15, 2011

looks like it's fine now, but worst case you can access the cached version:

http://webcache.googleusercontent.com/search?q=cache:KBsyzf3...

bpierre · on Aug 15, 2011

Better cached version: http://webcache.googleusercontent.com/search?q=cache:KBsyzf3...

bcx · on Aug 15, 2011

no need for cache anymore. Word to the wise, make sure supercache is actually caching when copying wordpress installs :-)

sofal · on Aug 16, 2011

The cache is the only way it will display properly on my Android phone.

JimmyMCN · on Aug 16, 2011

Wouldn't ReiserFS work better for this?

tedunangst · on Aug 16, 2011

ext3 and ext4 both support b[h]tree directories, which I guess is what you're thinking of, but that would only matter for creating/deleting/looking up a particular file, not listing them. The fact the system didn't slow to a crawl creating the directory suggests that's not the problem.

mmccaskill · on Aug 16, 2011

It murders your disk though...

mmccaskill · on Aug 18, 2011

Too soon?

brndnhy · on Aug 16, 2011

Excellent case for not giving root.

ars · on Aug 16, 2011

What does root have to do with this?

brndnhy · on Aug 16, 2011

Article illustrates numerous manner of ways of doing things incorrectly.

The premise of the article is a bad precedent for stable environments: let's bend the OS so that it plays nicely with what's clearly misuse and misunderstanding of filesystems.

The only way eight million files should ever end up in a single directory level is by accident, and that's not the case in the scenario outlined here.

bcx · on Aug 16, 2011

Typically there are on the order of 2000 files in a directory. It was a mistake.