Hacker News new | past | comments | ask | show | jobs | submit login
Bfs 3.0: The Fastest Find Yet (tavianator.com)
119 points by rurban on July 21, 2023 | hide | past | favorite | 53 comments



Is there a chance for an Everything alternative? That's the awesome tool that would allow you to ignore the depth/breadth completely find any file literally instantly (the power of instant feedback is unapralleled) Do non-NTFS systems (even the modern ones) not offer this?


Just adding the Everything is ridiculously fast - nothing comes close. https://www.voidtools.com/en-au/support/everything/


Anything that accesses the MFT directly will be faster than Everything, since Everything goes through the OS. That said, Everything has a superb design and indeed is extremely fast.


Everything is using the MFT as well.


How about FSearch in Linux? [http://cboxdoerfer.github.io/fsearch/]


Yes, FSearch is the one I use, but it's not as great, per FSearch's dev:

> However, FSearch doesn't automatically detect changes made to the file system and update its index then. This is on the roadmap (it's called inotify support) but it'll never work as smooth as Everything on Windows, because the Linux kernel isn't particularly good at reporting filesystem changes

https://github.com/cboxdoerfer/fsearch/issues/26

Everything is comprehensive + instant + always up-to-date, that's so awesome a combo it's a pity it's Windows only


Doesn't everything need to rebuild its database on every reboot? Was a minor annoyance the last time i remember using it.


No, upon start Everything loads its database file into RAM and queries the USN journal for all file system changes since the last time it was running. By applying those changes to its current database it automatically gets a consistent representation of the current state of the file system. This usually is much quicker than a full rebuild.

Only when the USN journal doesn't go back enough in time will Everything rebuild its database by parsing the MFT (which is also much quicker than traversing the file system).


Okay. My system might have been bugged or misconfigured. It was several years ago.


what about deleted files? Do they eventually get flushed from the cache?


Files get deleted immediately, the USN Journal only keeps track of the events (including delete events). It's basically a log of all filesystem events (up until a certain point in time, due to size restrictions).


The journal has data about deletions, too.

https://learn.microsoft.com/en-us/windows-server/administrat...:

“As files, directories, and other NTFS objects are added, deleted, and modified, NTFS enters records into the USN change journal, one for each volume on the computer.”


Mac OS spotlight has more-or-less instant file search. Command-F in finder or mdfind in the terminal.


Except it's neither instant nor comprehensive, and those two features are what make Everything awesome


I think you never actually used the mdfind search, only the bastardized version that is now available as Spotlight. That interface used to be good, now it’s so bad and slow I don’t know why they let this happen. But try mdfind with filenames, it is incredibly fast. Also for file content searches it is very fast also.


I tried a quick search between mdfind, bfs, and find to locate all README.md files in my home directory. bfs and find both found 2930 files and took 8.5s and 47.5s respectively. mdfind took 0.1s but found only 2400 files.


so what's the query to find a file anywhere on my hard drive, including hidden folders/files?


I don’t know what problems people have with Spotlight. It finds everything instantly for me. It’s my launcher, calculator…


For starters, it hides system files by default.


I think that’s a good default.

Most people don’t need to be messing with system files, the ones who do, can go through the extra trouble of checking a few preferences, using command line tools, setting up plist flags, etc.


Oh so you do know what problems people have with Spotlight, you just don't care about those problems to offer anything useful (your generic list isn't that since there are no preferences, cli tools, plist flags that would bring Everything to life on a Mac)


No, I learned from this problem just now and think most users shouldn’t be messing with Library folder, etc, which is why it is hidden, and rightly so, IMO.


I have not used Everything, is it aomething similar to SwiftSearch (bypasses fs and parses ntfs disk directly, so no indexing and instant results) ?


If the parent is talking about the Windows search software Everything (https://www.voidtools.com/), it does index the files (but unsure of how, it's relatively quick compared to other Windows tools that does the same, so probably some magic involved?) at startup, then continuously indexes (as far as I can tell) but then after that the searches are instant.

Edit: Found some answers (https://www.voidtools.com/faq/):

Re indexing:

> How long will it take to index my files?

> "Everything" only indexes file and folder names and generally takes a few seconds to build its database. A fresh install of Windows 10 (about 120,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.

Re changes:

> Does "Everything" monitor file system changes?

> Yes, "Everything" does monitor your file systems for all changes. Your search results will update in real-time to reflect any changes. Everything will automatically keep your NTFS indexes up to date with the NTFS USN Journal. Changes will not be missed when Everything is not running as the system maintains the NTFS USN Journal.


Yes, exactly the principle as SwiftSearch - bypassing the slow fs and getting instant results (but ntfs only), and checking fs logs directly to keep track of all the updates


the 'locate' command is faster than everything but arguably not as convenient (being a commandline tool). It seems to support non-ntfs (meaning it doesn't support ntfs) linux filesystems.

edit: catfish is a gui front-end available for locate.


locate is much slower than Everything and it also has a much less powerful query language. With Everything you usually get results in way less than 100ms, whereas with locate it's several hundred milliseconds for a database with more than a million files.


There's plocate too, which is a bit faster still according to the authors.


Related: https://mastodon.social/@pervognsen/110739397974530013

But this tests perf for only finding one file. bfs is optimized for finding many files at once, as it defers closing the dirhandle, and uses io_uring for parallel IO callbacks.


`bfs` doesn't actually use io_uring yet, but it is planned. I'm not sure I'd say it's specifically optimized for finding multiple files at once either, I try to make it fast for many different use cases. There's two benchmarks in the blog post and a few more that I run regularly, e.g. https://github.com/tavianator/bfs/pull/107


I am performing a similar file system tree navigation asynchronously in Node.js which is just a shallow API over the C Linux FS APIs.

I can see you are using opendir and closedir functions? What is the benefit from using the opendir function[1] when readdir[2] can be called on a location directly? Is the benefit that opendir returns a file descriptor for use in opening a stream to gather directory object descriptors?

[1] https://man7.org/linux/man-pages/man3/opendir.3.html

[2] https://man7.org/linux/man-pages/man3/readdir.3.html

Your project is probably more mature but if you want an alternate approach to examine here is I have been doing it: https://github.com/prettydiff/share-file-systems/blob/master...

I considering changing my use of readdir to use the withFileTypes option so that it returns a list of directory entries (objects of artifact name and type) instead of a list of conditions to discern types like I am doing on lines 382-432.


If you read the manpages you linked to you'll see that in the C API, readdir cannot be called with a path but needs a directory handle returned by opendir. Node.js is probably doing this in the background when you call readdir.


Nice to see other alternatives to find. I personally use fd (https://github.com/sharkdp/fd) a lot, as I find the UX much better. There is one thing that I think could be better, around the difference between "wanting to list all files that follow a certain pattern" and "wanting to find one or a few specific files". Technically, those are the same, but an issue I'll often run into is wanting to search something in dotfiles (for example the Go tools), use the unrestricted mode, and it'll find the few files I'm looking for, alongside hundreds of files coming from some cache/backup directory somewhere. This happens even more with rg, as it'll look through the files contents.

I'm not sure if this is me not using the tool how I should, me not using Linux how I should, me using the wrong tool for this job, something missing from the tool or something else entirely. I wonder if other people have this similar "double usage issue", and I'm interested in ways to avoid it.


Would love to see how bfs compares to fdir[0] for directory traversal. Even though fdir is using Node.js underneath, the comparisons I have done with fd & find are pretty close. Of course, bfs would probably be quite a bit faster...but how much faster exactly?

Disclaimer: I am the developer of fdir.

[0] https://github.com/thecodrr/fdir


From a quick test, about 5x faster:

    tavianator@tachyon $ cat bench.mjs
    #!/usr/bin/env node

    import { fdir } from "fdir";

    console.log(new fdir().withFullPaths().crawl("/home/tavianator/code/android").sync().length);
    tavianator@tachyon $ hyperfine -w1 "node ./bench.mjs" "bfs ~/code/android -false"
    Benchmark 1: node ./bench.mjs
      Time (mean ± σ):      2.073 s ±  0.031 s    [User: 1.372 s, System: 1.260 s]
      Range (min … max):    2.022 s …  2.128 s    10 runs

    Benchmark 2: bfs ~/code/android -false
      Time (mean ± σ):     417.5 ms ±   6.8 ms    [User: 592.3 ms, System: 2487.7 ms]
      Range (min … max):   405.2 ms … 429.7 ms    10 runs

    Summary
      bfs ~/code/android -false ran
        4.97 ± 0.11 times faster than node ./bench.mjs
Is there a way to call it that doesn't require holding all the paths in memory simultaneously?


You are using fdir synchronously. Try this:

console.log(await new fdir().onlyCounts().crawl("/home/tavianator/code/android").withPromise());


I tried it on my end. Built bfs from source using `make release`:

    $ hyperfine -w1 "NODE_ENV=production node ./fdir.mjs" "./bin/bfs /
    home/thecodrr/ -false"

    Benchmark 1: NODE_ENV=production node ./fdir.mjs
      Time (mean ± σ):     965.5 ms ±  53.0 ms    [User: 703.0 ms, System: 1220.5 ms]
      Range (min … max):   858.4 ms … 1041.3 ms    10 runs

    Benchmark 2: ./bin/bfs /home/thecodrr/ -false
      Time (mean ± σ):      1.530 s ±  0.127 s    [User: 0.341 s, System: 2.282 s]
      Range (min … max):    1.401 s …  1.808 s    10 runs

    Summary
      'NODE_ENV=production node ./fdir.mjs' ran
        1.58 ± 0.16 times faster than './bin/bfs /home/thecodrr/ -false'

    $ cat fdir.mjs
    #!/usr/bin/env node

    import { fdir } from "fdir";

    console.log(await new fdir().onlyCounts().crawl("/home/thecodrr").withPromise());
For some reason, reducing the UV_THREADPOOL_SIZE to 2 gives the best result on my machine (I have heard the opposite in case of macOS):

    $ hyperfine -w1 "UV_THREADPOOL_SIZE=2 NODE_ENV=production node ./fdir.mjs" "NODE_ENV=production node ./fdir.mjs" "./bin/bfs /home/thecodrr/ -false"
    Benchmark 1: UV_THREADPOOL_SIZE=2 NODE_ENV=production node ./fdir.mjs
      Time (mean ± σ):     355.8 ms ±  16.1 ms    [User: 479.4 ms, System: 356.0 ms]
      Range (min … max):   328.3 ms … 387.5 ms    10 runs

    Benchmark 2: NODE_ENV=production node ./fdir.mjs
      Time (mean ± σ):     935.4 ms ±  52.7 ms    [User: 695.8 ms, System: 1176.5 ms]
      Range (min … max):   850.6 ms … 1031.9 ms    10 runs

    Benchmark 3: ./bin/bfs /home/thecodrr/ -false
      Time (mean ± σ):      1.534 s ±  0.104 s    [User: 0.353 s, System: 2.307 s]
      Range (min … max):    1.428 s …  1.773 s    10 runs

    Summary
      'UV_THREADPOOL_SIZE=2 NODE_ENV=production node ./fdir.mjs' ran
        2.63 ± 0.19 times faster than 'NODE_ENV=production node ./fdir.mjs'
        4.31 ± 0.35 times faster than './bin/bfs /home/thecodrr/ -false'
Another factor to take into account is that I ran all this on a WSL instance which may or may not affect the performance. However, since both programs are running on WSL, the results should be accurate.


At a glance, 7.6 million files in under 2.xx seconds using bfs. And fdir 1 million in 1 second. So bfs is 2-3 times faster than fdir.


>bfs operates breadth-first, which typically finds the file(s) you're looking for faster.

I find this statement very confusing.

Typically I'm looking for files in subdirs because I'm not aware of target's path. That's depth first (though of couse there is a change I'm going to find my target quite close to the starting point).

And if I'm looking in the same dir I'm currently in (or I know what dir to look at) - then there is no need for a sophisticated search anyway, as typicall you don't have too many files in a directory unless this is some kind of tmp or cache directory.


breath-first doesn't mean it's not looking into subdirectories. It is, but it looks into each subdirectory before decending into the different subdirectories subdirectories.

So it enger first subdir, looks at files there, then goes back up, enters second subdirectory, looks at files there, ... After checking each subdir it goes into the first subdir oft the first subdir, looks at files there, then goes to second subdir of first subdir


I need some test vs fd.


The author gives some fd comparisons in their benchmarks.


You are right.

UPD: tavianator actually is one of the fd maintainers as well: https://www.reddit.com/r/linux/comments/153far4/comment/jskx...


This is so much nicer than using find while iterating max depth.


There is a mistake in the README of the github repo:

In the 1st part of the "Features" section, the "bfs" and "find" columns' contents are swapped. I was a bit confused while reading it and it took me a while to notice that it was a mistake.


Thanks for the heads up! It's fixed now.


Last time I needed to traverse millions of files the fastest was 9yo https://github.com/MichaelTJones/walk

I wonder how this compares in speed.


I keep coming across examples where io_uring cannot be used. It seems like functionality that needs to be scrapped and replaced by something more robust. It's the fastest thing that no one seems to be able to use.


Or maybe just iterated on until it gets more functionality? Linux has quite a large API surface, I'm not surprised it takes a while to replicate all of it.


I tried it out. If there's a lot of results, it seems to fare poorer.

However as a disclosure, I was running this on my "bedside laptop" which is a Core2 from like 15 years ago.


As opposed to DFS: https://xkcd.com/761/


The alt text is rather interesting on this one.


I just love that there is a relevant xkcd for pretty much every nerdy topic. Didn’t know/ remember this one.. so thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: