Is there a chance for an Everything alternative? That's the awesome tool that would allow you to ignore the depth/breadth completely find any file literally instantly (the power of instant feedback is unapralleled)
Do non-NTFS systems (even the modern ones) not offer this?
Anything that accesses the MFT directly will be faster than Everything, since Everything goes through the OS. That said, Everything has a superb design and indeed is extremely fast.
Yes, FSearch is the one I use, but it's not as great, per FSearch's dev:
> However, FSearch doesn't automatically detect changes made to the file system and update its index then. This is on the roadmap (it's called inotify support) but it'll never work as smooth as Everything on Windows, because the Linux kernel isn't particularly good at reporting filesystem changes
No, upon start Everything loads its database file into RAM and queries the USN journal for all file system changes since the last time it was running. By applying those changes to its current database it automatically gets a consistent representation of the current state of the file system. This usually is much quicker than a full rebuild.
Only when the USN journal doesn't go back enough in time will Everything rebuild its database by parsing the MFT (which is also much quicker than traversing the file system).
Files get deleted immediately, the USN Journal only keeps track of the events (including delete events). It's basically a log of all filesystem events (up until a certain point in time, due to size restrictions).
“As files, directories, and other NTFS objects are added, deleted, and modified, NTFS enters records into the USN change journal, one for each volume on the computer.”
I think you never actually used the mdfind search, only the bastardized version that is now available as Spotlight. That interface used to be good, now it’s so bad and slow I don’t know why they let this happen. But try mdfind with filenames, it is incredibly fast. Also for file content searches it is very fast also.
I tried a quick search between mdfind, bfs, and find to locate all README.md files in my home directory. bfs and find both found 2930 files and took 8.5s and 47.5s respectively. mdfind took 0.1s but found only 2400 files.
Most people don’t need to be messing with system files, the ones who do, can go through the extra trouble of checking a few preferences, using command line tools, setting up plist flags, etc.
Oh so you do know what problems people have with Spotlight, you just don't care about those problems to offer anything useful (your generic list isn't that since there are no preferences, cli tools, plist flags that would bring Everything to life on a Mac)
No, I learned from this problem just now and think most users shouldn’t be messing with Library folder, etc, which is why it is hidden, and rightly so, IMO.
If the parent is talking about the Windows search software Everything (https://www.voidtools.com/), it does index the files (but unsure of how, it's relatively quick compared to other Windows tools that does the same, so probably some magic involved?) at startup, then continuously indexes (as far as I can tell) but then after that the searches are instant.
> "Everything" only indexes file and folder names and generally takes a few seconds to build its database. A fresh install of Windows 10 (about 120,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.
Re changes:
> Does "Everything" monitor file system changes?
> Yes, "Everything" does monitor your file systems for all changes. Your search results will update in real-time to reflect any changes. Everything will automatically keep your NTFS indexes up to date with the NTFS USN Journal. Changes will not be missed when Everything is not running as the system maintains the NTFS USN Journal.
Yes, exactly the principle as SwiftSearch - bypassing the slow fs and getting instant results (but ntfs only), and checking fs logs directly to keep track of all the updates
the 'locate' command is faster than everything but arguably not as convenient (being a commandline tool). It seems to support non-ntfs (meaning it doesn't support ntfs) linux filesystems.
edit: catfish is a gui front-end available for locate.
locate is much slower than Everything and it also has a much less powerful query language. With Everything you usually get results in way less than 100ms, whereas with locate it's several hundred milliseconds for a database with more than a million files.
But this tests perf for only finding one file. bfs is optimized for finding many files at once, as it defers closing the dirhandle, and uses io_uring for parallel IO callbacks.
`bfs` doesn't actually use io_uring yet, but it is planned. I'm not sure I'd say it's specifically optimized for finding multiple files at once either, I try to make it fast for many different use cases. There's two benchmarks in the blog post and a few more that I run regularly, e.g. https://github.com/tavianator/bfs/pull/107
I am performing a similar file system tree navigation asynchronously in Node.js which is just a shallow API over the C Linux FS APIs.
I can see you are using opendir and closedir functions? What is the benefit from using the opendir function[1] when readdir[2] can be called on a location directly? Is the benefit that opendir returns a file descriptor for use in opening a stream to gather directory object descriptors?
I considering changing my use of readdir to use the withFileTypes option so that it returns a list of directory entries (objects of artifact name and type) instead of a list of conditions to discern types like I am doing on lines 382-432.
If you read the manpages you linked to you'll see that in the C API, readdir cannot be called with a path but needs a directory handle returned by opendir. Node.js is probably doing this in the background when you call readdir.
Nice to see other alternatives to find. I personally use fd (https://github.com/sharkdp/fd) a lot, as I find the UX much better. There is one thing that I think could be better, around the difference between "wanting to list all files that follow a certain pattern" and "wanting to find one or a few specific files". Technically, those are the same, but an issue I'll often run into is wanting to search something in dotfiles (for example the Go tools), use the unrestricted mode, and it'll find the few files I'm looking for, alongside hundreds of files coming from some cache/backup directory somewhere. This happens even more with rg, as it'll look through the files contents.
I'm not sure if this is me not using the tool how I should, me not using Linux how I should, me using the wrong tool for this job, something missing from the tool or something else entirely. I wonder if other people have this similar "double usage issue", and I'm interested in ways to avoid it.
Would love to see how bfs compares to fdir[0] for directory traversal. Even though fdir is using Node.js underneath, the comparisons I have done with fd & find are pretty close. Of course, bfs would probably be quite a bit faster...but how much faster exactly?
I tried it on my end. Built bfs from source using `make release`:
$ hyperfine -w1 "NODE_ENV=production node ./fdir.mjs" "./bin/bfs /
home/thecodrr/ -false"
Benchmark 1: NODE_ENV=production node ./fdir.mjs
Time (mean ± σ): 965.5 ms ± 53.0 ms [User: 703.0 ms, System: 1220.5 ms]
Range (min … max): 858.4 ms … 1041.3 ms 10 runs
Benchmark 2: ./bin/bfs /home/thecodrr/ -false
Time (mean ± σ): 1.530 s ± 0.127 s [User: 0.341 s, System: 2.282 s]
Range (min … max): 1.401 s … 1.808 s 10 runs
Summary
'NODE_ENV=production node ./fdir.mjs' ran
1.58 ± 0.16 times faster than './bin/bfs /home/thecodrr/ -false'
$ cat fdir.mjs
#!/usr/bin/env node
import { fdir } from "fdir";
console.log(await new fdir().onlyCounts().crawl("/home/thecodrr").withPromise());
For some reason, reducing the UV_THREADPOOL_SIZE to 2 gives the best result on my machine (I have heard the opposite in case of macOS):
$ hyperfine -w1 "UV_THREADPOOL_SIZE=2 NODE_ENV=production node ./fdir.mjs" "NODE_ENV=production node ./fdir.mjs" "./bin/bfs /home/thecodrr/ -false"
Benchmark 1: UV_THREADPOOL_SIZE=2 NODE_ENV=production node ./fdir.mjs
Time (mean ± σ): 355.8 ms ± 16.1 ms [User: 479.4 ms, System: 356.0 ms]
Range (min … max): 328.3 ms … 387.5 ms 10 runs
Benchmark 2: NODE_ENV=production node ./fdir.mjs
Time (mean ± σ): 935.4 ms ± 52.7 ms [User: 695.8 ms, System: 1176.5 ms]
Range (min … max): 850.6 ms … 1031.9 ms 10 runs
Benchmark 3: ./bin/bfs /home/thecodrr/ -false
Time (mean ± σ): 1.534 s ± 0.104 s [User: 0.353 s, System: 2.307 s]
Range (min … max): 1.428 s … 1.773 s 10 runs
Summary
'UV_THREADPOOL_SIZE=2 NODE_ENV=production node ./fdir.mjs' ran
2.63 ± 0.19 times faster than 'NODE_ENV=production node ./fdir.mjs'
4.31 ± 0.35 times faster than './bin/bfs /home/thecodrr/ -false'
Another factor to take into account is that I ran all this on a WSL instance which may or may not affect the performance. However, since both programs are running on WSL, the results should be accurate.
>bfs operates breadth-first, which typically finds the file(s) you're looking for faster.
I find this statement very confusing.
Typically I'm looking for files in subdirs because I'm not aware of target's path. That's depth first (though of couse there is a change I'm going to find my target quite close to the starting point).
And if I'm looking in the same dir I'm currently in (or I know what dir to look at) - then there is no need for a sophisticated search anyway, as typicall you don't have too many files in a directory unless this is some kind of tmp or cache directory.
breath-first doesn't mean it's not looking into subdirectories. It is, but it looks into each subdirectory before decending into the different subdirectories subdirectories.
So it enger first subdir, looks at files there, then goes back up, enters second subdirectory, looks at files there, ... After checking each subdir it goes into the first subdir oft the first subdir, looks at files there, then goes to second subdir of first subdir
There is a mistake in the README of the github repo:
In the 1st part of the "Features" section, the "bfs" and "find" columns' contents are swapped. I was a bit confused while reading it and it took me a while to notice that it was a mistake.
I keep coming across examples where io_uring cannot be used. It seems like functionality that needs to be scrapped and replaced by something more robust. It's the fastest thing that no one seems to be able to use.
Or maybe just iterated on until it gets more functionality? Linux has quite a large API surface, I'm not surprised it takes a while to replicate all of it.