ugrep, which is C++ and similar in scope to ripgrep is 0.9 MB on my machine, ripgrep is 4.4 MB and GNU grep us 0.2 MB. They all depend on libc and libpcre2.
Ugrep however depends on libstdc++ and a bunch of libraries for compressed file support (libz,...).
So yeah a bit bloated but we are not at Electron level yet.
It's not clear to me that you're accounting for the difference in size that results from static vs dynamic linking. For example, if I build `ugrep` with `./build.sh --enable-static --without-brotli --without-lzma --without-zstd --without-lz4 --without-bzlib`, then I get a `ugrep` binary that is 4.5MB. (I added all of those `--without-*` flags because I couldn't get the build to work otherwise.) If I add `--without-pcre2`, I get a 3.9MB binary.
ripgrep is only a little bigger here when you do an apples to apples comparison. To get a static build without PCRE2, run `cargo build --profile release-lto --target x86_64-unknown-linux-musl`. That gets me a 4.6MB `rg` binary. Running `PCRE2_SYS_STATIC=1 cargo build --profile release-lto --target x86_64-unknown-linux-musl --features pcre2` gets a fully static binary with PCRE2 at a 5.4MB `rg` binary.
Popping up a level, a fair criticism is that it is difficult to get ripgrep to dynamically link most of its dependencies. You can make it dynamically link libc and PCRE2 (that's just `cargo build --profile release-lto --features pcre2`) and get a 4.1MB binary, but getting it to dynamically link all of its Rust crate dependencies is an unsupported build configuration for ripgrep. But I don't know how much tools like ugrep or GNU grep rely on that level of granular dynamic linking anyway. GNU grep doesn't seem to do so on my system (only dynamically linking with libc and PCRE2).
Additionally, the difference in binary size may be at least partially attributable to a difference in Unicode support:
$ echo ♥ | rg '\p{Emoji}'
♥
$ echo ♥ | ugrep-7.5.0 '\p{Emoji}'
ugrep: error: error at position 6
(?m)\p{Emoji}
\___invalid character class
These are grep, ripgrep and ugrep installed on my Debian (bookworm). The mentioned sizes are the executables only, because I think that not taking advantage of dynamic libraries if you can is a downside, though there are arguments going the other way.
Anyways, I still took it into account when calling ripgrep "bloated". Using ldd, I counted 3.6 MB of dependencies for ripgrep and 7.1 MB for ugrep. Which coincidentally result in about 8 MB for both ugrep and ripgrep. But ugrep accounts for the entire libstdc++ and other libraries, which includes code that ugrep doesn't need (such as compression), so I would have expected ugrep to be smaller. GNU grep has 2.5 MB of dependencies btw: 1.9MB for libc and 0.6MB for libpcre2.
And to make things clear, I don't put ugrep in the lightweight category either. C++ (modern C++ in particular) suffers from some of the same problems as Rust: lots of code generation leading to bloat and slow compile times, but (as you pointed out) it tends to play along better with dynamic libraries with a C interface.
I don't know how much a size-optimized grep with the same features as ripgrep would take. 4 MB looks like a lot, but sometimes bloat come from unexpected places. For example, some compression algorithms may include predefined dictionaries, coloring may involve terminal databases, and Unicode support my involve databases too.
> The mentioned sizes are the executables only, because I think that not taking advantage of dynamic libraries if you can is a downside, though there are arguments going the other way.
I addressed and accounted for this in my comment. All three of GNU grep, ugrep and ripgrep can dynamically link libc and PCRE2.
ripgrep doesn't bundle any compression code or terminal databases. ripgrep does bundle a significant number of Unicode tables.
On my system grep is 136kb, dynamically linked to libc. Frankly, I don't use all its features and all its regex engines, can't remember when I searched for something more complex than fixed text. It supports coloring and references TERM variable. fgrep is 80kb, apparently supports all features except for coloring and pcre.
This is a bit of an aside, but I wonder if it'd be possible to combine the advantages of dynamic and static linking - meaning no need to do dlsym calls, and virtual dispatch to call into libraries, you could have fixed addresses for functions while sharing dynamic libraries between processes.
How it would work, is the process would tell the OS what libraries it expects to be loaded and at what address, and it would just mmap that piece of shared readonly memory. This mapping step happens for shared libraries anyways, but the symbol resolution is dynamic.
I imagine this would also harden the application somewhat, as there are less dynamic dispatches for hackers to exploit.
If you knew in advance the addresses of each function, it wouldn't be very dynamic. It would be akin to statically link the entire system, which makes sense in embedded systems, but probably not on a system where you would use ripgrep.
It seems that what you are suggesting is using position dependent code, which has performance benefits, especially on 32-bit x86, but it seems that we are moving away from it, one reason being that modern hardware supports position independent code better, and also for security.
Having fixed addresses means that hackers know exactly where functions are, making their life easier for writing their shellcodes. As a result, a common security feature we tend to see more and more is ASLR, where code is relocated at random addresses, even for statically-linked code, which is the exact opposite of what you are suggesting.
What you are suggesting could protect against things like DLL injection, which is something I consider more of a feature than a bug. For me, the small improvement in security is not worth it, especially since it would be incompatible with ASLR. Some of these benefits could be achieved by hardening the dynamic linker without changing the executables and libraries themselves.
It doesn't have to be dynamic 99% of the time, most/all of your dependencies tend to be static in the sense that you know them in advance. Yet you'd still have less memory footprint than static linking, better startup time than both dynamic linking (no symbol resolution) and static linking (less stuff to load).
Not sure what you're suggesting with position dependent code. All libraries would still be position independent, but libfoo would be loaded at 0x5000000 in executable A and 0x6000000 in executable B.
For ASLR, you'd have the exact same performance and security characteristics as statically linked code. Either you could go position dependent and forgo ASLR, or go position independent and randomly shift your base address, in which case the loaded libs would need to account for that (for example if ASLR decides to load your process at 0x10 then libfoo would be loaded at 0x5000010).
Also I don't see any reason why you couldn't combine this with dynamic linking, and static linking in the same process.
Ugrep however depends on libstdc++ and a bunch of libraries for compressed file support (libz,...).
So yeah a bit bloated but we are not at Electron level yet.