> "Nice, but why on earth would I want that?" I have no idea.
I know this is referring mostly to the `cat` portion and not the `splice` portion of the article, but I'll throw in a quick shoutout to `splice` for giving me one of the single biggest build performance wins in my time at Zynga (and possibly across most teams at the company at the time).
We had a ruby script which ran the majority of the build, and as the game grew we found that by far the slowest part was a loop which MD5 hashed each individual asset and used that as its filename on our CDN for per-asset-versioning.
At its worst it was taking nearly an hour and a half; the code was basically as inefficient as you could make it - multiple shell calls for each file rather than any sort of inlining of the hashing process.
I wrote a basic C program using splice and an MD5 library which took the whole process to under 10s. A bit overkill, perhaps, but the naive speedup I tried first still took over 1-2 minutes, and I figured 99.99% was worth the extra few hours to put it together knowing how many builds we ran each day.
Definitely gave me a healthy appreciation for the cost of transferring to user space that has stuck with me.
Last time I posted a link to that, I received quite a few replies where people find it more natural to use `cat file | …` even when unnecessary — so even though I agree with the intent of the page I feel like it's useless to try and evangelise every case. If cat is the bottleneck though, fair game.
First, if cat is slower than redirection from file (<file) - then I'd say something is amiss. But more to the point - I think it's really a bug that tools like gzip, grep, awk etc work on files at all. We do need a tool to feed files to pipes (I think cat is a fine candidate for that - also when we only con-cat-enate one file (the identity cat, if you will).
Maybe there are cases where a long string of awk|something|other|sort|uniq is not the problem, but forking an extra process for cat is.
And maybe there's a mismatch between pipes, files and mmap today. Splice seems like a reasonable fix (if we splice all the things, awk, grep etc).
> I think it's really a bug that tools like gzip, grep, awk etc work on files at all.
I'd say that is somewhat of a harsh premise, especially since the in-place editing of files available e.g. in many GNU tools (awk, sed, sort) is really useful and based on exactly that possibility.
I do agree that cat often makes pipes easier to read, though. And yes, obsessing over that one additional process seems to be somewhat silly. Unless, of course, it introduces a real bottleneck and the whole thing is time sensitive.
One good reason (IMO) for doing "cat file |" is it's easier to grab the command from your history and change it to something like "grep foo file |" rather than if you had run "cmd < file".
cat file | this | that | other | out
vs
this < file | that | other | out
In the cat example, it's easy to change the head of the pipeline, by adding things before "this" or deleting "this", which is less so in the non-cat example. (The use case I have in mind is experimental commands that take probably <10s to complete, where editing time is a significant fraction of the time you spend.)
My counter argument would be that with a good shell line editor like zsh in vi mode command transformations are as cheap as modular grammar; however I know there's limits to that argument (Java is only writeable in Java IDEs) so I'll grant you that :-)
Newer kernels also have the copy_file_range syscall (with compatibility shim in glibc) which is supposed to use the most efficient copying approach available between any two file descriptors. So it's more general than splice or sendfile.
There is a ruby gem for Linux called io_splice that does zero-copy IO. Hasn’t been updated in a while but it doesn’t have any dependencies other than modern Linux and doesn’t mean it won’t work. “Old” code that works still works, novelty, job-securitization and API churn be damned when it doesn’t add value.
The most interesting thing about all this to me, other than the existence of splice(I really should finish The Linux Programming Interface), is that you need a pipe and two splice operations to get the data between other file types.. There must be some dirty implementation detail forcing this right? Right?!
I know this is referring mostly to the `cat` portion and not the `splice` portion of the article, but I'll throw in a quick shoutout to `splice` for giving me one of the single biggest build performance wins in my time at Zynga (and possibly across most teams at the company at the time).
We had a ruby script which ran the majority of the build, and as the game grew we found that by far the slowest part was a loop which MD5 hashed each individual asset and used that as its filename on our CDN for per-asset-versioning.
At its worst it was taking nearly an hour and a half; the code was basically as inefficient as you could make it - multiple shell calls for each file rather than any sort of inlining of the hashing process.
I wrote a basic C program using splice and an MD5 library which took the whole process to under 10s. A bit overkill, perhaps, but the naive speedup I tried first still took over 1-2 minutes, and I figured 99.99% was worth the extra few hours to put it together knowing how many builds we ran each day.
Definitely gave me a healthy appreciation for the cost of transferring to user space that has stuck with me.