Python is installed literally everywhere. If it’s a CLI tool, and it’s written in python, the chance that it won’t work is slim.
Go has a massive number of problems. For example, I tried to run Keybase’s standard “go get” build instructions. It failed with 200 import errors. That was the end of my attempt to install keybase on my raspberry pi. Others had said that it works.
Rust requires a massive amount of hard drive space and takes a long time to build. You also have to build it. That’s antithetical to rapid development.
I can't wait until the pendulum swings back away from static typing and the next generation of programmers discover the benefits of literally ignoring everybody and doing your own thing. It’ll be painful, but at least it’ll be effective. And you won’t have to compile anything.
I figured Ubuntu 18.04 didn't have Python pre installed. Besides, installing all the dependencies with pip is another step to do and gets annoying when deploying to many servers.
For something that gets distributed, a single static binary is very welcomed.
xsv doesn't exist if it was written in Python. It would definitively be too slow. If you don't care about performance and would rather not wait a couple minutes to build the tool on your Pi, then go use csvkit, which is written in Python. The availability of software isn't a zero sum game.
I would definitely love to be proven wrong, because if I am, I am certain I would learn something new. I am pretty comfortable with Python, so I am pretty comfortable saying that you could not write a tool as fast as xsv in Python without writing some substantial portion of it in C. Using Python's standard library CSV parser is certainly fair game, but writing more C code on top of that which included the loop over records in a CSV file feels like a cheat. The problem here is that you've already lost because xsv's CSV parser is faster than Python's parser at the C level[1]. :-) Assuming I haven't made any silly gaffes in xsv itself, I don't see how you're going to make up the difference. Because of that, I am willing to extend a handicap: if you can get within the same order of magnitude as xsv while doing the same amount of work in a robust way, then I think I might even learn something there too. :-)
I am less certain about PyPy, LuaJIT or Node, but I am certain I'd learn something if you proved me wrong.
Note that a problematic part of this challenge is that your program would need to correctly handle CSV in a robust manner. It is very easy to write a very fast CSV parser that doesn't correctly handle all of the corner cases. Python's CSV parser certainly passes that test, but I don't know if Node's or Lua's CSV parser does because I've never used them.
Not sure whether data.table is in the same domain as xsv, and certainly a lot of it is written in C. But for comparison's sake:
fread("cities.csv") 1.30 s
And then the rest of the computations will be faster of course:
count -- 0.005 ms
freq -- 0.2 s
sort -- 0.1 s
It's so useful that I often just use csvs between 10 and 100GB as a database as the difference in performance between fread and a 'proper' database aren't enough to justify the latter.
Yes. I've used it lightly in the past and have never been impressed by its speed or memory usage. But there could have been user errors. I am not an R expert. :)
In any case, I think R implements all or most of its data table transformations in C, so I don't think it applies well here.
Without indexing, LuaJIT is twice as fast as XSV for 2.6M rows:
$ rm data/*.csv.idx
$ time lumen count.l data/nfl_all_plays_huge.csv
2592975
real 0m1.583s
user 0m1.237s
sys 0m0.311s
$ time xsv count data/nfl_all_plays_huge.csv
2592975
real 0m3.184s
user 0m2.425s
sys 0m0.553s
With indexing, LuaJIT is within an order of magnitude:
$ xsv index data/nfl_all_plays_huge.csv
$ time xsv count data/nfl_all_plays_huge.csv
2592975
real 0m0.019s
user 0m0.009s
sys 0m0.007s
$ time lumen count.l data/nfl_all_plays_huge.csv
2592975
real 0m0.184s
user 0m0.083s
sys 0m0.096s
I'll be implementing test cases to ensure it's catching malformed data.
Nice! Is your `count` command doing CSV parsing? I don't understand how your naive parser takes 400ms to parse nfl_all_plays_small.csv, but that counting records is somehow faster. The fact that your `count` program is needing to deal explicitly with `\r` makes me very suspicious. :)
Also, counting records with an index isn't particular interesting, since it's just reading 8 bytes from the index file. I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
In your comment above, you compared your CSV parser to `xsv stats --everything`, but `xsv stats` does a lot more than just CSV parsing. If you want to test how fast xsv takes to parse CSV, then `xsv count` without an index is the way to do it. `xsv` only takes 19ms on my machine to parse nfl_all_plays_small.csv, which is basically within process overhead time.
Also, when you're ready, I would like to be able to run and inspect the code myself. :-)
I warned you above: the key challenge you're going to face is creating a robust CSV parser, and using that to implement every command, including `count`. If that isn't a requirement, then basically all comparisons are unfortunately completely moot.
It's just counting lines and skipping any whose contents are "\r" or blank. I believe this is correct behavior because:
foo,bar,"quux
zap",bang
`xsv count` returns 0 for this.
Is there any situation where csv fields can contain literal newline characters? (ascii value 10.)
Will post code fairly soon. There aren't any tricks. I just implemented slice as well.
Also, when you're ready, I would like to be able to run and inspect the code myself. :-)
Certainly!
EDIT: Ah, user error. CSVs can indeed contain literal newlines, and XSV handles that. I'll switch it to parse doublequoted strings and add some tests.
One simplification: if a line contains N commas, where N matches the number of columns minus one, then there's no need to parse it for count, slice, etc.
I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.
Indeed. xsv returned 0 because it interprets the first record as a header row by default.
Counting is interesting, because you don't have to implement unescaping to do it, but any robust csv parser will do it. So if you write two different versions of a csv parser, one for normal reading and one just for counting, then the one for counting can go faster and you'll avoid the need to amortize allocation. It's a nice trick though! However, I was using `xsv count` as a proxy for CSV parsing. So if you're just going to not do actual CSV parsing, then your solution is much less interesting. :-)
> I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
> It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.
Holy moly. Really? Do you know why it takes so long? That's a showstopper...
Agreed, though I'm mainly seeing how quickly I can reimplement everything xsv has to offer without sacrificing performance. I don't consider the challenge finished until, as you say, it handles all of the corner cases.
EDIT: I'm actually not sure what's going on with the startup time, since it's usually fast. I have quite a few windows open, which might be affecting results. (xsv count is suddenly taking ~70ms for simple data, so I think a reboot is in order.)
To clarify, I was mainly just excited to share some WIP. There's still a lot of work left to do to cross the finish line.
> This would be a fun weekend project to reimplement XSV in Python and prove this wrong.
I don't know as much about the internals or performance characteristics of XSV (though it certainly touts performance as a feature), but if you can reimplement ripgrep in Python and get anywhere close to the same performance, I'd certainly be interested to see that.
> Now, to make this a fair comparison, are you excluding pypy? Or is that allowed for our game?
PyPy is not available everywhere, unlike the CPython runtime or the ability to run compiled binaries.
It's perfectly fine to say "this is an impressive project. I don't understand why it couldn't be done in Python. I would love for someonet to explain that to me. Thanks!"
It is not necessary to dismissively declare that some substantial piece of work could be implemented better in just a weekend.
It's 4k lines of Rust. Shedding the static typing nonsense will get rid of at least 25% of that. Writing it in Lumen will buy an extra 2x in productivity. And there's nothing to discover; the algorithms are right there, and my claim is that they will run nearly as fast in a non-statically-typed language. I don't think the weekend claim is that outrageous.
You don't like putting on a show for a crowd? It's one of the funnest things.
First of all, take a look at Cargo.toml for the list of dependencies; repeat recursively. Projects like xsv and ripgrep are modular, with many components that others can and do reuse.
Second, lines of code hardly gives any but the roughest idea of how hard something would be to write, and write well.
Third, interesting that you're not counting the test cases; after all, if you're not doing any static typing, surely you'll want more tests...
Fourth, hey, as long as you're getting rid of the "static typing nonsense" you might as well drop the error handling and comments while you're at it. More seriously, though, type signatures and similar are hardly a significant part of the lines of code of the average Rust program.
But in any case, you've already seen the replies elsewhere in the thread inviting you to try if you feel confident you can do so.
> You don't like putting on a show for a crowd? It's one of the funnest things.
You're certainly showing the crowd something about yourself. Whether it's what you're intending is another question.
If you want to write a replacement or alternative for a tool, especially as an exercise in learning something, by all means do; it's a fun pastime. You don't need to dismiss someone else's work or choice of language in the process.
If it sounded like I was dismissing someone else's work, you're reading too far into it. Who would be silly enough to dismiss a tool from the author of ripgrep?
Claiming you can implement a version in a weekend and match the same performance is quite dismissive.
Superficially counting the lines of code in the top-level project (ignoring everything else) and implying that it's "just" 4000 lines of code (as though that's a full description of the effort that went into it) is also quite dismissive.
It wasn't dismissive, it was foolish. The CSV parser is actually a separate project, and is around 15k lines of code. That certainly won't be done in a weekend.
Look, it's stellar, A+ software. All I was saying is that you can write it in a dynamic language without sacrificing performance. The goal wasn't to match the full functionality of XSV; that'd be absurd.
In some cases, LuaJIT is even faster than C. It's not an outlandish claim to say that it could match.
The Python claim was in the spirit of good fun, but that probably didn't come across.
Either way, software is meant to be fun. It's a positive statement to say that a dynamic language can match the performance of a statically typed one. Isn't that a cool idea, worth exploring? Why is it true?
The reason I'm confident in that claim is because LuaJIT has withstood the test of time and has repeatedly proven itself. This reduces to the old argument of static types vs lack of types. But a lack of typing was exactly why Lisp was so powerful, back in the day, and why a small number of programmers could wipe the floor vs large teams.
Either way, I've managed to stir the hive, so I'll leave this for whatever it is. To be clear: XSV is awesome software, and I never said otherwise.
The LuaJIT idea is interesting, I've certainly been impressed by it in the past, and can agree it is to some extent something that dispels myths like "statically typed languages are always faster than unityped languages." But if you instead interpret that as a first approximation, then it's fairly accurate IMO.
In the interest of cutting to the chase, I'll try to explain some of the high level ideas of why the CSV parser is fast, and typically faster than any other CSV parser I've come across.
Firstly, it is implemented by a hand-rolled DFA that is built from an NFA. The NFA is typically what most robust CSV parsers use, and it is quite fast, but it suffers from the overhead of moving through epsilon transitions and handling case analysis that is part of the configuration of the parser (i.e., delimiter, quote, escaping rules, etc.). It seems to me like this concept could be carried over to LuaJIT.
Secondly, the per-byte overhead of the DFA is very low, and even special cases[1] some transitions to get the overhead even lower. If you were doing this in pure Python or Lua or really any unityped language, I would be very skeptical that you could achieve this because of all the implicit boxing that tends to go on in those languages. Now, if you toss a JIT in the mix, I kind of throw my hands up. Maybe it will be good enough to cut through the boxing that would otherwise take place. From what I've heard about Mike Pall, it wouldn't surprise me! If the JIT fails at this, I'm not sure how I'd begin debugging it. I kind of imagine it's like trying to convince a compiler to optimize a segment of code in a certain way, but only harder.
Thirdly, a critical aspect of keeping things fast that bubbles all the way up into the xsv application code itself is the amortization of allocation. Namely, when xsv iterates over a CSV file, it reuses the same memory allocation for each record[2]. If you've written performance sensitive code before, then this is amateur hour, but I personally have always struggled to get these kinds of optimizations in unityped languages because allocation is typically not a thing they optimize for. Can a JIT cut through this? I don't know. I'm out of my depth. But I can tell you one thing for sure: in languages like Rust, C or C++, amortizing allocation is a very common thing to do. It is straight-forward and never relies on the optimizer doing it for you. There are some different angles to take here though. For example, unityped languages tend to be garbage collected, and in that environment, allocations can be faster which might make amortization less effective. But I'm really waving my hands here. I'm just vaguely drawing on experience.
Anyway, I think it's kind of counter productive to try to play the "knows better than the hivemind" role here. There are really good solid reasons why statically typed languages tend to out-perform unityped languages, and just because there is a counter example in some cases doesn't make those reasons any less important. I think I could also construct an argument around how statically typed languages make it easier to reason about performance, but I don't quite know how to phrase it. In particular, at the end of the day, both cases wind up relying on some magic black box (a compiler's optimizer or a JIT), but I'm finding it difficult to articulate why that isn't the full story.
My productivity doesn't come from writing software. It comes from reading its code and maintaining it. You can pry my types out of my cold dead hands. :-)
How long it takes you to do this largely depends on how much you can leverage your language's ecosystem. If you don't have a robust and fast CSV parser already written for you, then you'd need to sink many weekends into that alone.
You should definitely do this. Personally, I strongly suspect you wouldn't prove him wrong if you did attempt this in any of the languages you mentioned. But if you're right and we're wrong, I'd love it! It would be great and eye-opening to dig into your implementation(s) to see how you pulled it off.
$ time xsv stats --everything /tmp/nfl_all_plays.csv > stats.csv
real 5.723
user 14.390
sys 1.914
$ time csvstat /tmp/nfl_all_plays.csv
^C after 2.5 minutes
$ time csvstat /tmp/nfl_all_plays_small.csv > /tmp/stats.csv
real 1:01.85
user 1:01.70
sys 0.103
$ time xsv stats --everything /tmp/nfl_all_plays_small.csv > /tmp/stats.csv
real 0.308
user 0.576
sys 0.071
Now technically, csvstat is doing more work in that it seems to be computing a frequency table as well. But we can just do the same for xsv and add the time, with the knowledge that it would be faster if it were coupled into `xsv stats`:
$ time xsv frequency /tmp/nfl_all_plays_small.csv > /tmp/frequency.csv
real 0.251
user 0.187
sys 0.063
Now let's see how xsv fairs on a much larger sample, which is just nfl_all_plays.csv repeated 10 times and is ~800MB:
$ ls -lh /tmp/nfl_all_plays_huge.csv
-rw-r--r-- 1 andrew users 806M Sep 8 20:34 /tmp/nfl_all_plays_huge.csv
$ time xsv index /tmp/nfl_all_plays_huge.csv
real 2.041
user 1.876
sys 0.163
$ time xsv stats --everything /tmp/nfl_all_plays_huge.csv > /tmp/stats.csv
real 28.336
user 4:36.45
sys 24.212
$ time xsv frequency /tmp/nfl_all_plays_huge.csv > /tmp/frequency.csv
real 6.077
user 1:16.51
sys 1.873
That indexing step lets xsv do its processing in parallel. Good luck doing that in Python without blowing your memory budget. :-) csvkit would either take hours to handle that much data or would more likely run out of memory.
With that said, I was able to write a Python program that just counted records within an order of magnitude of `xsv count`, but it was still a few times slower.
Mm, using someone else's parser would defeat the spirit of the challenge. I think xsv is worthwhile for being a robust parser, not necessarily for its performance. And my claim is that you'd be able to write it faster, without trading any security guarantees, in Lua, without sacrificing much performance.
There's that pesky word, "much" performance. And that's really the interesting part here. How much would you trade away by shedding Rust? My hypothesis is less than 50% additional overhead.
Thanks for providing a dataset. I think LuaJIT will match these stats, and it's a good baseline to start with.
But yes, the CSV parser is around 15k lines. That'd be the trickiest part.
xsv doesn't have its own csv parser, it uses a Rust library to parse csv[1], which is almost 4 times the size of xsv itself. I just happen to have written it.
In any case, it would be fun to see an implementation in LuaJIT, especially if you did the CSV parser as well. Although, I think that takes you well outside a weekend project unless you cheat. :-) I don't know the performance characteristics of LuaJIT, but I assume they are better than Python's. I don't know how much better. In any case, this challenge was much more interesting to me when you were talking about Python.
Also, I don't really care about a claim that says you could write it faster. That's borderline meaningless in my work unless you're talking about an order of magnitude difference, and I sincerely doubt that.
Also, I don't really care about a claim that says you could write it faster. That's borderline meaningless in my work unless you're talking about an order of magnitude difference, and I sincerely doubt that.
Ah, fair point. If there is no benefit to writing software faster, then yes, the discussion is moot.
Apologies if it sounded like I was being a dick. I meant to come across as a “player 2 has entered the game,” but it probably just sounded annoying.
I’ve been reimplementing some C projects in LuaJIT (more specifically, a dialect of Lisp that compiles to LuaJIT), and it certainly feels an order of magnitude less overhead to get work done. Perhaps it would be interesting to bind the CSV crate to LuaJIT, and then do a direct translation of XSV. The original discussion was about CLI tools, which is one area that scripting languages excel in, and isn’t necessarily enhanced by the benefits of static typing.
> I’ve been reimplementing some C projects in LuaJIT (more specifically, a dialect of Lisp that compiles to LuaJIT), and it certainly feels an order of magnitude less overhead to get work done.
It is interesting how perspectives change things, because I wouldn't be altogether surprised by that actually. I've used both C and Lua quite a bit myself (although not too much recently), and I can believe that. But if you substituted C for Rust, then I would look at it with great skepticism. That is, I don't see it as types-vs-not-types, but memory-safety-vs-not-memory-safety in addition to having a ecosystem that is very easy to draw on in a granular way.
And I didn't think you were too annoying. :-) Building an ffi layer over the csv crate would be cool. It would probably be easiest to do it over csv-core, and then build up the convenience abstractions on the C side (or perhaps even at the Lua level) and therefore avoid the csv crate entirely. csv-core will be simpler because it is designed to work in `no_std` environments, which means it never uses platform specific routines and never allocates. Still though, you'll probably need to find a way to amortize allocation on the Lua side.
> and isn’t necessarily enhanced by the benefits of static typing
Yeah, I mean I obviously very very very very strongly disagree. We are clearly polar opposites here. If I never write anything substantial in a unityped language again, it will be too soon.
Has to be one of the most arbitrary benefits to programming, everyone is different how do you possibly use this as a comparable metric?
Even if, typing isn't an evil. I write Python for my day job and extensively use the typing libraries because it ends up saving me more time in long run.
Go has a massive number of problems. For example, I tried to run Keybase’s standard “go get” build instructions. It failed with 200 import errors. That was the end of my attempt to install keybase on my raspberry pi. Others had said that it works.
Rust requires a massive amount of hard drive space and takes a long time to build. You also have to build it. That’s antithetical to rapid development.
I can't wait until the pendulum swings back away from static typing and the next generation of programmers discover the benefits of literally ignoring everybody and doing your own thing. It’ll be painful, but at least it’ll be effective. And you won’t have to compile anything.