I would definitely love to be proven wrong, because if I am, I am certain I would learn something new. I am pretty comfortable with Python, so I am pretty comfortable saying that you could not write a tool as fast as xsv in Python without writing some substantial portion of it in C. Using Python's standard library CSV parser is certainly fair game, but writing more C code on top of that which included the loop over records in a CSV file feels like a cheat. The problem here is that you've already lost because xsv's CSV parser is faster than Python's parser at the C level[1]. :-) Assuming I haven't made any silly gaffes in xsv itself, I don't see how you're going to make up the difference. Because of that, I am willing to extend a handicap: if you can get within the same order of magnitude as xsv while doing the same amount of work in a robust way, then I think I might even learn something there too. :-)
I am less certain about PyPy, LuaJIT or Node, but I am certain I'd learn something if you proved me wrong.
Note that a problematic part of this challenge is that your program would need to correctly handle CSV in a robust manner. It is very easy to write a very fast CSV parser that doesn't correctly handle all of the corner cases. Python's CSV parser certainly passes that test, but I don't know if Node's or Lua's CSV parser does because I've never used them.
Not sure whether data.table is in the same domain as xsv, and certainly a lot of it is written in C. But for comparison's sake:
fread("cities.csv") 1.30 s
And then the rest of the computations will be faster of course:
count -- 0.005 ms
freq -- 0.2 s
sort -- 0.1 s
It's so useful that I often just use csvs between 10 and 100GB as a database as the difference in performance between fread and a 'proper' database aren't enough to justify the latter.
Yes. I've used it lightly in the past and have never been impressed by its speed or memory usage. But there could have been user errors. I am not an R expert. :)
In any case, I think R implements all or most of its data table transformations in C, so I don't think it applies well here.
Without indexing, LuaJIT is twice as fast as XSV for 2.6M rows:
$ rm data/*.csv.idx
$ time lumen count.l data/nfl_all_plays_huge.csv
2592975
real 0m1.583s
user 0m1.237s
sys 0m0.311s
$ time xsv count data/nfl_all_plays_huge.csv
2592975
real 0m3.184s
user 0m2.425s
sys 0m0.553s
With indexing, LuaJIT is within an order of magnitude:
$ xsv index data/nfl_all_plays_huge.csv
$ time xsv count data/nfl_all_plays_huge.csv
2592975
real 0m0.019s
user 0m0.009s
sys 0m0.007s
$ time lumen count.l data/nfl_all_plays_huge.csv
2592975
real 0m0.184s
user 0m0.083s
sys 0m0.096s
I'll be implementing test cases to ensure it's catching malformed data.
Nice! Is your `count` command doing CSV parsing? I don't understand how your naive parser takes 400ms to parse nfl_all_plays_small.csv, but that counting records is somehow faster. The fact that your `count` program is needing to deal explicitly with `\r` makes me very suspicious. :)
Also, counting records with an index isn't particular interesting, since it's just reading 8 bytes from the index file. I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
In your comment above, you compared your CSV parser to `xsv stats --everything`, but `xsv stats` does a lot more than just CSV parsing. If you want to test how fast xsv takes to parse CSV, then `xsv count` without an index is the way to do it. `xsv` only takes 19ms on my machine to parse nfl_all_plays_small.csv, which is basically within process overhead time.
Also, when you're ready, I would like to be able to run and inspect the code myself. :-)
I warned you above: the key challenge you're going to face is creating a robust CSV parser, and using that to implement every command, including `count`. If that isn't a requirement, then basically all comparisons are unfortunately completely moot.
It's just counting lines and skipping any whose contents are "\r" or blank. I believe this is correct behavior because:
foo,bar,"quux
zap",bang
`xsv count` returns 0 for this.
Is there any situation where csv fields can contain literal newline characters? (ascii value 10.)
Will post code fairly soon. There aren't any tricks. I just implemented slice as well.
Also, when you're ready, I would like to be able to run and inspect the code myself. :-)
Certainly!
EDIT: Ah, user error. CSVs can indeed contain literal newlines, and XSV handles that. I'll switch it to parse doublequoted strings and add some tests.
One simplification: if a line contains N commas, where N matches the number of columns minus one, then there's no need to parse it for count, slice, etc.
I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.
Indeed. xsv returned 0 because it interprets the first record as a header row by default.
Counting is interesting, because you don't have to implement unescaping to do it, but any robust csv parser will do it. So if you write two different versions of a csv parser, one for normal reading and one just for counting, then the one for counting can go faster and you'll avoid the need to amortize allocation. It's a nice trick though! However, I was using `xsv count` as a proxy for CSV parsing. So if you're just going to not do actual CSV parsing, then your solution is much less interesting. :-)
> I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
> It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.
Holy moly. Really? Do you know why it takes so long? That's a showstopper...
Agreed, though I'm mainly seeing how quickly I can reimplement everything xsv has to offer without sacrificing performance. I don't consider the challenge finished until, as you say, it handles all of the corner cases.
EDIT: I'm actually not sure what's going on with the startup time, since it's usually fast. I have quite a few windows open, which might be affecting results. (xsv count is suddenly taking ~70ms for simple data, so I think a reboot is in order.)
To clarify, I was mainly just excited to share some WIP. There's still a lot of work left to do to cross the finish line.
I am less certain about PyPy, LuaJIT or Node, but I am certain I'd learn something if you proved me wrong.
Note that a problematic part of this challenge is that your program would need to correctly handle CSV in a robust manner. It is very easy to write a very fast CSV parser that doesn't correctly handle all of the corner cases. Python's CSV parser certainly passes that test, but I don't know if Node's or Lua's CSV parser does because I've never used them.
[1] - https://github.com/BurntSushi/rust-csv/blob/master/csv-core/...