Hacker News new | past | comments | ask | show | jobs | submit login

I would definitely love to be proven wrong, because if I am, I am certain I would learn something new. I am pretty comfortable with Python, so I am pretty comfortable saying that you could not write a tool as fast as xsv in Python without writing some substantial portion of it in C. Using Python's standard library CSV parser is certainly fair game, but writing more C code on top of that which included the loop over records in a CSV file feels like a cheat. The problem here is that you've already lost because xsv's CSV parser is faster than Python's parser at the C level[1]. :-) Assuming I haven't made any silly gaffes in xsv itself, I don't see how you're going to make up the difference. Because of that, I am willing to extend a handicap: if you can get within the same order of magnitude as xsv while doing the same amount of work in a robust way, then I think I might even learn something there too. :-)

I am less certain about PyPy, LuaJIT or Node, but I am certain I'd learn something if you proved me wrong.

Note that a problematic part of this challenge is that your program would need to correctly handle CSV in a robust manner. It is very easy to write a very fast CSV parser that doesn't correctly handle all of the corner cases. Python's CSV parser certainly passes that test, but I don't know if Node's or Lua's CSV parser does because I've never used them.

[1] - https://github.com/BurntSushi/rust-csv/blob/master/csv-core/...




Have you seen the R package data.table (https://github.com/rdatatable/data.table)?

Not sure whether data.table is in the same domain as xsv, and certainly a lot of it is written in C. But for comparison's sake:

  fread("cities.csv")  1.30 s
And then the rest of the computations will be faster of course:

  count -- 0.005 ms
  freq  -- 0.2   s
  sort  -- 0.1   s
It's so useful that I often just use csvs between 10 and 100GB as a database as the difference in performance between fread and a 'proper' database aren't enough to justify the latter.


Yes. I've used it lightly in the past and have never been impressed by its speed or memory usage. But there could have been user errors. I am not an R expert. :)

In any case, I think R implements all or most of its data table transformations in C, so I don't think it applies well here.


I'm curious as to what has impressed you with regard to speed or memory usage (at least over this scale). I realize there's a lot I don't know.


Initial results: A naive parser in LuaJIT of your xsv-stats example:

  > (do (set csv (load "reader.l"))
        (set plays ((get (require 'system) 'read-file) "data/nfl_all_plays_small.csv"))
        (set s ((get csv 'stream) plays))
        nil)
  > (let t1 (seconds)
      (let read (get csv 'read)
        (while (read s)))
      (- (seconds) t1))
  0.80387997627258
i.e. it can parse the whole nfl_all_plays_small.csv dataset in 803ms.

  $ time xsv stats --everything data/nfl_all_plays_small.csv > /dev/null
  real	0m0.323s
  user	0m0.520s
  sys	0m0.064s
The CSV parser thus far is 62 lines: https://gist.github.com/shawwn/4c7f7e93ccf2e241e17e82d353301...

Obviously, this is a naive parser and doesn't handle the corner cases you mention. But it's a useful starting point.

Current CSV reader tests: https://gist.github.com/shawwn/53c06cd30f29b064c1f6c66f7896f...

Will have more results soon. :)

For basic record counting, LuaJIT takes 36ms:

  > (let t1 (seconds) (select 2 ((get string 'gsub) plays "\n" "\n")) (- (seconds) t1))
  0.036473989486694
xsv can do it in 29ms:

  $ time xsv count data/nfl_all_plays_small.csv
  10000

  real	0m0.029s
  user	0m0.018s
  sys	0m0.007s
EDIT: Down to 400ms to parse the dataset.


Ok, the `count` command is implemented. The code is 27 lines: https://gist.github.com/shawwn/e178e597cbf7f3682153e449f6633...

Without indexing, LuaJIT is twice as fast as XSV for 2.6M rows:

  $ rm data/*.csv.idx

  $ time lumen count.l data/nfl_all_plays_huge.csv
  2592975

  real	0m1.583s
  user	0m1.237s
  sys	0m0.311s

  $ time xsv count data/nfl_all_plays_huge.csv
  2592975

  real	0m3.184s
  user	0m2.425s
  sys	0m0.553s
With indexing, LuaJIT is within an order of magnitude:

  $ xsv index data/nfl_all_plays_huge.csv

  $ time xsv count data/nfl_all_plays_huge.csv
  2592975

  real	0m0.019s
  user	0m0.009s
  sys	0m0.007s

  $ time lumen count.l data/nfl_all_plays_huge.csv
  2592975

  real	0m0.184s
  user	0m0.083s
  sys	0m0.096s
I'll be implementing test cases to ensure it's catching malformed data.


Nice! Is your `count` command doing CSV parsing? I don't understand how your naive parser takes 400ms to parse nfl_all_plays_small.csv, but that counting records is somehow faster. The fact that your `count` program is needing to deal explicitly with `\r` makes me very suspicious. :)

Also, counting records with an index isn't particular interesting, since it's just reading 8 bytes from the index file. I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

In your comment above, you compared your CSV parser to `xsv stats --everything`, but `xsv stats` does a lot more than just CSV parsing. If you want to test how fast xsv takes to parse CSV, then `xsv count` without an index is the way to do it. `xsv` only takes 19ms on my machine to parse nfl_all_plays_small.csv, which is basically within process overhead time.

Also, when you're ready, I would like to be able to run and inspect the code myself. :-)

I warned you above: the key challenge you're going to face is creating a robust CSV parser, and using that to implement every command, including `count`. If that isn't a requirement, then basically all comparisons are unfortunately completely moot.


It's just counting lines and skipping any whose contents are "\r" or blank. I believe this is correct behavior because:

  foo,bar,"quux
  zap",bang
`xsv count` returns 0 for this.

Is there any situation where csv fields can contain literal newline characters? (ascii value 10.)

Will post code fairly soon. There aren't any tricks. I just implemented slice as well.

Also, when you're ready, I would like to be able to run and inspect the code myself. :-)

Certainly!

EDIT: Ah, user error. CSVs can indeed contain literal newlines, and XSV handles that. I'll switch it to parse doublequoted strings and add some tests.

One simplification: if a line contains N commas, where N matches the number of columns minus one, then there's no need to parse it for count, slice, etc.

I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.


Indeed. xsv returned 0 because it interprets the first record as a header row by default.

Counting is interesting, because you don't have to implement unescaping to do it, but any robust csv parser will do it. So if you write two different versions of a csv parser, one for normal reading and one just for counting, then the one for counting can go faster and you'll avoid the need to amortize allocation. It's a nice trick though! However, I was using `xsv count` as a proxy for CSV parsing. So if you're just going to not do actual CSV parsing, then your solution is much less interesting. :-)

> I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

> It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.

Holy moly. Really? Do you know why it takes so long? That's a showstopper...


Agreed, though I'm mainly seeing how quickly I can reimplement everything xsv has to offer without sacrificing performance. I don't consider the challenge finished until, as you say, it handles all of the corner cases.

EDIT: I'm actually not sure what's going on with the startup time, since it's usually fast. I have quite a few windows open, which might be affecting results. (xsv count is suddenly taking ~70ms for simple data, so I think a reboot is in order.)

To clarify, I was mainly just excited to share some WIP. There's still a lot of work left to do to cross the finish line.


Ah OK. Then to preserve our sanity, I'm going to bow out until you tell me I should go look. There are other things I want to accomplish today. :-)


Sounds good! I'll hopefully have something interesting by the end of the weekend.


rust-csv compiled and ran flawlessly. xsv had an error: https://github.com/BurntSushi/xsv/issues/139

Also loved this:

    // OMG I HATE BYTE STRING LITERALS SO MUCH.
    fn b(s: &str) -> &[u8] { s.as_bytes() }
EDIT: Aha, it was user error. Brew had an old version of Rust installed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: