I would definitely love to be proven wrong, because if I am, I am certain I woul...

hugh-avherald · on Sept 9, 2018

Have you seen the R package data.table (https://github.com/rdatatable/data.table)?

Not sure whether data.table is in the same domain as xsv, and certainly a lot of it is written in C. But for comparison's sake:

  fread("cities.csv")  1.30 s

And then the rest of the computations will be faster of course:

  count -- 0.005 ms
  freq  -- 0.2   s
  sort  -- 0.1   s

It's so useful that I often just use csvs between 10 and 100GB as a database as the difference in performance between fread and a 'proper' database aren't enough to justify the latter.

burntsushi · on Sept 9, 2018

Yes. I've used it lightly in the past and have never been impressed by its speed or memory usage. But there could have been user errors. I am not an R expert. :)

In any case, I think R implements all or most of its data table transformations in C, so I don't think it applies well here.

hugh-avherald · on Sept 9, 2018

I'm curious as to what has impressed you with regard to speed or memory usage (at least over this scale). I realize there's a lot I don't know.

shawn · on Sept 9, 2018

Initial results: A naive parser in LuaJIT of your xsv-stats example:

  > (do (set csv (load "reader.l"))
        (set plays ((get (require 'system) 'read-file) "data/nfl_all_plays_small.csv"))
        (set s ((get csv 'stream) plays))
        nil)
  > (let t1 (seconds)
      (let read (get csv 'read)
        (while (read s)))
      (- (seconds) t1))
  0.80387997627258

i.e. it can parse the whole nfl_all_plays_small.csv dataset in 803ms.

  $ time xsv stats --everything data/nfl_all_plays_small.csv > /dev/null
  real	0m0.323s
  user	0m0.520s
  sys	0m0.064s

The CSV parser thus far is 62 lines: https://gist.github.com/shawwn/4c7f7e93ccf2e241e17e82d353301...

Obviously, this is a naive parser and doesn't handle the corner cases you mention. But it's a useful starting point.

Current CSV reader tests: https://gist.github.com/shawwn/53c06cd30f29b064c1f6c66f7896f...

Will have more results soon. :)

For basic record counting, LuaJIT takes 36ms:

  > (let t1 (seconds) (select 2 ((get string 'gsub) plays "\n" "\n")) (- (seconds) t1))
  0.036473989486694

xsv can do it in 29ms:

  $ time xsv count data/nfl_all_plays_small.csv
  10000

  real	0m0.029s
  user	0m0.018s
  sys	0m0.007s

EDIT: Down to 400ms to parse the dataset.

shawn · on Sept 9, 2018

Ok, the `count` command is implemented. The code is 27 lines: https://gist.github.com/shawwn/e178e597cbf7f3682153e449f6633...

Without indexing, LuaJIT is twice as fast as XSV for 2.6M rows:

  $ rm data/*.csv.idx

  $ time lumen count.l data/nfl_all_plays_huge.csv
  2592975

  real	0m1.583s
  user	0m1.237s
  sys	0m0.311s

  $ time xsv count data/nfl_all_plays_huge.csv
  2592975

  real	0m3.184s
  user	0m2.425s
  sys	0m0.553s

With indexing, LuaJIT is within an order of magnitude:

  $ xsv index data/nfl_all_plays_huge.csv

  $ time xsv count data/nfl_all_plays_huge.csv
  2592975

  real	0m0.019s
  user	0m0.009s
  sys	0m0.007s

  $ time lumen count.l data/nfl_all_plays_huge.csv
  2592975

  real	0m0.184s
  user	0m0.083s
  sys	0m0.096s

I'll be implementing test cases to ensure it's catching malformed data.

burntsushi · on Sept 9, 2018

Nice! Is your `count` command doing CSV parsing? I don't understand how your naive parser takes 400ms to parse nfl_all_plays_small.csv, but that counting records is somehow faster. The fact that your `count` program is needing to deal explicitly with `\r` makes me very suspicious. :)

Also, counting records with an index isn't particular interesting, since it's just reading 8 bytes from the index file. I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

In your comment above, you compared your CSV parser to `xsv stats --everything`, but `xsv stats` does a lot more than just CSV parsing. If you want to test how fast xsv takes to parse CSV, then `xsv count` without an index is the way to do it. `xsv` only takes 19ms on my machine to parse nfl_all_plays_small.csv, which is basically within process overhead time.

Also, when you're ready, I would like to be able to run and inspect the code myself. :-)

I warned you above: the key challenge you're going to face is creating a robust CSV parser, and using that to implement every command, including `count`. If that isn't a requirement, then basically all comparisons are unfortunately completely moot.

shawn · on Sept 9, 2018

It's just counting lines and skipping any whose contents are "\r" or blank. I believe this is correct behavior because:

  foo,bar,"quux
  zap",bang

`xsv count` returns 0 for this.

Is there any situation where csv fields can contain literal newline characters? (ascii value 10.)

Will post code fairly soon. There aren't any tricks. I just implemented slice as well.

Also, when you're ready, I would like to be able to run and inspect the code myself. :-)

Certainly!

EDIT: Ah, user error. CSVs can indeed contain literal newlines, and XSV handles that. I'll switch it to parse doublequoted strings and add some tests.

One simplification: if a line contains N commas, where N matches the number of columns minus one, then there's no need to parse it for count, slice, etc.

I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.

burntsushi · on Sept 9, 2018

Indeed. xsv returned 0 because it interprets the first record as a header row by default.

Counting is interesting, because you don't have to implement unescaping to do it, but any robust csv parser will do it. So if you write two different versions of a csv parser, one for normal reading and one just for counting, then the one for counting can go faster and you'll avoid the need to amortize allocation. It's a nice trick though! However, I was using `xsv count` as a proxy for CSV parsing. So if you're just going to not do actual CSV parsing, then your solution is much less interesting. :-)

> I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

> It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.

Holy moly. Really? Do you know why it takes so long? That's a showstopper...

shawn · on Sept 9, 2018

Agreed, though I'm mainly seeing how quickly I can reimplement everything xsv has to offer without sacrificing performance. I don't consider the challenge finished until, as you say, it handles all of the corner cases.

EDIT: I'm actually not sure what's going on with the startup time, since it's usually fast. I have quite a few windows open, which might be affecting results. (xsv count is suddenly taking ~70ms for simple data, so I think a reboot is in order.)

To clarify, I was mainly just excited to share some WIP. There's still a lot of work left to do to cross the finish line.

burntsushi · on Sept 9, 2018

Ah OK. Then to preserve our sanity, I'm going to bow out until you tell me I should go look. There are other things I want to accomplish today. :-)

shawn · on Sept 9, 2018

Sounds good! I'll hopefully have something interesting by the end of the weekend.

shawn · on Sept 9, 2018

rust-csv compiled and ran flawlessly. xsv had an error: https://github.com/BurntSushi/xsv/issues/139

Also loved this:

    // OMG I HATE BYTE STRING LITERALS SO MUCH.
    fn b(s: &str) -> &[u8] { s.as_bytes() }

EDIT: Aha, it was user error. Brew had an old version of Rust installed.