Show HN: uDSV.js – A faster CSV parser

liuliu · on Sept 7, 2023

With multi-thread and some tricks, csv parsing should be able to reach SSD speed (~2GiB/s read): https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...

At that point, I think the speed is limited by boxing JS objects.

leeoniya · on Sept 7, 2023

nice, is this something you've tried in WASM or WebGL/WebGPU?

at some point it makes sense to simply recommend using WASM SQLite (or DuckDB) to load the data for OLAP stuff. im not gonna encourage anyone to use JS on multi-GB files. even a few hundred MB becomes questionable ;)

texodus · on Sept 7, 2023

The CSV parser for [Perspective](https://github.com/finos/perspective) uses the [Apache Arrow](https://github.com/apache/arrow) C++ CSV parser compiled to WASM. It's not currently multi-threaded but this is possible as well to my understanding.

leeoniya · on Sept 7, 2023

what i learned from benchmarking Dekkai, is that simply "being in WASM" is a poor indicator of actual performance if you ever need to pull the values across that boundary into JS...especially strings.

i'll take a look and see if i can incorporate Perspective's parser into the bench, thanks!

tomcam · on Sept 7, 2023

What does it mean to box an object?

davefol · on Sept 7, 2023

I believe allocating memory for objects on the heap. https://stackoverflow.com/a/80113/10654749

tomcam · on Sept 7, 2023

I think it’s more specific, something along the lines of turning a reference (pointer-allocated) object to a value-based object? But I’m woefully ignorant on this.

liuliu · on Sept 11, 2023

Oh, by "boxing object", I meant to create the in-memory object that can be used by JavaScript runtime, preferably native objects. For example, in JSC case, something like https://developer.apple.com/documentation/javascriptcore/jsv....

tomcam · on Sept 16, 2023

Ah, thanks

pickledish · on Sept 7, 2023

Ha, the creator of uplot strikes again! Love seeing your stuff man, always so thorough and well-explained, nice work

(https://github.com/leeoniya/uPlot)

MrJohz · on Sept 7, 2023

I was going to write a simpler comment but referencing uFuzzy instead! It's one of my favourite fuzzy search tools, but possibly even better than uFuzzy itself is the fuzzy search comparison tool they built. When I have a project that requires fuzzy search, I load a bunch of representative data into it and try out the different algorithms to see what sort of things I can expect, and which one feels most like the results I want.

So even if I don't end up using uFuzzy, I still end up using uFuzzy!

leeoniya · on Sept 7, 2023

> When I have a project that requires fuzzy search, I load a bunch of representative data into it and try out the different algorithms to see what sort of things I can expect, and which one feels most like the results I want.

be careful with this. the settings for other libs are hardcoded to be as close to uFuzzy as possible, so you can definitely get worse/better results depending on how you config each lib for your needed use case.

MrJohz · on Sept 7, 2023

*simpler -> similar, I can't edit the comment any more unfortunately.

leeoniya · on Sept 7, 2023

ironically, uPlot is a great counter-example to projects that are "well-explained" :D

pickledish · on Sept 7, 2023

Ehehe well your writings always seem to make sense to me at least! I spend an abnormal amount of time thinking about stacked charts and the extremely-opinionated demo you made here has stuck with me for years, lol

https://leeoniya.github.io/uPlot/demos/stacked-series.html

jkrubin · on Sept 7, 2023

Idk uplot, but I use ufuzzy!

couchand · on Sept 7, 2023

The benchmarks are well done, definitely worth the effort you put in. Taking a look at them I'm satisfied with my default choice of d3-dsv for balancing speed and reliability. The impressive performance improvements of this library are worth a look.

leeoniya · on Sept 7, 2023

thanks :)

d3-dsv is one of the better ones for sure. it even slightly beats uDSV in Bun on some benchmarks. maybe it's better optimized for JSC rather than V8.

it doesnt do streaming tho and the typed performance is rather behind since it's BYO conversion rather than compiled for you via `new Function()`

knrz · on Sept 7, 2023

Super cool! Great work on making it fast and having clear benchmarks.

I had some fun looking through them and seeing how comma-separated-values — a package I authored ~9 years ago and posted to Show HN as well fared.

leeoniya · on Sept 7, 2023

ha, you win the longest package name!

tyingq · on Sept 7, 2023

The benchmark numbers for Airport2.csv are interesting to me. None of the tools can hit 1 row per second?

leeoniya · on Sept 7, 2023

no, it's just the bench runner. dekkai and udsv finish in about 2.5s.

the runner is designed to read results.length, but we're not accumulating in this case, so the rows/s math is...off :)

jkrubin · on Sept 7, 2023

Was wondering where I had seen this person before… he wrote ufuzzy, which is pretty dope imo!