Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: uDSV.js – A faster CSV parser (github.com/leeoniya)
67 points by leeoniya on Sept 6, 2023 | hide | past | favorite | 23 comments
Hey folks!

I know CSV parsers (especially in JS) aren't terribly exciting and someone writes a "better" one every week.

I'm in the middle of my parental leave, and this was a project that came out of me looking for the fastest/smallest CSV parser. It all started so innocently, and then turned into a benchmark-validation-athon; the library itself took ~2 weeks to write, but the performance comparisons took another ~4 weeks (on and off).

The benchmarks were a huge effort, but I think they are the most thorough to date, both in breadth and in depth, so hopefully you find them useful: https://github.com/leeoniya/uDSV/tree/main/bench

Let me know if you have specific concerns / questions / improvements :)

cheers! Leon



With multi-thread and some tricks, csv parsing should be able to reach SSD speed (~2GiB/s read): https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...

At that point, I think the speed is limited by boxing JS objects.


nice, is this something you've tried in WASM or WebGL/WebGPU?

at some point it makes sense to simply recommend using WASM SQLite (or DuckDB) to load the data for OLAP stuff. im not gonna encourage anyone to use JS on multi-GB files. even a few hundred MB becomes questionable ;)


The CSV parser for [Perspective](https://github.com/finos/perspective) uses the [Apache Arrow](https://github.com/apache/arrow) C++ CSV parser compiled to WASM. It's not currently multi-threaded but this is possible as well to my understanding.


what i learned from benchmarking Dekkai, is that simply "being in WASM" is a poor indicator of actual performance if you ever need to pull the values across that boundary into JS...especially strings.

i'll take a look and see if i can incorporate Perspective's parser into the bench, thanks!


What does it mean to box an object?


I believe allocating memory for objects on the heap. https://stackoverflow.com/a/80113/10654749


I think it’s more specific, something along the lines of turning a reference (pointer-allocated) object to a value-based object? But I’m woefully ignorant on this.


Oh, by "boxing object", I meant to create the in-memory object that can be used by JavaScript runtime, preferably native objects. For example, in JSC case, something like https://developer.apple.com/documentation/javascriptcore/jsv....


Ah, thanks


Ha, the creator of uplot strikes again! Love seeing your stuff man, always so thorough and well-explained, nice work

(https://github.com/leeoniya/uPlot)


I was going to write a simpler comment but referencing uFuzzy instead! It's one of my favourite fuzzy search tools, but possibly even better than uFuzzy itself is the fuzzy search comparison tool they built. When I have a project that requires fuzzy search, I load a bunch of representative data into it and try out the different algorithms to see what sort of things I can expect, and which one feels most like the results I want.

So even if I don't end up using uFuzzy, I still end up using uFuzzy!


> When I have a project that requires fuzzy search, I load a bunch of representative data into it and try out the different algorithms to see what sort of things I can expect, and which one feels most like the results I want.

be careful with this. the settings for other libs are hardcoded to be as close to uFuzzy as possible, so you can definitely get worse/better results depending on how you config each lib for your needed use case.


*simpler -> similar, I can't edit the comment any more unfortunately.


ironically, uPlot is a great counter-example to projects that are "well-explained" :D


Ehehe well your writings always seem to make sense to me at least! I spend an abnormal amount of time thinking about stacked charts and the extremely-opinionated demo you made here has stuck with me for years, lol

https://leeoniya.github.io/uPlot/demos/stacked-series.html


Idk uplot, but I use ufuzzy!


The benchmarks are well done, definitely worth the effort you put in. Taking a look at them I'm satisfied with my default choice of d3-dsv for balancing speed and reliability. The impressive performance improvements of this library are worth a look.


thanks :)

d3-dsv is one of the better ones for sure. it even slightly beats uDSV in Bun on some benchmarks. maybe it's better optimized for JSC rather than V8.

it doesnt do streaming tho and the typed performance is rather behind since it's BYO conversion rather than compiled for you via `new Function()`


Super cool! Great work on making it fast and having clear benchmarks.

I had some fun looking through them and seeing how comma-separated-values — a package I authored ~9 years ago and posted to Show HN as well fared.


ha, you win the longest package name!


The benchmark numbers for Airport2.csv are interesting to me. None of the tools can hit 1 row per second?


no, it's just the bench runner. dekkai and udsv finish in about 2.5s.

the runner is designed to read results.length, but we're not accumulating in this case, so the rows/s math is...off :)


Was wondering where I had seen this person before… he wrote ufuzzy, which is pretty dope imo!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: