XSV: A fast CSV command-line toolkit written in Rust

cube2222 on Sept 8, 2018 | [–]

I like how all those rust/go implemented, fast, cross-platform tools are starting to get mainstream. (ripgrep's another great one).

Really useful if you're a programmer preferring Windows but mainly using Unix tools and developing for Unix OSes.

By the way, I've been using xsv when analysing 8 GB csv's (the Amazon review dataset) and have been nothing but happy with it.

vram22 on Sept 8, 2018 | | [–]

>I like how all those rust/go implemented, fast, cross-platform tools are starting to get mainstream. (ripgrep's another great one).

Both xsv and ripgrep are by BurntSushi, and ripgrep was mentioned twice on HN recently, once in a release note thread, IIRC, and another time in the CLI: Improved thread.

Both were mentioned before too.

coldtea on Sept 9, 2018 | | | [–]

Crazy, I know, but some people don't read HN 24/7.

vram22 on Sept 9, 2018 | | | [–]

I don't either :) I just happened to have read those earlier threads because of interest in the topics, so thought of sharing the info that I did.

et2o on Sept 8, 2018 | | | [–]

Ripgrep is by the same author, actually.

cube2222 on Sept 8, 2018 | | | [–]

Yup, I know. Actually, the official regexp implementation in rust (underlying ripgrep) is written by him, too.

thsowers on Sept 9, 2018 | | | [–]

Hadn't heard of rg before, awesome!

Perfect counterpart to sharkdp's alternative find: https://github.com/sharkdp/fd

da_chicken on Sept 8, 2018 | | | [–]

I'd be curious how it stacks up against Microsoft's LogParser[0].

[0]: https://en.wikipedia.org/wiki/Logparser

domoritz on Sept 8, 2018 | | [–]

This is great. For now, my go to tool is csvkit (Python) which has all kinds of neat tools. In particular loading data into databases (csvsql) is just plain awesome. Check it out at https://csvkit.readthedocs.io/en/1.0.3/.

Ecco on Sept 8, 2018 | | [–]

Seems great at first, but how is that better than piping the whole CSV file into SQLite and then doing the processing there? I think CSV is great for data exchange but not so great for data processing.

By using SQLite (or any other DB actually) you can decide which data to index, and write arbitrarily complex queries in a rather understandable language. I think XSV is kind of reinventing the wheel here.

burntsushi on Sept 8, 2018 | | [–]

You could. But then you need to write a schema. How do you know what schema to write? You could ask csvkit to do it, but it might not be fast enough on, say, a 40GB CSV file. Or maybe it isn't quite accurate enough. xsv might be the tool you use to figure out what the schema should be.

SQLite (or any SQL database) does not cover all use cases. For example, if you need to produce CSV data, and you can fit your transformation into xsv commands, then it might be hard to beat the performance in exchange for the effort you put in.

This is probably an expression problem. Tools don't always neatly fit into orthogonal buckets. If you think in terms of shell pipelines and want to attack the data as it is, then xsv might be good for you. SQLite is, IMO, a pretty massive hammer to apply every single time you want to look at CSV data.

jgord on Sept 9, 2018 | | | [–]

Have to pipe up here and say an enthusiastic / visceral / emotional 'thankyou' to BurntSushi for all your great contributions - xsv, rust-csv, ripgrep etc - not to mention your superb blog articles.

Im using xsv to wrangle 500gb of data, it is phenomenally useful and fast. The main datastore is postgresql and it is great, but there are some things pg just isn't fast enough for [hash joins anyone]. I have to do a lot of pre-processing, and xsv is incredible for that.

Perhaps the best complement of all - I am seriously looking at moving from node.js to Rust as my daily data/systems programming language _because_ of how performant and elegant xsv / rust-csv / ripgrep are, and how readable the rust code is.

I also have renewed respect for the staple of olde unix tools - sort sed grep wc etc

xsv is a brilliant addition to that canon of unix lore.

burntsushi on Sept 9, 2018 | | | [–]

Thanks for your kind words, I really appreciate them! :-) Please reach out if you'd like any Rust advice!

shaklee3 on Sept 9, 2018 | | | [–]

Which of the two available rust books would you recommend for someone who knows c/c++?

burntsushi on Sept 9, 2018 | | | [–]

Hard to say. I own them both but haven't thoroughly read through either. I would probably go with The Rust Programming Language, though, I suspect you can't go wrong with either.

anarazel on Sept 9, 2018 | | | | [–]

> but there are some things pg just isn't fast enough for [hash joins anyone]

If you've a separable benchmark for this, I'd be curious. I work on PG performance, including hashjoins, and real-world (-ish) benchmarks are good!

nabla9 on Sept 9, 2018 | | | | [–]

https://www.sqlite.org/csv.html

>The CSV Virtual Table

>The CSV virtual table reads RFC 4180 formatted comma-separated values, and returns that content as if it were rows and columns of an SQL table.

>The CSV virtual table is useful to applications that need to bulk-load large amounts of comma-separated value content. The CSV virtual table is also useful as a template source file for implementing other virtual tables.

>The CSV virtual table is not built into the SQLite amalgamation. It is available as a separate source file that can be compiled into a loadable extension. Typical usage of the CSV virtual table from the command-line shell would be something like this:

Ecco on Sept 9, 2018 | | | | [–]

For non-lookup csv-to-csv transformation, yes, I guess your tool can easily beat any SQL DB since it can work without any need for an external storage. That also applies to sequential-scan queries too. You make a good point :-)

Regarding the schema now, I’m not sure I get your point. I would argue the queries you want to make already dictate a schema (e.g. you want to compute an average ? Use a numeric column).

burntsushi on Sept 9, 2018 | | | [–]

> Regarding the schema now, I’m not sure I get your point. I would argue the queries you want to make already dictate a schema (e.g. you want to compute an average ? Use a numeric column).

Yeah that's fine. I think we're just missing each other, that's all. I'm not sure what kind of CSV data you've worked with, but from the stuff I've seen, it's rarely clean. `xsv frequency` and `xsv stats` are great ways to look at your data as it is. For example, a column might be named as if it were a numeric field. Maybe it's first hundred rows are even numeric. But there might be a row where it isn't numeric and isn't even empty, but has some kind of weird value in it. Like, say, 'N/A' or 'null' or 'none' or whatever sentinel value someone decided to use to indicate absence.

Data types can be richer than just primitives too. For example, one field might be an enumeration of a fixed set of values, and I can think of at least three different ways you might want to represent that in your schema.

scarejunba on Sept 9, 2018 | | | [–]

I prefer the SQLite option because it's faster when you're doing repeated operations. I load it in once and then run many queries on the same SQLite store. It's very simple and I know I can join etc. with a SQL syntax.

But I usually can figure out a schema for my CSV.

dagenix on Sept 9, 2018 | | | [–]

Faster than xsv?

AIUI, both sqlite and xsv rely on the OS filesystem buffer cache to keep their data in memory.

burntsushi on Sept 9, 2018 | | | [–]

Most definitely, yes. If you setup your indexes correctly in a relational database, then xsv isn't going to come anywhere near performing them as efficiently. The only indexing mechanism xsv has is a record index, which lets it parallelize certain workloads across a CSV file. But it doesn't have any per-field indices like relational databases do, so any kind of join done by xsv requires building the index first in memory.

xsv could grow support for field level indices, but it's more work and I don't know whether it's worth it. There's a line to straddle here, because at some point, xsv isn't going to be good enough and you should instead just use a database.

danso on Sept 9, 2018 | | | [–]

As a daily user of xsv, I can't imagine why a CSV processor would need to be competing with SQLite in terms of performance. For me, xsv hits the right niche in 2 very important and frequent use-cases:

1. One-off analyses or tasks involving relatively small datasets, like trimming the output of a command-line Twitter tool into xargs to, say, unfollow everyone who hasn't tweeted this year, or isn't following me and has fewer than 100 followers. If I were attempting to do a deeper analyses, SQLite would be better (because of the query syntax), but xsv/csvkit works incredibly well for quick explorations/filtering/sorting. Here's an example using csvkit (haven't quite gotten used to all of xsv's syntax) and the t command-line tool: https://gist.github.com/dannguyen/7213b16b7de79ad9c89fc2297d...

2. Prepping data for importing into SQLite; for very large public datasets, I like using xsv/csvkit to trim columns and filter out rows before importing them into SQLite. This is especially convenient since SQLite doesn't allow for dropping columns once a table is created.

If I'm ever working with data big enough to need SQLite-like performance, then I probably need SQLite's query syntax and other features. xsv is already flexible and performant for what I need in my data pipeline.

dagenix on Sept 9, 2018 | | | | [–]

I don't doubt that sqlite will wipe the floor with xsv for certain types of analysis. I also suspect that the opposite would be true for other types of analysis. Although, you'd know much better than I would - I've used sqlite quite a bit while my only usage of xsv was to work with a 100 line CSV file. Xsv was fast, but, I'm sure most tools would be at that size. What was killer, however, is that xsv gave me a lot of tools that helped exactly what I needed to do - sqlite wouldn't have helped at all. Right tool for the right job and all.

What I was more curious about was if the original comment was about a particular case where sqlite beat xsv, or if it was more handwavy. Also, the comment seemed to imply that repeated queries would be faster in sqlite than with xsv but didn't mention anything about indexes. My suspicion would be that without something that can take advantage of an index, xsv and sqlite would perform about the same doing repeated table / file scans.

burntsushi on Sept 9, 2018 | | | [–]

Yeah those are all great points! I think you're right to wonder a bit. Hard to say what's what without more details!

thom on Sept 9, 2018 | | | [–]

That's a lot of faff for ad-hoc jobs. R with dplyr is more likely the direct competition, and even then I assume this has some speedups.

miguelmota on Sept 9, 2018 | | [–]

BurntSushi is a badass; always cranking out awesome tools. I use ripgrep [1] on a daily basis as a grep replacement

[1] https://github.com/BurntSushi/ripgrep

bpicolo on Sept 9, 2018 | | [–]

Their FST library is also awesome

https://github.com/BurntSushi/fst

damageboy on Sept 9, 2018 | | [–]

Would be interesting to see how xsv compared to miller (https://johnkerl.org/miller/doc/index.html) in terms of perf, this tool comes exactly as I am about to munge 1TB of gzipped csv files.

Unfortunately, the main operation I need is not supported by xsv...

notimetorelax on Sept 9, 2018 | | [–]

What is the operation that you need? Can you send a PR?

Dowwie on Sept 9, 2018 | | [–]

I like the pager UI of the VisiData utility, written in Python, for exploring data. Since XSV is written in Rust, it could theoretically be imported and used in VisiData.

https://github.com/saulpw/visidata

radarsat1 on Sept 9, 2018 | | [–]

I'm not too familiar with rust tools, but I wanted to check this out. Can someone explain to me why this worked,

    sudo apt-get install cargo rustc
    cargo install xsv

but this did not:

    git clone <xsv github>
    cd xsv
    cargo build --release

The latter gave me a ton of compilation errors on the package crossbeam-channel. The former installed the program and compiled a bunch of crates, but did it install a pre-built binary for xsv? "downloading" then "installing" of xsv was the first step, then all the crates were downloaded. ldd seems to report that the installed xsv binary has no dependencies so I don't see why it would need to download and compile a bunch of crates in that case. On the other hand if it was compiling xsv then I don't understand why I don't get the same errors as in the latter case.

It seems Ubuntu has the following version:

    $ rustc --version
    rustc 1.25.0

I realize it's probably not optimal to use the Debian-packaged rustc but I didn't feel like figuring out the whole ecosystem just to install one program to test it out.

burntsushi on Sept 9, 2018 | | [–]

Because current master requires a newer version of the Rust compiler than the current release on crates.io. If current master were a release, then `cargo install xsv` would have failed with the same errors.

If you like, you can download static binaries from github: https://github.com/BurntSushi/xsv/releases

radarsat1 on Sept 9, 2018 | | | [–]

Ah, ok thanks. Anyways I got it installed, was just wondering about certain actions that cargo took. I found I could copy the resulting binary and delete my entire .cargo directory and it continues to work, so it seems cargo did some unnecessary things for `install`, but no worries.

burntsushi on Sept 9, 2018 | | | [–]

To add to this: Cargo is more of a development tool for Rust programmers rather than a package manager for end users. Both types of tools have a lot in common, which is why things like `cargo install` exist and are very useful. In the case of `xsv`, it's a useful escape hatch since your distro doesn't package `xsv`, but `cargo install` is not like `apt install`. That is, it downloads all of the source code dependencies of `xsv` and builds them. This also requires sync'ing with crates.io's registry list. None of this is unnecessary in the standard Cargo workflow in order to build xsv, but the build artifacts are certainly unnecessary in order to run xsv.

maccam94 on Sept 9, 2018 | | | [–]

Development of the Rust language is still happening fairly quickly, so many projects won't compile with the 6 month old compiler version in your Linux distribution's package repository. Most developers install it using rustup.

go_prodev on Sept 9, 2018 | | [–]

This looks great, and I'm very keen to try it out.

I have a malformed 115GB CSV which took some work to process with SSIS, but I'm really interested to try again using xsv to split off the bad rows and see if that would have been an easier option.

Very cool!

rookwood102 on Sept 8, 2018 | | [–]

Burntsushi mentions in the readme that it is often the case that people receive large csv files needing analysis but he also mentions that valid criticisms of the tool are that you could use an SQL database. Why can't you use a database in some scenarios? And when blazing in memory speed is useful, why can't you use something like this: https://github.com/jtablesaw/tablesaw

Is XSV faster? Is it the command line convenience (although surely it was not that convenient to write a new tool to do it)

Genuinely curious and I do not mean to belittle the project as it looks well implemented and useful nonetheless.

kunashe on Sept 9, 2018 | | [–]

BurntSushi, dude you're awesome.

colanderman on Sept 8, 2018 | | [–]

OK, dumb question, because there have been a LOT of these types of stories lately:

Why does it matter (beside to the implementor, or for pedagogy) what language a CLI tool is implemented in?

littlestymaar on Sept 9, 2018 | | [–]

The language itself doesn't have much importance, what matters is :

- is it portable ? Or will the users have to install a specific runtime to run the tool ? For Linux it's not a big deal because it comes pre-installed in most distributions, but making Python work on Windows requires some work. It's not too much if you're going to use the tool often, but for merely testing it, the runtime installation burden can deter users.

- is it fast ? For many application, speed doesn't matter, but for some it matters.

- is it stable ? Or will it crash every now and then ? C or C++ software have memory bugs, that leads to crashes (segfault). The most mainstream one don't have too much of it, but as soon as you start using niche software, you'll face crashes more often than not.

For these three reasons, the language matters, but there's not much difference if it's Rust, Go or Haskell.

earenndil on Sept 9, 2018 | | | [–]

* Python is not all that difficult to package into an exe. C/C++ are very cross-platform.

* Python can be reasonably fast with pypy.

* Go and rust maybe don't have segfaults, but they can still crash and burn, or have other kinds of bugs.

cgag on Sept 9, 2018 | | | [–]

I don't want reasonably fast, I want the 4 billion Hertz to rattle my teeth.

TheCoelacanth on Sept 9, 2018 | | | | [–]

Python is not hard to package into an exe, but most Python programs aren't packaged like that.

C and C++ are very cross-platform, if you only use cross-platform dependencies and if you are careful to write portable code. Most C and C++ applications don't do that.

earenndil on Sept 9, 2018 | | | [–]

Huh? What do you have to do to be careful? Just:

* Don't include windows.h

* Use a gui toolkit like gtk and qt (which you probably want to do anyway)

Generally, making a software windows-only is a deliberate choice, not something that happens 'by accident'.

realusername on Sept 9, 2018 | | | [–]

Not really, most of the standard library and basic stuff is incompatible on Windows. So sockets, file system calls, event loops, file configuration & default folders... It's never THAT different but different enough that just compiling your code won't work.

jcelerier on Sept 9, 2018 | | | [–]

> So sockets, file system calls, event loops, file configuration & default folders... It's never THAT different but different enough that just compiling your code won't work.

But if you use something like Qt you've got all of this cross-platform out of the box - I have a multiple hundred lines of code C++ software which does all this and much more ; it has almost zero platform-specific stuff except one function at some point to set real-time priorty to threads.

burntsushi on Sept 9, 2018 | | | [–]

So you think it's appropriate for tools like ripgrep and xsv to bundle a gui toolkit to do its job?

I feel like this whole line of argument is disingenuous. In Rust, you don't need to bundle a GUI toolkit to get cross platform stuff that works. The standard library was built with Windows in mind, and it shows. You don't need any additional POSIX layer. You just use the standard library.

If you go and look at competitors to things like xsv and ripgrep that are written in C or C++, they either require some kind of POSIX thing on Windows to run correctly (like cygwin), or they jump through lots of hoops to behave as a native Windows application.

It really should not be a controversial statement to say that Rust has a better cross platform story (Windows/mac/Linux). It was a design goal.

jcelerier on Sept 9, 2018 | | | [–]

> So you think it's appropriate for tools like ripgrep and xsv to bundle a gui toolkit to do its job?

Qt is not a GUI toolkit, it's a general framework. If you only link to -lQtCore you won't have anything remotely related to GUI in your binary. For instance doxygen is built like this.

burntsushi on Sept 9, 2018 | | | [–]

Why isn't this more common then? I've literally never seen this done in any of the C++ command line tools I've looked at.

jcelerier on Sept 9, 2018 | | | [–]

https://github.com/search?l=QMake&q=%22CONFIG+%2B%3D+console...

common enough for me

dj43nq on Sept 9, 2018 | | | | [–]

Don’t you end up with licensing fees if you use QT in a commercial product? Seems like a downside to me.

dj43nq on Sept 10, 2018 | | | [–]

QT seems interesting.

For those like me who’ve never used QT: As long as you don’t statically link the QT libraries you don’t have to open source your entire commercial application. I’m not a lawyer.

General notes about licensing https://www1.qt.io/licensing/

Comparison of licenses https://www1.qt.io/licensing-comparison/

jcelerier on Sept 11, 2018 | | | [–]

> As long as you don’t statically link the QT libraries you don’t have to open source your entire commercial application

I provided a link from the FSF that shows that this is not the case just below - you can statically link proprietary apps with LGPL qt.

dj43nq on Sept 12, 2018 | | | [–]

“(1) If you statically link against an LGPL'd library, you must also provide your application in an object (not necessarily source) format, so that a user has the opportunity to modify the library and relink the application.”

So for LGPL. As a commercial application developer I’m not sure I’d want to do that. I’d then have to field support requests for that object file.

Still, good to know.

jcelerier on Sept 9, 2018 | | | | [–]

Not at all. You have to respect the license of Qt (e.g. if your customer asks you for the source of the Qt libraries you have to provide them - including any modification you made to them), but you can keep your own app proprietary and ship the whole however you like.

detaro on Sept 9, 2018 | | | | [–]

LGPL isn't that hard to comply with for a desktop application, so no.

littlestymaar on Sept 10, 2018 | | | [–]

With LGPL, you can't statically link the library though. Which requires your user to install it manualy, unless you package your software with some kind of installer (which is not so common on windows for cli executables).

jcelerier on Sept 10, 2018 | | | [–]

> With LGPL, you can't statically link the library though.

no, that's tiresomely false. You can link proprietary code statically with LGPL code. LGPL does not give a shit about static or dynamic libraries because LGPL is not a language-specific license.

Right from the source: https://www.gnu.org/licenses/gpl-faq.html#LGPLStaticVsDynami...

littlestymaar on Sept 10, 2018 | | | [–]

Thanks for the link. The nuance I was missing is that you can distribute statically-linked LGPL library as long as you provide a way for your user to override the chosen library with his own version of the library.

detaro on Sept 10, 2018 | | | | [–]

It's very common to ship windows CLI tools as a .zip file with an exe and a bunch of .dlls, that's not exactly burdensome to install compared to only an executable.

dj43nq on Sept 9, 2018 | | | | [–]

It’s not THAT hard to make a cli C program portable between windows, Mac and Linux. I’ve done plenty of projects that support all three platforms and the amount of work wasn’t excessive. For some projects it was just getting the build environment up that was the biggest issue. Once you use cmake, modular design etc a lot of the issues vanish.

realusername on Sept 10, 2018 | | | [–]

I never said it's complicated, I find it pretty simple, it's just not going to work by default straight away.

earenndil on Sept 9, 2018 | | | | [–]

Mingw?

jaipilot747 on Sept 9, 2018 | | | | [–]

It's not difficult to package, but if your experience is anything like mine, distributing a 400+MB zip containing all the dependencies and the python core+stdlib is no fun.

whyever on Sept 9, 2018 | | | | [–]

> Python can be reasonably fast with pypy.

Until it can't and then you are stuck with reimplementing the hot path in a language that can.

PioNono on Sept 9, 2018 | | | [–]

I’ll share a use case where Rust specifically has been a godsend.

I write software for users who have strong engineering backgrounds but tend to know just enough about computers to be dangerous. They tend to know a smattering if Python, Perl, SQL, SAS, R, or Matlab, and will happily use a well-documented CLI tool, but cannot be relied upon to do basic sysadmin tasks effectively. They run a mix of windows and OSX, and some set up Linux VMs. They are hired for their expertise in their domain, not because they are great IMS technicians. They deal with large, heterogenous, difficult-to-schematize datasets and value speed, but it is a total waste of their time to troubleshoot environment/linking issues that frequently accompany building and using C/C++ utilities for many non-programmer users. I’ve found Rust to be hugely beneficial for writing programs to serve this user base (RF engineers in my case, but I’d imagine people who write software for ag/life science people, banking analysts, civil/environmental engineers and others have similar challenges).

* it’s also nice that it’s hard to segfault in Rust. Many existing tools in my domain are notoriously fragile in this regard.

bigger_cheese on Sept 10, 2018 | | | [–]

Wow that sums up my world precisely - also engineering. Especially the part about people knowing a smattering of Perl, Python, SAS and Matlab. Add in C++ (written in circa early 90's style i.e C with "new" and "delete") and Fortran (Fortan 77 style straight out of the classical Numerical Recipes text book) you could be describing my org.

People here have so many issues managing things like library dependencies. This is especially a problem on Windows. I remember witnessing the guy next to me waste hours trying to build and link an image analysis library written in C++. I could feel his frustration from across the partition "This library needs this dependency, which has to build from scratch but before I can do that I have to build library 'zzz' but to do that I have to download something called "CMAKE". I've finally done all that but then I forgot to set a compiler flag correctly in some library 3 steps up the chain and now I have to start all over again. At the end of all that you better be damn sure you were paying attention and didn't accidentally mix some 32 / 64 bit code together which apparently is pretty easy to do when you are downloading random code from various websites."

I think these were instructions he was trying to follow

(https://docs.opencv.org/2.4/doc/tutorials/introduction/windo...)

Stuff like configuring environment variables, editing make files etc are very difficult for a lot of otherwise smart people to understand. These people have no problem writing code to do quite complex analysis just building it and distributing it is quite challenging.

pests on Sept 8, 2018 | | | [–]

A lot of CLI tools are quick scripts or other programs that will start, execute, and then die quickly.

If a tool is written in an interpreted language or has to load a large VM on each invocation than a large percentage of the total runtime of the CLI tool will be spent on language-specific tasks.

With a non CLI tool such as a server or daemon the expectation is you start it once and it runs for awhile. In these cases startup time does not matter as much.

Furthermore, for interpreted languages like JS/Python/PHP and languages that require a VM such as Java or C#, not everyone will have or want to install each languages runtime.

colanderman on Sept 8, 2018 | | | [–]

Yes, but Rust is far from unique in this regard.

Maybe I shouldn't be surprised that "written in language without onerous runtime and installation overhead" is apparently now rare enough to be a selling point.

wybiral on Sept 9, 2018 | | | [–]

Languages like Rust and Go are just really easy to cross-compile compared to working with lower level OS-dependent APIs in C or something.

There's nothing inherently unique, they're just a good combo of features.

saagarjha on Sept 9, 2018 | | | [–]

> Languages like Rust and Go are just really easy to cross-compile compared to working with lower level OS-dependent APIs in C or something.

You make it seem like C has no standard library. C is really easy to cross compile: you just add a -arch flag to your compiler frontend and, provided you're staying inside POSIX (which you should!) it should "just work".

hsivonen on Sept 9, 2018 | | | [–]

C's standard library is completely broken for text handling (depends on external locale instead of caller-specified locale) and on Windows broken for file system access if your paths contain non-ASCII. (At least until the latest dev version of Windows 10.)

Rust has the best file system abstraction for exposing Windows file paths in a way that allows you to write common file path handling application code for Windows and *nix.

colanderman on Sept 9, 2018 | | | | [–]

For basic load/munge/save stuff (like CSV processing) the C standard library more than suffices, and is cross-platform. You don't need OS-dependent stuff (like POSIX) to do that.

(I'm not arguing C vs. Rust though... just pointing out that "cross-platform" is a much more meaningful descriptor for a CLI tool than "in Rust". The latter just reads like noise to me.)

jaipilot747 on Sept 9, 2018 | | | [–]

Do you know of any libraries in C that give me an easy way to manipulate tabular data by columns, like pandas does?

petre on Sept 9, 2018 | | | [–]

There is miller. But it's a CLI tool, not s library. Maybe it comes bundled with a library?

https://github.com/johnkerl/miller

colanderman on Sept 10, 2018 | | | | [–]

What does that have to do with "lower level OS-dependent APIs"?

jaipilot747 on Sept 10, 2018 | | | [–]

Nothing, honestly. I was just looking to learn if there was a tool in C that could let me be just as productive as Pandas.

paulddraper on Sept 10, 2018 | | | | [–]

Java, C#, Python, Ruby, Perl, PHP, Visual Basic, Scala, Clojure, Dart, Groovy, F#, Kotlin, JavaScript, and Lua are (almost always) interpreted languages.

So that leaves C, C++, Rust, D, Pascal, Haskell, and maybe Go (depending how large the binary gets and if the GC gods are on your side).

Whether or not it is "unique", it is a highly significant choice.

zmmmmm on Sept 10, 2018 | | | [–]

Slightly semantic argument, but most of those languages are normally considered compiled. They have an interpreter, but it operates on a compiled form of the language. Eg: Java is "compiled", Java Byte Code is "interpreted". Slightly semantic but not completely, as a tremendous amount of optimisation can actually be done by that compile step.

paulddraper on Sept 10, 2018 | | | [–]

Yes that is semantic.

And not real distinguishing, given that most languages have bytecodes. This is true even of many "scripting" languages such as Python, PHP, and Ruby.

> as a tremendous amount of optimisation can actually be done by that compile step.

Not really. Look at something like Proguard which (among other things) is the #1 JVM bytecode optimizer. And yet the runtime performance differs very little before and after optimization. As long as you deliver reasonable bytecode, the JIT will capture every optimization and more that you could make to it.

---

At the end of the day, machine code is thing that actually moves the needle on performance, whether JIT machine code or AOT machine code.

shawn on Sept 8, 2018 | | | | [–]

If a tool is written in an interpreted language or has to load a large VM on each invocation than a large percentage of the total runtime of the CLI tool will be spent on language-specific tasks.

This is a myth, FWIW. LuaJIT, python, and node all load within 150ms. And it's usually closer to 70ms: https://gist.github.com/shawwn/544b643bba018fb6bd302a5c46222...

dbaupp on Sept 9, 2018 | | | [–]

With a warm cache, ripgrep takes 80-100ms to search 500MB of files across all the projects that I have checked out (for a regex like /val\w+;/ which matches 500 lines). Python starts (also from a warm cache) in 30ms on my machine. 30-40% seems like a large percentage by most measures.

leshow on Sept 9, 2018 | | | | [–]

In 150ms the other tool would be finished before any of those had even begun. Plus, this is for a print statement, you'd have to factor in dependencies as well to get a real idea of how long it took.

hegz on Sept 9, 2018 | | | [–]

That 150ms is especially important if the tool is running in a loop in a bash command.

lifthrasiir on Sept 9, 2018 | | | | [–]

While it us technically possible to optimize dependency relations to minimize the startup cost for minimal tasks, it is very hard to do in reality. In the "oxidation" plan of the Mercurial SCM [1] this problem was directly addressed and one of the first suggested targets to switch to Rust was an alternative CLI frontend.

[1] https://www.mercurial-scm.org/wiki/OxidationPlan

pests on Sept 9, 2018 | | | | [–]

A single print statement? That is extremely artificial. I'd like to to see numbers wheb you're loading dependencies and on a real workload.

That's barely even making use of those languages runtimes as well.

shawn on Sept 9, 2018 | | | [–]

Why load dependencies when the task is text transformation? It’s not artificial.

Importing the regex libs won’t affect the results by more than a rounding error.

londt8 on Sept 9, 2018 | | | | [–]

The comment was about loading a large VM at each invocation, so only minimal example is sufficient.

pests on Sept 9, 2018 | | | [–]

Sure, technically.

We were talking about start up time when using a CLI tool. Loading a VM is only part of that.

arcticbull on Sept 9, 2018 | | | | [–]

That is a long time, though. An eternity in terms of compute.

cube2222 on Sept 8, 2018 | | | [–]

Rust and Go are languages with ecosystems, in which writing everything in a cross-platform way is the default. They also compile to a single statically linked binary.

You don't need an additional runtime + load of dependencies like node or python.

They are lightweight and fast to start up as opposed to the jvm.

colanderman on Sept 8, 2018 | | | [–]

Everything you said applies to many other languages, including C, which is probably the most common language used to write CLIs if you rank by hours used. Why not advertise the unique properties of this tool (fast, easy to install) than its implementation language?

Does this ultimately boil down to "go install" (or whatever the Rust equivalent is) is easier to deal with than "./configure; make"?

Maybe I'm just a curmudgeon and don't understand why it seems to be in vogue to ignore OS package managers, which hide all this complexity anyway.

21 on Sept 9, 2018 | | | [–]

> Does this ultimately boil down to "go install" (or whatever the Rust equivalent is) is easier to deal with than "./configure; make"?

Well, yeah. For once, "./configure; make" doesn't work on Windows. So you need CMake, or SCons, or Boost Jam, or any other cross platform build system. Then you have different ways of getting dependent libraries, like zlib. In short, the C/C++ cross platform "packaging" is just a mess.

TheCoelacanth on Sept 9, 2018 | | | | [–]

> OS package managers

Already ruled out the most common OS with that phrase.

ptman on Sept 9, 2018 | | | [–]

Well there is the windows app store these days. Probably no cli software offered, and no third party repositories, though. And there's chocolatey, which may be of use

mkl on Sept 9, 2018 | | | [–]

The Windows store now includes entire Linux distributions, which come with all their CLI and repository goodness. I don't know about native Windows CLI stuff though - WSL is so good that I've switched pretty much entirely to that (Windows commands can run from Linux Bash when they're occasionally needed).

leshow on Sept 9, 2018 | | | | [–]

I think the tools speak for themselves. They get use and then they get included in repos, not just by virtue of the fact they were written in lang X.

colanderman on Sept 9, 2018 | | | [–]

Then why advertise that? Why not advertise their benefits instead (fast, cross-platform, easy install)? "Written in X" is only useful to advertise pedagogy or integration.

mmirate on Sept 9, 2018 | | | [–]

Shorthand. Compare the lengths of these two strings:

    > fast, cross-platform, easy install

    > Written in Rust

Also note how the second statement expresses a single concept; whereas the first statement is a list of concepts that must be memorized and reproduced by rote, and is therefore constantly at-risk of being stated incompletely or inaccurately.

(And it is incomplete: there are many more reasons why "Written in Rust" might be important, such as correctness, security, and ease of maintenance/refactoring.)

* * * * *

Alternatively: why convey a set of vague descriptors when you can instead convey precisely why they are applicable, and let the audience's pre-existing memories fill in the details for them?

colanderman on Sept 10, 2018 | | | [–]

Like I said, "written in X" conveys pedagogy and integration. (Or, if we're being cynical, hip language clickbait.) It's not a very good shorthand.

leshow on Sept 10, 2018 | | | | [–]

Because it's a post for hackernews, and we're mostly engineers and developers so we care to know what it's written in. Rust is a awesome new language with growing popularity and it gets hits whenever it's posted.

shawn on Sept 8, 2018 | | | | [–]

Python is installed literally everywhere. If it’s a CLI tool, and it’s written in python, the chance that it won’t work is slim.

Go has a massive number of problems. For example, I tried to run Keybase’s standard “go get” build instructions. It failed with 200 import errors. That was the end of my attempt to install keybase on my raspberry pi. Others had said that it works.

Rust requires a massive amount of hard drive space and takes a long time to build. You also have to build it. That’s antithetical to rapid development.

I can't wait until the pendulum swings back away from static typing and the next generation of programmers discover the benefits of literally ignoring everybody and doing your own thing. It’ll be painful, but at least it’ll be effective. And you won’t have to compile anything.

h1d on Sept 9, 2018 | | | [–]

> Python is installed literally everywhere.

I figured Ubuntu 18.04 didn't have Python pre installed. Besides, installing all the dependencies with pip is another step to do and gets annoying when deploying to many servers.

For something that gets distributed, a single static binary is very welcomed.

ptman on Sept 9, 2018 | | | [–]

I think it has python 3, but not python 2

cube2222 on Sept 8, 2018 | | | | [–]

You don't need to go get or install the rust sdk if you just download the small, statically linked binary.

shawn on Sept 8, 2018 | | | [–]

There is no such binary for the raspberry pi. If it was written in python, it would have worked. It likely would have turned out smaller, too.

burntsushi on Sept 9, 2018 | | | [–]

xsv doesn't exist if it was written in Python. It would definitively be too slow. If you don't care about performance and would rather not wait a couple minutes to build the tool on your Pi, then go use csvkit, which is written in Python. The availability of software isn't a zero sum game.

shawn on Sept 9, 2018 | | | [–]

Ohhh, you wanna throw down eh? Hmmm. This would be a fun weekend project to reimplement XSV in Python and prove this wrong. :)

Now, to make this a fair comparison, are you excluding pypy? Or is that allowed for our game?

How about LuaJIT or Node? Are those fair too?

burntsushi on Sept 9, 2018 | | | [–]

I would definitely love to be proven wrong, because if I am, I am certain I would learn something new. I am pretty comfortable with Python, so I am pretty comfortable saying that you could not write a tool as fast as xsv in Python without writing some substantial portion of it in C. Using Python's standard library CSV parser is certainly fair game, but writing more C code on top of that which included the loop over records in a CSV file feels like a cheat. The problem here is that you've already lost because xsv's CSV parser is faster than Python's parser at the C level[1]. :-) Assuming I haven't made any silly gaffes in xsv itself, I don't see how you're going to make up the difference. Because of that, I am willing to extend a handicap: if you can get within the same order of magnitude as xsv while doing the same amount of work in a robust way, then I think I might even learn something there too. :-)

I am less certain about PyPy, LuaJIT or Node, but I am certain I'd learn something if you proved me wrong.

Note that a problematic part of this challenge is that your program would need to correctly handle CSV in a robust manner. It is very easy to write a very fast CSV parser that doesn't correctly handle all of the corner cases. Python's CSV parser certainly passes that test, but I don't know if Node's or Lua's CSV parser does because I've never used them.

[1] - https://github.com/BurntSushi/rust-csv/blob/master/csv-core/...

hugh-avherald on Sept 9, 2018 | | | [–]

Have you seen the R package data.table (https://github.com/rdatatable/data.table)?

Not sure whether data.table is in the same domain as xsv, and certainly a lot of it is written in C. But for comparison's sake:

  fread("cities.csv")  1.30 s

And then the rest of the computations will be faster of course:

  count -- 0.005 ms
  freq  -- 0.2   s
  sort  -- 0.1   s

It's so useful that I often just use csvs between 10 and 100GB as a database as the difference in performance between fread and a 'proper' database aren't enough to justify the latter.

burntsushi on Sept 9, 2018 | | | [–]

Yes. I've used it lightly in the past and have never been impressed by its speed or memory usage. But there could have been user errors. I am not an R expert. :)

In any case, I think R implements all or most of its data table transformations in C, so I don't think it applies well here.

hugh-avherald on Sept 9, 2018 | | | [–]

I'm curious as to what has impressed you with regard to speed or memory usage (at least over this scale). I realize there's a lot I don't know.

shawn on Sept 9, 2018 | | | | [–]

Initial results: A naive parser in LuaJIT of your xsv-stats example:

  > (do (set csv (load "reader.l"))
        (set plays ((get (require 'system) 'read-file) "data/nfl_all_plays_small.csv"))
        (set s ((get csv 'stream) plays))
        nil)
  > (let t1 (seconds)
      (let read (get csv 'read)
        (while (read s)))
      (- (seconds) t1))
  0.80387997627258

i.e. it can parse the whole nfl_all_plays_small.csv dataset in 803ms.

  $ time xsv stats --everything data/nfl_all_plays_small.csv > /dev/null
  real	0m0.323s
  user	0m0.520s
  sys	0m0.064s

The CSV parser thus far is 62 lines: https://gist.github.com/shawwn/4c7f7e93ccf2e241e17e82d353301...

Obviously, this is a naive parser and doesn't handle the corner cases you mention. But it's a useful starting point.

Current CSV reader tests: https://gist.github.com/shawwn/53c06cd30f29b064c1f6c66f7896f...

Will have more results soon. :)

For basic record counting, LuaJIT takes 36ms:

  > (let t1 (seconds) (select 2 ((get string 'gsub) plays "\n" "\n")) (- (seconds) t1))
  0.036473989486694

xsv can do it in 29ms:

  $ time xsv count data/nfl_all_plays_small.csv
  10000

  real	0m0.029s
  user	0m0.018s
  sys	0m0.007s

EDIT: Down to 400ms to parse the dataset.

shawn on Sept 9, 2018 | | | [–]

Ok, the `count` command is implemented. The code is 27 lines: https://gist.github.com/shawwn/e178e597cbf7f3682153e449f6633...

Without indexing, LuaJIT is twice as fast as XSV for 2.6M rows:

  $ rm data/*.csv.idx

  $ time lumen count.l data/nfl_all_plays_huge.csv
  2592975

  real	0m1.583s
  user	0m1.237s
  sys	0m0.311s

  $ time xsv count data/nfl_all_plays_huge.csv
  2592975

  real	0m3.184s
  user	0m2.425s
  sys	0m0.553s

With indexing, LuaJIT is within an order of magnitude:

  $ xsv index data/nfl_all_plays_huge.csv

  $ time xsv count data/nfl_all_plays_huge.csv
  2592975

  real	0m0.019s
  user	0m0.009s
  sys	0m0.007s

  $ time lumen count.l data/nfl_all_plays_huge.csv
  2592975

  real	0m0.184s
  user	0m0.083s
  sys	0m0.096s

I'll be implementing test cases to ensure it's catching malformed data.

burntsushi on Sept 9, 2018 | | | [–]

Nice! Is your `count` command doing CSV parsing? I don't understand how your naive parser takes 400ms to parse nfl_all_plays_small.csv, but that counting records is somehow faster. The fact that your `count` program is needing to deal explicitly with `\r` makes me very suspicious. :)

Also, counting records with an index isn't particular interesting, since it's just reading 8 bytes from the index file. I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

In your comment above, you compared your CSV parser to `xsv stats --everything`, but `xsv stats` does a lot more than just CSV parsing. If you want to test how fast xsv takes to parse CSV, then `xsv count` without an index is the way to do it. `xsv` only takes 19ms on my machine to parse nfl_all_plays_small.csv, which is basically within process overhead time.

Also, when you're ready, I would like to be able to run and inspect the code myself. :-)

I warned you above: the key challenge you're going to face is creating a robust CSV parser, and using that to implement every command, including `count`. If that isn't a requirement, then basically all comparisons are unfortunately completely moot.

shawn on Sept 9, 2018 | | | [–]

It's just counting lines and skipping any whose contents are "\r" or blank. I believe this is correct behavior because:

  foo,bar,"quux
  zap",bang

`xsv count` returns 0 for this.

Is there any situation where csv fields can contain literal newline characters? (ascii value 10.)

Will post code fairly soon. There aren't any tricks. I just implemented slice as well.

Also, when you're ready, I would like to be able to run and inspect the code myself. :-)

Certainly!

EDIT: Ah, user error. CSVs can indeed contain literal newlines, and XSV handles that. I'll switch it to parse doublequoted strings and add some tests.

One simplification: if a line contains N commas, where N matches the number of columns minus one, then there's no need to parse it for count, slice, etc.

I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.

burntsushi on Sept 9, 2018 | | | [–]

Indeed. xsv returned 0 because it interprets the first record as a header row by default.

Counting is interesting, because you don't have to implement unescaping to do it, but any robust csv parser will do it. So if you write two different versions of a csv parser, one for normal reading and one just for counting, then the one for counting can go faster and you'll avoid the need to amortize allocation. It's a nice trick though! However, I was using `xsv count` as a proxy for CSV parsing. So if you're just going to not do actual CSV parsing, then your solution is much less interesting. :-)

> I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?

> It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.

Holy moly. Really? Do you know why it takes so long? That's a showstopper...

shawn on Sept 9, 2018 | | | [–]

Agreed, though I'm mainly seeing how quickly I can reimplement everything xsv has to offer without sacrificing performance. I don't consider the challenge finished until, as you say, it handles all of the corner cases.

EDIT: I'm actually not sure what's going on with the startup time, since it's usually fast. I have quite a few windows open, which might be affecting results. (xsv count is suddenly taking ~70ms for simple data, so I think a reboot is in order.)

To clarify, I was mainly just excited to share some WIP. There's still a lot of work left to do to cross the finish line.

burntsushi on Sept 9, 2018 | | | [–]

Ah OK. Then to preserve our sanity, I'm going to bow out until you tell me I should go look. There are other things I want to accomplish today. :-)

shawn on Sept 9, 2018 | | | [–]

Sounds good! I'll hopefully have something interesting by the end of the weekend.

shawn on Sept 9, 2018 | | | | [–]

rust-csv compiled and ran flawlessly. xsv had an error: https://github.com/BurntSushi/xsv/issues/139

Also loved this:

    // OMG I HATE BYTE STRING LITERALS SO MUCH.
    fn b(s: &str) -> &[u8] { s.as_bytes() }

EDIT: Aha, it was user error. Brew had an old version of Rust installed.

JoshTriplett on Sept 9, 2018 | | | | [–]

> This would be a fun weekend project to reimplement XSV in Python and prove this wrong.

I don't know as much about the internals or performance characteristics of XSV (though it certainly touts performance as a feature), but if you can reimplement ripgrep in Python and get anywhere close to the same performance, I'd certainly be interested to see that.

> Now, to make this a fair comparison, are you excluding pypy? Or is that allowed for our game?

PyPy is not available everywhere, unlike the CPython runtime or the ability to run compiled binaries.

dagenix on Sept 9, 2018 | | | | [–]

It's perfectly fine to say "this is an impressive project. I don't understand why it couldn't be done in Python. I would love for someonet to explain that to me. Thanks!"

It is not necessary to dismissively declare that some substantial piece of work could be implemented better in just a weekend.

shawn on Sept 9, 2018 | | | [–]

  $ files | narrow \.rs$ | narrow '!tests' | xargs cat | nlines
  4251

It's 4k lines of Rust. Shedding the static typing nonsense will get rid of at least 25% of that. Writing it in Lumen will buy an extra 2x in productivity. And there's nothing to discover; the algorithms are right there, and my claim is that they will run nearly as fast in a non-statically-typed language. I don't think the weekend claim is that outrageous.

You don't like putting on a show for a crowd? It's one of the funnest things.

JoshTriplett on Sept 9, 2018 | | | [–]

First of all, take a look at Cargo.toml for the list of dependencies; repeat recursively. Projects like xsv and ripgrep are modular, with many components that others can and do reuse.

Second, lines of code hardly gives any but the roughest idea of how hard something would be to write, and write well.

Third, interesting that you're not counting the test cases; after all, if you're not doing any static typing, surely you'll want more tests...

Fourth, hey, as long as you're getting rid of the "static typing nonsense" you might as well drop the error handling and comments while you're at it. More seriously, though, type signatures and similar are hardly a significant part of the lines of code of the average Rust program.

But in any case, you've already seen the replies elsewhere in the thread inviting you to try if you feel confident you can do so.

> You don't like putting on a show for a crowd? It's one of the funnest things.

You're certainly showing the crowd something about yourself. Whether it's what you're intending is another question.

If you want to write a replacement or alternative for a tool, especially as an exercise in learning something, by all means do; it's a fun pastime. You don't need to dismiss someone else's work or choice of language in the process.

shawn on Sept 9, 2018 | | | [–]

If it sounded like I was dismissing someone else's work, you're reading too far into it. Who would be silly enough to dismiss a tool from the author of ripgrep?

JoshTriplett on Sept 9, 2018 | | | [–]

Claiming you can implement a version in a weekend and match the same performance is quite dismissive.

Superficially counting the lines of code in the top-level project (ignoring everything else) and implying that it's "just" 4000 lines of code (as though that's a full description of the effort that went into it) is also quite dismissive.

shawn on Sept 9, 2018 | | | [–]

It wasn't dismissive, it was foolish. The CSV parser is actually a separate project, and is around 15k lines of code. That certainly won't be done in a weekend.

Look, it's stellar, A+ software. All I was saying is that you can write it in a dynamic language without sacrificing performance. The goal wasn't to match the full functionality of XSV; that'd be absurd.

In some cases, LuaJIT is even faster than C. It's not an outlandish claim to say that it could match.

The Python claim was in the spirit of good fun, but that probably didn't come across.

Either way, software is meant to be fun. It's a positive statement to say that a dynamic language can match the performance of a statically typed one. Isn't that a cool idea, worth exploring? Why is it true?

The reason I'm confident in that claim is because LuaJIT has withstood the test of time and has repeatedly proven itself. This reduces to the old argument of static types vs lack of types. But a lack of typing was exactly why Lisp was so powerful, back in the day, and why a small number of programmers could wipe the floor vs large teams.

Either way, I've managed to stir the hive, so I'll leave this for whatever it is. To be clear: XSV is awesome software, and I never said otherwise.

burntsushi on Sept 9, 2018 | | | [–]

The LuaJIT idea is interesting, I've certainly been impressed by it in the past, and can agree it is to some extent something that dispels myths like "statically typed languages are always faster than unityped languages." But if you instead interpret that as a first approximation, then it's fairly accurate IMO.

In the interest of cutting to the chase, I'll try to explain some of the high level ideas of why the CSV parser is fast, and typically faster than any other CSV parser I've come across.

Firstly, it is implemented by a hand-rolled DFA that is built from an NFA. The NFA is typically what most robust CSV parsers use, and it is quite fast, but it suffers from the overhead of moving through epsilon transitions and handling case analysis that is part of the configuration of the parser (i.e., delimiter, quote, escaping rules, etc.). It seems to me like this concept could be carried over to LuaJIT.

Secondly, the per-byte overhead of the DFA is very low, and even special cases[1] some transitions to get the overhead even lower. If you were doing this in pure Python or Lua or really any unityped language, I would be very skeptical that you could achieve this because of all the implicit boxing that tends to go on in those languages. Now, if you toss a JIT in the mix, I kind of throw my hands up. Maybe it will be good enough to cut through the boxing that would otherwise take place. From what I've heard about Mike Pall, it wouldn't surprise me! If the JIT fails at this, I'm not sure how I'd begin debugging it. I kind of imagine it's like trying to convince a compiler to optimize a segment of code in a certain way, but only harder.

Thirdly, a critical aspect of keeping things fast that bubbles all the way up into the xsv application code itself is the amortization of allocation. Namely, when xsv iterates over a CSV file, it reuses the same memory allocation for each record[2]. If you've written performance sensitive code before, then this is amateur hour, but I personally have always struggled to get these kinds of optimizations in unityped languages because allocation is typically not a thing they optimize for. Can a JIT cut through this? I don't know. I'm out of my depth. But I can tell you one thing for sure: in languages like Rust, C or C++, amortizing allocation is a very common thing to do. It is straight-forward and never relies on the optimizer doing it for you. There are some different angles to take here though. For example, unityped languages tend to be garbage collected, and in that environment, allocations can be faster which might make amortization less effective. But I'm really waving my hands here. I'm just vaguely drawing on experience.

Anyway, I think it's kind of counter productive to try to play the "knows better than the hivemind" role here. There are really good solid reasons why statically typed languages tend to out-perform unityped languages, and just because there is a counter example in some cases doesn't make those reasons any less important. I think I could also construct an argument around how statically typed languages make it easier to reason about performance, but I don't quite know how to phrase it. In particular, at the end of the day, both cases wind up relying on some magic black box (a compiler's optimizer or a JIT), but I'm finding it difficult to articulate why that isn't the full story.

[1] - https://github.com/BurntSushi/rust-csv/blob/546291a0095a2537...

[2] - https://github.com/BurntSushi/xsv/blob/9574d89634031259802dd...

baq on Sept 9, 2018 | | | [–]

Just wanted to say that you ought to be paid for your comments in threads about your tools, they're so good. Thanks!

burntsushi on Sept 9, 2018 | | | | [–]

My productivity doesn't come from writing software. It comes from reading its code and maintaining it. You can pry my types out of my cold dead hands. :-)

How long it takes you to do this largely depends on how much you can leverage your language's ecosystem. If you don't have a robust and fast CSV parser already written for you, then you'd need to sink many weekends into that alone.

smittywerben on Sept 9, 2018 | | | | [–]

I hope this is a joke because I expected Python but got unix pipes.

sanderjd on Sept 9, 2018 | | | | [–]

You should definitely do this. Personally, I strongly suspect you wouldn't prove him wrong if you did attempt this in any of the languages you mentioned. But if you're right and we're wrong, I'd love it! It would be great and eye-opening to dig into your implementation(s) to see how you pulled it off.

burntsushi on Sept 9, 2018 | | | | [–]

Here's an example using csvkit:

    $ time xsv stats --everything /tmp/nfl_all_plays.csv > stats.csv
    real    5.723
    user    14.390
    sys     1.914
    
    $ time csvstat /tmp/nfl_all_plays.csv
    ^C after 2.5 minutes

Here's the data: https://burntsushi.net/stuff/nfl_all_plays.csv

It's only 74MB. Let's take a smaller slice to see how long csvstat really takes. This is just the first 10,000 records, which is ~3MB: https://burntsushi.net/stuff/nfl_all_plays_small.csv

    $ time csvstat /tmp/nfl_all_plays_small.csv > /tmp/stats.csv
    
    real    1:01.85
    user    1:01.70
    sys     0.103
    
    $ time xsv stats --everything /tmp/nfl_all_plays_small.csv > /tmp/stats.csv
    
    real    0.308
    user    0.576
    sys     0.071

Now technically, csvstat is doing more work in that it seems to be computing a frequency table as well. But we can just do the same for xsv and add the time, with the knowledge that it would be faster if it were coupled into `xsv stats`:

    $ time xsv frequency /tmp/nfl_all_plays_small.csv > /tmp/frequency.csv
    real    0.251
    user    0.187
    sys     0.063

Now let's see how xsv fairs on a much larger sample, which is just nfl_all_plays.csv repeated 10 times and is ~800MB:

    $ ls -lh /tmp/nfl_all_plays_huge.csv
    -rw-r--r-- 1 andrew users 806M Sep  8 20:34 /tmp/nfl_all_plays_huge.csv
    
    $ time xsv index /tmp/nfl_all_plays_huge.csv
    
    real    2.041
    user    1.876
    sys     0.163
    
    $ time xsv stats --everything /tmp/nfl_all_plays_huge.csv > /tmp/stats.csv
    
    real    28.336
    user    4:36.45
    sys     24.212
    
    $ time xsv frequency /tmp/nfl_all_plays_huge.csv > /tmp/frequency.csv
    
    real    6.077
    user    1:16.51
    sys     1.873

That indexing step lets xsv do its processing in parallel. Good luck doing that in Python without blowing your memory budget. :-) csvkit would either take hours to handle that much data or would more likely run out of memory.

With that said, I was able to write a Python program that just counted records within an order of magnitude of `xsv count`, but it was still a few times slower.

shawn on Sept 9, 2018 | | | [–]

Mm, using someone else's parser would defeat the spirit of the challenge. I think xsv is worthwhile for being a robust parser, not necessarily for its performance. And my claim is that you'd be able to write it faster, without trading any security guarantees, in Lua, without sacrificing much performance.

There's that pesky word, "much" performance. And that's really the interesting part here. How much would you trade away by shedding Rust? My hypothesis is less than 50% additional overhead.

Thanks for providing a dataset. I think LuaJIT will match these stats, and it's a good baseline to start with.

But yes, the CSV parser is around 15k lines. That'd be the trickiest part.

burntsushi on Sept 9, 2018 | | | [–]

xsv doesn't have its own csv parser, it uses a Rust library to parse csv[1], which is almost 4 times the size of xsv itself. I just happen to have written it.

In any case, it would be fun to see an implementation in LuaJIT, especially if you did the CSV parser as well. Although, I think that takes you well outside a weekend project unless you cheat. :-) I don't know the performance characteristics of LuaJIT, but I assume they are better than Python's. I don't know how much better. In any case, this challenge was much more interesting to me when you were talking about Python.

Also, I don't really care about a claim that says you could write it faster. That's borderline meaningless in my work unless you're talking about an order of magnitude difference, and I sincerely doubt that.

[1] - https://crates.io/crates/csv

shawn on Sept 9, 2018 | | | [–]

Also, I don't really care about a claim that says you could write it faster. That's borderline meaningless in my work unless you're talking about an order of magnitude difference, and I sincerely doubt that.

Ah, fair point. If there is no benefit to writing software faster, then yes, the discussion is moot.

Apologies if it sounded like I was being a dick. I meant to come across as a “player 2 has entered the game,” but it probably just sounded annoying.

I’ve been reimplementing some C projects in LuaJIT (more specifically, a dialect of Lisp that compiles to LuaJIT), and it certainly feels an order of magnitude less overhead to get work done. Perhaps it would be interesting to bind the CSV crate to LuaJIT, and then do a direct translation of XSV. The original discussion was about CLI tools, which is one area that scripting languages excel in, and isn’t necessarily enhanced by the benefits of static typing.

Creationix has done a lot of excellent work in popularizing LuaJIT’s benefits: https://github.com/luvit/luvit

burntsushi on Sept 9, 2018 | | | [–]

> I’ve been reimplementing some C projects in LuaJIT (more specifically, a dialect of Lisp that compiles to LuaJIT), and it certainly feels an order of magnitude less overhead to get work done.

It is interesting how perspectives change things, because I wouldn't be altogether surprised by that actually. I've used both C and Lua quite a bit myself (although not too much recently), and I can believe that. But if you substituted C for Rust, then I would look at it with great skepticism. That is, I don't see it as types-vs-not-types, but memory-safety-vs-not-memory-safety in addition to having a ecosystem that is very easy to draw on in a granular way.

And I didn't think you were too annoying. :-) Building an ffi layer over the csv crate would be cool. It would probably be easiest to do it over csv-core, and then build up the convenience abstractions on the C side (or perhaps even at the Lua level) and therefore avoid the csv crate entirely. csv-core will be simpler because it is designed to work in `no_std` environments, which means it never uses platform specific routines and never allocates. Still though, you'll probably need to find a way to amortize allocation on the Lua side.

> and isn’t necessarily enhanced by the benefits of static typing

Yeah, I mean I obviously very very very very strongly disagree. We are clearly polar opposites here. If I never write anything substantial in a unityped language again, it will be too soon.

rustyboy on Sept 9, 2018 | | | | [–]

> writing faster

Has to be one of the most arbitrary benefits to programming, everyone is different how do you possibly use this as a comparable metric?

Even if, typing isn't an evil. I write Python for my day job and extensively use the typing libraries because it ends up saving me more time in long run.

GordonS on Sept 9, 2018 | | | | [–]

> Python is installed literally everywhere

Nope, not on Windows.

mmirate on Sept 9, 2018 | | | | [–]

Which Python? 2 or 3?

coldtea on Sept 9, 2018 | | | [–]

(1) It matters to those who follow the language's progress, because it points to increased adoption.

(2) It matters for tools that, like csvkit and ripgrep, make parts of their implementation available as libraries.

(3) It matters because e.g. Rust is both fast as C/C++ and safe-by-default for more things than C/C++, so one can expect xsv to be faster than e.g. the Python based csvkit.

(4) Startup costs, static binary, etc, things that are strongly influenced by the language used, matter.

(5) It matters for those wanting to contribute to the tool, since it's also open source.

(6) It matters because we are programmers here on HN, and we want to know about the internals of the programs we use, check their implementation, learn from their coding style, etc., not just to use them as mere consumers.

Should I go on?

lilydjwg on Sept 9, 2018 | | | [–]

Because:

* Some languages don't run on some platforms, e.g. C# doesn't work well on Linux.

* Some languages are hard to build (for real projects), e.g. C or C++ projects, Python with C extensions on Windows. I don't have a good time to build Haskell projects (on Linux) either.

* Some languages require a huge runtime (either install seperately or bundled), e.g. Java. They take a lot of time to download and install, and occupy a lot of disk space and memory.

mwcampbell on Sept 9, 2018 | | | [–]

> C# doesn't work well on Linux.

Not true anymore. .NET Core, including ASP.NET Core, works just fine on Linux.

k__ on Sept 9, 2018 | | | | [–]

I just read good about Mono.

xnyan on Sept 9, 2018 | | | [–]

Mono is great. .NET core is also underrated to the detriment of all imo.

nickm12 on Sept 9, 2018 | | | [–]

The language tells you a great deal about whether the tool will fit in with your other tools in just a word or two. "command-line" here tells you a bunch as well, but there are command-line tools that require a runtime env that you might not want to deploy and maintain just for one tool.

sullyj3 on Sept 9, 2018 | | | [–]

Well for starters you know it's much less likely to leak memory, segfault, or experience buffer overflows. Not necessarily relevant to xsv specifically, but speaking of rust tools more generally.

_ph_ on Sept 9, 2018 | | | [–]

From the perspective of the program: you might get a hint about the speed and robustness, knowing it is written in Rust.

But what is more important: for all, who are interested in programming languages, there is no better showcase for a tool, in this case a programming language, than to have real-world examples of useful things created with it.

groestl on Sept 9, 2018 | | | [–]

Besides the ops-story and culture a language implies, if you rally for contributions to your free software'd tool, you can use the language as part of the marketing message to make sure your users also like your language (and therefore are more likely to contribute).

sh4z on Sept 9, 2018 | | | [–]

Well, my guess is it matters for the upvotes the post receives on HN - hence you're seeing a lot of these posts. Leave out "written in X" and you'll be missing out on upvotes from the X community.

shawn on Sept 8, 2018 | | | [11 more]

It doesn’t. It’s one of the most annoying aspects of Rust that “Written in Rust” is considered an attractive quality.

But, whatever. There’s no reason to be gratuitously negative. I’m just a bit salty that lesser known languages are often excluded on the basis that they’re lesser known.

pests on Sept 8, 2018 | | | [–]

It does matter. Not only for the reasons my sibling comment states but others as well.

Such as the safety/security guarantees Rust makes by default.

I would trust any binary decoder that is written in Rust more than the equivalent written in C.

leshow on Sept 9, 2018 | | | [–]

You shouldn't, and I write Rust all the time. Don't implicitly trust it because it's written in Rust. There could still be errors, there could be uses of unsafe, etc.

pests on Sept 9, 2018 | | | [–]

I just said more than C not an absolute trust level or trust just because it's Rust. I do know Rust very well myself.

It does have some gotchas for those who think it means instant safety.

For example the interface between safe and unsafe. If you change an established constraint that safe Rust depends on inside an unsafe block then all safe code has to be aware of this change and take it into account. In a way anything could be undefined behavior if you don't.

Or the hole that allowed RAII guard subversion.

leshow on Sept 10, 2018 | | | [–]

> Or the hole that allowed RAII guard subversion.

?

shawn on Sept 8, 2018 | | | | [–]

I’d trust Tarsnap with my life, and it’s written in C. You have to evaluate the merits of an individual project, not blindly put your faith in their security claims.

I say this as a former pentester.

sullyj3 on Sept 9, 2018 | | | [–]

https://www.mckinsey.com/business-functions/strategy-and-cor...

In general, when you write C, there are a multitude of ways to introduce memory related vulnerabilities if you don't know what you're doing, and historically and empirically speaking, even if you do (heartbleed comes to mind, but I'm sure 5 minutes of research would turn up hundreds of examples).

Safe rust prevents you from double freeing, dereferencing null, leaking memory, and a whole bunch of memory related bugs/potential vulnerabilities.

Now, programmers are human. We make mistakes. Why, a priori, should anyone trust the programmers of one particular project to be superhuman and never screw up? That seems much more like blind faith to me than trusting a language which has been specifically designed to eliminate these kinds of errors.

shawn on Sept 9, 2018 | | | [–]

This question reduces to "Why trust Tarsnap?"

Colin is a renowned security expert and it's known how he handles vulnerabilities: http://www.daemonology.net/blog/2011-01-18-tarsnap-critical-...

After that incident, the probability goes way up that it won't happen again.

Trust credibility, not tools.

Also, Tarsnap has had a long-standing bug bounty program, which is another reason to trust it.

TheDong on Sept 9, 2018 | | | [–]

Why can't I trust credibility and tools?

I do trust tarsnap, but if Colin wrote tarsnap in safe rust, I'd trust it even more.

Similarly, I can trust a library or program is less likely to exhibit crashes and memory issues if it's written in rust, and that allows the developers to be less renowned security experts while still producing usable code.

Your argument that C programs can be trusted to be memory safe and contain minimal bugs IFF they're written by experts doesn't prove the point that language's are meaningless for gaining trust.

I trust a combination of credibility and tools. If someone tells me they ran a fuzzer over their http server, I'll trust it more than if they say they didn't. That's not credibility, that's tools.

Both credibility and tools have their place. This isn't some black and white issue like you're portraying it.

lifthrasiir on Sept 9, 2018 | | | | [–]

> Also, Tarsnap has had a long-standing bug bounty program, which is another reason to trust it.

This is not necessarily a reason to trust Tarsnap. It is just an (rather weak) indication of the security-oriented process and nothing more. Probably you would want to say security track records (inferred from the bug bounty program), which would be a good reason to trust it.

pests on Sept 9, 2018 | | | | [–]

I trust taranap with my life as well.

I did not claim C was unsafe or that safe programs could not or have not been written in C.

I'd just trust it more if it was in Rust.

adamnemecek on Sept 8, 2018 | | [–]

Burntsushi is like the most productive programmer.

person_of_color on Sept 8, 2018 | | [–]

BurntSushi will soon be BurntOut ;)

safgasCVS on Sept 9, 2018 | | [–]

Thank you for this! I've been looking for windows equivalent to csvkit and this is just the ticket

convivialdingo on Sept 8, 2018 | | [–]

Really nice!

I think I’ve written at least a couple of these in some basic form myself. At some point in a business someone is going to hand you CSV or excel dataset and you end up having to deal with it.

sunpazed on Sept 9, 2018 | | [–]

Have been using xsv (in combination with any-json, and jq) for a few months wrangling big csv files at work. Found it to be a better / faster process than rolling my own code.

fithisux on Sept 9, 2018 | | [–]

I have used the simple and beautiful tool

https://github.com/Clever/csvlint

It is in golang.

adiusmus on Sept 9, 2018 | | [–]

Good to have another tool for csv debris management. Especially those multigigabyte gifts that need to be in the database “yesterday”. This has happened several days or weeks in a row more times than I can count. And no, people won’t provide such things in sqlite or something sane.

I feel like I should also recommend this: https://digital-preservation.github.io/csv-validator/

I think something like this rewritten in go would be great.

burntsushi on Sept 9, 2018 | | [–]

> I think something like this rewritten in go would be great.

Why not use it how it is? There are static binaries provided on github.

dj43nq on Sept 9, 2018 | | | [–]

Isn’t that Java? Not everyone wants java on their server, I’d like something in c or rust. Good project for a few weekends work.

gregw2 on Sept 9, 2018 | | [–]

no CSV diff capability? find which columns are different, which rows? ability to ignore spaces or leading/trailing 0s or quote marks?

burntsushi on Sept 9, 2018 | | [–]

I imagine you could get pretty far with

    diff <(xsv input old.csv) <(xsv input new.csv)

In particular, xsv prints valid CSV, and quoting rules will be applied consistently so that they can be diffed. Ignoring spaces and 0s might not work though.

If you'd like to file an issue with a specific use case (sample data from a real world problem would be great), then that would be appreciated!

rixed on Sept 9, 2018 | | [–]

Also relevant, this forgotten gem:

http://www.rseventeen.com/

jancsika on Sept 9, 2018 | | [–]

Ok, I'm curious and may have follow-ups:

What exactly is being checksummed in cargo.lock? Is it the source code, a binary, something else?

e12e on Sept 9, 2018 | | [–]

https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lo...

dbaupp on Sept 9, 2018 | | | [–]

That doesn't seem to talk about the series of "checksum ..." entries in the [metadata] section at the end of the Cargo.lock.

e12e on Sept 10, 2018 | | | [–]

Hm, appears to be an alternate (new? Old?) syntax for:

"Cargo will take the latest commit and write that information out into our Cargo.lock when we build for the first time. That file will look like this:

[[package]] name = "hello_world" version = "0.1.0" dependencies = [ "rand 0.1.0 (git+https://github.com/rust-lang-nursery/rand.git#9f35b8e439eeed..., ]

[[package]] name = "rand" version = "0.1.0" source = "git+https://github.com/rust-lang-nursery/rand.git#9f35b8e439eeed...

You can see that there’s a lot more information here, including the exact revision we used to build."

dbaupp on Sept 10, 2018 | | | [–]

No, the git hashes are still there in the same syntax as described in that document. In fact, the checksum entry of a git dependency seems to be "<none>".

lerax on Sept 9, 2018 | | [–]

Awesome tool. Pretty useful to manage big-scary csv files.

tigrezno on Sept 9, 2018 | | [–]

What about using awk?

burntsushi on Sept 9, 2018 | | [–]

awk doesn't support csv.

tajen on Sept 8, 2018 | | [–]

XML-separated values? Fortunately no: command line program for indexing, slicing, analyzing, splitting and joining CSV files.

ubernostrum on Sept 8, 2018 | | [–]

This comment shouldn't have been downvoted/killed; at the time it was made, the title of the post was completely uninformative, leaving people to guess what on earth "XSV" might be.

watmough on Sept 9, 2018 | | | [–]

XML-separated values? It sounds like some kind of (easily detected) side-band, or an xkcd cartoon.

mastrsushi on Sept 9, 2018 | [–]

I started to write a similar application in C++. Have up because I didn't think people would find it useful. After seeing this in the front page, I feel it might be a good idea to get back in. https://github.com/tfili001/line If I get back into this, CSV parsing will be my next step.

burntsushi on Sept 9, 2018 | [–]

Check out csvmonkey for a really fast C++ csv parser: https://github.com/dw/csvmonkey