>I like how all those rust/go implemented, fast, cross-platform tools are starting to get mainstream. (ripgrep's another great one).
Both xsv and ripgrep are by BurntSushi, and ripgrep was mentioned twice on HN recently, once in a release note thread, IIRC, and another time in the CLI: Improved thread.
This is great. For now, my go to tool is csvkit (Python) which has all kinds of neat tools. In particular loading data into databases (csvsql) is just plain awesome. Check it out at https://csvkit.readthedocs.io/en/1.0.3/.
Seems great at first, but how is that better than piping the whole CSV file into SQLite and then doing the processing there? I think CSV is great for data exchange but not so great for data processing.
By using SQLite (or any other DB actually) you can decide which data to index, and write arbitrarily complex queries in a rather understandable language. I think XSV is kind of reinventing the wheel here.
You could. But then you need to write a schema. How do you know what schema to write? You could ask csvkit to do it, but it might not be fast enough on, say, a 40GB CSV file. Or maybe it isn't quite accurate enough. xsv might be the tool you use to figure out what the schema should be.
SQLite (or any SQL database) does not cover all use cases. For example, if you need to produce CSV data, and you can fit your transformation into xsv commands, then it might be hard to beat the performance in exchange for the effort you put in.
This is probably an expression problem. Tools don't always neatly fit into orthogonal buckets. If you think in terms of shell pipelines and want to attack the data as it is, then xsv might be good for you. SQLite is, IMO, a pretty massive hammer to apply every single time you want to look at CSV data.
Have to pipe up here and say an enthusiastic / visceral / emotional 'thankyou' to BurntSushi for all your great contributions - xsv, rust-csv, ripgrep etc - not to mention your superb blog articles.
Im using xsv to wrangle 500gb of data, it is phenomenally useful and fast. The main datastore is postgresql and it is great, but there are some things pg just isn't fast enough for [hash joins anyone]. I have to do a lot of pre-processing, and xsv is incredible for that.
Perhaps the best complement of all - I am seriously looking at moving from node.js to Rust as my daily data/systems programming language _because_ of how performant and elegant xsv / rust-csv / ripgrep are, and how readable the rust code is.
I also have renewed respect for the staple of olde unix tools - sort sed grep wc etc
xsv is a brilliant addition to that canon of unix lore.
Hard to say. I own them both but haven't thoroughly read through either. I would probably go with The Rust Programming Language, though, I suspect you can't go wrong with either.
>The CSV virtual table reads RFC 4180 formatted comma-separated values, and returns that content as if it were rows and columns of an SQL table.
>The CSV virtual table is useful to applications that need to bulk-load large amounts of comma-separated value content. The CSV virtual table is also useful as a template source file for implementing other virtual tables.
>The CSV virtual table is not built into the SQLite amalgamation. It is available as a separate source file that can be compiled into a loadable extension. Typical usage of the CSV virtual table from the command-line shell would be something like this:
For non-lookup csv-to-csv transformation, yes, I guess your tool can easily beat any SQL DB since it can work without any need for an external storage. That also applies to sequential-scan queries too. You make a good point :-)
Regarding the schema now, I’m not sure I get your point. I would argue the queries you want to make already dictate a schema (e.g. you want to compute an average ? Use a numeric column).
> Regarding the schema now, I’m not sure I get your point. I would argue the queries you want to make already dictate a schema (e.g. you want to compute an average ? Use a numeric column).
Yeah that's fine. I think we're just missing each other, that's all. I'm not sure what kind of CSV data you've worked with, but from the stuff I've seen, it's rarely clean. `xsv frequency` and `xsv stats` are great ways to look at your data as it is. For example, a column might be named as if it were a numeric field. Maybe it's first hundred rows are even numeric. But there might be a row where it isn't numeric and isn't even empty, but has some kind of weird value in it. Like, say, 'N/A' or 'null' or 'none' or whatever sentinel value someone decided to use to indicate absence.
Data types can be richer than just primitives too. For example, one field might be an enumeration of a fixed set of values, and I can think of at least three different ways you might want to represent that in your schema.
I prefer the SQLite option because it's faster when you're doing repeated operations. I load it in once and then run many queries on the same SQLite store. It's very simple and I know I can join etc. with a SQL syntax.
Most definitely, yes. If you setup your indexes correctly in a relational database, then xsv isn't going to come anywhere near performing them as efficiently. The only indexing mechanism xsv has is a record index, which lets it parallelize certain workloads across a CSV file. But it doesn't have any per-field indices like relational databases do, so any kind of join done by xsv requires building the index first in memory.
xsv could grow support for field level indices, but it's more work and I don't know whether it's worth it. There's a line to straddle here, because at some point, xsv isn't going to be good enough and you should instead just use a database.
As a daily user of xsv, I can't imagine why a CSV processor would need to be competing with SQLite in terms of performance. For me, xsv hits the right niche in 2 very important and frequent use-cases:
1. One-off analyses or tasks involving relatively small datasets, like trimming the output of a command-line Twitter tool into xargs to, say, unfollow everyone who hasn't tweeted this year, or isn't following me and has fewer than 100 followers. If I were attempting to do a deeper analyses, SQLite would be better (because of the query syntax), but xsv/csvkit works incredibly well for quick explorations/filtering/sorting. Here's an example using csvkit (haven't quite gotten used to all of xsv's syntax) and the t command-line tool: https://gist.github.com/dannguyen/7213b16b7de79ad9c89fc2297d...
2. Prepping data for importing into SQLite; for very large public datasets, I like using xsv/csvkit to trim columns and filter out rows before importing them into SQLite. This is especially convenient since SQLite doesn't allow for dropping columns once a table is created.
If I'm ever working with data big enough to need SQLite-like performance, then I probably need SQLite's query syntax and other features. xsv is already flexible and performant for what I need in my data pipeline.
I don't doubt that sqlite will wipe the floor with xsv for certain types of analysis. I also suspect that the opposite would be true for other types of analysis. Although, you'd know much better than I would - I've used sqlite quite a bit while my only usage of xsv was to work with a 100 line CSV file. Xsv was fast, but, I'm sure most tools would be at that size. What was killer, however, is that xsv gave me a lot of tools that helped exactly what I needed to do - sqlite wouldn't have helped at all. Right tool for the right job and all.
What I was more curious about was if the original comment was about a particular case where sqlite beat xsv, or if it was more handwavy. Also, the comment seemed to imply that repeated queries would be faster in sqlite than with xsv but didn't mention anything about indexes. My suspicion would be that without something that can take advantage of an index, xsv and sqlite would perform about the same doing repeated table / file scans.
Would be interesting to see how xsv compared to miller (https://johnkerl.org/miller/doc/index.html) in terms of perf, this tool comes exactly as I am about to munge 1TB of gzipped csv files.
Unfortunately, the main operation I need is not supported by xsv...
I like the pager UI of the VisiData utility, written in Python, for exploring data. Since XSV is written in Rust, it could theoretically be imported and used in VisiData.
git clone <xsv github>
cd xsv
cargo build --release
The latter gave me a ton of compilation errors on the package crossbeam-channel. The former installed the program and compiled a bunch of crates, but did it install a pre-built binary for xsv? "downloading" then "installing" of xsv was the first step, then all the crates were downloaded. ldd seems to report that the installed xsv binary has no dependencies so I don't see why it would need to download and compile a bunch of crates in that case. On the other hand if it was compiling xsv then I don't understand why I don't get the same errors as in the latter case.
It seems Ubuntu has the following version:
$ rustc --version
rustc 1.25.0
I realize it's probably not optimal to use the Debian-packaged rustc but I didn't feel like figuring out the whole ecosystem just to install one program to test it out.
Because current master requires a newer version of the Rust compiler than the current release on crates.io. If current master were a release, then `cargo install xsv` would have failed with the same errors.
Ah, ok thanks. Anyways I got it installed, was just wondering about certain actions that cargo took. I found I could copy the resulting binary and delete my entire .cargo directory and it continues to work, so it seems cargo did some unnecessary things for `install`, but no worries.
To add to this: Cargo is more of a development tool for Rust programmers rather than a package manager for end users. Both types of tools have a lot in common, which is why things like `cargo install` exist and are very useful. In the case of `xsv`, it's a useful escape hatch since your distro doesn't package `xsv`, but `cargo install` is not like `apt install`. That is, it downloads all of the source code dependencies of `xsv` and builds them. This also requires sync'ing with crates.io's registry list. None of this is unnecessary in the standard Cargo workflow in order to build xsv, but the build artifacts are certainly unnecessary in order to run xsv.
Development of the Rust language is still happening fairly quickly, so many projects won't compile with the 6 month old compiler version in your Linux distribution's package repository. Most developers install it using rustup.
This looks great, and I'm very keen to try it out.
I have a malformed 115GB CSV which took some work to process with SSIS, but I'm really interested to try again using xsv to split off the bad rows and see if that would have been an easier option.
Burntsushi mentions in the readme that it is often the case that people receive large csv files needing analysis but he also mentions that valid criticisms of the tool are that you could use an SQL database. Why can't you use a database in some scenarios? And when blazing in memory speed is useful, why can't you use something like this: https://github.com/jtablesaw/tablesaw
Is XSV faster? Is it the command line convenience (although surely it was not that convenient to write a new tool to do it)
Genuinely curious and I do not mean to belittle the project as it looks well implemented and useful nonetheless.
The language itself doesn't have much importance, what matters is :
- is it portable ? Or will the users have to install a specific runtime to run the tool ? For Linux it's not a big deal because it comes pre-installed in most distributions, but making Python work on Windows requires some work. It's not too much if you're going to use the tool often, but for merely testing it, the runtime installation burden can deter users.
- is it fast ? For many application, speed doesn't matter, but for some it matters.
- is it stable ? Or will it crash every now and then ? C or C++ software have memory bugs, that leads to crashes (segfault). The most mainstream one don't have too much of it, but as soon as you start using niche software, you'll face crashes more often than not.
For these three reasons, the language matters, but there's not much difference if it's Rust, Go or Haskell.
Python is not hard to package into an exe, but most Python programs aren't packaged like that.
C and C++ are very cross-platform, if you only use cross-platform dependencies and if you are careful to write portable code. Most C and C++ applications don't do that.
Not really, most of the standard library and basic stuff is incompatible on Windows. So sockets, file system calls, event loops, file configuration & default folders... It's never THAT different but different enough that just compiling your code won't work.
> So sockets, file system calls, event loops, file configuration & default folders... It's never THAT different but different enough that just compiling your code won't work.
But if you use something like Qt you've got all of this cross-platform out of the box - I have a multiple hundred lines of code C++ software which does all this and much more ; it has almost zero platform-specific stuff except one function at some point to set real-time priorty to threads.
So you think it's appropriate for tools like ripgrep and xsv to bundle a gui toolkit to do its job?
I feel like this whole line of argument is disingenuous. In Rust, you don't need to bundle a GUI toolkit to get cross platform stuff that works. The standard library was built with Windows in mind, and it shows. You don't need any additional POSIX layer. You just use the standard library.
If you go and look at competitors to things like xsv and ripgrep that are written in C or C++, they either require some kind of POSIX thing on Windows to run correctly (like cygwin), or they jump through lots of hoops to behave as a native Windows application.
It really should not be a controversial statement to say that Rust has a better cross platform story (Windows/mac/Linux). It was a design goal.
> So you think it's appropriate for tools like ripgrep and xsv to bundle a gui toolkit to do its job?
Qt is not a GUI toolkit, it's a general framework. If you only link to -lQtCore you won't have anything remotely related to GUI in your binary. For instance doxygen is built like this.
For those like me who’ve never used QT: As long as you don’t statically link the QT libraries you don’t have to open source your entire commercial application. I’m not a lawyer.
“(1) If you statically link against an LGPL'd library, you must also provide your application in an object (not necessarily source) format, so that a user has the opportunity to modify the library and relink the application.”
So for LGPL. As a commercial application developer I’m not sure I’d want to do that. I’d then have to field support requests for that object file.
Not at all. You have to respect the license of Qt (e.g. if your customer asks you for the source of the Qt libraries you have to provide them - including any modification you made to them), but you can keep your own app proprietary and ship the whole however you like.
With LGPL, you can't statically link the library though. Which requires your user to install it manualy, unless you package your software with some kind of installer (which is not so common on windows for cli executables).
> With LGPL, you can't statically link the library though.
no, that's tiresomely false. You can link proprietary code statically with LGPL code. LGPL does not give a shit about static or dynamic libraries because LGPL is not a language-specific license.
Thanks for the link. The nuance I was missing is that you can distribute statically-linked LGPL library as long as you provide a way for your user to override the chosen library with his own version of the library.
It's very common to ship windows CLI tools as a .zip file with an exe and a bunch of .dlls, that's not exactly burdensome to install compared to only an executable.
It’s not THAT hard to make a cli C program portable between windows, Mac and Linux. I’ve done plenty of projects that support all three platforms and the amount of work wasn’t excessive. For some projects it was just getting the build environment up that was the biggest issue.
Once you use cmake, modular design etc a lot of the issues vanish.
It's not difficult to package, but if your experience is anything like mine, distributing a 400+MB zip containing all the dependencies and the python core+stdlib is no fun.
I’ll share a use case where Rust specifically has been a godsend.
I write software for users who have strong engineering backgrounds but tend to know just enough about computers to be dangerous. They tend to know a smattering if Python, Perl, SQL, SAS, R, or Matlab, and will happily use a well-documented CLI tool, but cannot be relied upon to do basic sysadmin tasks effectively. They run a mix of windows and OSX, and some set up Linux VMs. They are hired for their expertise in their domain, not because they are great IMS technicians. They deal with large, heterogenous, difficult-to-schematize datasets and value speed, but it is a total waste of their time to troubleshoot environment/linking issues that frequently accompany building and using C/C++ utilities for many non-programmer users. I’ve found Rust to be hugely beneficial for writing programs to serve this user base (RF engineers in my case, but I’d imagine people who write software for ag/life science people, banking analysts, civil/environmental engineers and others have similar challenges).
* it’s also nice that it’s hard to segfault in Rust. Many existing tools in my domain are notoriously fragile in this regard.
Wow that sums up my world precisely - also engineering. Especially the part about people knowing a smattering of Perl, Python, SAS and Matlab. Add in C++ (written in circa early 90's style i.e C with "new" and "delete") and Fortran (Fortan 77 style straight out of the classical Numerical Recipes text book) you could be describing my org.
People here have so many issues managing things like library dependencies. This is especially a problem on Windows. I remember witnessing the guy next to me waste hours trying to build and link an image analysis library written in C++. I could feel his frustration from across the partition "This library needs this dependency, which has to build from scratch but before I can do that I have to build library 'zzz' but to do that I have to download something called "CMAKE". I've finally done all that but then I forgot to set a compiler flag correctly in some library 3 steps up the chain and now I have to start all over again. At the end of all that you better be damn sure you were paying attention and didn't accidentally mix some 32 / 64 bit code together which apparently is pretty easy to do when you are downloading random code from various websites."
I think these were instructions he was trying to follow
Stuff like configuring environment variables, editing make files etc are very difficult for a lot of otherwise smart people to understand. These people have no problem writing code to do quite complex analysis just building it and distributing it is quite challenging.
A lot of CLI tools are quick scripts or other programs that will start, execute, and then die quickly.
If a tool is written in an interpreted language or has to load a large VM on each invocation than a large percentage of the total runtime of the CLI tool will be spent on language-specific tasks.
With a non CLI tool such as a server or daemon the expectation is you start it once and it runs for awhile. In these cases startup time does not matter as much.
Furthermore, for interpreted languages like JS/Python/PHP and languages that require a VM such as Java or C#, not everyone will have or want to install each languages runtime.
Maybe I shouldn't be surprised that "written in language without onerous runtime and installation overhead" is apparently now rare enough to be a selling point.
> Languages like Rust and Go are just really easy to cross-compile compared to working with lower level OS-dependent APIs in C or something.
You make it seem like C has no standard library. C is really easy to cross compile: you just add a -arch flag to your compiler frontend and, provided you're staying inside POSIX (which you should!) it should "just work".
C's standard library is completely broken for text handling (depends on external locale instead of caller-specified locale) and on Windows broken for file system access if your paths contain non-ASCII. (At least until the latest dev version of Windows 10.)
Rust has the best file system abstraction for exposing Windows file paths in a way that allows you to write common file path handling application code for Windows and *nix.
For basic load/munge/save stuff (like CSV processing) the C standard library more than suffices, and is cross-platform. You don't need OS-dependent stuff (like POSIX) to do that.
(I'm not arguing C vs. Rust though... just pointing out that "cross-platform" is a much more meaningful descriptor for a CLI tool than "in Rust". The latter just reads like noise to me.)
Slightly semantic argument, but most of those languages are normally considered compiled. They have an interpreter, but it operates on a compiled form of the language. Eg: Java is "compiled", Java Byte Code is "interpreted". Slightly semantic but not completely, as a tremendous amount of optimisation can actually be done by that compile step.
And not real distinguishing, given that most languages have bytecodes. This is true even of many "scripting" languages such as Python, PHP, and Ruby.
> as a tremendous amount of optimisation can actually be done by that compile step.
Not really. Look at something like Proguard which (among other things) is the #1 JVM bytecode optimizer. And yet the runtime performance differs very little before and after optimization. As long as you deliver reasonable bytecode, the JIT will capture every optimization and more that you could make to it.
---
At the end of the day, machine code is thing that actually moves the needle on performance, whether JIT machine code or AOT machine code.
If a tool is written in an interpreted language or has to load a large VM on each invocation than a large percentage of the total runtime of the CLI tool will be spent on language-specific tasks.
With a warm cache, ripgrep takes 80-100ms to search 500MB of files across all the projects that I have checked out (for a regex like /val\w+;/ which matches 500 lines). Python starts (also from a warm cache) in 30ms on my machine. 30-40% seems like a large percentage by most measures.
In 150ms the other tool would be finished before any of those had even begun. Plus, this is for a print statement, you'd have to factor in dependencies as well to get a real idea of how long it took.
While it us technically possible to optimize dependency relations to minimize the startup cost for minimal tasks, it is very hard to do in reality. In the "oxidation" plan of the Mercurial SCM [1] this problem was directly addressed and one of the first suggested targets to switch to Rust was an alternative CLI frontend.
Rust and Go are languages with ecosystems, in which writing everything in a cross-platform way is the default. They also compile to a single statically linked binary.
You don't need an additional runtime + load of dependencies like node or python.
They are lightweight and fast to start up as opposed to the jvm.
Everything you said applies to many other languages, including C, which is probably the most common language used to write CLIs if you rank by hours used. Why not advertise the unique properties of this tool (fast, easy to install) than its implementation language?
Does this ultimately boil down to "go install" (or whatever the Rust equivalent is) is easier to deal with than "./configure; make"?
Maybe I'm just a curmudgeon and don't understand why it seems to be in vogue to ignore OS package managers, which hide all this complexity anyway.
> Does this ultimately boil down to "go install" (or whatever the Rust equivalent is) is easier to deal with than "./configure; make"?
Well, yeah. For once, "./configure; make" doesn't work on Windows. So you need CMake, or SCons, or Boost Jam, or any other cross platform build system. Then you have different ways of getting dependent libraries, like zlib. In short, the C/C++ cross platform "packaging" is just a mess.
Well there is the windows app store these days. Probably no cli software offered, and no third party repositories, though. And there's chocolatey, which may be of use
The Windows store now includes entire Linux distributions, which come with all their CLI and repository goodness. I don't know about native Windows CLI stuff though - WSL is so good that I've switched pretty much entirely to that (Windows commands can run from Linux Bash when they're occasionally needed).
Then why advertise that? Why not advertise their benefits instead (fast, cross-platform, easy install)? "Written in X" is only useful to advertise pedagogy or integration.
Shorthand. Compare the lengths of these two strings:
> fast, cross-platform, easy install
> Written in Rust
Also note how the second statement expresses a single concept; whereas the first statement is a list of concepts that must be memorized and reproduced by rote, and is therefore constantly at-risk of being stated incompletely or inaccurately.
(And it is incomplete: there are many more reasons why "Written in Rust" might be important, such as correctness, security, and ease of maintenance/refactoring.)
* * * * *
Alternatively: why convey a set of vague descriptors when you can instead convey precisely why they are applicable, and let the audience's pre-existing memories fill in the details for them?
Because it's a post for hackernews, and we're mostly engineers and developers so we care to know what it's written in. Rust is a awesome new language with growing popularity and it gets hits whenever it's posted.
Python is installed literally everywhere. If it’s a CLI tool, and it’s written in python, the chance that it won’t work is slim.
Go has a massive number of problems. For example, I tried to run Keybase’s standard “go get” build instructions. It failed with 200 import errors. That was the end of my attempt to install keybase on my raspberry pi. Others had said that it works.
Rust requires a massive amount of hard drive space and takes a long time to build. You also have to build it. That’s antithetical to rapid development.
I can't wait until the pendulum swings back away from static typing and the next generation of programmers discover the benefits of literally ignoring everybody and doing your own thing. It’ll be painful, but at least it’ll be effective. And you won’t have to compile anything.
I figured Ubuntu 18.04 didn't have Python pre installed. Besides, installing all the dependencies with pip is another step to do and gets annoying when deploying to many servers.
For something that gets distributed, a single static binary is very welcomed.
xsv doesn't exist if it was written in Python. It would definitively be too slow. If you don't care about performance and would rather not wait a couple minutes to build the tool on your Pi, then go use csvkit, which is written in Python. The availability of software isn't a zero sum game.
I would definitely love to be proven wrong, because if I am, I am certain I would learn something new. I am pretty comfortable with Python, so I am pretty comfortable saying that you could not write a tool as fast as xsv in Python without writing some substantial portion of it in C. Using Python's standard library CSV parser is certainly fair game, but writing more C code on top of that which included the loop over records in a CSV file feels like a cheat. The problem here is that you've already lost because xsv's CSV parser is faster than Python's parser at the C level[1]. :-) Assuming I haven't made any silly gaffes in xsv itself, I don't see how you're going to make up the difference. Because of that, I am willing to extend a handicap: if you can get within the same order of magnitude as xsv while doing the same amount of work in a robust way, then I think I might even learn something there too. :-)
I am less certain about PyPy, LuaJIT or Node, but I am certain I'd learn something if you proved me wrong.
Note that a problematic part of this challenge is that your program would need to correctly handle CSV in a robust manner. It is very easy to write a very fast CSV parser that doesn't correctly handle all of the corner cases. Python's CSV parser certainly passes that test, but I don't know if Node's or Lua's CSV parser does because I've never used them.
Not sure whether data.table is in the same domain as xsv, and certainly a lot of it is written in C. But for comparison's sake:
fread("cities.csv") 1.30 s
And then the rest of the computations will be faster of course:
count -- 0.005 ms
freq -- 0.2 s
sort -- 0.1 s
It's so useful that I often just use csvs between 10 and 100GB as a database as the difference in performance between fread and a 'proper' database aren't enough to justify the latter.
Yes. I've used it lightly in the past and have never been impressed by its speed or memory usage. But there could have been user errors. I am not an R expert. :)
In any case, I think R implements all or most of its data table transformations in C, so I don't think it applies well here.
Without indexing, LuaJIT is twice as fast as XSV for 2.6M rows:
$ rm data/*.csv.idx
$ time lumen count.l data/nfl_all_plays_huge.csv
2592975
real 0m1.583s
user 0m1.237s
sys 0m0.311s
$ time xsv count data/nfl_all_plays_huge.csv
2592975
real 0m3.184s
user 0m2.425s
sys 0m0.553s
With indexing, LuaJIT is within an order of magnitude:
$ xsv index data/nfl_all_plays_huge.csv
$ time xsv count data/nfl_all_plays_huge.csv
2592975
real 0m0.019s
user 0m0.009s
sys 0m0.007s
$ time lumen count.l data/nfl_all_plays_huge.csv
2592975
real 0m0.184s
user 0m0.083s
sys 0m0.096s
I'll be implementing test cases to ensure it's catching malformed data.
Nice! Is your `count` command doing CSV parsing? I don't understand how your naive parser takes 400ms to parse nfl_all_plays_small.csv, but that counting records is somehow faster. The fact that your `count` program is needing to deal explicitly with `\r` makes me very suspicious. :)
Also, counting records with an index isn't particular interesting, since it's just reading 8 bytes from the index file. I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
In your comment above, you compared your CSV parser to `xsv stats --everything`, but `xsv stats` does a lot more than just CSV parsing. If you want to test how fast xsv takes to parse CSV, then `xsv count` without an index is the way to do it. `xsv` only takes 19ms on my machine to parse nfl_all_plays_small.csv, which is basically within process overhead time.
Also, when you're ready, I would like to be able to run and inspect the code myself. :-)
I warned you above: the key challenge you're going to face is creating a robust CSV parser, and using that to implement every command, including `count`. If that isn't a requirement, then basically all comparisons are unfortunately completely moot.
It's just counting lines and skipping any whose contents are "\r" or blank. I believe this is correct behavior because:
foo,bar,"quux
zap",bang
`xsv count` returns 0 for this.
Is there any situation where csv fields can contain literal newline characters? (ascii value 10.)
Will post code fairly soon. There aren't any tricks. I just implemented slice as well.
Also, when you're ready, I would like to be able to run and inspect the code myself. :-)
Certainly!
EDIT: Ah, user error. CSVs can indeed contain literal newlines, and XSV handles that. I'll switch it to parse doublequoted strings and add some tests.
One simplification: if a line contains N commas, where N matches the number of columns minus one, then there's no need to parse it for count, slice, etc.
I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.
Indeed. xsv returned 0 because it interprets the first record as a header row by default.
Counting is interesting, because you don't have to implement unescaping to do it, but any robust csv parser will do it. So if you write two different versions of a csv parser, one for normal reading and one just for counting, then the one for counting can go faster and you'll avoid the need to amortize allocation. It's a nice trick though! However, I was using `xsv count` as a proxy for CSV parsing. So if you're just going to not do actual CSV parsing, then your solution is much less interesting. :-)
> I would definitely be curious to know why your program is taking 184ms though. That isn't startup time, is it?
> It's actually the time it takes to load in a C function to swap big-endian uint64 to little-endian.
Holy moly. Really? Do you know why it takes so long? That's a showstopper...
Agreed, though I'm mainly seeing how quickly I can reimplement everything xsv has to offer without sacrificing performance. I don't consider the challenge finished until, as you say, it handles all of the corner cases.
EDIT: I'm actually not sure what's going on with the startup time, since it's usually fast. I have quite a few windows open, which might be affecting results. (xsv count is suddenly taking ~70ms for simple data, so I think a reboot is in order.)
To clarify, I was mainly just excited to share some WIP. There's still a lot of work left to do to cross the finish line.
> This would be a fun weekend project to reimplement XSV in Python and prove this wrong.
I don't know as much about the internals or performance characteristics of XSV (though it certainly touts performance as a feature), but if you can reimplement ripgrep in Python and get anywhere close to the same performance, I'd certainly be interested to see that.
> Now, to make this a fair comparison, are you excluding pypy? Or is that allowed for our game?
PyPy is not available everywhere, unlike the CPython runtime or the ability to run compiled binaries.
It's perfectly fine to say "this is an impressive project. I don't understand why it couldn't be done in Python. I would love for someonet to explain that to me. Thanks!"
It is not necessary to dismissively declare that some substantial piece of work could be implemented better in just a weekend.
It's 4k lines of Rust. Shedding the static typing nonsense will get rid of at least 25% of that. Writing it in Lumen will buy an extra 2x in productivity. And there's nothing to discover; the algorithms are right there, and my claim is that they will run nearly as fast in a non-statically-typed language. I don't think the weekend claim is that outrageous.
You don't like putting on a show for a crowd? It's one of the funnest things.
First of all, take a look at Cargo.toml for the list of dependencies; repeat recursively. Projects like xsv and ripgrep are modular, with many components that others can and do reuse.
Second, lines of code hardly gives any but the roughest idea of how hard something would be to write, and write well.
Third, interesting that you're not counting the test cases; after all, if you're not doing any static typing, surely you'll want more tests...
Fourth, hey, as long as you're getting rid of the "static typing nonsense" you might as well drop the error handling and comments while you're at it. More seriously, though, type signatures and similar are hardly a significant part of the lines of code of the average Rust program.
But in any case, you've already seen the replies elsewhere in the thread inviting you to try if you feel confident you can do so.
> You don't like putting on a show for a crowd? It's one of the funnest things.
You're certainly showing the crowd something about yourself. Whether it's what you're intending is another question.
If you want to write a replacement or alternative for a tool, especially as an exercise in learning something, by all means do; it's a fun pastime. You don't need to dismiss someone else's work or choice of language in the process.
If it sounded like I was dismissing someone else's work, you're reading too far into it. Who would be silly enough to dismiss a tool from the author of ripgrep?
Claiming you can implement a version in a weekend and match the same performance is quite dismissive.
Superficially counting the lines of code in the top-level project (ignoring everything else) and implying that it's "just" 4000 lines of code (as though that's a full description of the effort that went into it) is also quite dismissive.
It wasn't dismissive, it was foolish. The CSV parser is actually a separate project, and is around 15k lines of code. That certainly won't be done in a weekend.
Look, it's stellar, A+ software. All I was saying is that you can write it in a dynamic language without sacrificing performance. The goal wasn't to match the full functionality of XSV; that'd be absurd.
In some cases, LuaJIT is even faster than C. It's not an outlandish claim to say that it could match.
The Python claim was in the spirit of good fun, but that probably didn't come across.
Either way, software is meant to be fun. It's a positive statement to say that a dynamic language can match the performance of a statically typed one. Isn't that a cool idea, worth exploring? Why is it true?
The reason I'm confident in that claim is because LuaJIT has withstood the test of time and has repeatedly proven itself. This reduces to the old argument of static types vs lack of types. But a lack of typing was exactly why Lisp was so powerful, back in the day, and why a small number of programmers could wipe the floor vs large teams.
Either way, I've managed to stir the hive, so I'll leave this for whatever it is. To be clear: XSV is awesome software, and I never said otherwise.
The LuaJIT idea is interesting, I've certainly been impressed by it in the past, and can agree it is to some extent something that dispels myths like "statically typed languages are always faster than unityped languages." But if you instead interpret that as a first approximation, then it's fairly accurate IMO.
In the interest of cutting to the chase, I'll try to explain some of the high level ideas of why the CSV parser is fast, and typically faster than any other CSV parser I've come across.
Firstly, it is implemented by a hand-rolled DFA that is built from an NFA. The NFA is typically what most robust CSV parsers use, and it is quite fast, but it suffers from the overhead of moving through epsilon transitions and handling case analysis that is part of the configuration of the parser (i.e., delimiter, quote, escaping rules, etc.). It seems to me like this concept could be carried over to LuaJIT.
Secondly, the per-byte overhead of the DFA is very low, and even special cases[1] some transitions to get the overhead even lower. If you were doing this in pure Python or Lua or really any unityped language, I would be very skeptical that you could achieve this because of all the implicit boxing that tends to go on in those languages. Now, if you toss a JIT in the mix, I kind of throw my hands up. Maybe it will be good enough to cut through the boxing that would otherwise take place. From what I've heard about Mike Pall, it wouldn't surprise me! If the JIT fails at this, I'm not sure how I'd begin debugging it. I kind of imagine it's like trying to convince a compiler to optimize a segment of code in a certain way, but only harder.
Thirdly, a critical aspect of keeping things fast that bubbles all the way up into the xsv application code itself is the amortization of allocation. Namely, when xsv iterates over a CSV file, it reuses the same memory allocation for each record[2]. If you've written performance sensitive code before, then this is amateur hour, but I personally have always struggled to get these kinds of optimizations in unityped languages because allocation is typically not a thing they optimize for. Can a JIT cut through this? I don't know. I'm out of my depth. But I can tell you one thing for sure: in languages like Rust, C or C++, amortizing allocation is a very common thing to do. It is straight-forward and never relies on the optimizer doing it for you. There are some different angles to take here though. For example, unityped languages tend to be garbage collected, and in that environment, allocations can be faster which might make amortization less effective. But I'm really waving my hands here. I'm just vaguely drawing on experience.
Anyway, I think it's kind of counter productive to try to play the "knows better than the hivemind" role here. There are really good solid reasons why statically typed languages tend to out-perform unityped languages, and just because there is a counter example in some cases doesn't make those reasons any less important. I think I could also construct an argument around how statically typed languages make it easier to reason about performance, but I don't quite know how to phrase it. In particular, at the end of the day, both cases wind up relying on some magic black box (a compiler's optimizer or a JIT), but I'm finding it difficult to articulate why that isn't the full story.
My productivity doesn't come from writing software. It comes from reading its code and maintaining it. You can pry my types out of my cold dead hands. :-)
How long it takes you to do this largely depends on how much you can leverage your language's ecosystem. If you don't have a robust and fast CSV parser already written for you, then you'd need to sink many weekends into that alone.
You should definitely do this. Personally, I strongly suspect you wouldn't prove him wrong if you did attempt this in any of the languages you mentioned. But if you're right and we're wrong, I'd love it! It would be great and eye-opening to dig into your implementation(s) to see how you pulled it off.
$ time xsv stats --everything /tmp/nfl_all_plays.csv > stats.csv
real 5.723
user 14.390
sys 1.914
$ time csvstat /tmp/nfl_all_plays.csv
^C after 2.5 minutes
$ time csvstat /tmp/nfl_all_plays_small.csv > /tmp/stats.csv
real 1:01.85
user 1:01.70
sys 0.103
$ time xsv stats --everything /tmp/nfl_all_plays_small.csv > /tmp/stats.csv
real 0.308
user 0.576
sys 0.071
Now technically, csvstat is doing more work in that it seems to be computing a frequency table as well. But we can just do the same for xsv and add the time, with the knowledge that it would be faster if it were coupled into `xsv stats`:
$ time xsv frequency /tmp/nfl_all_plays_small.csv > /tmp/frequency.csv
real 0.251
user 0.187
sys 0.063
Now let's see how xsv fairs on a much larger sample, which is just nfl_all_plays.csv repeated 10 times and is ~800MB:
$ ls -lh /tmp/nfl_all_plays_huge.csv
-rw-r--r-- 1 andrew users 806M Sep 8 20:34 /tmp/nfl_all_plays_huge.csv
$ time xsv index /tmp/nfl_all_plays_huge.csv
real 2.041
user 1.876
sys 0.163
$ time xsv stats --everything /tmp/nfl_all_plays_huge.csv > /tmp/stats.csv
real 28.336
user 4:36.45
sys 24.212
$ time xsv frequency /tmp/nfl_all_plays_huge.csv > /tmp/frequency.csv
real 6.077
user 1:16.51
sys 1.873
That indexing step lets xsv do its processing in parallel. Good luck doing that in Python without blowing your memory budget. :-) csvkit would either take hours to handle that much data or would more likely run out of memory.
With that said, I was able to write a Python program that just counted records within an order of magnitude of `xsv count`, but it was still a few times slower.
Mm, using someone else's parser would defeat the spirit of the challenge. I think xsv is worthwhile for being a robust parser, not necessarily for its performance. And my claim is that you'd be able to write it faster, without trading any security guarantees, in Lua, without sacrificing much performance.
There's that pesky word, "much" performance. And that's really the interesting part here. How much would you trade away by shedding Rust? My hypothesis is less than 50% additional overhead.
Thanks for providing a dataset. I think LuaJIT will match these stats, and it's a good baseline to start with.
But yes, the CSV parser is around 15k lines. That'd be the trickiest part.
xsv doesn't have its own csv parser, it uses a Rust library to parse csv[1], which is almost 4 times the size of xsv itself. I just happen to have written it.
In any case, it would be fun to see an implementation in LuaJIT, especially if you did the CSV parser as well. Although, I think that takes you well outside a weekend project unless you cheat. :-) I don't know the performance characteristics of LuaJIT, but I assume they are better than Python's. I don't know how much better. In any case, this challenge was much more interesting to me when you were talking about Python.
Also, I don't really care about a claim that says you could write it faster. That's borderline meaningless in my work unless you're talking about an order of magnitude difference, and I sincerely doubt that.
Also, I don't really care about a claim that says you could write it faster. That's borderline meaningless in my work unless you're talking about an order of magnitude difference, and I sincerely doubt that.
Ah, fair point. If there is no benefit to writing software faster, then yes, the discussion is moot.
Apologies if it sounded like I was being a dick. I meant to come across as a “player 2 has entered the game,” but it probably just sounded annoying.
I’ve been reimplementing some C projects in LuaJIT (more specifically, a dialect of Lisp that compiles to LuaJIT), and it certainly feels an order of magnitude less overhead to get work done. Perhaps it would be interesting to bind the CSV crate to LuaJIT, and then do a direct translation of XSV. The original discussion was about CLI tools, which is one area that scripting languages excel in, and isn’t necessarily enhanced by the benefits of static typing.
> I’ve been reimplementing some C projects in LuaJIT (more specifically, a dialect of Lisp that compiles to LuaJIT), and it certainly feels an order of magnitude less overhead to get work done.
It is interesting how perspectives change things, because I wouldn't be altogether surprised by that actually. I've used both C and Lua quite a bit myself (although not too much recently), and I can believe that. But if you substituted C for Rust, then I would look at it with great skepticism. That is, I don't see it as types-vs-not-types, but memory-safety-vs-not-memory-safety in addition to having a ecosystem that is very easy to draw on in a granular way.
And I didn't think you were too annoying. :-) Building an ffi layer over the csv crate would be cool. It would probably be easiest to do it over csv-core, and then build up the convenience abstractions on the C side (or perhaps even at the Lua level) and therefore avoid the csv crate entirely. csv-core will be simpler because it is designed to work in `no_std` environments, which means it never uses platform specific routines and never allocates. Still though, you'll probably need to find a way to amortize allocation on the Lua side.
> and isn’t necessarily enhanced by the benefits of static typing
Yeah, I mean I obviously very very very very strongly disagree. We are clearly polar opposites here. If I never write anything substantial in a unityped language again, it will be too soon.
Has to be one of the most arbitrary benefits to programming, everyone is different how do you possibly use this as a comparable metric?
Even if, typing isn't an evil. I write Python for my day job and extensively use the typing libraries because it ends up saving me more time in long run.
(1) It matters to those who follow the language's progress, because it points to increased adoption.
(2) It matters for tools that, like csvkit and ripgrep, make parts of their implementation available as libraries.
(3) It matters because e.g. Rust is both fast as C/C++ and safe-by-default for more things than C/C++, so one can expect xsv to be faster than e.g. the Python based csvkit.
(4) Startup costs, static binary, etc, things that are strongly influenced by the language used, matter.
(5) It matters for those wanting to contribute to the tool, since it's also open source.
(6) It matters because we are programmers here on HN, and we want to know about the internals of the programs we use, check their implementation, learn from their coding style, etc., not just to use them as mere consumers.
* Some languages don't run on some platforms, e.g. C# doesn't work well on Linux.
* Some languages are hard to build (for real projects), e.g. C or C++ projects, Python with C extensions on Windows. I don't have a good time to build Haskell projects (on Linux) either.
* Some languages require a huge runtime (either install seperately or bundled), e.g. Java. They take a lot of time to download and install, and occupy a lot of disk space and memory.
The language tells you a great deal about whether the tool will fit in with your other tools in just a word or two. "command-line" here tells you a bunch as well, but there are command-line tools that require a runtime env that you might not want to deploy and maintain just for one tool.
Well for starters you know it's much less likely to leak memory, segfault, or experience buffer overflows. Not necessarily relevant to xsv specifically, but speaking of rust tools more generally.
From the perspective of the program: you might get a hint about the speed and robustness, knowing it is written in Rust.
But what is more important: for all, who are interested in programming languages, there is no better showcase for a tool, in this case a programming language, than to have real-world examples of useful things created with it.
Besides the ops-story and culture a language implies, if you rally for contributions to your free software'd tool, you can use the language as part of the marketing message to make sure your users also like your language (and therefore are more likely to contribute).
Well, my guess is it matters for the upvotes the post receives on HN - hence you're seeing a lot of these posts. Leave out "written in X" and you'll be missing out on upvotes from the X community.
It doesn’t. It’s one of the most annoying aspects of Rust that “Written in Rust” is considered an attractive quality.
But, whatever. There’s no reason to be gratuitously negative. I’m just a bit salty that lesser known languages are often excluded on the basis that they’re lesser known.
You shouldn't, and I write Rust all the time. Don't implicitly trust it because it's written in Rust. There could still be errors, there could be uses of unsafe, etc.
I just said more than C not an absolute trust level or trust just because it's Rust. I do know Rust very well myself.
It does have some gotchas for those who think it means instant safety.
For example the interface between safe and unsafe. If you change an established constraint that safe Rust depends on inside an unsafe block then all safe code has to be aware of this change and take it into account. In a way anything could be undefined behavior if you don't.
I’d trust Tarsnap with my life, and it’s written in C. You have to evaluate the merits of an individual project, not blindly put your faith in their security claims.
In general, when you write C, there are a multitude of ways to introduce memory related vulnerabilities if you don't know what you're doing, and historically and empirically speaking, even if you do (heartbleed comes to mind, but I'm sure 5 minutes of research would turn up hundreds of examples).
Safe rust prevents you from double freeing, dereferencing null, leaking memory, and a whole bunch of memory related bugs/potential vulnerabilities.
Now, programmers are human. We make mistakes. Why, a priori, should anyone trust the programmers of one particular project to be superhuman and never screw up? That seems much more like blind faith to me than trusting a language which has been specifically designed to eliminate these kinds of errors.
I do trust tarsnap, but if Colin wrote tarsnap in safe rust, I'd trust it even more.
Similarly, I can trust a library or program is less likely to exhibit crashes and memory issues if it's written in rust, and that allows the developers to be less renowned security experts while still producing usable code.
Your argument that C programs can be trusted to be memory safe and contain minimal bugs IFF they're written by experts doesn't prove the point that language's are meaningless for gaining trust.
I trust a combination of credibility and tools. If someone tells me they ran a fuzzer over their http server, I'll trust it more than if they say they didn't. That's not credibility, that's tools.
Both credibility and tools have their place. This isn't some black and white issue like you're portraying it.
> Also, Tarsnap has had a long-standing bug bounty program, which is another reason to trust it.
This is not necessarily a reason to trust Tarsnap. It is just an (rather weak) indication of the security-oriented process and nothing more. Probably you would want to say security track records (inferred from the bug bounty program), which would be a good reason to trust it.
I think I’ve written at least a couple of these in some basic form myself. At some point in a business someone is going to hand you CSV or excel dataset and you end up having to deal with it.
Have been using xsv (in combination with any-json, and jq) for a few months wrangling big csv files at work. Found it to be a better / faster process than rolling my own code.
Good to have another tool for csv debris management. Especially those multigigabyte gifts that need to be in the database “yesterday”. This has happened several days or weeks in a row more times than I can count. And no, people won’t provide such things in sqlite or something sane.
In particular, xsv prints valid CSV, and quoting rules will be applied consistently so that they can be diffed. Ignoring spaces and 0s might not work though.
If you'd like to file an issue with a specific use case (sample data from a real world problem would be great), then that would be appreciated!
No, the git hashes are still there in the same syntax as described in that document. In fact, the checksum entry of a git dependency seems to be "<none>".
This comment shouldn't have been downvoted/killed; at the time it was made, the title of the post was completely uninformative, leaving people to guess what on earth "XSV" might be.
I started to write a similar application in C++. Have up because I didn't think people would find it useful. After seeing this in the front page, I feel it might be a good idea to get back in.
https://github.com/tfili001/line
If I get back into this, CSV parsing will be my next step.
Really useful if you're a programmer preferring Windows but mainly using Unix tools and developing for Unix OSes.
By the way, I've been using xsv when analysing 8 GB csv's (the Amazon review dataset) and have been nothing but happy with it.