If you work with CSV a lot, then also check out xsv, "a command line program for indexing, slicing, analyzing, splitting and joining CSV files", from Andrew Gallant (burntsushi of Go and Rust fame): https://github.com/BurntSushi/xsv
And check out "Motivation" in xsv's readme (at the end) to understand why it might be a better choice than most other tools suggested here, particularly those that go through Python and sqlight.
Also, while I like/applaud Miller (also referenced), xsv has a simpler UI that only handles CSV.
My main hesitation:
Its another DSL to learn, there is a big benefit to keeping this kind of work within the SQL world for me (or R, python, etc depending on the user).
On the other hand, getting table stats and frequencies right from the shell is a huge time saver over SQL.
Yes, I use q all the time for slicing and dicing delimited files. The only problem I have with it is that the name can make it a littler harder to find if you don't remember the repo.
Since q will read stdin and write CVS to stdout you can chain several queries on the command line, or use it in series with other with other commands such as cat, grep, sed, etc.
Highly recommended if you like SQL and deal with delimited files.
Was wondering the same. I've been using csvkit a lot lately. This looks like a subset of functionality using more or less the same approach. Csvsql uses sqlite under neath and you can do some nice things with this like joins, using sql functions, etc. There is also stuff like amazon athena that allows you to do similar things in s3 at scale.
Csvkit is great with pipes and you can also easily convert between csv tsv and even stuff like json.
Although csvsql is great, the real reason I love csvkit is that most of the tools feel like standard *nix tools that just operate on CSV files (csvcut, csvgrep, etc.)
I’ve worked on a similar tool in the past called Charlatan [1] (no I didn’t choose that name), but it implemented its own SQL-like language and was extensible to support any file type.
Its key difference with TextQL and similar alternatives is it works on streams instead of importing everything in a SQLite then querying it. It seems strange to read a whole file in memory then perform a `SELECT` query on it when you could just run that query while reading the file. That means a much lower memory footprint and faster execution, but on the other hand you can only use the subset of SQL that’s implemented.
If you are on Windows you may consider using Log Parser Studio [1]. It is a GUI over logparser.exe which understands a lot of various file formats and uses SQL to query them [2].
This is one of those problems that is just hard enough that everyone feels it is useful to solve and just easy enough that everyone can solve it. Hence the dozens or even hundreds of independent solutions. For some reason I never end up using any of them!
Apache Drill (https://drill.apache.org/) is also a great alternative. Although advertised as a query engine for Hadoop, it works perfectly fine on local CSV and JSON files.
I usually appreciate the animated gif screenshot (especially for curses based programs), but this particular animated gif would have been better off as just plain text—there's too much interesting data that can't be copied, it's real hard to go back to look at something that you missed and it's frustrating to have to wait while fake typing happens. I really wish there was just a transcript of the gif that I could read at my own pace.
There is also BigBash [0] that converts an Sql statement to a bash one-liner (using sed, grep, awk,...) which can then run to execute the query.
The advantage is that you can let it run on very large file(s) because of the streaming nature of linus bintools.
I like VisiData[1][2] for CSV/TSV data exploration. textql is a good command-line follow-up to run quick direct queries against the data without needing to see visual/spatial structure.
Yawn. We've had this in Perl for decades using the DBI Shell (dibsh). DBI is a common database interface. To connect to a data source, one uses an appropriate driver (by convention, in the DBD name space, such as DBD::Pg or DBD::MySQL) to get a db handle that can run SQL commands. There are a lot of drivers for things like CSV files, in memory database, etc.
Plus you can access all this programmatically and not just through the dbish command. Since this is a central part of Perl infrastructure, there are a ton of tools that extend the DBI, from a variety of ORMs like DBIx::Class, to data export and even DDL delta management.
Ha. Shameless plug, but just about a week ago I started to work on a somewhat analogous thing. Executing SQL on XML [1]. I came up with this idea after trying to use stackoverflow data dump and finding out it was stored as XML. I wanted to run queries with LIKE operator on it.
This is a great idea. There are so many APIs and text dumps from websites and databases that are XML, and it would be nice to be able to query them with regular relational operations. I'll be following your project! I will say that I believe a streaming backend would be more suitable, since most of the XML dumps I'm thinking of are humongous and won't always fit in memory.
Thanks! About fitting into memory; the way the library currently works should technically work even on very big files, as the XML is iteratively converted into a SQL database. You can choose whether you want that database to be in a file or in memory. The only thing that could break right now is synchronizing back to XML after having made changes with UPDATE or INSERT statements, because I'm not paging results.
[1] https://www.lnav.org -- a feature-rich, powerful, flexible "mini-ETL" with an embedded sqlite engine -- is one of my all-time favorite CLI tools that somehow remains under the radar
I'm a big fan of SQL Notebook [1]. It is open source and has a great UI to quickly work with a number of importable formats (CSV, Excel, JSON) as well as server-based DBs.
Having to define the schema is the major hangup for me in my workflow. I would prefer a wrapper that could run within unix pipelines around arbitrary text data files, but they all use SQLite.
I personally have been using harelba’s q (the un-googleable utility), which is just fine.
It would be great not to shift gears into SQLite syntax and date formatting all the time. Does anyone know of a similar tool that runs over postgres?
For queries, it does use SQLite under the hood, but before querying your "arbitrary
text data" files, you can use regex to define a custom format (eg with named groups and back references), providing structure against which standard SQL is an ideal tool to query.
It's simpler than I'm probably making it sound. Highly recommended.
One thing that I would like to add to my projects is the animated GIF shell. I've looked at several ways to create them but they all seem really clunky and force me to type correctly the first time. What do people use to create these helpful animations?
I'm pretty sure many people have written or attempted to write similar tools at least once before. It's as simple as programmatically importing to an in-memory sqlite instance and then executing SQL queries.
I was asking because the topic of this discussion is specifically running SQL queries (that's what TextQL does).
But, there certainly are good reasons for using SQL in R. Packages like sqldf exist, so their authors would probably be able to give the best answer to your question.
Some reasons:
- SQL is a very widely-known DSL for working with tabular data, so it may make sense to make that interface available within R.
- For certain operations (e.g. specific types of joins) it may be more natural / easier to express the operation in SQL.
- For large data sets, some operations in R have large memory footprints, and doing the operations in a database may have lower memory requirements.
As a command-line tool it could serve as a more powerful replacement for unix utilities that I often use for simple analysis jobs: sort, uniq, wc, awk, sometimes cut and sed. I'm going to give it a try.
It looks like textql loads everything into an in-memory SQLite instance, whereas I'd really like to see an approach that uses the SQLite's virtual table mechanism (https://sqlite.org/vtab.html), which would avoid the loading step and perhaps make streaming processing possible.
What you describe would be a super nice utility. Great for prototyping and development, as the code could be copy & pasted from such a tool into application code.
The streaming would make it memory efficient, and possibly able to handle some big data - maybe not true "Big Data", but certainly 10s of gigabytes.
Anyone want to take this idea into a GoFundMe site?
SQL is set-oriented. How would that work on a potentially indefinite stream, other than as a simple filter which you could just do with a tool such as awk.
Many relational operations don't require a whole stream to compute, and many of those that do don't need it all at once.
Projection (mapping), a join against a fully loaded other side as well as filtering work.
Aggregation can consume an indefinite stream with limited working set if the cardinality of the grouping key isn't large.
And of course you can combine these in nested and unioned operations, computing across multiple indefinite streams concurrently and with limited working set.
It would be tricky to make work effecively without hinting for things like joins, for sure; join order is one of the hardest bits a query engine optimizes.
In this case, It’s what it does that matters, not how it does it. I don’t care if SQLLite is powering it. Maybe the author would want to swap out the SQLLite engine with something else in the future too. Then it’s even better that they didn’t advertise the link to SQLLite as a selling point.
SQL dialects have significant differences. You can't simply change the underlying engine without breaking backward compatibility. It's also good to redirect documentation to official sqllite docs instead of writing a 1000 page manual yourself.
It does actually matter what's running under, so you know what to expect from its behavior and if one doesn't exactly need the few features textQL offers exclusively, you would think to just use SQLite without random layers on top.
That's a silly complaint. It's a tool powered by SQLite. Under your extended definition, tools like web browsers (which use SQLite to power various things) would be SQLite wrappers.
Nah, people say such ridiculous garbage that they need to be called out. Especially here where people seem to take delight in disparaging your work in the tiniest way they can think of.
I don't think OP needs to pussy foot around some stranger on an internet message board. It's free shit.
Help with what ?
If you want to import a csv file into SQLite, just follow the "obsolete" docs which probably still work.
If you want help finding a similar page in the docs, just read all the docs or download them and grep. It may or may not exist.
It is a negative comment. We should make alternatives not kill off technologies. Even the Commodore 64 has use with the right modern add ons, now that is hacking.