TextQL: Execute SQL Against CSV or TSV

kibwen · on April 7, 2018

If you work with CSV a lot, then also check out xsv, "a command line program for indexing, slicing, analyzing, splitting and joining CSV files", from Andrew Gallant (burntsushi of Go and Rust fame): https://github.com/BurntSushi/xsv

jhoechtl · on April 7, 2018

Another great tool to wrangle csv file is miller

http://johnkerl.org/miller/doc/

walshemj · on April 7, 2018

You can also use log parser lizard to parse CSV files.

fnl · on April 8, 2018

And check out "Motivation" in xsv's readme (at the end) to understand why it might be a better choice than most other tools suggested here, particularly those that go through Python and sqlight.

Also, while I like/applaud Miller (also referenced), xsv has a simpler UI that only handles CSV.

banku_brougham · on April 9, 2018

`xsv` looks great, I will try it out.

My main hesitation: Its another DSL to learn, there is a big benefit to keeping this kind of work within the SQL world for me (or R, python, etc depending on the user).

On the other hand, getting table stats and frequencies right from the shell is a huge time saver over SQL.

ravinizme · on April 7, 2018

Seem very similar to q

q is out there for years, with a very large community.

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

http://harelba.github.io/q/

ams6110 · on April 7, 2018

Yes, I use q all the time for slicing and dicing delimited files. The only problem I have with it is that the name can make it a littler harder to find if you don't remember the repo.

Since q will read stdin and write CVS to stdout you can chain several queries on the command line, or use it in series with other with other commands such as cat, grep, sed, etc.

Highly recommended if you like SQL and deal with delimited files.

ConceitedCode · on April 7, 2018

This looks very simple and easy to use.

Are there any differences with using this instead of CSVKit?

https://csvkit.readthedocs.io/en/1.0.3/

It includes a tool called csvsql.

Example usage -

  csvsql --query "select name from data where age > 30" data.csv > new.csv

jillesvangurp · on April 7, 2018

Was wondering the same. I've been using csvkit a lot lately. This looks like a subset of functionality using more or less the same approach. Csvsql uses sqlite under neath and you can do some nice things with this like joins, using sql functions, etc. There is also stuff like amazon athena that allows you to do similar things in s3 at scale.

Csvkit is great with pipes and you can also easily convert between csv tsv and even stuff like json.

donarb · on April 7, 2018

Python's pandas library can also do sql-like queries against csv data.

https://pandas.pydata.org/pandas-docs/stable/comparison_with...

michaelmior · on April 9, 2018

Although csvsql is great, the real reason I love csvkit is that most of the tools feel like standard *nix tools that just operate on CSV files (csvcut, csvgrep, etc.)

hk__2 · on April 7, 2018

I’ve worked on a similar tool in the past called Charlatan [1] (no I didn’t choose that name), but it implemented its own SQL-like language and was extensible to support any file type.

Its key difference with TextQL and similar alternatives is it works on streams instead of importing everything in a SQLite then querying it. It seems strange to read a whole file in memory then perform a `SELECT` query on it when you could just run that query while reading the file. That means a much lower memory footprint and faster execution, but on the other hand you can only use the subset of SQL that’s implemented.

[1]: https://github.com/BatchLabs/charlatan#charlatan

lowleveldesign · on April 7, 2018

If you are on Windows you may consider using Log Parser Studio [1]. It is a GUI over logparser.exe which understands a lot of various file formats and uses SQL to query them [2].

[1] https://gallery.technet.microsoft.com/office/Log-Parser-Stud...

[2] https://en.m.wikipedia.org/wiki/Logparser

noveltyaccount · on April 8, 2018

Came here to say this. Log parser is amazing. Check out Log Parser Lizard for a good IDE.

zmmmmm · on April 7, 2018

This is one of those problems that is just hard enough that everyone feels it is useful to solve and just easy enough that everyone can solve it. Hence the dozens or even hundreds of independent solutions. For some reason I never end up using any of them!

cplat · on April 7, 2018

Apache Drill (https://drill.apache.org/) is also a great alternative. Although advertised as a query engine for Hadoop, it works perfectly fine on local CSV and JSON files.

berta · on April 7, 2018

How is this different from https://github.com/harelba/q

striking · on April 7, 2018

For anyone looking to do this with plain SQLite, one can import a CSV by running (in the REPL)

    .mode csv
    .headers on
    .import my.csv tablename

This does look extremely convenient, though; being able to use UNIX pipes will be a huge improvement to my workflow.

mingodad · on April 8, 2018

To have an idea of what can be done with sqlite look at this solution to the Hashcode 2018 https://github.com/mingodad/sqlite3-hashcode-2018 using only sqlite3 command line.

__david__ · on April 7, 2018

I usually appreciate the animated gif screenshot (especially for curses based programs), but this particular animated gif would have been better off as just plain text—there's too much interesting data that can't be copied, it's real hard to go back to look at something that you missed and it's frustrating to have to wait while fake typing happens. I really wish there was just a transcript of the gif that I could read at my own pace.

bobivl · on April 8, 2018

There is also BigBash [0] that converts an Sql statement to a bash one-liner (using sed, grep, awk,...) which can then run to execute the query. The advantage is that you can let it run on very large file(s) because of the streaming nature of linus bintools.

[0] http://bigbash.it or the corresponding Github repo.

blacksqr · on April 7, 2018

Sqawk [1] is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite.

[1] https://github.com/dbohdan/sqawk

otoburb · on April 7, 2018

I like VisiData[1][2] for CSV/TSV data exploration. textql is a good command-line follow-up to run quick direct queries against the data without needing to see visual/spatial structure.

[1] https://jsvine.github.io/intro-to-visidata/

[2] Previous HN VisiData thread https://news.ycombinator.com/item?id=16515299

daotoad · on April 10, 2018

Yawn. We've had this in Perl for decades using the DBI Shell (dibsh). DBI is a common database interface. To connect to a data source, one uses an appropriate driver (by convention, in the DBD name space, such as DBD::Pg or DBD::MySQL) to get a db handle that can run SQL commands. There are a lot of drivers for things like CSV files, in memory database, etc.

Plus you can access all this programmatically and not just through the dbish command. Since this is a central part of Perl infrastructure, there are a ton of tools that extend the DBI, from a variety of ORMs like DBIx::Class, to data export and even DDL delta management.

Generalized Database Interface: https://metacpan.org/pod/DBI DBI Shell program: https://metacpan.org/pod/DBI::Shell CSV driver: https://metacpan.org/pod/DBD::CSV

kamac · on April 7, 2018

Ha. Shameless plug, but just about a week ago I started to work on a somewhat analogous thing. Executing SQL on XML [1]. I came up with this idea after trying to use stackoverflow data dump and finding out it was stored as XML. I wanted to run queries with LIKE operator on it.

[1]: https://github.com/kamac/AskXML

_mhr_ · on April 8, 2018

This is a great idea. There are so many APIs and text dumps from websites and databases that are XML, and it would be nice to be able to query them with regular relational operations. I'll be following your project! I will say that I believe a streaming backend would be more suitable, since most of the XML dumps I'm thinking of are humongous and won't always fit in memory.

kamac · on April 8, 2018

Thanks! About fitting into memory; the way the library currently works should technically work even on very big files, as the XML is iteratively converted into a SQL database. You can choose whether you want that database to be in a file or in memory. The only thing that could break right now is synchronizing back to XML after having made changes with UPDATE or INSERT statements, because I'm not paging results.

Nitrado · on April 7, 2018

ClickHouse ships with a command line tool which does this (without the actual database server):

    ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | \
        clickhouse-local -S "user String, mem Float64" \
            -q "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"

    
    ┏━━━━━━━━━━┳━━━━━━━━━━┓
    ┃ user     ┃ memTotal ┃
    ┡━━━━━━━━━━╇━━━━━━━━━━┩
    │ clickho+ │      0.7 │
    ├──────────┼──────────┤
    │ root     │      0.2 │
    ├──────────┼──────────┤
    │ netdata  │      0.1 │
    ├──────────┼──────────┤
    │ ntp      │        0 │
    ├──────────┼──────────┤
    │ dbus     │        0 │
    ├──────────┼──────────┤
    │ nginx    │        0 │
    ├──────────┼──────────┤
    │ polkitd  │        0 │
    ├──────────┼──────────┤
    │ nscd     │        0 │
    ├──────────┼──────────┤
    │ postfix  │        0 │
    └──────────┴──────────┘

Has the advantage of being really fast.

spullara · on April 7, 2018

the additional requirement to set up the schema is kind of onerous.

networked · on April 8, 2018

Shameless plug: Sqawk can do nearly the same without you defining a schema.

  $ ps aux | sqawk -output table \
                   'select user, round(sum("%mem"), 2) as memtotal
                    from a
                    group by user
                    order by memtotal desc' \
                   header=1
  ┌────────┬────┐
  │dbohdan │67.1│
  ├────────┼────┤
  │  root  │3.5 │
  ├────────┼────┤
  │ avahi  │0.0 │
  ├────────┼────┤
  │ daemon │0.0 │
  ├────────┼────┤
  │message+│0.0 │
  ├────────┼────┤
  │ nobody │0.0 │
  ├────────┼────┤
  │  ntp   │0.0 │
  ├────────┼────┤
  │ rtkit  │0.0 │
  ├────────┼────┤
  │ syslog │0.0 │
  ├────────┼────┤
  │ uuidd  │0.0 │
  ├────────┼────┤
  │whoopsie│0.0 │
  └────────┴────┘

Link: https://github.com/dbohdan/sqawk

emmelaich · on April 8, 2018

Looks pretty good!

Just for historical interest, similar things have been done before.

e.g.http://quisp.sourceforge.net/shsqlhome.html

See also Facebook's osquery of course. Turns even complex structures into sql tables -- helped by Augeas.net.

chrisweekly · on April 8, 2018

Does this have any benefits over lnav[1]?

[1] https://www.lnav.org -- a feature-rich, powerful, flexible "mini-ETL" with an embedded sqlite engine -- is one of my all-time favorite CLI tools that somehow remains under the radar

vdm · on April 10, 2018

http://innolitics.com/10x/graduate-from-sed-and-sort-to-lnav...

yehosef · on April 8, 2018

Thanks for the link - looks cool!

srhngpr · on April 8, 2018

I'm a big fan of SQL Notebook [1]. It is open source and has a great UI to quickly work with a number of importable formats (CSV, Excel, JSON) as well as server-based DBs.

[1] https://sqlnotebook.com/

daamien · on April 7, 2018

PostgreSQL does this out of the box with :

  CREATE EXTENSION file_fdw;
  CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
  CREATE FOREIGN TABLE foo (
    col1 text,
    col2 text,
    ...
  ) SERVER import 
  OPTIONS ( filename '/path/to/foo.csv', format 'csv' );
  SELECT col1 FROM foo WHERE col2='x';

daamien · on April 7, 2018

... and with dozens of other data sources:

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

techtalsky · on April 7, 2018

However this is something you can put in /use/local/bin and use alongside other command line tools with pipes

banku_brougham · on April 7, 2018

Having to define the schema is the major hangup for me in my workflow. I would prefer a wrapper that could run within unix pipelines around arbitrary text data files, but they all use SQLite.

I personally have been using harelba’s q (the un-googleable utility), which is just fine.

It would be great not to shift gears into SQLite syntax and date formatting all the time. Does anyone know of a similar tool that runs over postgres?

chrisweekly · on April 8, 2018

> "prefer a wrapper that could run within unix pipelines around arbitrary text data files"

That sounds like [lnav](https://www.lnav.org).

For queries, it does use SQLite under the hood, but before querying your "arbitrary text data" files, you can use regex to define a custom format (eg with named groups and back references), providing structure against which standard SQL is an ideal tool to query.

It's simpler than I'm probably making it sound. Highly recommended.

Myrmornis · on April 7, 2018

Which is exactly why a convenient alternative was desired.

owlninja · on April 9, 2018

These responses are so HackerNews. Everyone asking 'why not use x project?', and each x is different...

kbenson · on April 7, 2018

In the past, I've used fsql[1] when I needed this. This looks interesting though.

1: https://metacpan.org/pod/distribution/App-fsql/bin/fsql

Corrado · on April 9, 2018

One thing that I would like to add to my projects is the animated GIF shell. I've looked at several ways to create them but they all seem really clunky and force me to type correctly the first time. What do people use to create these helpful animations?

ilitirit · on April 8, 2018

I'm pretty sure many people have written or attempted to write similar tools at least once before. It's as simple as programmatically importing to an in-memory sqlite instance and then executing SQL queries.

rhacker · on April 7, 2018

Apache Spark can do this too, similar to the Apache Drill comment. This thread is starting to feel like Stack Overflow a bit.

mekazu · on April 7, 2018

If only when googling “sql on csv” this comments page would show up first.

aembleton · on April 7, 2018

This is cool. How does it work? I'd guess cleaning the data, and copying to SQL Lite?

hk__2 · on April 7, 2018

It imports everything in an in-memory SQLite database.

th0ma5 · on April 7, 2018

Apache MetaModel is great for this in Java.

pX0r · on April 7, 2018

While dealing with CSV files using SQL statements is cool and all, SQL queries are still SQL queries.

IMTDb · on April 7, 2018

https://en.wikipedia.org/wiki/Truism

partycoder · on April 7, 2018

I use R for this.

Myrmornis · on April 7, 2018

Can you show an example of how to use R to read tabular data on stdin (e.g. in a UNIX pipe line) and perform SQL queries?

knbknb · on April 7, 2018

Using ideas contained in this script you can execute tidyverse commands with R on the command line:

https://github.com/jeroenjanssens/data-science-at-the-comman...

I was not able to commit its options to long term memory, so I've built a bash script that contains a pipe of commands, including Rio.

goatlover · on April 8, 2018

Why would you be running SQL queries in R, a language with built-in support for handling tabular data?

Myrmornis · on April 9, 2018

I was asking because the topic of this discussion is specifically running SQL queries (that's what TextQL does).

But, there certainly are good reasons for using SQL in R. Packages like sqldf exist, so their authors would probably be able to give the best answer to your question.

Some reasons:

- SQL is a very widely-known DSL for working with tabular data, so it may make sense to make that interface available within R.

- For certain operations (e.g. specific types of joins) it may be more natural / easier to express the operation in SQL.

- For large data sets, some operations in R have large memory footprints, and doing the operations in a database may have lower memory requirements.

partycoder · on April 7, 2018

Rather than SQL queries, you would be using the R language. Unless you use a package like sqldf. stdin can be read just like any other file.

marcle · on April 8, 2018

For what it's worth, if we define Rsql as the following script:

  #!/usr/bin/env Rscript
  library(sqldf, quietly=TRUE)
  statement <- commandArgs(TRUE)
  a <- read.table(file("stdin"), header=TRUE)
  print(sqldf(statement))

then we can call

  ps -Ao user | Rsql 'select USER, count(*) as nprocesses from a group by USER limit 5'

to get the output

          USER nprocesses
  1      avahi          2
  2     colord          1
  3     daemon          1
  4     marcle        136
  5 messagebus          1

Myrmornis · on April 9, 2018

Cool! Thanks.

gaius · on April 7, 2018

In SQLite you can just do

    sqlite> .import myfile.csv mytable
    sqlite> select ...

What does this tool give you that SQLite doesn't do out of the box?

jssmith · on April 7, 2018

As a command-line tool it could serve as a more powerful replacement for unix utilities that I often use for simple analysis jobs: sort, uniq, wc, awk, sometimes cut and sed. I'm going to give it a try.

It looks like textql loads everything into an in-memory SQLite instance, whereas I'd really like to see an approach that uses the SQLite's virtual table mechanism (https://sqlite.org/vtab.html), which would avoid the loading step and perhaps make streaming processing possible.

MR4D · on April 7, 2018

What you describe would be a super nice utility. Great for prototyping and development, as the code could be copy & pasted from such a tool into application code.

The streaming would make it memory efficient, and possibly able to handle some big data - maybe not true "Big Data", but certainly 10s of gigabytes.

Anyone want to take this idea into a GoFundMe site?

ams6110 · on April 7, 2018

SQL is set-oriented. How would that work on a potentially indefinite stream, other than as a simple filter which you could just do with a tool such as awk.

barrkel · on April 8, 2018

Many relational operations don't require a whole stream to compute, and many of those that do don't need it all at once.

Projection (mapping), a join against a fully loaded other side as well as filtering work.

Aggregation can consume an indefinite stream with limited working set if the cardinality of the grouping key isn't large.

And of course you can combine these in nested and unioned operations, computing across multiple indefinite streams concurrently and with limited working set.

It would be tricky to make work effecively without hinting for things like joins, for sure; join order is one of the hardest bits a query engine optimizes.

luckydata · on April 8, 2018

I think this might be helpful context:

https://calcite.apache.org/docs/stream.html

orionblastar · on April 7, 2018

Sort of a good idea to play with a small database to see how things work if it was on a SQL database.

I usually have to develop a database in Windows and Access. One more tool to work in Linux is a good idea.

I used to use awk and sed before.

lowboy · on April 7, 2018

From the README:

Key differences between textql and sqlite importing

sqlite import will not accept stdin, breaking unix pipes. textql will happily do so.

textql supports quote escaped delimiters, sqlite does not.

textql leverages the sqlite in memory database feature as much as possible and only touches disk if asked.

gaius · on April 7, 2018

But it is SQLite under the covers? A better title then would be an SQLite wrapper for use in shell pipelines.

bayonetz · on April 8, 2018

In this case, It’s what it does that matters, not how it does it. I don’t care if SQLLite is powering it. Maybe the author would want to swap out the SQLLite engine with something else in the future too. Then it’s even better that they didn’t advertise the link to SQLLite as a selling point.

Too · on April 8, 2018

SQL dialects have significant differences. You can't simply change the underlying engine without breaking backward compatibility. It's also good to redirect documentation to official sqllite docs instead of writing a 1000 page manual yourself.

h1d · on April 8, 2018

It does actually matter what's running under, so you know what to expect from its behavior and if one doesn't exactly need the few features textQL offers exclusively, you would think to just use SQLite without random layers on top.

sockgrant · on April 8, 2018

Sure, but it doesn’t need to be in the title

mulmen · on April 7, 2018

My favorite C wrapper for scripting is Perl.

The Readme is describing the differences in importing. I'm not sure how your distinction makes this any more clear.

TAForObvReasons · on April 7, 2018

That's a silly complaint. It's a tool powered by SQLite. Under your extended definition, tools like web browsers (which use SQLite to power various things) would be SQLite wrappers.

gaius · on April 7, 2018

If I wrote 10 lines of JS and claimed credit for the whole web browser, more like

_dw7s · on April 7, 2018

I don't see it as a complaint but as a piece of feedback and if you're the author then you may want to revisit your attitude.

always_good · on April 7, 2018

Nah, people say such ridiculous garbage that they need to be called out. Especially here where people seem to take delight in disparaging your work in the tiniest way they can think of.

I don't think OP needs to pussy foot around some stranger on an internet message board. It's free shit.

tejasmanohar · on April 7, 2018

"SQLite in your shell" would be a catchy title :-)

naveen99 · on April 7, 2018

You skipped the line to create the bare table first in SQLite: https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

banku_brougham · on April 9, 2018

Link is to a page that says obsolete since 2009. I wasn't able to find similar page in the docs, can you help?

naveen99 · on April 10, 2018

Help with what ? If you want to import a csv file into SQLite, just follow the "obsolete" docs which probably still work. If you want help finding a similar page in the docs, just read all the docs or download them and grep. It may or may not exist.

TAForObvReasons · on April 7, 2018

It's discussed in the README: https://github.com/dinedal/textql#key-differences-between-te...

> sqlite import will not accept stdin, breaking unix pipes. textql will happily do so.

> textql supports quote escaped delimiters, sqlite does not.

> textql leverages the sqlite in memory database feature as much as possible and only touches disk if asked.

codemac · on April 7, 2018

    #!/usr/bin/env rc
    
    row_headers=''
    output_mode=column
    
    fn usage {
      echo $0 usage
      echo   -o <sqlite output mode> '#' which output mode to use
      echo   -r                      '#' if present, output row headers
      echo   -h                      '#' this message
    }
    
    while(~ $1 -*) {
      switch($1) {
        case -o
          output_mode=$2
          shift
        case -r
          row_headers='.headers on'
        case -h
          usage
          exit 0
        case *
          echo Bad args: $*
          usage
          exit 1
       }
       shift
    }
    
    test $#* -ne 1 && echo Wrong number of arguments && usage && exit 1
    
    sql_statement=$1
    sqlite_bin=''
    { which sqlite > /dev/null >[2=1] && sqlite_bin=sqlite } || { which sqlite3 >/dev/null >[2=1] && sqlite_bin=sqlite3 }
    
    table_name=atable
    stdin_file=/proc/$pid/fd/0
    {
      echo .mode csv
      echo .import $stdin_file $table_name
      echo $row_headers
      echo .mode $output_mode
      echo $sql_statement
    } | $sqlite_bin

yoklov · on April 8, 2018

Erm, people don't have rc installed though. I had completely forgotten it existed, to be honest.

codemac · on April 8, 2018

OK, thanks for the heads up.

scrollaway · on April 7, 2018

https://github.com/dinedal/textql#key-differences-between-te...

clatan · on April 7, 2018

we should be aiming to kill SQL as a language

walshemj · on April 7, 2018

Why? Not surprised you got down voted but I am curious?

orionblastar · on April 7, 2018

It is a negative comment. We should make alternatives not kill off technologies. Even the Commodore 64 has use with the right modern add ons, now that is hacking.

_mhr_ · on April 8, 2018

What would you suggest we replace it with?