Hacker News new | past | comments | ask | show | jobs | submit login
TextQL: Execute SQL Against CSV or TSV (github.com/dinedal)
658 points by TAForObvReasons on April 7, 2018 | hide | past | favorite | 91 comments



If you work with CSV a lot, then also check out xsv, "a command line program for indexing, slicing, analyzing, splitting and joining CSV files", from Andrew Gallant (burntsushi of Go and Rust fame): https://github.com/BurntSushi/xsv


Another great tool to wrangle csv file is miller

http://johnkerl.org/miller/doc/


You can also use log parser lizard to parse CSV files.


And check out "Motivation" in xsv's readme (at the end) to understand why it might be a better choice than most other tools suggested here, particularly those that go through Python and sqlight.

Also, while I like/applaud Miller (also referenced), xsv has a simpler UI that only handles CSV.


`xsv` looks great, I will try it out.

My main hesitation: Its another DSL to learn, there is a big benefit to keeping this kind of work within the SQL world for me (or R, python, etc depending on the user).

On the other hand, getting table stats and frequencies right from the shell is a huge time saver over SQL.


Seem very similar to q

q is out there for years, with a very large community.

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

http://harelba.github.io/q/


Yes, I use q all the time for slicing and dicing delimited files. The only problem I have with it is that the name can make it a littler harder to find if you don't remember the repo.

Since q will read stdin and write CVS to stdout you can chain several queries on the command line, or use it in series with other with other commands such as cat, grep, sed, etc.

Highly recommended if you like SQL and deal with delimited files.


This looks very simple and easy to use.

Are there any differences with using this instead of CSVKit?

https://csvkit.readthedocs.io/en/1.0.3/

It includes a tool called csvsql.

Example usage -

  csvsql --query "select name from data where age > 30" data.csv > new.csv


Was wondering the same. I've been using csvkit a lot lately. This looks like a subset of functionality using more or less the same approach. Csvsql uses sqlite under neath and you can do some nice things with this like joins, using sql functions, etc. There is also stuff like amazon athena that allows you to do similar things in s3 at scale.

Csvkit is great with pipes and you can also easily convert between csv tsv and even stuff like json.


Python's pandas library can also do sql-like queries against csv data.

https://pandas.pydata.org/pandas-docs/stable/comparison_with...


Although csvsql is great, the real reason I love csvkit is that most of the tools feel like standard *nix tools that just operate on CSV files (csvcut, csvgrep, etc.)


I’ve worked on a similar tool in the past called Charlatan [1] (no I didn’t choose that name), but it implemented its own SQL-like language and was extensible to support any file type.

Its key difference with TextQL and similar alternatives is it works on streams instead of importing everything in a SQLite then querying it. It seems strange to read a whole file in memory then perform a `SELECT` query on it when you could just run that query while reading the file. That means a much lower memory footprint and faster execution, but on the other hand you can only use the subset of SQL that’s implemented.

[1]: https://github.com/BatchLabs/charlatan#charlatan


If you are on Windows you may consider using Log Parser Studio [1]. It is a GUI over logparser.exe which understands a lot of various file formats and uses SQL to query them [2].

[1] https://gallery.technet.microsoft.com/office/Log-Parser-Stud...

[2] https://en.m.wikipedia.org/wiki/Logparser


Came here to say this. Log parser is amazing. Check out Log Parser Lizard for a good IDE.


This is one of those problems that is just hard enough that everyone feels it is useful to solve and just easy enough that everyone can solve it. Hence the dozens or even hundreds of independent solutions. For some reason I never end up using any of them!


Apache Drill (https://drill.apache.org/) is also a great alternative. Although advertised as a query engine for Hadoop, it works perfectly fine on local CSV and JSON files.


How is this different from https://github.com/harelba/q


For anyone looking to do this with plain SQLite, one can import a CSV by running (in the REPL)

    .mode csv
    .headers on
    .import my.csv tablename
This does look extremely convenient, though; being able to use UNIX pipes will be a huge improvement to my workflow.


To have an idea of what can be done with sqlite look at this solution to the Hashcode 2018 https://github.com/mingodad/sqlite3-hashcode-2018 using only sqlite3 command line.


I usually appreciate the animated gif screenshot (especially for curses based programs), but this particular animated gif would have been better off as just plain text—there's too much interesting data that can't be copied, it's real hard to go back to look at something that you missed and it's frustrating to have to wait while fake typing happens. I really wish there was just a transcript of the gif that I could read at my own pace.


There is also BigBash [0] that converts an Sql statement to a bash one-liner (using sed, grep, awk,...) which can then run to execute the query. The advantage is that you can let it run on very large file(s) because of the streaming nature of linus bintools.

[0] http://bigbash.it or the corresponding Github repo.


Sqawk [1] is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite.

[1] https://github.com/dbohdan/sqawk


I like VisiData[1][2] for CSV/TSV data exploration. textql is a good command-line follow-up to run quick direct queries against the data without needing to see visual/spatial structure.

[1] https://jsvine.github.io/intro-to-visidata/

[2] Previous HN VisiData thread https://news.ycombinator.com/item?id=16515299


Yawn. We've had this in Perl for decades using the DBI Shell (dibsh). DBI is a common database interface. To connect to a data source, one uses an appropriate driver (by convention, in the DBD name space, such as DBD::Pg or DBD::MySQL) to get a db handle that can run SQL commands. There are a lot of drivers for things like CSV files, in memory database, etc.

Plus you can access all this programmatically and not just through the dbish command. Since this is a central part of Perl infrastructure, there are a ton of tools that extend the DBI, from a variety of ORMs like DBIx::Class, to data export and even DDL delta management.

Generalized Database Interface: https://metacpan.org/pod/DBI DBI Shell program: https://metacpan.org/pod/DBI::Shell CSV driver: https://metacpan.org/pod/DBD::CSV


Ha. Shameless plug, but just about a week ago I started to work on a somewhat analogous thing. Executing SQL on XML [1]. I came up with this idea after trying to use stackoverflow data dump and finding out it was stored as XML. I wanted to run queries with LIKE operator on it.

[1]: https://github.com/kamac/AskXML


This is a great idea. There are so many APIs and text dumps from websites and databases that are XML, and it would be nice to be able to query them with regular relational operations. I'll be following your project! I will say that I believe a streaming backend would be more suitable, since most of the XML dumps I'm thinking of are humongous and won't always fit in memory.


Thanks! About fitting into memory; the way the library currently works should technically work even on very big files, as the XML is iteratively converted into a SQL database. You can choose whether you want that database to be in a file or in memory. The only thing that could break right now is synchronizing back to XML after having made changes with UPDATE or INSERT statements, because I'm not paging results.


ClickHouse ships with a command line tool which does this (without the actual database server):

    ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | \
        clickhouse-local -S "user String, mem Float64" \
            -q "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"

    
    ┏━━━━━━━━━━┳━━━━━━━━━━┓
    ┃ user     ┃ memTotal ┃
    ┡━━━━━━━━━━╇━━━━━━━━━━┩
    │ clickho+ │      0.7 │
    ├──────────┼──────────┤
    │ root     │      0.2 │
    ├──────────┼──────────┤
    │ netdata  │      0.1 │
    ├──────────┼──────────┤
    │ ntp      │        0 │
    ├──────────┼──────────┤
    │ dbus     │        0 │
    ├──────────┼──────────┤
    │ nginx    │        0 │
    ├──────────┼──────────┤
    │ polkitd  │        0 │
    ├──────────┼──────────┤
    │ nscd     │        0 │
    ├──────────┼──────────┤
    │ postfix  │        0 │
    └──────────┴──────────┘
Has the advantage of being really fast.


the additional requirement to set up the schema is kind of onerous.


Shameless plug: Sqawk can do nearly the same without you defining a schema.

  $ ps aux | sqawk -output table \
                   'select user, round(sum("%mem"), 2) as memtotal
                    from a
                    group by user
                    order by memtotal desc' \
                   header=1
  ┌────────┬────┐
  │dbohdan │67.1│
  ├────────┼────┤
  │  root  │3.5 │
  ├────────┼────┤
  │ avahi  │0.0 │
  ├────────┼────┤
  │ daemon │0.0 │
  ├────────┼────┤
  │message+│0.0 │
  ├────────┼────┤
  │ nobody │0.0 │
  ├────────┼────┤
  │  ntp   │0.0 │
  ├────────┼────┤
  │ rtkit  │0.0 │
  ├────────┼────┤
  │ syslog │0.0 │
  ├────────┼────┤
  │ uuidd  │0.0 │
  ├────────┼────┤
  │whoopsie│0.0 │
  └────────┴────┘
Link: https://github.com/dbohdan/sqawk


Looks pretty good!

Just for historical interest, similar things have been done before.

e.g.http://quisp.sourceforge.net/shsqlhome.html

See also Facebook's osquery of course. Turns even complex structures into sql tables -- helped by Augeas.net.


Does this have any benefits over lnav[1]?

[1] https://www.lnav.org -- a feature-rich, powerful, flexible "mini-ETL" with an embedded sqlite engine -- is one of my all-time favorite CLI tools that somehow remains under the radar



Thanks for the link - looks cool!


I'm a big fan of SQL Notebook [1]. It is open source and has a great UI to quickly work with a number of importable formats (CSV, Excel, JSON) as well as server-based DBs.

[1] https://sqlnotebook.com/


PostgreSQL does this out of the box with :

  CREATE EXTENSION file_fdw;
  CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
  CREATE FOREIGN TABLE foo (
    col1 text,
    col2 text,
    ...
  ) SERVER import 
  OPTIONS ( filename '/path/to/foo.csv', format 'csv' );
  SELECT col1 FROM foo WHERE col2='x';


... and with dozens of other data sources:

https://wiki.postgresql.org/wiki/Foreign_data_wrappers


However this is something you can put in /use/local/bin and use alongside other command line tools with pipes


Having to define the schema is the major hangup for me in my workflow. I would prefer a wrapper that could run within unix pipelines around arbitrary text data files, but they all use SQLite.

I personally have been using harelba’s q (the un-googleable utility), which is just fine.

It would be great not to shift gears into SQLite syntax and date formatting all the time. Does anyone know of a similar tool that runs over postgres?


> "prefer a wrapper that could run within unix pipelines around arbitrary text data files"

That sounds like [lnav](https://www.lnav.org).

For queries, it does use SQLite under the hood, but before querying your "arbitrary text data" files, you can use regex to define a custom format (eg with named groups and back references), providing structure against which standard SQL is an ideal tool to query.

It's simpler than I'm probably making it sound. Highly recommended.


Which is exactly why a convenient alternative was desired.


These responses are so HackerNews. Everyone asking 'why not use x project?', and each x is different...


In the past, I've used fsql[1] when I needed this. This looks interesting though.

1: https://metacpan.org/pod/distribution/App-fsql/bin/fsql


One thing that I would like to add to my projects is the animated GIF shell. I've looked at several ways to create them but they all seem really clunky and force me to type correctly the first time. What do people use to create these helpful animations?


I'm pretty sure many people have written or attempted to write similar tools at least once before. It's as simple as programmatically importing to an in-memory sqlite instance and then executing SQL queries.


Apache Spark can do this too, similar to the Apache Drill comment. This thread is starting to feel like Stack Overflow a bit.


If only when googling “sql on csv” this comments page would show up first.


This is cool. How does it work? I'd guess cleaning the data, and copying to SQL Lite?


It imports everything in an in-memory SQLite database.


Apache MetaModel is great for this in Java.


While dealing with CSV files using SQL statements is cool and all, SQL queries are still SQL queries.



I use R for this.


Can you show an example of how to use R to read tabular data on stdin (e.g. in a UNIX pipe line) and perform SQL queries?


Using ideas contained in this script you can execute tidyverse commands with R on the command line:

https://github.com/jeroenjanssens/data-science-at-the-comman...

I was not able to commit its options to long term memory, so I've built a bash script that contains a pipe of commands, including Rio.


Why would you be running SQL queries in R, a language with built-in support for handling tabular data?


I was asking because the topic of this discussion is specifically running SQL queries (that's what TextQL does).

But, there certainly are good reasons for using SQL in R. Packages like sqldf exist, so their authors would probably be able to give the best answer to your question.

Some reasons:

- SQL is a very widely-known DSL for working with tabular data, so it may make sense to make that interface available within R.

- For certain operations (e.g. specific types of joins) it may be more natural / easier to express the operation in SQL.

- For large data sets, some operations in R have large memory footprints, and doing the operations in a database may have lower memory requirements.


Rather than SQL queries, you would be using the R language. Unless you use a package like sqldf. stdin can be read just like any other file.


For what it's worth, if we define Rsql as the following script:

  #!/usr/bin/env Rscript
  library(sqldf, quietly=TRUE)
  statement <- commandArgs(TRUE)
  a <- read.table(file("stdin"), header=TRUE)
  print(sqldf(statement))
then we can call

  ps -Ao user | Rsql 'select USER, count(*) as nprocesses from a group by USER limit 5'
to get the output

          USER nprocesses
  1      avahi          2
  2     colord          1
  3     daemon          1
  4     marcle        136
  5 messagebus          1


Cool! Thanks.


In SQLite you can just do

    sqlite> .import myfile.csv mytable
    sqlite> select ...
What does this tool give you that SQLite doesn't do out of the box?


As a command-line tool it could serve as a more powerful replacement for unix utilities that I often use for simple analysis jobs: sort, uniq, wc, awk, sometimes cut and sed. I'm going to give it a try.

It looks like textql loads everything into an in-memory SQLite instance, whereas I'd really like to see an approach that uses the SQLite's virtual table mechanism (https://sqlite.org/vtab.html), which would avoid the loading step and perhaps make streaming processing possible.


What you describe would be a super nice utility. Great for prototyping and development, as the code could be copy & pasted from such a tool into application code.

The streaming would make it memory efficient, and possibly able to handle some big data - maybe not true "Big Data", but certainly 10s of gigabytes.

Anyone want to take this idea into a GoFundMe site?


SQL is set-oriented. How would that work on a potentially indefinite stream, other than as a simple filter which you could just do with a tool such as awk.


Many relational operations don't require a whole stream to compute, and many of those that do don't need it all at once.

Projection (mapping), a join against a fully loaded other side as well as filtering work.

Aggregation can consume an indefinite stream with limited working set if the cardinality of the grouping key isn't large.

And of course you can combine these in nested and unioned operations, computing across multiple indefinite streams concurrently and with limited working set.

It would be tricky to make work effecively without hinting for things like joins, for sure; join order is one of the hardest bits a query engine optimizes.


I think this might be helpful context:

https://calcite.apache.org/docs/stream.html


Sort of a good idea to play with a small database to see how things work if it was on a SQL database.

I usually have to develop a database in Windows and Access. One more tool to work in Linux is a good idea.

I used to use awk and sed before.


From the README:

Key differences between textql and sqlite importing

sqlite import will not accept stdin, breaking unix pipes. textql will happily do so.

textql supports quote escaped delimiters, sqlite does not.

textql leverages the sqlite in memory database feature as much as possible and only touches disk if asked.


But it is SQLite under the covers? A better title then would be an SQLite wrapper for use in shell pipelines.


In this case, It’s what it does that matters, not how it does it. I don’t care if SQLLite is powering it. Maybe the author would want to swap out the SQLLite engine with something else in the future too. Then it’s even better that they didn’t advertise the link to SQLLite as a selling point.


SQL dialects have significant differences. You can't simply change the underlying engine without breaking backward compatibility. It's also good to redirect documentation to official sqllite docs instead of writing a 1000 page manual yourself.


It does actually matter what's running under, so you know what to expect from its behavior and if one doesn't exactly need the few features textQL offers exclusively, you would think to just use SQLite without random layers on top.


Sure, but it doesn’t need to be in the title


My favorite C wrapper for scripting is Perl.

The Readme is describing the differences in importing. I'm not sure how your distinction makes this any more clear.


That's a silly complaint. It's a tool powered by SQLite. Under your extended definition, tools like web browsers (which use SQLite to power various things) would be SQLite wrappers.


If I wrote 10 lines of JS and claimed credit for the whole web browser, more like


I don't see it as a complaint but as a piece of feedback and if you're the author then you may want to revisit your attitude.


Nah, people say such ridiculous garbage that they need to be called out. Especially here where people seem to take delight in disparaging your work in the tiniest way they can think of.

I don't think OP needs to pussy foot around some stranger on an internet message board. It's free shit.


"SQLite in your shell" would be a catchy title :-)


You skipped the line to create the bare table first in SQLite: https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles


Link is to a page that says obsolete since 2009. I wasn't able to find similar page in the docs, can you help?


Help with what ? If you want to import a csv file into SQLite, just follow the "obsolete" docs which probably still work. If you want help finding a similar page in the docs, just read all the docs or download them and grep. It may or may not exist.


It's discussed in the README: https://github.com/dinedal/textql#key-differences-between-te...

> sqlite import will not accept stdin, breaking unix pipes. textql will happily do so.

> textql supports quote escaped delimiters, sqlite does not.

> textql leverages the sqlite in memory database feature as much as possible and only touches disk if asked.


    #!/usr/bin/env rc
    
    row_headers=''
    output_mode=column
    
    fn usage {
      echo $0 usage
      echo   -o <sqlite output mode> '#' which output mode to use
      echo   -r                      '#' if present, output row headers
      echo   -h                      '#' this message
    }
    
    while(~ $1 -*) {
      switch($1) {
        case -o
          output_mode=$2
          shift
        case -r
          row_headers='.headers on'
        case -h
          usage
          exit 0
        case *
          echo Bad args: $*
          usage
          exit 1
       }
       shift
    }
    
    test $#* -ne 1 && echo Wrong number of arguments && usage && exit 1
    
    sql_statement=$1
    sqlite_bin=''
    { which sqlite > /dev/null >[2=1] && sqlite_bin=sqlite } || { which sqlite3 >/dev/null >[2=1] && sqlite_bin=sqlite3 }
    
    table_name=atable
    stdin_file=/proc/$pid/fd/0
    {
      echo .mode csv
      echo .import $stdin_file $table_name
      echo $row_headers
      echo .mode $output_mode
      echo $sql_statement
    } | $sqlite_bin


Erm, people don't have rc installed though. I had completely forgotten it existed, to be honest.


OK, thanks for the heads up.



we should be aiming to kill SQL as a language


Why? Not surprised you got down voted but I am curious?


It is a negative comment. We should make alternatives not kill off technologies. Even the Commodore 64 has use with the right modern add ons, now that is hacking.


What would you suggest we replace it with?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: