Hacker News new | past | comments | ask | show | jobs | submit login
Friendlier SQL with DuckDB (duckdb.org)
366 points by hfmuehleisen on May 12, 2022 | hide | past | favorite | 133 comments



Lots of great additions. I will just highlight two:

Column selection:

When you have tons of columns these become useful. Clickhouse takes it to the next level and supports APPLY and COLUMN in addition to EXCEPT, REPLACE which DuckDB supports:

    - APPLY: apply a function to a set of columns
    - COLUMN: select columns by matching a regular expression (!)
Details here: https://clickhouse.com/docs/en/sql-reference/statements/sele...

Allow trailing commas:

I can't count how many times I've run into a problem with a trailing comma. There's a whole convention developed to overcome this: the prefix comma convention where you'd write:

    SELECT
      first_column
      ,second_column
      ,third_column
which lets you easily comment out a line without worrying about trailing comma errors. That's no longer necessary in DuckDB. Allowing for trailing commas should get included in the SQL spec.


> Allowing for trailing commas should get included in the SQL spec.

Yep! That would be my #1 request for SQL. Seems ridiculous that it's not supported already.


I agree, though you can always use a dummy value as a workaround:

  SELECT
      first_column,
      second_column,
      third_column,
      null


BigQuery also supports trailing commas!


Allow referencing columns defined previously in the same query would make duckdb competitive for data analytics. Without that one has to chain With statements for just the tiniest operations.

  select 1 as x, x + 2 as y, y/x as z;


There is a bug for that and it looks someone is even working on it. https://github.com/duckdb/duckdb/issues/1547


There's also no need to make it left to right usage, as long as it's acyclic:

  select y-2 as x, 3 as y, y/x as z;


Would this be compiled into a graph of subqueries and window statements?


I'm not following, the original could be written as

  select x + 2 as y, 1 as x, y/x as z;
with the same column values in a different order. Order of arguments shouldn't matter is all I was saying.


Yes, good thought! That is listed at the bottom of the article as something we are looking at for the future.


Yep! Agreed!


(nothing to do with DuckDB but..) SQL is complex enough, and allowing this (and acyclically as mentioned below) would do my $%^& nut implementing it.

But I know a user requirement when I hear one, so can you give me an large, real example of where allowing this would make things easier? That would be mega helpful, ta


SQL is complex enough

No, it is not. I mean it is, but not in parts where that could be seen as useful and/or convenient. [A]cyclic graph traversal/etc is one of the basic tests in a modern interview at any CRUD studio. How come it could do $%^& to any part of yours?


> How come it could do $%^& to any part of yours?

Because just implementing the standard stuff nearly did my $^&% nut. Also I know about graphs & posets, and it's potentially a little more complex than it seems. The variables

    select x * x as y, 1 as x
is meh, but what about

    select 
        (select tbl.z from tbl where tbl.y = y) as subq, 
        x * yy as y,
        xx + 1 as x,
        subq + yy as zzz
    from ( 
        select xx, yy
        from ... )
I just don't fancy supporting that.


Note you can already reference select list items in GROUP, HAVING, and ORDER BY so it's not that big of an extension.

I've implemented the ability to reference select-list aliases before; it's not that hard to do if implemented basically like a macro expansion. The main problem is user confusion due to ambiguous references, e.g.

    select 2 as x, x as `which x?`
    from (select 1 as x) t;
we ended up adding a warning for the case where a select list alias shadowed a table column, suggesting using a fully-qualified table name if they actually wanted the table column (t.x in the above example).

IMO only allowing references to previous select list items is a perfectly reasonable restriction; loosening it isn't worth the implementation headache or user confusion. Though we did allow using aliases in the WHERE clause.


> Note you can already reference select list items in GROUP, HAVING, and ORDER BY so it's not that big of an extension.

You're just looking for symbols in the symbol table, I think it's a big difference!

> IMO only allowing references to previous select list items is a perfectly reasonable...

agreed, see my other post where I say the same.

> Though we did allow using aliases in the WHERE clause

And the SQL standards people didn't go for this, and I'm sure they were very far from stupid. And nobody's asking why they didn't allow this, which really bothers me.


Oh, was your objection specifically to allowing references to following (not just preceding) select list items? Then we're in violent agreement. That would be complicated to implement and confuse users. Definitely not worth it.


I'm against doing anything without checking beforehand whether it's actually going to be worth the effort.

But yes, I'd be far happier doing left-to-right dependencies only, which I can believe (though I still need evidence) it would be of some value.


> suggesting using a fully-qualified table name if they actually wanted the table column (t.x in the above example).

I just realised why this was bothering me. That means 't' and 't.x' are actually different variables. In standard SQL it's always the case (right?) that an unqualified variable ('t') is just an convenient shorthand for the fully qualified variable ('t.x', or more fully I suppose, '<db>.<schema>.t.x), and you just broke that.


what about

That’s no different than the first snippet, if you aren’t parsing it with regexps, of course. The resulting AST identifiers would simply refer to not only column names, but also to other defined expressions. This is the case for both snippets. It’s either cyclic or not, and when not, it is easy to substitute/cse/etc as usual. The complexity of these expressions is irrelevant.



@wruza, @wenc: These are both very good answers, and you are of course both right. Check the symbol table, anything you can't find should be defined in the same context (in the select list, as a new expr). In which case, match each symbol use (eg. x in x * x as y) to the definition (eg. 1 as x) to establish a set of dependencies, then do a partial order sort, then spit out the results.

I can do that I just don't fancy it, and more to the point nobody is giving me an example of where it would be particularly helpful. So if anyone can, I'm interested.

(also, consider human factors; although an acyclic definition could be extracted from an unordered expression set, a consistent left to right (in the western world anyway, matching textual layout) with dependencies being introduced on the right and depending only on what came before on the left might actually be better for us meatsacks)


My examples are from boring enterprise, not from what we love to create at home. I’ve read and patched literally meters long queries in analytics, which could be reduced dramatically by being self-referential and by other approaches discussed itt. Of course these could be refactored into something “create view/temp/cte”, but that requires a full control of ddl, special access rights and code ownership. Most space was used by similar case-when-then constructs and permutations of values these produced. The original code was on official support, so we couldn’t just rewrite it, because migrating to the next update would cost a week instead of an hour.

I could reach to and post a lenghty example, but it’s nothing but boring reshuffles really, spiced with 3-level joins of “modelling db in db to allow user columns”.

I agree on the LTR idea, because reading a symbol not yet defined may lead to confusion.


It’s not trivial but as someone who has implemented something similar (for an equation based modeling language) it’s not super complicated if you use the right abstractions. It’s basically traversing the AST and doing substitutions.



The thing that makes SQL simple for me is that I can think in set operations devoid of proceduralness. Once we make things more and more sequential the more it is like programming than a formula.


What does "do my $%^& nut" even mean? (looks like Perl ;))


:-) English idiom. Nut = head. Doing my head in, basically.


for example

    select id, count(...something complicated) as complicated_count
    from ....
    order by complicated_count
would help


'ORDER BY 2' would work here, but using the named column is a lot nicer.


Wow, TIL. Great tip for those random one-off queries you have to bash out when investigating a problem.


Please never let this vile shortcut work its way into your production code.


I've seen quite a few production queries that use indexes in GROUP BY and ORDER BY; it's quite common. Probably partially because linters/code review/etc are lightweight to nonexistent amongst the analysts/data science types that I tend to work with.


Indexes are used all over for grouping an ordering, I was objecting only to the syntax of ORDER BY <number>


SQL already supports "order by complicated_count". Did you mean group by?

This isn't really the large, convincing example I was looking for btw.


Many dialects already support using aliases in GROUP BY and HAVING too, btw.

IMO it's most useful (though somewhat more difficult to implement) to be able to use the aliases with window functions or large case/when statements, something like

   SELECT
     page,
     SUM(clicks) AS total_clicks,
     100. * total_clicks / (SUM(total_clicks) OVER ()) AS click_pct,
     100. * SUM(total_clicks) OVER (ORDER BY total_clicks DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / (SUM(total_clicks) OVER ()) AS cumulative_click_pct
    FROM weblog
    GROUP BY page;


Interesting, ta. My code rarely looks like that so thanks for the insight. Was exactrly what I was looking for.


Yes, trailing commas should work everywhere!

JSON is the other one where it annoys me, but luckily I rarely hand-write any JSON anymore (and there are semi-solutions for this like json5).

In code I always add trailing commas to anything comma-separated. It makes editing simpler (you can shuffle lines without thinking about commas). In a diff or blame it doesn't show adding a comma as a change.

SQL is the one spot where this doesn't work, and it's a constant foot-gun as I often don't remember until I run and get a syntax error.


JSONC allows comments and trailing commas, but adoption seems to be low.

VSCode uses it for configuration, but when I wanted to use it in Python (to add context to source-controlled Elasticsearch schemas) there were only a couple old barely-maintained libraries for parsing.


You can often find better maintained libraries for json5, which is a superset of jsonc


> there were only a couple old barely-maintained libraries for parsing.

Do they work, though? If it’s a mostly stable standard, doesn’t seem like you’d need a frequently updated parser.


EXCEPT columns would get my vote for ansi standard SQL adoption. So much time is spent selecting all but a few columns.


You can do the same thing with your WHERE clause and ANDs by always starting them WHERE 1=1 as well.

>> Allowing for trailing commas should get included in the SQL spec.

So there is no "SQL spec" per se, there's an ANSI specification with decades of convention and provider-specific customizations piled on top. This support for trailing commas is the best you're going to get.


Thank you for the feedback! I will check those Clickhouse features out. I totally agree on the trailing commas, and I use commas first syntax for that same reason! But maybe not anymore... :-)


> Allowing for trailing commas should get included in the SQL spec

Not just SQL, trailing commas are stupidly useful and convenient, so as far as I'm concerned every language should have them. To be fair, a decent amount of them have implemented them (I was pleasantly surprised by GCC C), but there are still notable holdouts (JSON!).


Are leading commas allowed? Because otherwise, you've just traded out the inability to comment out the last element for the inability to comment out the first. I never understood this convention.


I agree that it's ugly and don't use it myself, but I find that I modify the last item in a list far more frequently than the first. Probably because the grouping columns tend to go first by convention, and these change less.


Matching columns by regular expression sounds like a terrible feature. Talk about bug-prone!


Wow this was definitely a pessimistic click for me, I was thinking "trying to replace SQL? How stupid!" But it just looks like SQL with all the stuff you wish SQL had, and some more stuff you didn't even know you wanted.


I was just yesterday exploring DuckDB and it looked very promising but I was very surprised to find out that indexes are not persisted (and I assume that means they must fit in RAM).

> Unique and primary key indexes are rebuilt upon startup, while user-defined indexes are discarded.

The second part with just discarding previously defined indexes is super surprising.

https://duckdb.org/docs/sql/indexes

This was an instant showstopper for me or I assume most people whose databases grow to a bigger size at which point an OLAP DB becomes interesting in the first place.

Also the numerous issues in Github regarding crashes make me hesitant.

But I really like the core idea of DuckDB being a very simple codebase with no dependencies and still providing very good performance. I guess I just would like to see more SQLite-esque stability/robustness in the future and I'll surely revisit it at some point.


Hey eis, you are correct. We do not support index storage in our latest release. I am currently implementing this, and it is in a fairly advanced stage. So it should be featured in the next release.

This took a little while because we use a fairly modern index structure with no literature definition on how to buffer manage it.


It's good to hear that persistent indexes are coming soon. I saw it was on the roadmap but didn't know how far out this feature was. Do you have an idea when that release could be out?

BTW Do you have some kind of code/docs one can take a look at regarding the index structure? I'm a part-time data structure nerd :)


Here is a paper that was used when building the DuckDB approach!

https://db.in.tum.de/~leis/papers/ART.pdf

There are some other papers and details about the architecture here: https://duckdb.org/why_duckdb


Will the persistent indices still be built on ART indexes? ART is a great choice of data structure. Fast lookups, range queries and prefix queries.

Typesense DB (Typesense.org) under the hood also uses ART as the base datastructure for fast full text search queries.

So I assume it would be straightforward to build an text search engine on top of duck DB and utilize ART indices to do even more lovely things.


Persistent indexes are being actively worked on! Stay tuned. As for the crashes - DuckDB is very well tested and used in production in many places. The core functionality is very mature! Let us know if you test it out! Happy to help if I can.

(disclaimer - on the DuckDB team)


Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!


Yes, that benchmark result is quite old in Duck years! :-)

We actually run that benchmark as a part of our test suite now, so I am certain that there is improvement from that version.

The biggest DuckDB I've used so far was about 400 GB on a machine with ~250 GB of RAM.

There is ongoing work that we are treating as a high priority for handling larger-than-memory intermediate results within a query. But we can handle larger than RAM in many cases already - we sometimes run into issues today if you are joining 2 larger than RAM tables together (depending on the join), or if you are aggregating a larger than RAM table with really high cardinality in one of the columns you are grouping on.

Would you be open to testing out your use case and letting us know how it goes? We always appreciate more test cases!


Often efforts and articles like this feel like minor affordances which don’t immediately jump out as a big deal, even if they eventually turn out to be really useful down the road. Seeing the article title, that’s what I expected. I did not expect to read through the whole thing with my inner voice, louder and more enthusiastically, saying “yes! this!” Very cool.


How does DuckDB compare to SQLite (e.g. which workloads are a good fit for what? Would it be a good idea to use both?)

I found https://duckdb.org/why_duckdb but I'm sure someone here can share some real world lessons learned?


Excellent question! I'll jump in - I am a part of the DuckDB team though, so if other users have thoughts it would be great to get other perspectives as well.

First things first - we really like quite a lot about the SQLite approach. DuckDB is similarly easy to install and is built without dependencies, just like SQLite. It also runs in the same process as your application just like SQLite does. SQLite is excellent as a transactional database - lots of very specific inserts, updates, and deletes (called OLTP workloads). DuckDB can also read directly out of SQLite files as well, so you can mix and match them! (https://github.com/duckdblabs/sqlitescanner)

DuckDB is much faster than SQLite when doing analytical queries (OLAP) like when calculating summaries or trends over time, or joining large tables together. It can use all of your CPU cores for sometimes ~100x speedup over SQLite.

DuckDB also has some enhancements with respect to data transfer in and out of it. It can natively read Pandas, R, and Julia dataframes, and can read parquet files directly also (meaning without inserting first!).

Does that help? Happy to add more details!


One of SQLite's most appealing aspects to me is using it as an application file format, as described in this article: https://www.sqlite.org/appfileformat.html

How does DuckDB compare in that aspect? Does it have the same kind of guarantees of robustness, incorruptibility and performance (especially reading/writing binary blobs) that SQLite does?

In any case: DuckDB looks great, nice work! Good to have more players in this space!


It is a goal of ours to become a standard multi-table storage format! However, today we are still in beta and have made some breaking changes in the last few releases. (Exporting from the old version, then reimporting your DB in the new allows you to upgrade!) Those should happen less often as we move forward (the storage format was genericized a bit and is more resilient to future enhancements now), and locking in our format amd guaranteeing backwards compatibility will occur when we go to 1.0!


Thanks, it does help! I understand SQLite might be better/ideal for OLTP (?) but would DuckDB also work for use cases where I query for specific records (e.g. based on primary key) or would I rather use SQLite for OLTP stuff and then read SQLite from DuckDB for analytical workloads?

Basically I'm wondering: if I go all in on DuckDB instead of SQLite would I notice? Do I have to keep anything in mind?

I know, probably difficult to answer without a concrete example of data, schema, queries and so on.

The SQL query features in the article seem really neat. Kudos @ shipping.


Good questions! You are correct that it depends. We do have indexes to help with point queries, but they are not going to be quite as fast as SQLite because DuckDB stores data in a columnar format. (Soon they will be persistent - see comments above!) That columnar format is really great for scanning many items, but not optimal for grabbing all of a single row.

With DuckDB, bulk inserts are your friend and are actually super fast.

Definitely let us know what you find! Just open up a discussion on Github if you'd like to share what you find out: https://github.com/duckdb/duckdb/discussions


I should add that we can read/write Apache Arrow as well!


one thing I love about DuckDB is that it supports Parquet files, which means you can get great compression on the data. Here's an examples getting a 1 million row CSV under 50mb and interactive querying in the browser: https://observablehq.com/@observablehq/bandcamp-sales-data?c...

the other big thing is better native data types, especially dates. With SQLite if you want to work with timeseries you need to do your own date/time casting.


Yes, it is always difficult to use dates in SQLite... DuckDB makes dates easier - like they should be!


DuckDB: embedded OLAP, SQLite: embedded OLTP. For small datasets (<1M rows let's say) either would be similar in performance.

I need to do the benchmarks to substantiate this but this is my intuition.


I haven't used DuckDB yet but the biggest differences I've discovered if you aren't working on huge datasets where the column/row thing makes a difference (you're probably not) are:

1. SQLite has a great GUI and is really really widely supported.

2. DuckDB is properly statically typed with a much wider range of types than SQLite, which is dynamically typed and only just added support for any kind of type checking at all.


If you'd like to work with DuckDB in a SQL IDE/GUI, we recommend DBeaver! It uses the DuckDB JDBC connector. A quick how to guide is here: https://duckdb.org/docs/guides/sql_editors/dbeaver


> huge datasets where the column/row thing makes a difference (you're probably not)

For programmers, it’s a tossup.

For most people working in data (databases, data engineering, ML etc) the column vs row thing makes a difference for datasets as small as a few hundred k records.


Sqlite's SQL is severely limited.

If you want to do something a bit more complex, you will have a bad time. Hello! With recursive.


`EXCLUDE`

Extremely useful, is there a reason why this is something not implemented in SQL in the first place? I often find myself writing very long queries just to select basically all columns except for two or three of them.


because columns can be added to tables in production databases, so any time you use select * you run the chance the number of columns changing and breaking anything you wrote.


This seems reasonable on its own, but then you can add a compound index and forget to join on a second part, or refactor a column in two and only collect one value into aggregation. This spotted babysitting is just stupid. If you’re anxious about query integrity, get some tooling and check your sqls/ddls against some higher-level schema.

Even if that turns out to be a constant source of trouble worth not having, then why SQL can’t provide columnsets at least, so that queries could include, group or join on these predefined sets of columns instead of repeating tens of columns and/or expressions and/or aggregations many times across a single query. You had employees.bio_set=(name, dob), now you add `edu` to it and it just works everywhere, because you think in sets rather than in specific columns. Even group by bio_set works. Heck, I bet most of ORMs partially exist only to generate SQL, because it’s sometimes unbearable as is.


That's a problem with select * in general, not a problem with using EXCLUDE with select *. So that still doesn't explain why it's not in SQL to begin with.


I've always viewed SELECT * as a convinence for schema discovery and a huge bonus for subqueries - our shop excludes its use at the top level in production due to the danger of table definitions changing underneath... but we happily allow subqueries to use SELECT * so long as that column list is clearly defined before we leave the database.

Worst, by far, than a column you didn't expect being added is a column you did expect being removed. Depending on how thorough your integration tests are (and ideally they should be pretty thorough) you could suddenly start getting strange array key access (or object key unfound) errors somewhere on the other side of the codebase.


Yeah I tend to use "select *" in interactive queries when I'm working out what I want, but then write explicit column names in anything going into production. This helps with the column-being-removed case, as the query will fail immediately selecting a nonexistent column, whereas "select *" will not fail and the error will happen somewhere else.


"A traditional SQL SELECT query requires that requested columns be explicitly specified, with one notable exception: the * wildcard. SELECT * allows SQL to return all relevant columns. This adds tremendous flexibility, especially when building queries on top of one another. However, we are often interested in almost all columns. In DuckDB, simply specify which columns to EXCLUDE:"

It appears how this works is that is selects all columns and then EXCLUDES only the column's specified, the reason this doesn't exist in normal SQL is because it is a terrible idea. This is something that will break at many companies with large technical debt if it is ever used.


It can definitely be misused, but SELECT * is pretty handy for ad-hoc queries and to succinctly get all (or almost all) of the columns for a subquery or CTE.


That's why good database wrappers support referencing columns in result sets by column name. It's good practice.


If anyone is interested in improvements to SQL, checkout PRQL https://github.com/prql/prql, a pipelined relational query language.

It supports:

- functions,

- using an alias in same `select` that defined it,

- trailing commas,

- date literals, f-strings and other small improvements we found unpleasant with SQL.

https://lang.prql.builders/introduction.html

The best part: it compiles into SQL. It's under development, though we will soon be releasing version 0.2 which would be "you can check it out"-version.


This is neat. Have you found this new found capability at odds with "good SQL"? Eg, i run a fairly large application that has a huge DB schema, and more often than not when the SQL gets huge and ugly it often means we're asking too much of the DB. "Too much" being more easy to run into poor indexes, giving more chances for it to pull in unexpectedly large number of rows, etc.

My fear with PRQL is that i'd more easily ask too much of the DB, given how easy it looks to write larger and more complex SQL. Thoughts?


That's true - when you hit 4th CTE you are probably doing something wrong.

But not always. Some analytical queries may actually need such complexity. Also, during development, you would sometimes pick only first 50 rows before joining and grouping, with intention of not overloading the db. To do this you need a CTE (or nested select), but in PRQL you just add a `take 50` transform to the top.


Since the DuckDB people are here, just want to say that what you're doing is going to be a complete game-changer in the next few years, much like SQLite changed the game. Thanks for making it open source!


That 750KB PNG can probably be a 50KB PNG. Even without resizing it compresses to less than half its size.

https://duckdb.org/images/blog/duck_chewbacca.png


Thanks! Can you tell that my SQL-fu is stronger than my HTML-fu? :-) Much appreciated!


In this case it should probably be a JPEG? (Unless it has a transparent background and the site responds to the user's dark-mode setting? :) Also, this image looks like it almost certainly was a JPEG, at some point!)


The color palette is pretty limited, so using something that can have a specific/limited palette (say 48-64 color in this case) is probably going to have a better result than jpeg. Also, optimizing for display size would take it further still. Alpha transparency support is also a bonus for png over jpeg.


Came across this a few time but never got to try it out because the only golang binding is unofficial and I can't get CGO to work as expected...

That would be really neat to have an official one. This articles makes me want to try it even more


Here is a solved Github Issue related to CGO for the Go bindings! If you have another issue, please feel free to post it on their Github page!

https://github.com/marcboeker/go-duckdb/issues/4


Thanks! I didn't see that, I'll give it a try again!


Ran into some similar issues with those bindings. We switched to using the ODBC drivers and its been great. Those are official and we just use an ODBC library in go to connect


I love love love DuckDB. When I can use DuckDB + pyarrow and not import pandas, it makes my day.


I'm enjoying experimenting with Duckdb from python, it's a promising product and has a large list of data formats it can read, including pandas dataframes from in-memory with zero-copy. However its still quite the moving target, with a number of things not at maturity yet. e.g. the TimestampZ column type isn't implemented yet [1], although it is in the documentation.

Edit: I came across it via the podcast: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...

Latest release notes: https://github.com/duckdb/duckdb/releases/tag/v0.3.3

[1] Error message: Not implemented Error: DataType TIMESTAMPZ not supported yet...


The correct type is TIMESTAMPTZ. Is the type TimestampZ mentioned anywhere in our documentation? If so, that looks like a typo.

I fully agree that error message needs to improve, however. I will have a look at that.


You're right, I was using 'timestampz' when I should have been using 'timestamptz', (or 'timestamp with time zone') thanks for that.


On the topic of Friendlier SQL, I have extended the similarity search to types in this PR [1], so the system will now offer you this correction as well :)

[1] https://github.com/duckdb/duckdb/pull/3633


This is fantastic. Column aliases are super helpful in reducing verbose messiness.

DuckDB has all but replaced Pandas for my use cases. It’s much faster than Pandas even when working with Pandas data frames. I “import duckdb as db” more than I “import pandas as pd” these days.

The only thing I need now is a parallelized APPLY syntax in DuckDB.


Fugue has a DuckDB back end and I believe they can actually use Dask and DuckDB in combination for what I believe is similar to what you are looking for! There is also a way to map Python functions in DuckDB using the relational (dataframe-like) API.

https://fugue-tutorials.readthedocs.io/tutorials/integration...

https://github.com/duckdb/duckdb/pull/1569


  SELECT * EXCLUDE (jar_jar_binks, midichlorians) FROM star_wars
Error: columns not found

On further investigation, It seems that someone had maliciously injected lots of bogus data into the production database. We tried to clean up by truncating tables and dropping columns, but in the end it was easier to just restore from backup prior to 1999.

There still seems to be some residual corruption, most predominantly around mos_eisley and jabbas_palace data, and we had to truncate the end of Return of the Jedi, but not much was lost there.


What are some potential long-term liabilities we might see in choosing to adopt duckdb today?

Obviously there will be a desire to monetize this project, if not for the very simple reason of subsidizing the cost of its development and maintenance. I love everything I hear and see about this project, but it makes me nervous to recommend this internally due to it not only being in such an early stage, but also bc of any unforeseen costs and liabilities that it might introduce in the future.


Let me see if I can assuage some of your concerns!

First off - DuckDB is MIT licensed, so you are welcome to use and enhance it essentially however you please!

DuckDB Labs is a commercial entity that offers commercial support and custom integrations. (https://duckdblabs.com/). If the MIT DuckDB works for what you need, then you are all set no matter what!

However, much of the IP for DuckDB is owned by a foundation, so it is independent of that commercial entity. (https://duckdb.org/foundation/)

Does that help? Happy to answer any other questions!


Absolutely. I think DuckDB's future is bright and excited to hopefully work with it in the near future.


I find that the examples are very confusing because they are using names that sound like rows or tables (jar_jar_binks, planets) as fields in the examples.


Ah, well, that was a risk that I took... Thank you for the feedback though! The Star Wars puns were too hard to resist...

If you have a specific question, definitely post it here and I will clarify!


I'd agree. Took me a couple of reads to make sense of it.

Keep the puns, I just think with a bit of adjustment the examples would be easier to understand.


These are great features! I wish I had them in every database. Hmm, I wonder if Babelfish could support that...

PS those examples were so good! really good writing :)


I love the attitude towards ergonomics over standard compliance. And you'll see why SQL has never been really portable across databases ;-)


Maybe if SQL was a better language at the start, there would be more incentive to follow the spec.


IIRC there was am ANSI/FIPS standard which was dropped under the bill clinton's administration, which is when things started to diverge.

(Info from memory, may be wrong or a bit mangled)


SQL is an amazingly good language for what it does. It has been with us coming up on 5 decades.


Minus everyone having different versions of it to facilitate the missing functionality/needs of users


Well, it's not portable, but you don't have to learn a new language every time you encounter a new DBMS (unlike MongoDB or InfluxDB).


I love these updates. It would be great to see some of the major data warehouse vendors (Snowflake, BigQuery, Redshift) follow suit.


This is awesome and would love to chat around building an integration to the low-code platform Budibase: https://github.com/Budibase/budibase


I would go even further and say that "GROUP BY ALL" and "ORDER BY ALL" should be implied if not provided in the query.

EDIT: Typo


We thought about that, but particularly with `GROUP BY ALL` the problem is that we would get different results from SQLite. When a column is not mentioned in the `GROUP BY` clause in SQLite it automatically pushes a `FIRST` aggregate over that column. So for example, the following query:

  SELECT city, COUNT(*)
  FROM customers
In SQLite is transformed into:

  SELECT FIRST(city), COUNT(*)
  FROM customers
In our experience this is not a good default since it is almost never what you want, and hence we did not copy this behavior and instead throw an error in this situation. However, if we were to add an implicit `GROUP BY ALL` our transformed queries would now diverge, i.e. we would transform the above query to:

  SELECT city, COUNT(*)
  FROM customers
  GROUP BY city
Having diverging query results from SQLite on quite basic queries would confuse a lot of newcomers in DuckDB, and potentially cause silent problems when query semantics change when switching databases.

We could definitely add a flag to enable this behavior, however.


I vaguely remember that this is what mysql does?


I've been experimenting with DuckDB using modified Mondrian OLAP engine and it looks very promising so far, performance wise.

A questions I have to author, or anyone using: Is there a easy way to transfer whole Postgres DB into DuckDB so I can do some tests with actual client data? I could export each table by hand and reimport it, but that is kind of painful.


Interesting thought! I have not tried this yet so I only have a guess as an answer. Could you export the data as SQL statements and then run those statements on DuckDB? That may be easier to set up, but may take longer to run...

DuckDB also has the ability to read Postgres data directly, and there is a Postgres FDW that can read from DuckDB!

https://github.com/duckdblabs/postgresscanner

https://github.com/alitrack/duckdb_fdw


Trailing commas and “GROUP BY ALL” is such a huge improvement. Some of this should start making it to other databases.


Though many of the queries don’t make complete sense the mapping of queries to Star Wars is :chefkiss:


On the topic of friendlier SQL, there was a feature LINQ to SQL added to (and I believe removed from) .Net

It was basically syntactic sugar for a persistence API.

Instead of "select bar from foo" it used a "from foo select bar" type of syntax.

This was rather nice from a code completion perspective.


Friendlier sql is MySQL "insert into set".

Normal insert, hard to read:

INSERT INTO table1 ( field1, field2, field3, ... ) VALUES ('value1', 'value2', 'value3', ... );

vs

Easier to read:

INSERT INTO table1 SET field1='value1', field2='value2', field3='value3', ...


This looks a little odd

   SELECT age, sum(civility) as total_civility
   FROM star_wars_universe
   ORDER BY ALL
   -- ORDER BY age, total_civility
there's no GROUP BY?

edit: (removed edit, I blew it, sorry)


Good catch! Fixed! Needed a group by all in there.


Does it support a syntax for recursive queries? In T-SQL we use recursive CTEs which are ugly as hell.

This is very cool though. There are lot of features that would make my life easier. Group By All is noice.


It does, DuckDB supports recursive CTEs


What do you use recursive queries for?


Interesting additions! On using column aliases in predicates, what if my alias exists in the source as well, what takes precedence? I feel like this can become a bit confusing either way.


For compatibility, the original column in the data source takes precedence. That's how other DBs handle things so we wanted to stay standard where it made sense!


Wow so many nice database-related news recently - feels like database week or something! :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: