Hacker News new | past | comments | ask | show | jobs | submit login
SQL as an API language (simonwillison.net)
201 points by craigkerstiens on Oct 4, 2018 | hide | past | favorite | 157 comments



Slightly on topic: when giving unsolicited tech career advice, my favorite trope is "learn sql". Don't have any computer background? Start with SQL. Already write compilers for living but don't know SQL? Of course you should learn sql. A www dev who is tied to ORMs? Learn SQL. A manager who wants to understand the IT that runs your department? Yep, SQL. Etc, etc. SQL is probably the most ubiquitous language under hood today.


I agree. I used to dislike SQL, but after making some effort to better learn it, I now much prefer “raw” SQL over any wrapper or ORM for anything but the most trivial use cases. My favourite Clojure DB library is HugSQL[1] because it allows you to write pure SQL in a composable way.

I’ve had to fix, optimise and diagnose transaction/locking problems with complex ORM-queries in the past and it was very painful. The best I could do was have the DB log the raw statements and then inspect those. If the queries were already in SQL, it would have been a lot easier.

SQL is pretty damned good at what it does and is battle tested.

[1] https://www.hugsql.org


It powers the best tested technology on the planets.

We've been writing relational databases for 45 years, and it has worked.


Well, sort of. Every decade it bolts on support for the new technology craze, and usually that proves insufficient so vendors add their own features, and even that tends to be pretty painful. Definitely a useful skill set but it’s still pretty painful.


> Every decade it bolts on support for the new technology craze

You mean they got better? I've been using Postgres, MySQL, and SQL Server professionally for over 20 years (Postgres for only 15), but they've only become better & faster over time. I don't see this "technology craze" you mention, but I have seen things like parallel processing, bitmap indexes, selective indexes, query plans that use multiple indexes, online index creation (no locking), support for complex data types like JSON or XML, partitioning, vastly improved resource usage (CPU+memory+disk) just to name a few.

This doesn't even include some of the cool vendor-specific features like foreign data wrappers (FDW) in Postgres, embedded language runtime support (SQL Server can do .NET, Python, R) and so on.

> so vendors add their own features

Of course they do, RDBMS' are products.


> I don't see this "technology craze" you mention, but I have seen things like parallel processing, bitmap indexes, selective indexes, query plans that use multiple indexes, online index creation (no locking), support for complex data types like JSON or XML, partitioning, vastly improved resource usage (CPU+memory+disk) just to name a few.

Those all seem like pretty ad-hoc features, some of which were influenced by prevailing tech fashion trends at the time. For instance, the fact that you need to manually specify so many details about indices sounds like a failure to have a good theory around indexing which can be integrated into query plans. There are a lot of flaws like this in SQL stores as a whole.


How are they 'ad-hoc'? With the exceptions of JSON and XML support (cf. the 'NoSQL' fad), they're all valuable enhancements to core functionalities, no?

> the fact that you need to manually specify so many details about indices sounds like a failure to have a good theory around indexing which can be integrated into query plans

I'm no SQL guru, but this smacks of a variant on the 'sufficiently smart compiler fallacy' to me.

That the system can't always, uh, optimally optimise, isn't necessarily an indication that the system is fundamentally flawed.

GCC permits inline assembly, but that doesn't mean GCC is a failure.


If GCC required inline assembly everywhere, it would definitely be a failure.


Of course.

Given that this feature was added to the DBMS pretty late in the game, we can infer that it's only rarely worthwhile.


See my other comment [1] if you want further details as to SQL's limitations.

[1] https://news.ycombinator.com/item?id=18147388


> For instance, the fact that you need to manually specify so many details about indices sounds like a failure to have a good theory around indexing

Real life isn't about theory, it's about what works and doesn't work in production. These are not flaws, they're features. I don't need a bitmap index on my primary key, nor does it need to be selective because it will include every row. But having these features and the ability to customize based on the needs of my product is paramount.

If you were not using an RDBMS to store your data, you'd still have these same problems and you'd still end up with optimized secondary indexes based on the lookup criteria.


A lot of SQL databases can definitely adhoc determine if indices are necessary; it's just not the default because predictable performance and explicit indices tend to be preferred.

Postgresql has a handful of plugins that can tell you if you had an indice; would it be used and you can see query plans before/after without actually having to add indices until you're comfortable. This same plugin could easily automate indice creation, but most people don't.


> You mean they got better?

Well, I mean they support new features, which is perhaps "better", though when you continue to tack on features for decades, perhaps something was wrong with the relational algebra model to begin with. Also, you're conflating "SQL" with advancements in individual implementations of a relational database. Relational databases are great; I just think we'll find a model that is more suitable for application development than "relational algebra + the kitchen sink".


On the contrary, a model that is solid enough that you can continue to improve and build features on it for decades without changing the fundamentals is an extremely rare triumph in this industry. The signal that “something is wrong” is that you have to reinvent the foundation every five years (e.g., the tottering stack of false starts inside all web browsers).


sql has stood the test of time, but we've also figured out how to extend the sql dbs and relational model so that it works efficiently in a lot more scenarios. there are column stores that do incredible compression, in-memory row stores, you can put json in a sql database and efficiently query it.


Features aren’t being “built on it”, they’re being bolted on because the model is insufficiency powerful, which is my point. To be clear, relational databases are great and important, but the relational model will have to be replaced.


There are two issues I know of with SQL where it isn't super great IMHO.

1) Hierarchical data. The relational model is fine here, but the query syntax isn't awesome.

2) Understanding remote data. This is outside the scope of the data model, but it does effect the way software is built.

Neither of these show any evidence that the relational model should be replaced.


> There are two issues I know of with SQL where it isn't super great IMHO.

There are many more issues, you've just gotten used to eating SQL turds. Hierarchical data is solved by CTEs, so that's not a big deal.

The bigger deal is that relations themselves are impoverished second-class citizens. This means you can't write a query, bind it to a variable and then reuse that query to build another query or store that query in a table column. Something like:

    var firstQuery = select * from Foo where ...
    var compositeQuery = select * from firstQuery where ...

    create table StoredQuery(id int not null primary key, query relation)
    insert into StoredQuery values (0, firstQuery)
That's something like relations as first-class values. This replaces at least 3 distinct concepts in SQL: views, CTEs, and temporary tables, all of which were added to address SQL's expressiveness limitations. Relations as first-class values not only improve the expressiveness of SQL beyond those 3 constructs, it would also solve many annoying domain problems that require a lot of boilerplate at the moment.

Then there's pervasive NULL, inconsistent function and value semantics across implementations, and a few other issues.


I like the first-class-relations idea.

As for pervasive NULLs, isn't that more the fault of schema design?

There are a lot of things wrong with SQL. Little things like the fact that INSERT and UPDATE have different syntax for no good reason.

A few of its syntax quirks annoy me as they're nonstandard in today's languages, but it's only a shallow complaint: using '<>' for its inequality operator, and using single-quotes for strings.

I'm also disappointed in the implementations in various ways: the way Microsoft SQL Server sends query text over the wire unencrypted, in its default configuration. The way Firebird SQL has such a basic wire protocol that you can see significant performance enhancements by invoking TRIM on text-type fields. The way the optimisers are so damn primitive, especially compared to the baffling wizardry that goes on in today's (programming language) compilers.

Somewhat off topic further ranting:

But the core relational model makes good sense. I see little general value in the freeform graph-databases calling themselves 'NoSQL'. (Do we call functional programming languages "No-assignment"?)

Perhaps some of them can scale well, but can't SQL do that? Google Cloud Datastore, for instance - it can scale marvellously, but only because it imposes considerable constraints on its queries. Can't we do the same thing with an SQL subset?


> As for pervasive NULLs, isn't that more the fault of schema design?

I think NULL defaults are widely regarded as a bad thing by now. You should have to declare what's nullable, not declare what's not null.

> There are a lot of things wrong with SQL. Little things like the fact that INSERT and UPDATE have different syntax for no good reason.

Moreover, SELECT should be at the end not the beginning. Query comprehensions and LINQ did this right.

And yes, the implementation inconsistencies are seriously irritating as well.

> Google Cloud Datastore, for instance - it can scale marvellously, but only because it imposes considerable constraints on its queries. Can't we do the same thing with an SQL subset?

If you extend Map-Reduce with a Merge phase, then you can implement the relational algebra with joins [1]. That scales pretty well.

http://cs.brown.edu/courses/cs295-11/mapreducemerge.pdf


> Moreover, SELECT should be at the end not the beginning. Query comprehensions and LINQ did this right.

I'd one-up this and argue that all SQL statements should be in order of execution (within reason). Moving SELECT to the end is definitely a good start.


Mixing execution concerns and the language seems like a bad idea.

It's pretty much impossible to decide what an optimised query should look like on a different databases with different data.

Slow network? Fast disks? GPU optimised joins? No way to know what execution order should be, and that's a strength of the relational model.


Interesting paper, thanks.


The schema aspects of this idea are hard for me to grasp.

How would the database engine efficiently resolve any references to the embedded relation ?

I'm also having trouble understanding how your INSERT statement would even work - is the firstQuery being inserted being treated as hierarchical data where every single relation attached to every single row has its own schema/structure ? Or is the structure fixed in the CREATE TABLE statement so that you can't use just any relation, but rather have to use a relation that conforms to the structure defined in the CREATE TABLE statement ?


I played a little fast and loose to convey the basic idea. To elaborate since you asked, introduce typed tuples/records as first-class entities, then "tables" are just a sequence of records declared with a particular lifetime.

Relations are functions over these sequences. The embedded relation is then a stored function over a set of nested sequences. So to add the concrete types:

    var firstQuery = select Id, Name, Payment from Foo where ...
    var compositeQuery = select Id, Name, Total from firstQuery where ...

    create table StoredQuery(id int not null primary key, query relation{Id int, Name Text, Payment decimal})
    insert into StoredQuery values (0, firstQuery)
Again eliding a few details, but hopefully you get the basic idea.


I think what you're describing is already present as of SQL 2003 in the form of multisets:

http://farrago.sourceforge.net/design/CollectionTypes.html

However, good luck finding a database engine that necessarily supports such features. Some support array types, but I haven't seen too many that support multisets.


What you're asking for is an independently evolving logical schema on top of the physical schema - this is a great idea, but performance guarantees are highly tied to the physical layout of the data.

Apache Spark is the best example I've seen of this in that you can compose schemas on the fly and it will unroll them at the end based on physical layout.


This particular use case sounds like a MATERIALIZED VIEW, and views in general for doing what you're asking for...


A materialized view would be an optimization technique used when compiling a schema containing first-class relations (essentially, memoization). First-class relations are more general though.

But first-class relations would also present some optimization challenges, so a subset of a relational system with a restricted form of first-class relations corresponding to materialized views that can be stored in table columns and used in queries would get pretty close.


What do you think of perhaps a logical (e.g. Prolog or Datadog) model for databases?


I’m not familiar with those. What do _you_ think about them? :)


They are what came to mind when you said perhaps the relational algebra was not sufficient as a theoretical model of databases.


Good point, I was conflating SQL & RDBMS.


> Already write compilers for living ...

Please pardon my ignorance, I really interested in this. What kind of job where you are writing compilers for living? The one that I know is something along Mozilla's, Google's, Apple's dev that working on the Rust/Go/Swift language. Is there any other kind of jobs that also in this writing compiler for living category?


Larger chip manufacturers (those with the resources to be designing their own CPUs/GPUs/similar) will likely have a dedicated team maintaining reference compilers for their designs, especially the more experimental or secret designs that are not yet proven and/or publicly available.

Those creating specialist processing chips more dynamically in FPGAs may also have someone working in that area if they intend to control the result via a higher level language.


Well there are other compilers people are paid to write, eg oracle and IBM have java (JIT) compilers. There are also JavaScript compilers in vms.

Plenty of companies can have internal languages which need to be compiled. One would be Facebook with hack but in a normal company it isn’t totally improbable to go from “we need to configure some rules” to “we need to add more features to our rule language” to “we need to run our rules faster” to “oops we are now making a compiler.” And there are specialised consultancies to help with these compilers.

However I think GP was really alluding to working on programs that aren’t really anything to do with databases rather than writing compilers specifically


There should be an international Learn SQL day, with free all day workshops by volunteers, etc. Maybe on the date of one of the first big Codd papers or something...


How does one learn SQL, then? Any good resources?



Check out SQL zoo :)


Amen to sql


You could get a more complete solution with PostgreSQL + PostgREST. Regarding the concerns:

> Security: we don’t want people messing up our data

Row level security + roles/grants should be enough to cover this one.

> On performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold.

With a timeout, what you'll need is to request the expensive queries in parallel to ddos the service. Would be better to filter by the query plan, in PostgreSQL there's an extension for this https://github.com/pgexperts/pg_plan_filter. Though pg also has a `statement_timeout`.

> Hiding implementation details

You can expose a dedicated schema for you API that doesn't contain a single table but only views and stored functions. Migrating this schema should be easy thanks to transactional DDL.

That being said, I agree with the sentiment about GraphQL, besides the extra query language, there's an additional impedance mismatch since now you have to map your DB schema to the GraphQL schema.


I'm always surprised that PostgREST isn't the industry standard for creating an API. It looks like it would save so much work and the corner cases can be handled without writing code in a different language.

What am I missing?


I tried it, and still have it running on a little toy service I built. It's cool, but IMO, it doesn't really seem production-ready for a large-scale service. Last time I updated it, I had to spend a full day sorting out the changes they made to how the server config and user authentication works. Last I checked, they don't even have a suggested script to run it as a service. Going by the contributor graph, it's still essentially a 1-person project.


IIRC, this was what the connection poolers (pgpool, pgbouncer), and perhaps all now-popular postgres extensions/enhancements, looked like in their initial years, as well.

In the context of a "large-scale service", my guess is that there's a chicken-and-egg problem, where an operators of such a service has to decide they want it to be production-ready enough to throw money and/or engineering resources at the project, which can attract yet more contributors.


> a chicken-and-egg problem

The way I see it. Large corporations have the golden egg laying hen. And plebian project like PostgREST have little to no chance.


If I had to guess, this is due to a combination of only being a few years old and marketing (initially, the major version number was 0).


PostgraphQL


I don’t think mapping your DB is necessarily a disadvantage.

It provides a lot of benefits in modern web- and mobile-development which is where you’d use something like GraphQL.

JavaScript works better when it knows the DB schemes, not only is it more efficient for your developers but it’s also higher quality because it allows fewer fuckups. If you’re building your DB schemes anyway, you may as well take advantage of GraphQLs other features.

And it’s in a familiar language, any web developer can do a GraphQL query, and it’ll be efficient too. I think most developers can do a SQL queries too, but not always efficient. You can get a long way by going to W3 or google and hacking your way through a query to get what you want. Our business analytics and our LEAN process people do it, but I’ve seen some real horrors build that way by people who knew SQL but not databases.

Build what you want, but there is a reason GraphQL is popular.


But how do you solve a typical multi-tenant SaaS with users and teams? You can‘t create roles for every login user because (s)he may have special access control rights for every project (s)he is participating in.


Can you give a good example? I've found that the combination of row level security and role inheritance in Postgres can handle some fairly complex cases.


Note that this is solely in the context of a static data set. That makes this a relatively limited idea, since most datasets are either dynamic or small enough to just copy over. I write this not to dismiss Datasette, which looks like an excellently designed tool for the problem it's meant to tackle, but mostly as a response to the title of this particular HN submission, "SQL as an API language" which suggests maybe more than "just" querying static data.

Thing is, I really want to use/expose SQL as an API language for dynamic data.

E.g. the author writes:

> On hidden implementation details: since we are publishing static data rather than maintaining an evolving API, we can mostly ignore this issue. If you are really worried about it you can take advantage of canned queries and SQL view definitions to expose a carefully selected forward-compatible view into your data.

At my company (https://talkjs.com) we explored making our public API SQL-based for the chat & message data that our customers host with us. It would solve lots of customer requests in one go, much more so than adding a new REST resource for every kind of question a customer might want to ask our API. What's been holding us back is exactly this: Even if we'd carefully design a set of database views to guarantee backward compatibility, doing so will severely limit the kinds of database refactorings we'd be able to do. Especially when it comes to adding redundancy, denormalizing stuff, etc.

If anyone else has experience with exposing queryable (SQL or otherwise) dynamic APIs directly to customers, I'd love to hear your insights. How did you tackle this problem?


I've done it before. Some day I'll blog about it, but in the meantime here are a few places I've brought it up in HN comments:

https://news.ycombinator.com/item?id=14464929

https://news.ycombinator.com/item?id=15335915

https://news.ycombinator.com/item?id=17820410

I agree with the concern about schema flexibility, although since you'd be exposing only a derived, flattened, OLAP-oriented database, you can still change your OLTP schema without breaking compatibility. You just have to maintain the mapping from OLTP->OLAP. You could even version the OLAP schemas so that you have a controlled way of getting customers onto newer structures. If I were taking that approach I'd consider what Stripe has said about how they build "mappings" from one API version to another.

My email is in my profile if you'd like to talk more about this! SQL-as-an-API is a big interest of mine.


The Stripe article in question, for others' reference: https://stripe.com/blog/api-versioning .

Your comments strike very close to home! In one project, to support Tableau analysis for business analytics, I did the opposite of "wrapping a JSON API around a relational database" - I used https://www.torodb.com/ to wrap a derived Postgres database around our Mongo source of truth! (Laugh if you want - it got us to launch!)

GraphQL, for all its power, can't support arbitrary aggregates the way SQL can. And when working with web-scale data, you want a server to be doing those calculations, not the client.

I continue to work on systems with constantly-reprocessing data pipelines feeding the primary records that users interact with; per the above, they can feed derived databases that support SQL queries as well, whether or not we give that full flexibility to the API user. What I should write a blog post about is how we think of optimistic updates; just like in a rich client-side app where you want to optimistically update client-side data with how you might expect it to appear after receiving confirmation of your transaction, you also want to optimistically update derived-database data with how you might expect it to appear after it flows fully through an asynchronous data pipeline that overwrites that derived database. It's optimistic-updating all the way down, with the deepest layer (a more-or-less immutable event stream, incorporating things pulled in from the outside world) operating as the source of truth. The trick with a system like this is to build the right abstractions (and they're very domain-specific) to make it a joy to program in - more Django than PySpark - but it's definitely possible.


Maybe I am a little biased, but once you define what SQL interface you want to provide to your users (so the schema of the tables) you have basically decoupled your internal data structure from the view of the client.

At this point the only thing you need to do is to update the user facing database whenever necessary and keep it consistent with your view of the world.

Yes, you have a new problem since you had to maintain the costumer view consistent with your view, but it may be a price worth paying.

The way I would test this implementation is with RediSQL (https://RediSQL.com) that I wrote (so I am definitely biased) but it may be worthed to test it.


What you're describing is exactly what GraphQL sets out to solve. Most of the flexibility and declarativeness of SQL, but still sitting a layer above the raw data itself.


GraphQL is just a query language. It has exactly the same problems in this respect as SQL - once a certain query from a customer that you never imagined they'd write works, it has to keep working, even if your underlying data model totally changes.


This has a greatly reduced effect. What you have to keep updated is the accessible properties.

As I said, it's similar to writing a SQL view and giving access to that. Difference being, GraphQL is a lot more appropriate especially for web REST endpoints


> Most of the flexibility and declarativeness of SQL,

does graphQL support aggregate functions like count,sum,average and co?


The answer is muddier than what the other comments say. It depends on how you design your schema. It can support whatever you need it to support. Aggregations aren't impossible, but they have to be designed into the schema. A very quick example query with aggregates _may_ look like this:

    {
      searchFriends(first_name: {startsWith: "John"} age: {gte: 30}) {
        count
        aggregate {
          total_age: sum(field: age)
          avg_age: average(field: age)
          name_count: distinct(field: first_name)
        }
        nodes {
          first_name
          last_name
          age
          gender
          location
        }
      }
    }
But there are many different ways to model this.


No, that's why I said most; and that's also what makes it more appropriate for APIs -- you don't want uncontrolled performance hitting you like this; if you wish to allow for aggregates, you expose them as properties.


> you don't want uncontrolled performance hitting you like this

On RDBMSes with multitenancy support, you can simply apply resource constraints and quotas (in CPU time, CPU percentage, wall-clock time, memory, disk usage, etc.) to each user and to each connection. (You just have to ensure that each separate customer ends up hitting the DBMS as a separate DB user.)

The entire point of the OLAP model is to allow clients to ask their own questions. Some of those questions may be too expensive to answer—modern DBMSes are smart enough to realize which queries those are, and allow the sysadmin to restrict users from doing them. But that restriction should always be a matter of scale, not of kind. Users should be able to do a very complex query over a small data-set, or a very simple query over a huge data-set, or a combination, as long as they aren't actually impacting the performance of the cluster for other users.


Sure. And if that's a model that is appropriate for your API users, go for it.

But usually, it won't be appropriate. And by usually, I mean it will almost never be appropriate to adopt a model like this for your users. And I say this having been in the very situation where it was appropriate and we did expose it.


Alternately, what's wrong with just tracking the expense of the user's query and then passing the costs of it back to them? Like most "scale-free" DBaaSes (Dynamo, BigQuery, etc.) do.

This is assuming your users are paying for your API, mind you. I have no idea why you'd want to provide an online relational data-access service for free. (The "free" option for OLAP is, and has always been, "we give you a .zip of CSV files, and you ETL them into your own DB to look at them." That's how things like government GIS data or Amazon's and Google's hosted "public datasets" work—you pay for online access, or you get free offline snapshot dumps.)


You are the only one who has been mentioning OLAP so far. It's just one use case among many.


OLAP is the only use-case where queries that require complex aggregate computations with potentially-unbounded runtimes are relevant. When you subtract GraphQL from SQL, the difference is "arbitrary OLAP reporting capabilities." So I think it's fair to focus on this use-case when talking about what exactly having an SQL API gets you (over having a GraphQL API.)


I'm not really sure what you're trying to communicate here. Arbitrary aggregates are not usually an appropriate thing to expose to users as part of an API. Where they're actually relevant doesn't change that.


Maybe you don't understand the original point of SQL as a language? It's a language where humans—usually, employees who are granted access to data, or customers who are paying for access to data—type up descriptions of reports they want to receive, in a formal language; send it off to a remote system; and then those reports, magically, come back. In between, there can be anything—maybe a bunch of interns manually collating data from thousands of Excel spreadsheets. But usually there's a DBMS, an automated system whose job is to do whatever is necessary to turn the data it has, into the reports people need—no matter how complex a task that turns out to be. Because it's a complex task that needs doing.

I don't think it matters whether you call that "an API"; I think your implicit point here is that "an API" is something that you expose to the general public, and this is ...not that. But if we're talking about the benefit people get from having SQL-shaped access to your data model, then this is what it is: the ability to specify complete, arbitrarily-complex reports that they want your system to calculate out for them.

To me, the entire point of "an SQL API", as opposed to any other kind of API, is that it allows the data in your system to serve the needs of people who need those reports. Other users can be satisfied with pre-defined reports. The point of SQL is to serve the needs of users with arbitrary moment-to-moment needs. Users who need to explore and decompose your data in ways you never considered. Users who need to experiment inside your data model; to investigate it, to audit it.

With any other kind of API, such people would be too constrained by your pre-built possible queries to satisfy their needs; and so would instead need a whole ETL pipeline built for them, to take your raw data out of your system, and put it into their own (OLAP) system, where they could then ask that system to generate their reports.

If you expose an SQL API, you obviate all of that—the customers or employees who need to generate complex reports from your data can just ask your system to go do that.

---

Consider a government agent tasked with auditing Facebook's GDPR compliance. How would they be enabled to do that?

Facebook certainly can't just take an offline dump of the entirety of Facebook's data and give it to the auditor, in order for them to play with it in their own OLAP systems. It won't fit, and that's a GDPR violation in-and-of-itself besides.

Alternately, dumping a random sampling of their data might miss important low-frequency edge-cases.

That leaves online examination of their data.

It's not already in relational form—most of it is in Cassandra, a NoSQL database. So Facebook can't just give them access "to their RDBMS"—they don't have one.

But none of these other systems directly support the kinds of arbitrarily-complex queries the auditor needs to do.

What would Facebook need to build, for this use-case?

Why, an SQL API gateway server, of course—one that wraps their entire data warehouse, allowing the auditor to issue arbitrary SQL queries that reach out to join together and process data from all of Facebook's services.

That's when you have "an SQL API."

---

And there are more use-cases than just "auditors."

If, for example, the reason you're collecting data is to do analytics to it (this is e.g. advertising data), then the point of your business is selling your customers the ability to arbitrarily analyze that data. You might have an online GUI for doing basic BI-style break-downs, but the only "fallback" interface that your customers will be truly satisfied with is an SQL API.


This is a lot of text to essentially say what I've been repeating: If it's the right tool, use it. I never said it's never the right tool. What you are describing isn't the usual scenario for API consumers.


No, largely because it can make query performance harder to evaluate: https://github.com/graphql/graphql-js/issues/855


I would say yes, but you have to add in the resolvers. I've built these resolvers once and they are dynamically added to the GraphQL schema for all types that support them (different number types etc), and aren't explicitly disabled or enabled.


SPARQL, part of the Linked Data / Semantic Web toolset, does support aggregate functions.


True, but wow SPARQL is a terrible language.


So why not pure SQL, which everyone knows already?


Complexity?

1) Parsing SQL is complex, and full of decades of warts in expectation (dialectal differences between databases, among other things). There are few "drop in" SQL parser libraries that are trustworthy and support much more than a bare subset of the language.

2) Query Planning/Optimization is the "secret sauce" of most SQL databases, and is extremely hard in general. To use SQL as a general API outside of an implementation tied to a specific SQL vendor you'd presumably need a "general" query planner/optimizer that becomes a mini-version of an SQL database in itself.

GraphQL's faults aside, it is quite straight-forward to parse (with multiple well known implementations of the parser), and for the most part query planning/optimization in GraphQL follows a simpler, more straight-forward map/reduce-esque "resolver" pattern than the Relational Calculus and general complexity of SQL query planners/optimizers.


> There are few "drop in" SQL parser libraries that are trustworthy and support much more than a bare subset of the language.

Libraries, no, but you don't need a parser library to parse SQL; SQL [even in its DBMS-specific dialects] is a regular language, and regular languages can be entirely formalized using a declarative grammar. You just need a grammar file in a standard format, and a parser-generator like GNU Bison to throw it through.

Conveniently, existing FOSS RDBMSes like Postgres already generate their parsers from grammars, so they have grammar files just laying around for you to reuse (or customize to suit your needs): https://github.com/postgres/postgres/blob/master/src/backend...

> Query Planning/Optimization is the "secret sauce" of most SQL databases, and is extremely hard in general. To use SQL as a general API outside of an implementation tied to a specific SQL vendor you'd presumably need a "general" query planner/optimizer that becomes a mini-version of an SQL database in itself.

Definitely the bigger problem of the two.

Personally, I would suggest reversing the whole problem formulation. You don't need to take your existing system and make it "speak SQL." Instead, it's much simpler (and, in the end, more performant) to take an existing, robust RDBMS and hollow it out into this "mini-version" that queries through to your service.

I've personally done this before: I created a Postgres Foreign Data Wrapper for my data service, then stood up a Postgres instance which has my data-source mounted into it as a set of foreign tables. I then created indices, views, stored procedures, etc. within Postgres, on top of these foreign tables; and then gave people the connection information for a limited user on the Postgres server, calling that "my service's SQL API."

In this setup, all the query planning is on the Postgres side; my data source only needs to know enough to answer simple index range queries with row tuples. I don't really have to do much thinking about SQL now that it's set up; I just needed to understand enough, once, to write the foreign data wrapper.


> a regular language

SQL varies somewhere between a context-free language and a context-sensitive language between dialects and/or edge cases.

Pedantry aside, certainly there are open source grammars for particular dialects. But it is the whole package of "drop in" library including at least some basic semantic modelling that doesn't exist, certainly not to the extent of say an XML or a GraphQL parser library.

> Personally, I would suggest reversing the whole problem formulation. You don't need to take your existing system and make it "speak SQL." Instead, it's much simpler (and, in the end, more performant) to take an existing, robust RDBMS and hollow it out into this "mini-version" that queries through to your service.

Which again, was complexity that I mentioned trying to avoid ("outside of an implementation tied to a specific SQL vendor"). You might not think it a big deal to tie your application to a single database vendor, but that's not always a good option. GraphQL as a language/platform is largely agnostic, supports a large variety of options for backend storage and makes it very easy to support heterogeneous data sources, because it is almost always as simple as "write a function to get the data" (and maybe "write a function to guess how expensive it would be to get the data", if you are optionally using a basic quota system). It does sound like Postgres Foreign Data Wrappers make some of that possible in that very specific database environment, but certainly it's a lot more complex of an option, which gets back to my root problem statement that "complexity" is most probably the reason SQL isn't getting used as much traction as an API option as for instance GraphQL.


> "complexity" is most probably the reason SQL isn't getting used as much traction as an API option as for instance GraphQL.

There's a much more obvious reason: there's no such thing as an "ANSI SQL wire protocol." If you want to take advantage of an existing ecosystem of client libraries that speak SQL to your server (and why wouldn't you?), you literally have to choose a specific vendor's RDBMS wire protocol to mimic, and then tell your clients to configure their client library as if your service was an instance of that RDBMS.

Amazon's Aurora? You need to speak the MySQL wire protocol.

CockroachDB? It speaks the Postgres wire protocol.

If there was such a thing as an "ANSI SQL wire protocol"—and especially if there was such a thing as a "packed binary ANSI-SQL wire protocol"—then I think you'd see adoption of SQL as a lingua franca for things that are not, themselves, SQL RDBMSes take off. As it is, SQL can't be a lingua franca of anything–because there is no "SQL standard" in the sense of an IETF RFC networking standard you can just build your server around. "ANSI SQL" is a language standard (specifically, a mapping from a grammar to a semantics); not a protocol standard.

There isn't even an SQL-wire-protocol standardization committee. Not even a working group. Not even a random FOSS project to attempt to create a "reference ANSI-SQL protocol specification." It's just... not there. (It's kind of bizarre, actually.)


But ODBC exists. Or am I missing something?


Would you use ODBC to communicate to a web API?

Also, ODBC is not quite the abstraction layer I think parent was referring to, because ODBC only (barely) abstracts getting SQL from a client to a server, it doesn't abstract any of the semantics of the communication because it entirely punts that to a "driver" model. ODBC is slightly better than RS-232, but it's about the same abstraction layer from what I've seen.


GraphQL has a lot of advantages over SQL, API-wise; namely it's a lot more natural to describe what you want to get back. But without getting into that, the point is that you usually don't want to give direct data access, with all that entails.

You still want to be declaring the fields and properties of the data that users will be querying, and that usually sits a layer above your raw data.

You can make it a SQL view if you like, and give access to those instead. But it's just a lot of hassle to solve a solved problem in an unfamiliar way.

_______

Edit: Clarifying my stance on SQL views. I'm not dismissing it as a "bad tool", but there's good reasons it's not a popular one today. Writing SQL is not easy for everyone, and it's especially not easy for potential API consumers.

The best APIs are those you grok. If you grok SQL, congratulations, but most devs don't, and APIs are usually designed for most people.

Obviously, if SQL is the right tool for the job, use SQL. But don't start doing stuff like this for ideological reasons.


I mean SQL views have been a thing for literally decades longer than GraphQL. If any technology should be accused of causing a 'lot of hassle to solve a solved problem', it's graphql, not SQL.


If your original data is flat and your app can consume it flat, no worries. But SQL is a worse fit for structured data, so you end up with an ORM on top, which is what GraphQL sort of is. Typically web apps are written with web technologies that work with structured objects, and you want something like {title: "Stapler", price: 1230, category: {id: "123", name: "Office supplies"}, mainPhoto: {url: "https://...", width: 600, height: 300}} back to feed directly in the UI.

Some databases (Postgres comes to mind) are better at letting you express this as a query which builds the structure using JSON functions in the SELECT part of the query. But it's not even close to ideal.


"Some databases (Postgres comes to mind) are better at letting you express this as a query which builds the structure using JSON functions in the SELECT part of the query. But it's not even close to ideal."

Depending on which version of postgres you have(Want to say pre 9.5.) it can be a pain. There are various functions that allow a JSON output and another to make it human readable. With regards to performance I couldn't say but I imagine it would not be as fast as GraphQL.


In this context, what do you think of Datomic? Wish they make it the usual open source with a paid version with more features.


I don't know Datomic very well, but everything I've read makes it sound a bit like a modern take on an object database. Certainly interesting, but as long as it's commercial software I'm not going to touch it. Its close coupling (AFAIK) to Clojure is also not interesting to me.


You don’t need an ORM, just a big JOIN.


JOINs also return flat data structures, no?


Just iterate through the results and recover the hierarchical structure. It’s not exactly rocket science:

    rows = sql("SELECT * FROM parent LEFT JOIN child ON parent.id = child.parent_id")
    results = []
    for row in rows {
        if row["parent.id"] != results.last.id {
            results.add({ id: row["parent.id"], children: [], ... })
        }
        if row["child.id"] != null {
            results.last.children.add({ id: row["child.id"], ... })
        }
    }


You have to handle arbitrarily nested hierarchies.

You also need to support referential equality — so that objects with the same type and ID map to the same in-memory JS instance — which requires building an identity map that can live across multiple paginated result set pages.

You have to support untangling the object trees for UPDATE statements, which will require tracking dirtiness to avoid unnecessary requests. Then something similar to deconstruct new objects into multiple INSERTs. Both require a knowledge of foreign-key relationships unless you run all constraints in DEFERRED mode.

Now you're building an ORM.

(Your query also makes assumptions about order and structure that you cannot make in the general case. The devil is in the details.)


but now it's no longer a declarative query. What if you want to include grand children? or if you want to fetch just one child but include it's parent and it's siblings of another type like a product, it's category and the categories tags:

   {id: 23, name: 'Trackpad', category: {id: 42, name:'Equipment', tags: [{id:...}, {id:...}]}}
Now instead of just adding two joins you have to build a custom loop with hand crafted conditions. Sure you can find a way to generalize that loop and put it into a library but still the query itself is separated from some kind of post processing that has be kept in sync.


Nice bit of code.

Just to clarify, wouldn't you need to order by parent_id for the row["parent.id"] != results.last.id to work? Otherwise, say you're ordering by child.date_modified, you could end up with two identical parent_ids in your array?


Yeah, I forgot the ORDER BY clause, but you get the idea.


> You can make it a SQL view if you like, and give access to those instead. But it's just a lot of hassle to solve a solved problem in an unfamiliar way.

It's kind of odd to use that to dismiss the solution that was established and widely available for 30 years longer, and which more people are probably familiar with.


See my edit. I'm not dismissing them, but the range of cases where they're the correct tool over a REST API, GraphQL, or something else... is very narrow.


I’m a huge proponent of sql, for me it’s virtually impossible to beat as a query language.

We played with wrapping our dB layer with graphql recently and I would say that the experience was pretty good. In part it’s because you’re working a layer up where you can have a more pluggable model. In the first run you can model your local dB structure, but it’s easy to glue more stuff into it from other sources and generally evolve the structure.


HAVING is not the same thing as WHERE. WHERE filters the query to produce a result set. HAVING filters the result set.

    select id, count(*) - 1 as nduplicates
       from parts where color = 'red'
       group by id
       having count(*) > 1;
WHERE cannot filter on the count(*) because WHERE is applied during the query before the count is known.

I agree SQL is kind of warty, and it would be great if it were easier to compose from programs, but HAVING is not redundant to WHERE.


But that's equivalent to the following query which doesn't use having, only select

    select * from (select id, count(*) - 1 as nduplicates
       from parts where color = 'red'
       group by id) _inner
       where nduplicates > 0;

Now obviously that doesn't read as easily, but WHERE can filter out the count(*) perfectly.


I believe this should have been in response to its sibling. The set theorist in me agrees - the two operations are working on 2 different sets at 2 different levels of abstraction.


Yes, replied to the off by one post.


I find it very easy to imagine a better SQL.

Two things I'd fix:

* make it explicit when a join is expected to have 1:1 semantics vs 1:n, because it's very easy to end up with duplicated rows otherwise

* increase composability of the syntax; why do we need HAVING, why isn't WHERE enough? We have relational algebra, projections, filters, joins, folds, union, intersect etc. But the syntax of SQL is so idosyncratic, and at times antagonistic to the stream and set-like intuitions behind relational algebra.


Regarding your first point, I’m trying to imagine what that would look like. So you mean something you could put on there to say (join x [1:1] on blah)? In general the mistake that’s easiest to make is to get duplicate rows from another join later, which this wouldn’t help with - but maybe I’m misunderstanding.


I don't really understand your comment about it always being join x+n that causes the problem instead of join x. What is different about later joins that is not true when the later join is the current join, when you get around to writing it? If joins for tables T1::T2 and T2::T3 are 1:1, they don't magically become 1:n when you do T1::T2::T3.

To do it right, you'd need to mark up the relations with expected cardinality (effectively unique constraints), so that the DB would be able to verify ahead of time whether a join is going to be 1:1 or 1:n. That would be a better solution than a join working fine up until the cardinality expectation is violated and it suddenly stops working.

If we had to stick with SQL syntax, it might be something like:

   select a.*, b.*
   from a
   join one b on b.id = a.b_id
(Unique constraint from primary key on b.id. It may still filter down the set of rows in a, but it won't duplicate.)

Or:

   select a.*, b.*
   from a
   join many b on b.some_key = a.some_key
(No unique constraint on b.some_key)

I think the interesting cardinality distinctions are 1:1, 1:n and 1:0.

(The above syntax is ambiguous with aliases, so it wouldn't fly as is. But it gives a flavour.)


So the idea is that when you write the queries with these explicit join requirements, you'll want the query to fail if someone changes the schema later in such a way that the requirement is no longer met? If so, it seems like something that comes up so infrequently as to not warrant the noise, but maybe I'm still misunderstanding.

What I was saying was that often people have T1::T2 (1-1) and then throw in a later join T2::T3 (1-many) without considering the implications for the T1 (maybe they were doing a count(T1.id)). At least that seems to be an aspect that confuses developers newer to sql (that the many join is duplicating other parts of the data).


It comes up quite a bit if you have repeated data in batches where the batches are a first class concept. Any relations need to be qualified by both normal fk id and batch id, or else you bring back across all batches.

Think something like measurements samples that come in an envelope with metadata, or account updates that come in a statement, that kind of thing.


Why is writing SQL so hard for "most devs" I was moved to a Oracle based project (the management system for the uk's core internet) and only had a weeks training on PL/SQL.

I got a high performance award after that project was completed and I don't consider myself a Rockstar developer


You have a week's worth more experience of training than "most devs".

"Most devs" don't directly deal with SQL these days, because ORMs are good enough for day-to-day operations (and the non-day-to-day ends up being handled by people who actually know SQL). In fact, "most devs" rarely interact with databases. A lot of data sourcing is API driven these days.

Now, I don't claim to like or dislike that particular state. But you have two choices: You can innovate, which carries risks (of being wrong; of not selling your idea correctly; of reinventing the wheel; of making the same mistakes others did; ...). Or, you can play it safe, and create APIs that most people will know how to use without having to go through a 1-week course, because short as that may sound, that's 167.5 more hours than anyone will reasonably be willing to invest in understanding your API.

So at the end of the day: Use the right tool for the job. SQL for APIs is usually not the right tool.


Out of curiosity (didn't had the time to get up to date on it yet), how does GraphQL let you describe your returning data better? Is it because it has hierarchical types (types and collections inside types) or it's something else?


With GraphQL, you are creating what is essentially a 1:1 mapping of the structure of the data you want to get back. It's a lot more intuitive than joins and such.


A join in GraphQL looks like this:

{ getOrganisation(args){ name, employees{ Name, Position }}

(I added commas to make it easier to read, but they are not supposed to be there)

I know SQL, so I don’t think it’s necessarily easier, but what is easier, is non-sql-savvy people making efficient queries because GraphQL does a lot of the heavy lifting for you.


What about a stable materialized view (either materialized by SQL that is kept up to date with your source-of-truth tables as they're migrated... or, if necessary, by a custom real-time data pipeline to feed it) to service reads? And an event-based, insert-only schema to service writes?

I'm always inspired by Martin Kleppmann's talk about the architecture his team built at LinkedIn (and subsequently spun off as a separate business): https://www.confluent.io/blog/turning-the-database-inside-ou... . An event-sourced system like that, one that can denormalize and transform data in any way that's needed for security and performance, could theoretically support a SQL-like API. It could even support exposing SQL-like triggers for real-time updates to untrusted clients (or, in LinkedIn's case, practically limitless internal teams).

With chat being your domain, it's possible that event sourcing into a SQL database would be a great paradigm match for you. I've seen it work wonders.


Never have query APIs on your operational databases. Either use a read-only slave, or another database filled through an ETL process. This allows you to refactor your main database, without impacting your API. Also you avoid the performance impact of the APIs on your main database.

As an alternative for SQL you can consider oData, it's standardized REST compatible generic query language, similar to graphql, but supported in BI tooling like PowerBI, tableau and excel.


You could create the views, but add a deprication period, log queries and automatically remind users querying the-to-be-deprecated views. A customer that runs SQL queries directly will probably tolerate having to rewrite their SQL queries from time to time.


We give people a series of dropdowns on columns on tabular data, and they can do things like IN and BETWEEN queries for date and numeric data, as well as literal equality for string data.

The catch is that they can do this filtering on columns that are, behind the scenes, retrieved via joins. And the data set being searched may be many millions of rows. And it needs to come back interactively, i.e. no more than a second or two. And the schema is customer-defined.

I implemented a syntax tree that is composed from the various columnar dropdown filters and get passed as a big AND to the API. The API essentially recieves a full syntax tree for a predicate, but each field referenced in the predicate might be only available through a series of joins, or via the main table - it requires some analysis.

The database backing this is MySQL. MySQL has a fairly simple query optimizer, and it is reliably poor at being given complex predicates over queries with lots of joins. So, we split up the query: we try and tear off ANDed predicate parts that relate to the core table, push them into a derived table (i.e. nested query) along with the sorting predicate, if possible. We then join in just the tables required for evaluating the rest of the predicate, in the next level out. And finally, in the outermost query, we do all the joins required for the final page of results, joins that are necessary to retrieve all the columns that haven't been used by filters or sorts.

With this level of analysis of the predicate being executed, we could do things like reject overly complex queries. As it is, our underlying data model is limited somewhat in the expected volume on a time period basis, and we have some defaults over the maximum time window that can be searched, so it's mostly under control.

Occasionally new devs add in extra complex joins with derived tables that are a little bit too complex for interarctive querying, and staying on top of that is a challenge. It's needing a bunch more performance testing before releases, whereas we've been mostly able to get by with retrospective performance monitoring until now. But the team is bigger, not every query modification gets sufficiently expert eyes, and so we need to beef up our processes.


If anyone is looking for apis for mysql : xmysql will be really helpful.

https://github.com/o1lab/xmysql/

Xmysql is just one command to generate REST APIs for any mysql database.


> Note that this is solely in the context of a static data set. That makes this a relatively limited idea

Not if you use the event sourcing approach.


> If anyone else has experience with exposing queryable (SQL or otherwise) dynamic APIs directly to customers, I'd love to hear your insights. How did you tackle this problem?

This is something I've been experimenting with lately. The problem with APIs is that we conceive them as opaque abstractions. This is reassuring when you need to tweak the background and avoid breaking the informal contract a public API has with its clients. But this also mean you need to start over if you want to implement joins or filters. This is both what REST and GraphQL does over SQL which already has this. I'd also argue that SQL is a more specified and more broadly accepted standard than REST or GraphQL. So here is how it goes:

1. First, I use a SQL DSL that is written in a data-based language rather than being string based as SQL is by default. Since I write in Clojure, Clojure being a Lisp, Code being Data, this DSL is just EDN, a clone of JSON.

2. I add an abstraction layer on top of this DSL to make SQL more malleable and more lispy with respect to certain things (in particular you can't nest arbitrary SQL like you can with Lisp code).

3. I add another abstraction layer on top of the INSERT/CREATE/UPDATE/DELETE verbs (or actually the DSL's equivalents) so that any record inside any table can have its own set of custom, optional attributes/columns and can be queried, filtered and joined like any other field. Any SQL type is supported and EDN/JSON data is supported as well.

We're done with the DB, now let's turn to the API.

4. I add new verbs implemented by complex queries (point 2) that decompose into the enhanced INSERT/CREATE/UPDATE/DELETE verbs (point 3). Since the Clojure SQL DSL I'm extending supports churning of ingoing and outgoing DB data I also use that i.e. not everything is run in the database, and there can be multiple travels between the server and the db for the same verb -> This is how I blend database "code" (queries) with the service's business code.

These verbs are used to make the API deterministic (it's a critical service, you don't want the same stuff happening twice by mistake).

Example: CONJURE my_record BY {linked_client_id: 123} VALUES {field: "abc"}. Like ActiveRecord's find_or_create_by.

This verb is used as both a replacement for CREATE and SELECT (but receives/returns only one record).

This implies two things:

- The client doesn't have to store any data from my service anymore, not even ids. He just needs to send to the service data that allows it to uniquely identify the operation in question (like one of its id).

- Forcing the client into this deterministic mindset allows it to think less about the state of my service which is particularly useful against timeouts (when you're left not knowing if the operation has been carried out by the service) and simplifies the tying of the client's business logic with that of the service.

This is also why we need to be able to accept arbitrary data from the client in a flexible way (point 3).

5. I just stupidly expose the DSL (remember it's JSON-like by essence since it's Clojure code): I read payloads sent by the client, make sure the code is restricted to the DSL (is not actual evil Clojure code), then eval it.

6. As for the client, I just find the best SQL ORM in the target language and implement an adapter for my service's exposed SQL. At this point I basically just have to convert the ORM's AST into the Clojure-like form required by the service. Querying the service just feels like querying a local database.

7. The service also needs to handle basic retries against timeouts. As for "pagination", I use PostgreSQL streaming abilities in synchrony with that of HTML + lazy sequences on the client side. This also allows very large datasets to be processed. As for joins between the client's db and my service, this can be done using a WHERE ID IN (...) clause, which might be supported by the ORM natively. This might also be the occasion to implement some caching to alleviate the network load.

This is what I tried, and having achieved partially what's sketched on this roadmap, I think the overall approach I have here with respect to the concept of abstraction is to use it as a transparent structure throughout the implied levels (db -> server business logic -> API -> client -> client) in a bet whose goal is not to loose the expression power of the underlying RELATIONAL model since that's what is already used in the DB, it's how the API is designed and most likely it's what the client is using to model its own business.

I also like how the concept of API just melts away, hidden behind the client's ORM.


A really neat project (written in haskell) for turning a postgres database into a restful API:

https://github.com/PostgREST/postgrest


The thing I disliked about it is that the APIs were very ugly. We ended up building our own tool that allows for much nicer APIs to be built: <https://github.com/KayEss/fostgres>


I wish it gave a little "hello world" example of what it lets you do.



That's just what I wanted. What kind of search skills did you use to find it? Is there a convention?


This is a amazing project. It has many of features I always lost my time writing API.


We do this heavily in astronomy:

http://ivoa.net/documents/TAP

Although it's a bit crusty, it is heavily used. We store additional metadata about columns and we annotate the results we return.

You can build something easily like this in Java with the Presto parser, including adding authorization to systems that can't/don't support it.

If Go is more of your thing, there's a parser in Vitess:

https://github.com/vitessio/vitess/blob/master/go/vt/sqlpars...

I had done some work on porting the presto parser to python (github.com/slaclab/lacquer) to also build an API, but I had some problems in my parser grammar I never figured out. I was going to revisit this again now that Antlr4 can deal with Python properly and I can reuse the presto grammar, but I haven't had the chance.

You can couple all of this with csvw as well, and get some additional nice features out + JSON-LD.


I'd rather see "SPARQL[1][2] as an API language" myself. SPARQL is criminally under-appreciated.

BTW, if anybody wants to play around with SPARQL and see some of what it can do, the Wikidata public SPARQL endpoint[3] is very nice and comes preloaded with some great examples[4].

[1]: https://en.wikipedia.org/wiki/SPARQL

[2]: https://www.w3.org/TR/sparql11-overview/

[3]: https://query.wikidata.org/

[4]: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...


This may just be down to wikidata's implementation, but the reliance on hardcoded unique id's everywhere is... bleh

The following query uses these:

Items: hospital (Q16917) Properties: instance of (P31) coordinate location (P625)

  2 #defaultView:Map
  3 SELECT * WHERE {
  4   ?item wdt:P31*/wdt:P279* wd:Q16917;
  5         wdt:P625 ?geo .
  6 }


You always need unique identifiers for things, that's pretty much core to the whole Semantic Web idea. But what makes the Wikidata approach frustrating to me is that they use these identifiers that are totally opaque. I can sort of understand why they did it that way, but it does make for ugly queries.


SQL could actually be an API language without needing to be wrapped in JSON/GraphQL. Building a GraphQL backend makes you deal with a GraphQL AST, parse it, process it and so on. GraphQL backends are hard to build because naive implementations can't "optimise" the AST and resolve to making individual data calls for each node in the AST. But considering the tooling around GraphQL on the server and client side, it would have been amazing to have that API tooling around SQL! Assuming that writing an SQL parser, hooking into the AST to run the business logic or to add auth rules, and then use great client side libraries like jooq [1], massivejs [2] to be able to construct SQL queries programmatically, SQL as an API could have been a very feasible and neat idea!

The similarities between GraphQL and SQL are many on the surface. Another nice parallel that the blogpost doesn't mention is the idea of query variables in GraphQL which is reminiscent of prepared statements + variables in Postgres.

We believe in a similar thesis at Hasura[3], where the idea is to give app developers GraphQL on SQL (Postgres) with minimal abstraction. However, we've had to do a lot of work to support "application level" authorization so that read/write permissions are controlled through the applications auth system and make it a more complete solution.

In fact, our initial version was basically a JSON DSL very similar to datasette, and we switched to GraphQL primarily because the client-side tooling is quite amazing.

[1] https://www.jooq.org/

[2] https://github.com/dmfay/massive-js

[3] https://github.com/hasura/graphql-engine


I really want to use SQL everywhere. I'm sad that they deprecated the WebSQL, and looking forward to the WebAssembly implementations (IIRC there already was a project doing that).


I step up the challenge: To use the relational model everywhere. I'm working on it:

http://tablam.org

Eventually, if this project move forward, I pretend to provide a nicer version of sql on top of it, so get possible to do:

from customers where id=1

instead of

select * from customers where id=1


This sounds unintuitive, but the most lightweight API for analytics work I've come across is a SQL view.

It provides one level of indirection from the underlying data structures (if the data changes, you can just rewrite the view), and most db's provide very good access control. The view's schema is your interface contract, so users can rely on its output.

I was all about building REST interfaces to our backend databases until I realized that SQL views were a simpler, more efficient abstraction for analytics/machine learning work.

You may be like, "SQL views aren't REST APIs! It doesn't seem kosher!"

But if you think about it, REST APIs were designed for retrieving small amounts of data quickly using simple parameters, which is useful for low-latency web applications.

For analytics/machine learning work, we generally don't want that. We usually want to retrieve large chunks of pre-computed tabular data. And there's already a language for that... SQL. And SQL databases are naturally fast for large chunks of data (with optimizations like indexing, compression, etc. your results can get even faster... plus no JSON serialization cost).

If you build a REST API to SQL datasets, you have to first (inefficiently) parse the returned JSON, and reconstruct the dataframe for downstream analytics/machine learning work. Why do that, when you can use SQL (with all of its attendant data aggregation/manipulation keywords) to efficiently retrieve data via a binary wire protocol (native drivers or ODBC/JDBC) into a dataframe-like object?


I've started using a rails model backed by a SQL view anytime I need to do something more complicated than book.author.name with active record.

Then I can treat it just like any other rails model. If you add a controller and a route you can expose it just like you would any other read restful resource with the caveat that it's read only.

As an added bonus for analytics that don't need to be real-time I can turn the view into a materialized view that updates periodically for easy caching.

I'm using the Scenic gem to handle view versioning, which all let's me write all of my complicated queries in a dedicated SQL file instead of an embedded string without syntax highlighting.


Can you please explain what the interface looks like a little more concretely?


It's exactly as it sounds: a SQL view on a database server. It manifests as a SQL table backed by a query. It's mind-bendingly simple.

You can connect to it via a native db connector or ODBC. It's also accessible to a wide range of tools including Excel, Tableau, or any number of programming languages.

In Python you can pull data from it into a Pandas dataframe via TurbODBC (which recently added Apache Arrow in-memory support). In R, you can connect to it via dplyr. Apache Spark can get data from it via JDBC. You can experiment with it via Jupyter Notebooks.

SQL support is ubiquitous.


Salesforce actually exposes a SQL variant in their bulk export API called SOQL, https://developer.salesforce.com/docs/atlas.en-us.soql_sosl..... It made it surprisingly easy for me to do some SF ETL tool a few years ago.


I've been thinking about this in terms of GDPR. Most companies will just give a CSV dump to the user, but what if we exported it as a SQLite database instead (or in addition to)?


Ha! We do offer powerful "personal information analytics" in terms of GDPR, https://pii-tools.com :)

It's more for auditors and DPOs though.

The challenge with GDPR is more simplicity, taming the mess. Rather than adding more options. What sort of use-case scenarios do you have in mind? What need would SQLite exports solve?


That would be nice, but companies don't really have any incentive to make it easy for you to use your data elsewhere.


Reading the title, I expected something like

  update screen
  set red=3, green=200, blue=125
  where (row-100)*(row-100) + (col-200)+(col-200) < 12345
That could be a wonderful hack, certainly if combined with triggers that generate or even react to screen updates. Carefully done joins could easily blit in character bitmaps and icons.

If you have time at hand, feel free to implement this idea by adding a storage engine that stores on screen to your favorite sql database. Bonus points if you manage to run many of the updates on the GPU.


I actually implemented something like this a few years ago on top of Postgres for the CKAN Data Portal Software. Check out the docs at https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckan....

CKAN is used by data.gov, data.gov.uk and many other governments to publish open data.


The problem with SQL is that most products that implement it have subtle (and not so subtle) differences with custom functions, types, syntax for all sorts of things, etc. Sqlite is actually rather limited and lacks a lot of stuff you'd find in e.g. postgres or mysql, both of which tend to implement the same things very differently. That's of course aside from things like joins, sub selects, etc. ANSI SQL only gets you so far.

I'd argue decoupling how you distribute data from how you store data, and how you are going to query it is probably a good thing.

CSV is indeed not a great format because it lacks structure and types (everything is a string). Flattening complex data in csv format can get ugly quickly. I recently had to deal with some proprietary object database dumped out as csv. I had some great fun trying to reverse engineer their schema from the raw data.

XML has the same problem unless you dream up some schema that explicitly states the type. Json has some basic types for strings, booleans, and numbers. This makes it a bit more useful than XML out of the box. Both xml xml and json objects represent trees of information and SQL databases represent tables of rows. Of course some databases can store json (or xml even) and allow you to do some queries over that. E.g. Elasticsearch does a decent job of guessing schemas for json documents. It will figure out things like dates, geojson, numbers, strings, etc.

So, distributing data as sqlite files is not necessarily a horrible idea but it doesn't really solve the problem of how to move data between different data tools. What's lacking is a standard way to to distribute strongly typed tabular data. Most databases can import/export csv and native inserts for their flavor of sql but not much else. Most (open) data get shipped using some custom schema on top of xml, csv, json or whatever else seemed fashionable at the time.


The author is missing a few important notions:

- graphQL server is storage-agnostic. It allows you to mix databases, APIs, anything, into one cohesive queryable model. It acts as a hub for several data sources. Consumers don’t need to care

- queries are easily composable, reusable, and can be aggregated for batching, deduplication and optimisation (can be done with a query planner, but is way more complex)

- it’s safe enough for queries to be sent directly from the client on user devices


I am all for it. Also, Postgres FDW for network(TCP/HTTP/UDP) sockets would be great to further enhance or push the capabilities of PG


See also

Programming in the URL string https://apenwarr.ca/log/20121218

Previously (not :-) discussed much at https://news.ycombinator.com/item?id=4939674


> On security: the data is read-only, using SQLite’s immutable mode. You can’t damage it with a query—INSERT and UPDATEs will simply throw harmless errors.

In times of data breaches, the naivety of this statement is alarming. It is not too difficult to extract (all) data from an API that allows for passing arbitrary SQL statements.

Be careful here!



The best "API language" is the one constructed to work with the app's data model. SQL constructs are relational - fine for rails, django, or salesforce apps which map screen to table (or table join).

Wouldn't work for facebook, google knowledge graph, or the larger enterprise app vendors today. Theirs is typed graph.


Don't use JSON, XML has been able to do that since 1996!

I actually agree with the premise – my startup is http://www.hyperfiddle.net/ – but SQL is too complicated and abstraction-resistant in a dozen different ways to be a suitable foundation.


Use PostgREST (written in Haskell) and PostgreSQL. No SQL injection, but you get a lot of the power of SQL anyways.


In my experience, REST APIs devolve into thin wrappers over SQL statements anyway. Might as well embrace it.


> On performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold.

Can't it consider the complexity of the query and the actual database (indices, tables size etc) to guess how heavy the request is goign to be in advance?


sure, but aren't there some things that can't calculated in advance like i/o, locked tables or priority queries just to name a few


> sure, but aren't there some things that can't calculated in advance

I don't mean precise calculation but a reasonable guess

> like i/o

It can be benchmarked and estimated.

> locked tables

Why would tables be locked if we work in read-only mode?


How did the data get there.


The page by the link says it's about the immutable mode and INSERT and UPDATE errors raise errors. As about me it's a very usual use case: I pre-populate a database with a huge data set (some GiBs) in a single batch and work with the data (data science stuff) in read-only mode then, some time after, add another batch of data (usually much smaller than the initial one) to it but database writing and reading never happens simultaneously in this scenario. In fact Datasette seems a thing I've always wanted as it is probably going to let me access my SQLite databases over a network this way.


https://www.geopackage.org/ is worth a look too - although it's more "SQLite as file format" than SQLite as API.


Is there any better alternatives to SQL for requesting data !?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: