Hacker News new | past | comments | ask | show | jobs | submit login
Database expert on why NoSQL mattered – and SQL still matters (medium.com/s-c-a-l-e)
132 points by jasondc on July 29, 2015 | hide | past | favorite | 37 comments



It's pretty clear that SQL won, although ideas from “NoSQL” have been assimilated, i.e., storing and querying semi-structured or unstructured data in an otherwise relational data model.


SQL did win. Sorta.

NoSQL brought a lot of new ideas. Redis might seem superficially like NDB (MySQL Cluster), but underneath, it's extremely different. What kind of sane person could design a single-threaded, single process database? @antirez built something radical that delivers unbelievable performance.

More innovation and invention has happened in data in the last decade than in the previous thirty, and it's because people experimented and weren't afraid to fail.

Most NoSQL datastores swap limitations in one area for advances in another. Riak, Redis, DynamoDB.. S3 is probably one of the most successful examples of NoSQL, and its very limitations are crucial to its success. Each has clearly defined use cases.

Some people say "Big Data" or "NoSQL" when they really just mean Hadoop or MapReduce, but NoSQL just literally means "everything that's not SQL."

SQL is still great. Awesomeness like Postgresql's supporting every data type imaginable, Redshift and Aurora, file-based db's like SQLite...

So, yes, SQL won. But so did NoSQL. And all of the rest of us too.


I would say that people are too quick to jump on the new trend and are not evaluating what they actually need. In my company we are using PouchDB/CouchDb because of the exceptionally good sync engine, we have around 20 nodes offline at least 50% of the time, no SQL database could compete with PouchDB/CouchDb for this non-standard usecase, people should choose their DB according to what they need.


Same here, we're using CouchDB to get offline data on mobiles.


SQL, as a language, will always win. There is no more elegant way to work with sets, and you can't work with data without thinking about sets.

I really like his point about the original breakthrough of SQL being the ability to abstract the way the data is stored from the application, and the language was just a natural outgrowth.


> There is no more elegant way to work with sets,

There are, but SQL is good enough and has so much inertia that its hard to see any alternative getting much traction.

(Its kind of like JS in the browser.)


SQL is actually a pretty terrible syntax, though. Relational algebra is beautiful and elegant, but SQL is a dated, '70s-y syntax on top of that.


how/why?

To the person who only knows SQL from feeding and care of servers, this seems like complaining about the color of a car that gets 1000 miles per gallon.


The problem is people shoehorning relational problems into non relational databases. Then no one wins.


The same is true for the converse, as in both cases you get to be your own query planner and add client<->server latency on every join.


I wouldn't say it is as bad. Basically as you scale relational databases, you end up sharding and loose a lot of the relational features, and basically end up with NoSQL. In the majority of cases I see, people are not dealing with "big data" but think they are.


You can go really far with most relational database engines by adding hardware to a single instance instead of adding more instances.

> In the majority of cases I see, people are not dealing with "big data" but think they are.

Spot on. The amount of stuff you can do with a single $2,000 box these days is astounding.


I'm curious about this stance and the proliferation of [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_A...).

I'm working on a simple note taking app, using IndexedDB and offline backups. Its been interesting working with IndexedDB -- but the primary problems I'm addressing are simple with SQL (Foreign Key's, join tables). I'm grateful to learn about dealing with the issue (and approaching it with very different solutions) with NoSQL -- but on the fence if I"m spending my time on something useful. I'd certainly be further on the path of working software if I could rely on something like SQL (via sqlite) being widely adopted in the browser.


I am not sure how you can say "SQL won" - what I see is that client-accessible distributed hash tables have won (even PostgreSQL is now having them). For low-latency SQL is completely dead now.


You are half correct.

SQL itself is finding itself in vogue again. Not full ANSI SQL but derivatives/half implemented equivalents e.g. CQL, HiveQL, Spark SQL, Presto. The type that this article refers to (AMPLab are the original creators of Spark).

But SQL databases themselves are definitely less popular. Mainly as (a) data grows beyond their capability, (b) data becomes more streaming/real time and (c) microservices pushes data into smaller disparate databases. The one big vertically scaled database at the hub of everything is definitely disappearing.


> The one big vertically scaled database at the hub of everything is definitely disappearing.

Maybe in Startupville, CA. But I think you're forgetting that it's a big world out there and there are lots of systems that are built on vertically scaled relational engines that practically print money. The vast majority of companies out there do not have Twitter-scale engineering problems to solve.


I work exclusively in Fortune 500 size enterprises.

And the big EDW that you use to find powering everything has been broken over the years into unintegrated silos e.g. ERP, Web, Salesforce, Payroll etc. The big trend now is to reintegrate all this data and do analytics on it. To do this requires you to do (a) major ETL work between completely different schemas then (b) your data science/analytics work. In semi real time.

This article is referring to this type of workload since this is Spark's bread/butter. You land the data in HDFS, use Spark SQL to run ETL/Analytics jobs and then output the results in a single enterprise view for reporting, marketing etc. And yes this is identical to what Twitter's analytics team would be doing.

With cloud tools from Azure, IBM, Amazon this sort of analytics is going to be becoming much more common place. All using SQL the language but not SQL the database.


The enterprise I work for won't touch cloud with a 10 foot pole, and I know this because we literally got told to quit asking about it. :)

So yes, even we are building out a pretty beefy internal Hadoop cluster, so I would never say that it will be all-relational-all-the-time. But my point was more that there will be copious amounts of SSAS cubes and Oracle warehouses for the foreseeable future. They work great for their use cases and they have well known problems with well known solutions. Doing what Twitter's team is doing when you aren't Twitter might not be the best idea for everyone, after all.

In our case, we use Teradata for our work and it's quite capable of handling very large workloads, and thus we currently have no plans to spin it down in favor of the new hotness. (Even though the new Hadoop cluster positively dwarfs our TD appliance.) I'd say we have a mixture of both on the horizon, if only because our DBAs are less than cooperative about Java UDFs, so Hadoop is the easiest way for us to do complex processing against our fairly large data set.


For EDW, yes. You might see smaller federated data marts or even separately managed relational dbs all over the place. But for OLTP systems for the vast majority of enterprises out there, the vertical single instance big hunk database is still big dawg.


> separately managed relational dbs all over the place

On the BI side, this is overwhelmingly the outcome for large companies. The business units get silo'ed, they build their fiefdoms, a consolidation project gets kicked off and fails, rinse, repeat. Even if the consolidation succeeds, it takes extremely strong leadership to keep it from devolving right back to silos. The tech is not the cause of this problem, so I don't foresee it being the solution to it either.


How real-time are the analytics with these implementations? When I think ETL I think daily chronjobs. Have there been advances in this space which would let me instantaneously see a lead created in Salesforce in these new reports?


We have a POC running where we stream web hit data onto HDFS in near real time (several seconds of latency perhaps). There's no reason to think you couldn't do it with other streams of data as well.

edit: Not sure about Salesforce specifically, sorry if this is too far off topic.


The company I work for is building a near real time (web real time not real realtime) setup (a second or delayed) using AWS SQS and redshift with a custom message consumer. If you keep the message consumer as stateless as possible it's super scalable and reliable.


Those derivatives are, apart from not being really SQL (but a SQLish-like in the best case), only target an incredibly small subset of what is SQL.

When they say "we implement a subset of SQL" they really mean "we have a language similar to what used to be a small subset of SQL 92". And let me reinforce that 92, from the times of Windows 3.1. It cannot be compared to today's Modern SQL (http://use-the-index-luke.com/blog/2015-02/modern-sql) found on modern relational databases, like PostgreSQL.


I wouldn't say that.. as much as they disgust me, Oracle isn't going anywhere.. and neither are MS-SQL, mySQL/MariaDB, PostgreSQL, SQLite, Firebird, DB2, and a bunch of others.

SQL allows for very generalized tools to be used to query data, aside from whatever application front ends get developed. This is simply harder with other database engines. Not to mention that it's a simple good enough tool in many use cases. Not everyone has twitter/facebook/google sized problems. Most applications can do just fine scaling up. And often too much importance is placed in hitting five nines of availability.

I really like NoSQL databases, flexible schemaless systems and horizontal scaling. That said, you can go a long way with SQL, it's just you can't solve some problems on multiple systems, and you can't solve some problems on a single system... depends on your needs.


I definitely am not suggesting they are disappearing. Only becoming less popular.

And I wouldn't group SQLite in that list. I am assuming it has already has come from nowhere to be the most popular in use today (courtesy of iOS/Android ?) and I believe is part of an emerging trend of microdatabases.

What is happening is that SQL is becoming more popular but over the top of NoSQL systems e.g. HDFS, Cassandra.


Which is pretty nice... beyond the language of SQL, relational databases that support full ACID compliance are important for a lot of scenarios. Usually you have to sacrifice a portion of that in order to scale horizontally, or when a single system can no longer keep up.

There's been enormous progress in database reasoning the past decade or so, as more companies have needed to reach the scale of Facebook, Twitter, Google, Amazon and others... Not every instance needs that level of scale, but many businesses are also trying to reach very high levels of availability, which is an issue with similar solutions.


Why are the concepts of a query language different than SQL, loose typing, flexible schemas, horizontal scalability, and a lack of joins all conflated as being the same thing- "nosql".


To me, a major important thing about NoSql is the predictable behavior (because all queries are simple and there is no "query optimizer").

If I could ensure SQL never using index/table scans and make it fail without proper indexes it would be a major help.


How not having a query optimizer makes queries more predictable? It makes them slower for sure, but not more predictable.

Indeed, I find NoSQL very unpredictable. When your query hits and index, the query goes well. When not, you usually end up doing a whole database scan. Plus you usually cannot use more than one index for a given query, and some NoSQL require an index just for ordering....


I want my queries to always behave the same and fail without an index.

(you can't have a reliable system with random behavior where system drops to table scans when statistics are incomplete or something)

And yes, ordering requires an index.


I understand you want to be in "control" of how exactly a query is beeing executed (what indices it uses). There is a lot to be said about query optimization but it is not "random". My guess is that most queries can be optimized automatically and it shows that sometimes indices are not even needed (see https://robots.thoughtbot.com/why-postgres-wont-always-use-a...) whereas the "manual" approach would always use an index.

In PostgreSQL if you want to know what indices are used by a query, just ask the system using an EXPLAIN ANALYZE query and if it does not use any indices, create them (or live with the performance).

Also there are "Index-only scans" which I do not know much about but may well fit your approach: https://wiki.postgresql.org/wiki/Index-only_scans

So to summarize: automatic index selection is not random and slow queries can be identified pretty easy.


> I want my queries to always behave the same and fail without an index.

So I take it you are not using any kind of cache in your memory hierarchy?


This exactly where I observe MSSQL dropping to table scans too soon. If indexes are cached (or when using a ssd), they will be much faster for many many records than a table scan (which will always require disk access in a realistic scenario).

I agree that table scan will be faster if I need a large percentage of the table (more than can be cached).


There are many queries for which using an index is a worse query plan than running them with it. Indexes are not silver-bullet.

Plus having queries to fail if not use an index... seems worse than just having slower (if that would be the case) queries.


I do think it matters where and how you store your data, but I really like how he puts it simple, is not a battle between to opposing sides, simply there is a need for a data store that is not that structured, it is much simplier to prototype and achieve an mvp with a NoSQL Storage, but once you are in the real world you need to provide some type of guarantee to your users, in terms of data integrity.

In my opinion everybody needs to understand the differences and use each when it is the most convenient, use the best tool for the right job.

But maybe we will start to see some type of mixup between this two systems, because as he mentions in the article, OpenSource is driving the future of the databases, and due to this, the databases of the future would be more dynamic and more synchronized to the needs of the developer.


It doesn't matter where you store your data, as long as you always remember that the data you store is the value of your enterprise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: