Does the Database Community Have an Identity Crisis?

gfody · on May 2, 2016

RDBMSs are as powerful as ever and are still improving rapidly. If you look at "what's new" in Oracle 12c, SQL Server 2016, VoltDB and Vertica; you can see they're not maintenance mode. I think calling them fossils that should be burned down to fuel the next thing is way off.

IMO the main thing RDBMSs have failed to do is to capture the new generation of developers who don't have the desire or the patience to learn SQL, relational concepts, various modeling techniques, and whatever sharding or shared nothing architectural fundamentals just to scale out. This generation wants to "npm install magic-database", write everything in javascript, and push to the cloud without ever having to think about scaleout. This isn't necessarily a problem with RDBMSs but RDBMSs could address it.

[1] https://docs.oracle.com/database/121/NEWFT/chapter12102.htm [2] https://msdn.microsoft.com/en-us/library/bb510411.aspx [3] https://docs.voltdb.com/ReleaseNotes [4] https://community.dev.hpe.com/t5/Vertica-Blog/What-s-New-in-...

Retra · on May 2, 2016

>IMO the main thing RDBMSs have failed to do is to capture the new generation of developers who don't have the desire or the patience to learn [...]

As one of this 'new generation of developers' (who happens to also work on industry database internals), I would say that I've been enormously unimpressed by the amount of specialized knowledge needed to implement an optimized, portable, flexible, scalable database. Do you know what tables in your DB need indexes and why? Do you know how to optimally set up your partitions? Do you know why your one query runs fast for a week and then suddenly runs slow, and what you can do about it?

The database API is just too low-level for most people's comfort zone. Most people who want to use a database are just trying to stuff organized bits somewhere and be able to get them out quickly. Databases are smart, but the admin still has to learn and learn and learn -- all this esoteric, domain specific knowledge -- just to store some damn bits.

Learning to use a database feels like learning machine code in that if you suck at it, you're going to pay the price of not having the best performance. Databases market themselves as data storage applications, but they are complex programming APIs at heart. People turn to databases to quickly solve data storage problems, not because they want to interface with some unsexy, cryptic API.

gfody · on May 2, 2016

Knowing what indexes you need and why is far from being some kind of esoteric specialized knowledge of database internals. That and some basic relational modeling techniques are the very minimum prerequisites to being able to leverage an RDBMS competently. Learning these things is not as difficult as trying to use an RDBMS without learning them.

I can't say I get the comparison to machine code. SQL is probably the highest level language there is.

tmptmp · on May 2, 2016

I agree.

>>Learning these things is not as difficult as trying to use an RDBMS without learning them.

Very aptly put. But I guess mastering SQL is a rather difficult task as SQL is a very high level language (may be even higher than Haskell minus the great type system of Haskell). It may be because of such a high level SQL seems difficult for many people. No wonder, the ORM crap sells so much, they sell almost impossible dreams (read snake oil) to naive people who don't understand much about SQL, data modeling, database system and OS.

Also the relational modeling requires a lot of deep thinking to get yourself an effective schema. The magic art of getting indexes right to a large extent, relies on your understanding of the particular domain and the basics of relational modeling, the database internals don't come in picture too much. Of course, getting enough performance gains and squeezing time/space here and there requires some knowledge of database internals but when you reach to those levels of optimizations, NoSQL based solutions may as well require you to know enough details of their implementations.

Take home lesson I learnt: these things (very high performance, efficiency, scaling) don't come for free.

tome · on May 2, 2016

> SQL is a very high level language (may be even higher than Haskell minus the great type system of Haskell).

I have a solution for that :)

https://github.com/tomjaguarpaw/haskell-opaleye/

tmptmp · on May 3, 2016

Looks great, will try it out.

Retra · on May 2, 2016

The database should already know what indexes you need just by looking at how you access the data. Explicitly choosing indexes is explicitly making a choice off time-space tradeoffs, and you have to do it at a very low level: which columns do you build indexes on? How many do you build?

It really isn't that far off from choosing the layout of a struct in C and having to know how it's packed and aligned.

>Knowing what indexes you need and why is far from being some kind of esoteric specialized knowledge of database internals.

I'll concede that, but the issue still exists. Does your database have DROP TABLE IF EXISTS or not? Some do, some don't. If it doesn't, how do you achieve the same functionality? A given database offers thousands of operations, and the syntax, performance, and semantics of each varies by quite a bit.

My point is, you have to learn the specifics of the database you're using to do general purpose data storage stuff.

walshemj · on May 2, 2016

Yep you can learn the basics of decent SQL in a week if you cant even hack that well get a job at starbucks - working in tech is not for you.

huuu · on May 2, 2016

Do you know what tables in your DB need indexes and why? Do you know how to optimally set up your partitions? Do you know why your one query runs fast for a week and then suddenly runs slow, and what you can do about it?

Yes. And I'm absolutely not an expert.

But I think you are missing an important goal of RDBs: data integrity. This is why people like to use this storage method over a NoSQL method.

Need to store subscriptions or news items? Then NoSQL might be fine. But when you need to store data relations, well thats why it's called Relational Data Base.

I advice you never to stop learning even when stuff seems difficult.

jimjimjim · on May 2, 2016

hopefully this won't come across as an ad hominem, but

you basically said: This difficult thing is difficult.

why are restaurants so difficult. the menu is too low level for most people's comfort zone. most people just want to reduce their hunger level and obtain flavors.

manigandham · on May 2, 2016

Everything you highlight is basic level knowledge to work with databases. If you don't know that, you should either learn or not go near them.

However I will agree that "scalable" is starting to get a little complicated with all the various scale up/out options and strategies out there.

gaius · on May 2, 2016

You can buy, off the shelf from any number of vendors now, a single box with 512G or more main memory, 48 or more processor cores, and several teras of storage, with the option of SSDs. Like, click click click and a courier delivers it a couple days later, that easy. Running Postgres say, the workload these things can support is insane. Far, far more than 99% of people worrying about "scalability" will ever actually need. This is a solved problem.

manigandham · on May 2, 2016

True, most corporations are under 1TB of data (per project) so not a big deal.

However scalability is about more than just a big box though. Scale up has always been easy, and should be the first thing to do, but there's still a new class of companies/projects that need HA for 100% uptime and scale-out so they can have relational access across PBs of data.

This is usually done with Hadoop or other "big data" tech but there's no reason regular relational databases can't scale up to this and offer all the benefits and tooling they come with. Closing that gap is still very tough these days without 3rd party vendors or custom extensions.

gaius · on May 2, 2016

Everyone even contemplating Hadoop should read this first http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

manigandham · on May 2, 2016

That's a pointless article. When you have a few GB of data, you can use anything. Command line tools or SQLite or anything in between would've worked fine.

Realistically, anyone contemplating Hadoop or anything bigger than traditional relational databases is dealing with the hundreds of TBs to PBs range which is not going to work with some unix tools.

gaius · on May 2, 2016

No, what's pointless is the typical Hadoop workload. Sure there are some people who need it but I'll wager they're not even 1% of the people using it.

manigandham · on May 2, 2016

Ok, but this is a random tangent. People will always use things they shouldnt, that's their problem.

The point of the thread is that there's still no easy scale-out solution that has come along for relational databases to provide for data that typically gets put into proprietary data warehouse or hadoop installations. Citus data and memsql might get close but the whole industry is still far behind where it should be for this.

cc81 · on May 2, 2016

I also would guess that for 95% of databases it would be enough to just run a query/performance analyzer and follow the advice it gives you.

chris_wot · on May 2, 2016

I'm not sure I would consider DML and DDL statements complex programming APIs... To tune a database, you need to be able to look at what the query is trying to retrieve, know a bit about what you are storing, and know how to look at an execution plan.

You can usually add an index, but it's often good to know that if you have a correlated subselect that uses an equality match in the where clause, then it's possible that you can influence the database engine by using an EXISTS clause instead - and it's probably more appropriate anyway because that's what you are attempting to find at any rate.

The issue I take with your comment though is that "just store some damn bits" is not really what you are doing. You are adding in actual data into a system that needs to be retrieved effectively.

The truth is, knowing the basics of a relational database should come easily to a developer, or at least one that has paid a bit of attention to set theory and relations.

It's sort of funny in that I don't know if those who struggle with database concepts would have much more luck with a NoSQL engine like MongoDB. Each storage technology has it's own unique challenges, and eventually you are going to need to get to grips with somethat that seems "complex".

Retra · on May 2, 2016

I've had no trouble learning the basics of a relational databases. The problem is that the basics don't really transfer between implementations all that conveniently.

It's like how I know the basics of processor architectures. There's registers and caches and interrupts. But now ask me to apply that to a specific machine's code? Now I've got to read the whole manual for that machine to figure out how the 'basics' apply to this particular machine.

chris_wot · on May 2, 2016

Everyone has their stumbling blocks, so I'm not judging you - only that I think it's not that hard for myself and an essential skill for mostly every app developer except in niche industries.

But I guess I'm curious what your stumbling block is. Is the database servers themselves, or is it something like SQL that gets you? SQL was odd to me when I first looked at it due to its declarative nature, and having to think in sets. But it was quite intuitive once I used it for a little bit. Indexes... I knew about them from when I was learning about C++ data structures, so it should be a transferrable skill.

Retra · on May 3, 2016

SQL is probably only 1/50th of what a modern database does.

What is the proper method for returning a result set from a stored procedure? Is that a basic DB skill or not? Because that's the kind of thing I end up needing to know on day 2 of the (conceptually simple) queries that I need to implement. "Oh, this query didn't work well? I've done a pseudo-close?"

The problem is that databases solve one very constrained subset of data storage needs elegantly, and then throw on a mountain of hacks to get the rest of the way there. Basic database knowledge gets you as far toward modeling real data in the same manner that basic arithmetic gets you an understanding of modern physics.

gldalmaso · on May 2, 2016

You can use a filing cabinet to quickly solve your paper storage problems, but it knows nothing of the type of papers your are stuffing in them.

When it comes time to retrieve your papers, you need to have had them organized in a fashion that makes it easy to look them up.

You're the only one that knows your data and how you need to look them up, and it's up to you to decide what goes in a file folder, how to label them and sort them.

You kind of need to know ahead of time how you are going to look your files up, otherwise you are going to have a hard time finding them.

RDBMSs are like a toolset of best archiving practices implemented digitally and comes with data integrity and a query language built in.

Some different types of storage may look sexier, but come with trade offs. In the end you have to live with your tool choices.

Retra · on May 2, 2016

I'm not complaining that you have to build a schema, just that, once you've done so, it should be relatively hands-off, and it often isn't. But maybe I'm biased because I'm subject to a constant stream of customers' issues.

qewrffewqwfqew · on May 2, 2016

Is any of that really less the case with NoSQL?

oneloop · on May 2, 2016

Yes. The need for indices is easier to identify on NoSQL: if you want to grab a key in O(1) then you need an index.

However this simplicity comes not from eschewing SQL but because the queries you typically demand are simpler.

creshal · on May 2, 2016

I'm not sure I follow that line of argument. If plain SQL is too low-level for you there's a wealth of ORMs that abstract away all the nitty-gritty details.

virmundi · on May 2, 2016

Up front: I'm thinking about leaving NoSQL because I hate its memory model.

SQL requires joins to get a "document" back. In order to properly optimize that process, you'll want an index. Most DBs don't do that (if I understand them correctly). See 1 for details about PG.

NoSQLs like Mongo lack joins. As a result you pack most of your joinable data into a document. When you need some manner of linked relation, you can store either the key to the collection, or a URI that goes through your API to that collection. In either case, you're pegging against an indexed key (probably the PK).

http://dba.stackexchange.com/questions/53809/need-for-indexe...

jasondc · on May 2, 2016

https://docs.mongodb.org/manual/reference/operator/aggregati...

cwyers · on May 2, 2016

It's not generational. I don't think developers these days are any lazier or less gifted than developers at any other times. We're at a point in history where it's much, much cheaper at a certain point to double the number of computers you have than it is to make all of your computers twice as fast. SQL databases, especially the ones we have right now, were built with the idea that you'd have one large database server -- replication and such exists but it's an extension of the single-server model. Things like NoSQL databases and Hadoop are meant to provide a model that supports horizontal scaling better than existing relational databases do. As they mature, they are starting to take on more and more features from relational databases. And SQL databases are taking on more and more features that let them scale horizontally. I suspect at some point in the future we'll see a lot more convergence.

gfody · on May 2, 2016

I didn't mean to imply that the new generation of devs are lazy and stupid. They (and you apparently) are influenced by the perception that RDBMSs are relics not designed with modern hardware in mind. This perception is wrong. Vertica and VoltDB were both designed from the beginning to be distributed. SQL Server and Oracle were not designed that way in the beginning but incorporated those capabilities over time. All of these are incredibly sophisticated, modern, high performance RDBMSs. None of them can be used effectively without some effort to acquire relational concepts, modeling, SQL, etc. This is where the "database community" has lost ground to the prevailing NoSQL solutions IMO, by not making it easier for new devs to plug and play.

One example of the sort of thing that could make RDBMSs more plug-and-play is Vertica's flex tables - a schema-free, super flexible blob of a table that lets you move forward without having to completely define your relational model.

eva1984 · on May 2, 2016

SQL is easy. I believe in the life a developer, you will have to run into SQL, here and there which is inevitable.

But we have a lot of other tools now, RDBMS is just one of them. Plus the problems everybody having are vastly different those days, the solution varies with them. Can RDBMS used to address them? Probably. But RDBMS are beasts themselves, the amount of effort to optimize it for your specific requirement, might not look worthy compared to just setup a NoSQL/NewSQL solution that has been built purposefully to perform under certain scenarios.

I believe the new gen of devs, are just being more pragmatic and unconstrained about the data storage solutions the have in the toolkits, and pick whatever they seem fit.

crdb · on May 2, 2016

The relational model, by being based directly on set theory and first order logic, is the most robust method for managing data today.

Yes, SQL is not the best implementation; theoretically we could try out Tutorial D/D4/whatever, but it's good enough especially in the Postgres flavour which is built on decades of academic and commercial research. PostgreSQL is almost entirely declarative (the RM is declarative) - how do you propose to improve on that?

As a quick reminder, these are the things you have to build manually into your application if you're not using a relational DBMS (or check are included in your new black box solution for managing data):

- data being shared (concurrent access, writes, etc.);

- avoiding redundancy, inconsistency (which you get for free by centralising the data into a single copy instead of having one copy per thread/program/user);

- transaction (= logical unit of work) atomicity (all or nothing) - say you want to transfer money from A to B, decrease A, then increase B; what if "increase B" fails? A relational database will ensure your transaction does not half go through;

- integrity - impossible things are avoided by constraints, anything from "an employee logging 400 hours of work this week" to parsing different date formats (because the date is stored as string because it's an Entity Attribute Value antipattern because "the schema has to be flexible");

- easily enforced security ("nobody but finance accesses payroll tables");

- and obviously data independence, such as freedom from having to specify the physical representation of data and access techniques in your application code. (OK, this is one place where SQL is not perfect; but it's pretty good)

(*thanks to C. J. Date for the list)

The sad thing is that this stuff does not seem to be taught anymore; I get a lot of business from the fact that most frameworks encourage antipatterns by design (Bill Karwin's book on the subject [1] is a great, easy read for those who can't stomach C. J. Date's 1000 page "Introduction" [2]).

[1] http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic...

[2] http://www.amazon.com/Introduction-Database-Systems-8th/dp/0...

sqldba · on May 2, 2016

SQL is easy. ORM is not - except through a framework which will make you forget about the SQL at all.

luckystarr · on May 2, 2016

ORMs are a trap. They are like some sugary drink which makes you fat and you only realize it when it's too late.

They prevent harnessing the power of SQL by providing seemingly convenient shortcuts. Only later (when it's too late and you already invested heavily in it) it becomes clear that one would have been better off not using an ORM but using SQL directly.

rubber_duck · on May 2, 2016

ORM are bad if you look at things from a DB perspective.

If you're an application developer you've seen what happens what the code dissolves to if developers are rolling their own queries all over the place and doing manual model to object mapping. ORMs have appeal in environments they provide tooling on top and where developers are expected to know how to use them (Entity Framework, Django admin). In other situations like Clojure I would just use something like yesql.

soundwave106 · on May 2, 2016

One trap I've seen with big ORMs is some coders stop "thinking database", which can lead to performance problems.

The "N+1 selects" problem ( http://use-the-index-luke.com/sql/join/nested-loops-join-n1-... -- running for loops against nested selects) is probably the most notorious case of this. When you learn SQL, you pretty much learn from the get go that loops are bad (eg, you avoid cursors, and process things in batch, as much as possible). I think it's a lot less obvious that this also is the case in ORM land, that for loop doesn't instinctively look too dangerous when you first start programming using an ORM. (Because in your regular code, it really wouldn't be.)

Personally, I do quite like micro-ORMs like Dapper. Dapper's great at mapping query results to objects, but still allow you to take advantage of SQL performance.

rubber_duck · on May 2, 2016

The thing with ORMs is that they buy you a lot of fancy stuff as well - scaffolding/crud templates, they integrate in to your development workflow (under source control, in the same language as the rest of your code base).

I'm not really a fan of ORMs - I think it's fundamentally an inferior approach and I think clojure/yesql or (presumably) Dapper like micro ORMs are much better where you treat query results like values instead of objects - but at the same time an enormous ammount of work went in to ORMs and a lot of developers are familiar with them to the point where they can do their job in 90% of the cases - and for that 10% case you have the senior dev or a DBA can step in - this IMO is a big value proposition that shouldn't be ignored.

kidmenot · on May 2, 2016

That's a great analogy, I'm totally with you on this one.

ORMs are painful, I only use solutions which help me reducing the amount of boilerplate code to map data to domain objects (Dapper being a great example in .NET land), but I really prefer writing my queries and seeing them at a glance (however ugly they may look when inlined) without having to actually run the app. Plus, SQL produced by things like Entity Framework is not really readable with all those "extent1", "extent2" and so on.

In addition, ORMs tie you to the lowest common denominator so if you want to take full advantage of, say, jsonb in PostgreSQL you're out of luck.

I mean, in principle they're great, and they will theoretically allow you to switch database engines but, according to my limited experience, that is not such a frequent occurrence.

The thing is, they come with too many hidden costs, and I never bought the "no need to know any SQL" argument to begin with, as I really think a decent command of it is part of the fundamental abilities of a developer.

ensignavenger · on May 2, 2016

"if you want to take full advantage of, say, jsonb in PostgreSQL you're out of luck"

I don't know if you can take 'full advantage' but with the Django ORM, you get special postgreSQL fields to take advantage of postgres jsonb. There is nothing stopping someone from developing more features specific to other RDBMS either... and if you need to write some queries in SQL, the Django ORM lets you do that easily alongside more traditional ORM queries.

Tiquor · on May 2, 2016

Every widely used ORM allows direct SQL. Like many other things knowing when and where to use it is key. An ORM will be great for 90% of an app. When it is not a good fit, go direct to the DB/datastore. The point of enlightenment is not rejection of the ORM but rejection of the idea that the ORM is good for everything.

alephnil · on May 2, 2016

They do, but that is not the problem. The problem is that they usually try to impose a database design based on the class structure of the application, which will be something that you never would design by hand, and not is optimized for taking advantage of the relational model. Now you have a convoluted database structure and queries filling several screens, and becomes hard to reason about. Now it does not help that you can handcraft queries.

In addition, your data is usually longer lived than the class structure of application, so making a database design based on the object structure means that you can't change your class structure, because you have several terabytes of data you want to keep, following its structure.

chris_wot · on May 2, 2016

If you create your database schema straight from your class structure, then you are probably going to have problems, I think that's pretty much a given.

But if you want to query from an ORM to a reasonably normalized database, then with a bit of thought, you can almost certainly dodge many of the traps that an ORM can lead the unwary.

Tiquor · on May 9, 2016

Then don't use bad ORMs. What you are describing is caused by devs with poor understanding using ORMS as a crutch. That same dev would probably make a crappy schema without the ORM.

eva1984 · on May 2, 2016

ORM are good writing simple queries and prevent navie mistakes, like misspelling columns. And also narrow the chance of certain attacks, like SQL-injection.

And Django has backdoor for you to use raw SQL.

It is not a bad thing if not being abused.

krylon · on May 2, 2016

I agree, in a twisted way. On the one hand, I like the idea of ORMs in general. But on the other hand, I never really trust them and tend to write my own SQL, unless a framework like Django pretty much makes me use its ORM (to be fair, the few projects I built on Django had a sufficiently simple data model to make the ORM rather painless).

globuous · on May 2, 2016

And even then, Django allows you to perform sql queries if you think you need to :)

https://docs.djangoproject.com/en/1.9/topics/db/sql/#perform...

https://docs.djangoproject.com/en/1.9/topics/db/sql/#executi...

krylon · on May 2, 2016

Indeed. As far as those all-inclusive frameworks go, I really like Django, it is - for my needs, at least - fairly comprehensive without being overwhelming.

nickbauman · on May 2, 2016

Disagree. For starters it's clear the traditional (legacy?) RDMS community has yet to come to terms with the realities of the CAP theorem. Google, Facebook, etc, does not, nor will they ever use an Oracle database to run its business. That's a huge pile of data being ignored.

gfody · on May 3, 2016

Facebook has one of the largest Vertica installations in the world.

nickbauman · on May 4, 2016

All while Facebook being the birthplace of Cassandra.

walshemj · on May 2, 2016

I bet their backend billing/revenue sytems do.

manigandham · on May 2, 2016

I'd like to point out http://www.memsql.com as one of the better NewSQL implementations with a solid distributed design that actually works and scales.

genericpseudo · on May 2, 2016

The central tension alluded to here is "database people" vs "big data people". Which is a real dichotomy – as a "big data person" I've got a lot of leverage out of going "c'mon, we don't need Hadoop, we just need a relational database here, let's use Postgres", which is an option which gets culturally dismissed rather than on technical grounds. Me, I'm happy that I get to do less work and look clever while I'm doing it, but...

Marketing and cultural positioning matter. That really isn't news. But when people have identity invested in denying that, on both sides, it's difficult to overcome.

sgt101 · on May 2, 2016

In almost all applications using Postgres (or virtually any other decent RDBMS system) is the right technical choice.

But (and this may be cultural) wearing my "where are we in 2 years" hat makes me push for Hadoop. The why is this - data and access to data is now trumping applications for business value. Where as I can run almost any of my applications on Postgres (almost) and many of them can co-habit on one server relatively happily, I cannot run all of them on it. This means that I will end up moving data from one of them to the others, but which?

Alternatively I will shift the data from the RDBMS to Hadoop (or another data warehouse solution if I am working in oil-and-gas-three-years-ago or a hedge-fund-or-something now) where the insight that I am being kicked for daily can be extracted by combining the data product of 20 systems.

Or we can use Hadoop as the core of the data system and feed RDBMS slaves (or use Impala sometimes). This has the benefit that the data is managed and clean (ho ho ho) on the core system and we can move at the speed of security checking and data understanding rather than the speed of budgets for writing and running data extracts.

jldugger · on May 2, 2016

> But (and this may be cultural) wearing my "where are we in 2 years" hat makes me push for Hadoop.

Google's F1 paper makes it clear that they were running Adwords, their most (maybe only) important product, on MySQL on Jan 1 2012. GFS and Big Table were not enough, and until Hadoop provides a robust ACID compliant data store, you should probably keep your pgsql books around.

sgt101 · on May 2, 2016

Agree - there's no doubt that where transactions are needed you need something that can manage transactions (in the particular way you need them managed).

I actually think that transactions on Hadoop should be strictly limited to data import a-la HIVE but my arguement is that the enterprise data master is best on Hadoop rather in the slave systems.

IndianAstronaut · on May 2, 2016

>But (and this may be cultural) wearing my "where are we in 2 years" hat makes me push for Hadoop

This has been my experience as well. Worked on a data warehousing project that eventually had to get moved to Hadoop because management underestimated the scale of the data. We could have saved a lot of pain if we just started on Hadoop.

BenoitP · on May 2, 2016

About how this cultural divide came to be, part of me thinks that blanket rejection of "what works" is a sound strategy as an individual.

You don't want to work on a successful project. Successful projects are the ones you have been cramming 190 man.years of business rules into, and they did not explode.

You can thank Java's accessors, refactoring capabilities; you can thank SQL's ability to produce rich views without changing the data storage. You're still stuck with having to deal with at least all the essential complexity that is there.

On the other side, preach that the future is Go-Mongo-on-the-cloud and you've got yourself a greenfield project. Rarely have I heard about a Go project that is there only because it needs Go's unfair advantage (namely hugely high concurrency that you have to cram into one single box)

Same thing goes for microservices. Microservices are: I don't want to deal with other people's code.

_pmf_ · on May 2, 2016

> The central tension alluded to here is "database people" vs "big data people".

Then there's the group who think they're big data people, but actually fall within the domain of SQLite.

collyw · on May 2, 2016

That seems to be the majority of self proclaimed big data people in my experience.

gglitch · on May 2, 2016

Which, in fairness to and out of love for SQLite, is an enormous domain.

genericpseudo · on May 7, 2016

I may have perpetrated things like this. (20GB working set for a recommender? No problem.)

gaius · on May 2, 2016

Or Excel, even

_979m · on May 2, 2016

It seems like NoSQL could be left behind except for the .001% of use cases that actually require it and can't be easily replaced with extensions or (hopefully) automatable configurations of Postgres, but it would require application-level abstractions, and the database community doesn't value those enough, as evidenced by SQLAlchemy [1] not being highlighted on the homepage of every RDBMS project because of the awesome power and flexibility it gives the developer.

Specifically, a JSON column should be used to store everything other than primary keys and foreign keys, and views and indexes should be automatically created based on the schema defined in the application (i.e., get the schema from the ORM at deploy time and post the data to a schema/migration management system) using something like https://github.com/mwhite/JSONAlchemy

It is entirely possible to implement the CouchDB or MongoDB API on top of Postgres JSON, for instance.

[1] http://www.sqlalchemy.org/

oneweekwonder · on May 2, 2016

I have been interested in a couchdb api for postgres. But thus far could not find anything.

One option is PouchDB, with levelDOWN* to push it to levelUP* to store it in postgres. But the level of abstraction feels just to much.

I also found another project that basicly keeps a copy in couchdb and sync it over to pg, but stuff like attachments does not work.

* I'm not 100% sure about the projects and how to accomplish it.

ljw1001 · on May 2, 2016

>> It is entirely possible to implement the CouchDB or MongoDB API on top of Postgres JSON, for instance.

Sure, but that's just because someone has gone to the trouble of building a Mongo-equivalent as a Postgres module.

escherize · on May 2, 2016

I was hoping this would be about PLace Oriented Programming (PLOP) which confuses identity with state. It's a problem for OOP in general, where the identity of objects is predicated on their particular state. The talk http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hic... goes over the idea in more detail.

seanmcdirmid · on May 2, 2016

Huh? The identity of an object allows for mutable state, it doesn't change with its state. I am the same person I was yesterday even if I'm in a different place or I cut my hair. State is predicated on identity, not the other way around (and given identity, you can have any kind of mutable state even in a language that doesn't support mutation, since an identity can be used as a key in an immutable map, changing the map changes the state where identity provides for a constant frame of reference).

talles · on May 2, 2016

The problem is when we have only the last version (state) of an identity, when we can't remember "how your haircut was yesterday".

In today RDBMS world this is on your shoulder (anyone creating columns with timestamps and PK hell?) or is left out entirely.

seanmcdirmid · on May 2, 2016

True, but that doesn't change what identity is. It is only a key into a table, it is up to the table to contain values indexed by time if that is what is required.

talles · on May 2, 2016

I agree with you there. Nowadays the identity points to a single thing without any notion of time, in the perfect world it would point to a set of things with their respective dates (and the database would have operations to deal with that).

If you want to get philosophical you can argue that in the real world everything mutates, even the identities. Heck, you can even say that the identity comes out of the value (the river is just on your mind, there's only water flowing). But in our digital/fuzzyless/mechanic computing models those abstractions are just unrealistic.

seanmcdirmid · on May 2, 2016

Identity, or objectness, is a distinct concept from being a value that is defined purely by its form. So in that sense, identity is immutable because it is defined that way, it isn't a property that falls out of the universe. At least, its been that way since the Greek philosophy days.

A ball is an object, it has state, if its state changes, we can recognize it as being the same ball. A point/position is a value. It doesn't make sense for it to have state, a position with a different value is simply a different position. Getting around the arrow of time is something else entirely.

aarpmcgee · on May 2, 2016

"I am the same person I was yesterday even if I'm in a different place or I cut my hair."

Think so?

seanmcdirmid · on May 2, 2016

My identity is constant even if my state isn't.

_3u10 · on May 2, 2016

lol, can't stop laughing, it's too bad the comments that reveal truths about the world are so often the ones that get down voted.

I'm sure eventually they will run through enough yesterdays to realize the error of their assertion.

amelius · on May 2, 2016

The only constant is that we are slaves of our future selves.

straws · on May 2, 2016

That's why the distinction is so lovely. Same identity, different value. State is a snapshot of an identity at some point in time.

seanmcdirmid · on May 2, 2016

That still isn't right. Identity is immutable, state is something that is...identified by identity. The key of a map entry is different from the value of the entry, and in general the key can access the value, but not vice versa.

filereaper · on May 2, 2016

'The article titled “Architecture of Database System” should be considered harmful'

Not sure why that article is considered harmful, I read through all of it, it was a fantastic database resource.

makomk · on May 2, 2016

Traditional database design is considered unfashionable these days. You're meant to use more modern designs that eschew old-fashioned ideas like SQL and not causing massive data loss.

Dowwie · on May 2, 2016

"And it works on real problems, like combating human trafficking"

well since he put it that way, why not?!

sgt101 · on May 2, 2016

I have a counter argument.

There is now too much data system innovation - there are scores of projects each of which implements a particular idea and very few of which have a large enough community to move beyond version 0.3

We can't put our pipelines onto these because the cost of understanding and adopting them overwhelms the benefit of ditching our internal code. We are in the same situation as the NPM / Javascript folks, but our data is an enterprise asset and we have to consider the impact of a project just going away.

The vendors have held us to ransom for years and we've had to break out because it had got to an industry destroying level - really the bills are material to the stock price and time and again it turns out that despite handing over $100m the features and machines you need are not included and you need to send just another $5m over. The money stopped being used for R&D when the consolidation of the 00's happened. The vendors set themselves up as vertically integrated solution providers with the technology as a lock in factor rather than a competitive differentiator. This would have worked if nothing in the economy ever changed again but oddly it turns out that our needs are radically different when competing / collaborating with Facebook & Google vs "old corp who has gone bust now" who were our traditional enemy.

So we're in a fork. The old route of using a trusted technology partner has gone, they have all betrayed their customers every quarter for the last 20 years, not only can't we trust them but we can absolutely predict when and how they will screw us. We flag it for every project to our execs, it's a built in assumption. On the other hand the big hope of opensource alternatives is not arriving in the way that Linux arrived, or maybe it's arriving in the way that Linux on the desktop arrived.

The solution is that the middle tier of corporates have to get more real about opensource. At the moment there's a handful of companies in the Forbes 500 who are significantly involved. CIO's and CFO's don't see the benefit, there is no case at the moment, but if there were 500 >$1m opensource programs running in the corporate world rather than 20 things would be different.

There needs to be a standarization effort and a co-ordination layer. We also need to think through the value chain as well. At the moment leaving it to the market and "revenues from professional services" are not working so well, and I am very cautious about the trend to paid for bits and bobs on top of open source.

The worst outcome would be to go through all this and find that we are back to vendor hell but chained up in a different basement wondering where the bastards in suits have gone. Instead we're going to be guessing that the guy in a cool t-shirt and designer jeans who's fiddling with a blow-torch is going to want something soon.

It's up to us to work out a new model, but what?

sqldba · on May 2, 2016

The problem is anti-competition through bundling. If you're big enough on the MS side you'll get SQL and all its doodads for free.

Great for my job but an impossible fight to get anything else in there.

sgt101 · on May 2, 2016

Agree - it's a notional form of free that includes your CFO paying through the nose without realizing it.

The office up strategy is really good for MS. I've always wondered why they haven't made more of Exchange as a data platform given that they have a near corporate monopoly on that and there is vast latent value in the data that is in it.

Or maybe that's why!

kempe · on May 2, 2016

RDBMS are great for a lot of situations. Alternatives like NoSql which stands for Not Only Sql, can be handy for other situations like how easy it is there to recursively find relations of x^n that are way more annoying to write in t-sql. Identity crises might be because RDBMS is not the only approach anymore?

throwaway_exer · on May 2, 2016

I read the link.

As a working DBA, it's breathless drivel.

Unless you're an academic trying to get a grant or tenure, of course.

koverstreet · on May 2, 2016

What a singularly useless response.

Disagreeing is fine, but "breathless drivel"? Come on, this isn't reddit.

sqldba · on May 2, 2016

I couldn't put it better myself and it's sad others are so shortsighted you got instantly shut down simply for not elucidating the blaringly obvious.

We have a researcher saying hey you DBAs who are busy building using existing systems should be making your own systems; and in the same breathe acknowledging that takes 25+ people and a never ending bucket of money. Riiiiight.

Boo hoo SQL + DocumentDB isn't enough? I never saw any justification for the premise.

formula1 · on May 2, 2016

Very cool article. Tbh, I think instruction is not nearly as helpful as action. We sll can complain about what other people should do but wjat can we do to show how what we do. Nonetheless, I think its an important articke since databases come and go with few dticking