Bye Bye Mongo, Hello Postgres

jack6e · on Dec 19, 2018

Specific technology choices aside, this was an incredible write-up of their migration process: thorough, organized, readable prose about a technical topic. It is helpful to read how other teams handle these types of processes in real production systems. Perhaps most refreshing is the description of choices made for the various infrastructure pieces, because it is reasonable and real-world. Blog posts so often describe building a system from scratch where all latest-and-greatest software can be used at each layer. However, here is a more realistic mix of, on the one hand, swapping out DBs for an entirely new (and better) one, but on the other hand finding new tools within their existing primary language to extend the API and proxy.

Great read. Well done.

chronotis · on Dec 19, 2018

This was what was particularly interesting to me - that they went to the effort of writing a purely technical article on the particulars of how parts of their environment operate, and to publish it on their platform even when that's not the sort of content they're known for.

viraptor · on Dec 19, 2018

It's not their main platform though:

> Digital Blog

> A blog by the Guardian's internal Digital team. We build the Guardian website, mobile apps, Editorial tools, revenue products, support our infrastructure and manage all things tech around the Guardian

wftglf · on Dec 19, 2018

Hi! Thanks for your comments. I'm one of the authors of this post. It is the same platform at the moment (just not tagged with editorial tags so it stays away from the fronts), though sometimes the team that approves non-editorial posts to the site can be concerned about us writing about outages and things as it might carry a 'reputational risk', so we may end up migrating to a different platform in the future so we can publish more quickly, we'll see!

reilly3000 · on Dec 20, 2018

In our era of deplatforming a publisher publishing on something like Medium seems antithetical, even for a dev team that just wants to get words out. Should you spend some cycles on the dev blog? Probably, but you should also split test the donation copy and get the data warehouse moved forward for 2019 initiatives and fix 1200 other issues. Thanks for sharing a great post. I shared it with my team and we all learned a lot.

wftglf · on Dec 20, 2018

Thanks to hacker news and reddit this piece got over 100,000 page views which should be enough to justify the blog staying on platform!

AlexCoventry · on Dec 20, 2018

Does SecureDrop run on your AWS infrastructure?

wftglf · on Dec 20, 2018

No, I think it's on our own infrastructure.

AlexCoventry · on Dec 20, 2018

Thanks for the info, here and below.

dmix · on Dec 20, 2018

It’s probably better if they didn’t answer this question...

AlexCoventry · on Dec 20, 2018

It's a bit disturbing to me that they seem to be using AWS for confidential editorial work.

> Due to editorial requirements, we needed to run the database cluster and OpsManager on our own infrastructure in AWS rather than using Mongo’s managed database offering.

wftglf · on Dec 20, 2018

In a happy world the guardian wouldn't rely on a company we spend a lot of time reporting on for unethical practices (tax avoidance, worker exploitation etc.) - but we decided it was the only way to compete. One of the big drivers was a denial of service attack on our datacentre in 2014 on boxing day - not an experience any of us want to have to deal with again.

nexuist · on Dec 20, 2018

>Since all our other services are running in AWS, the obvious choice was DynamoDB – Amazon’s NoSQL database offering. Unfortunately at the time Dynamo didn’t support encryption at rest. After waiting around nine months for this feature to be added, we ended up giving up and looking for something else, ultimately choosing to use Postgres on AWS RDS.

AlexCoventry · on Dec 20, 2018

Anyone who gets control of the live server can still read a database, even if it encrypts its storage.

hbbio · on Dec 20, 2018

Exactly. As I read the original article, which mentions "encryption-at-rest", there was a voice in my head crying: "No, what they need is E2EE". That would enable the authors to write confidential drafts of the articles, no matter where the data is stored (and AWS would be perfectly fine of course).

Disclaimer: The voice is my head does not come out of nowhere. I am building a product which addresses this: https://github.com/wallix/datapeps-sdk-js is a API/SDK solution for E2EE. Sample app integration is available at: https://github.com/wallix/notes (you can switch between master and datapeps branches to see the changes of the E2EE integration)

manigandham · on Dec 20, 2018

In which case they could've just used a separate encryption layer with any database, including DynamoDB. The HSM security keys available from all the clouds makes this rather simple.

hbbio · on Dec 20, 2018

Yes, any db including Dynamo would have been fine.

Our software E2EE solution has advantages over HSM though: Cost obviously, and more features and extensibility.

AlexCoventry · on Dec 20, 2018

Great idea.

grogenaut · on Dec 20, 2018

Encryption at rest is still important as it closes off a few attack/loss vectors: mis-disposed hard drives, re-allocated hosts. I'm probably missing a few others.

AlexCoventry · on Dec 20, 2018

Yeah, but it doesn't really address my concern.

nixgeek · on Dec 20, 2018

Anyone who can control a server in any environment can potentially interact with the database powering applications running on that server.

How is running on AWS different than Guardian Cloud in their basement?

AlexCoventry · on Dec 20, 2018

The level of control over who has physical access, of course.

Did reports of the Snowden revelations reside on the CMS?

wftglf · on Dec 20, 2018

Sadly we don't trust our security practices anywhere enough for that! Secret investigations happen in an air gapped room on computers with their network cards removed then get moved across to the main CMS when they're ready to publish.

untog · on Dec 20, 2018

Probably not, no, until they were about to be published. I imagine that the choice between "run an entire data centre ourselves, store everything there" and "use AWS, but keep high sensitivity stories on local machines" is an easy one.

After all, the client computer that connects to the CMS is just as, or more likely to be compromised. I wouldn't be surprised if the coverage (or at least parts of it) were edited on airgapped laptops.

AlexCoventry · on Dec 20, 2018

> the choice between "run an entire data centre ourselves, store everything there"

If those were the only two choices, you might be right. But the resources needed for the actual CMS functionality sound modest enough to run independently of the main website.

> the client computer that connects to the CMS is just as, or more likely to be compromised

That's faulty reasoning.

untog · on Dec 20, 2018

> That's faulty reasoning.

Why? It's an obvious potential point of compromise.

AlexCoventry · on Dec 20, 2018

Sorry, I misunderstood. I read it as saying "We're going to get hacked via this other vector, anyway, so why bother?" I see your point, now.

philjohn · on Dec 20, 2018

They're using AWS VPC (Virtual Private Cloud) which isn't open to the world (you use a VPN to bridge the VPC into your internal network) and which you can spin up dedicated instances that don't share underlying hardware with other AWS customers.

pfriend · on Dec 21, 2018

Thanks for writing the blog post. Insights like these and of such high quality are rare.

Can I ask what was the total cost of the migration?

If there was software that could do this database migration without downtime, how much would you/Guardian be willing to pay?

jwdunne · on Dec 19, 2018

This is pretty much how all Guardian articles are formatted. Some of their regular pieces could be called "blog posts" - Felicity Cloake's cooking series comes to mind.

Guess it makes sense to reuse the platform that already has the templates than use another platform and reimplement the design.

vanderZwan · on Dec 19, 2018

Given their employers, one would hope that they get good editorial support though!

pdpi · on Dec 19, 2018

Ish. From what I can tell, the “Digital Blog” seems to be set up as just another column on the platform.

revel · on Dec 19, 2018

Totally agreed. This is pretty much the definitive guide on how to perform a high stakes migration where downtime is absolutely unacceptable. It's extremely tempting, particularly for startups, to simply have a big-bang migration where an old system gets replaced by something else in one shot. I have never, ever seen that approach work out well. The Guardian approach is certainly conservative but it's hard to read that article and conclude anything other than that they did the right thing at every step along the way.

Well done and congratulations to everyone on the team.

konschubert · on Dec 20, 2018

I agree, but it looked them a year if I am reading the article right.

In most early stage startups, that would be an unacceptable loss of time.

So I don't judge them for doing a one-shot migration even if it causes an hour of downtime.

It all depends on the business.

wftglf · on Dec 20, 2018

Yeah it did take a long time! Part of this though was due to people moving on/off the project a fair bit as other more pressing business needs took priority. We sort of justified the cost due to the expected cost savings from not paying for OpsManager/Mongo support (as in the RDS world support became 'free' as we were already paying for AWS support) - which took the pressure off a bit.

Another team at the guardian did a similar migration but went for a 'bit by bit' approach - so migrating a few bits of the API at a time - which worked out faster, in part because stuff was tested in production more quickly, rather than our approach with the proxy which, whilst imitating production traffic, didn't actually serve Postgres data to the users until 'the big switch' - so not really a continuous delivery migration!

eecc · on Dec 20, 2018

The article mentions several corner cases that weren’t well covered by testing and caused issues later. What sort of test tooling did you use, Scalacheck?

hideo · on Dec 19, 2018

Agreed! I don't think enough engineering orgs appreciate the value of a great narrative on any technical topic.

Part of my duties at work require me to deal with "large" issues. While a solution to them is usually necessary and quick and high quality, I've seen the analyses that come after them vary in quality.

Good writeups tend to stick around in people's memories and become company culture, and drive everyone to do better. Bad writeups are forgotten, and thus the lessons learned from them are forgotten as well.

This particular article stands out for me. English is not my first language, and I've spent most of my life dealing with very fundamental technical details, so most of my writeups aren't the best. I'm going to bookmark this one and come back to it to learn how to write accessible technical narratives.

harel · on Dec 19, 2018

Writing is their Core business after all :) I agree though, I read it like a fascinating breaking news story

jdietrich · on Dec 19, 2018

The BBC's Online and R&D departments have very interesting blogs, if you like this sort of thing.

http://www.bbc.co.uk/blogs/internet

https://www.bbc.co.uk/rd

DRW_ · on Dec 19, 2018

Also: https://medium.com/bbc-design-engineering

deadbunny · on Dec 20, 2018

Has world class platform for delivering articles. Uses Medium.

I don't even.

brown9-2 · on Dec 19, 2018

I imagine this is a nice benefit of working at an organization that is primarily about writing.

ranman · on Dec 20, 2018

I was actually a little confused by the article - it seems to go up and down in terms of technical depth. It feels like it was written by several people. The hyperlink to “a screen session” was odd as well ... ammonite hyperlink I get but... screen is a pretty ancient tool... people either know it or can find out about it. Like you link to screen but not “ELK” stack?

I like the article but it was a bit hard for me to consume with multiple voices in different parts.

thanatropism · on Dec 20, 2018

screen is hard to google.

luord · on Dec 20, 2018

This is the first thing I noticed too. This was an excellent read.

InGodsName · on Dec 19, 2018

Old versions of mongo were very bad.

We accured lots of downtime due to mongo.

But later versions were rock solid and I've matainer mongo installations at many startup and SMEs once you setup alertd for disk/memory usage, off you go. Works like charm 99% of the times.

kbenson · on Dec 19, 2018

MogoDB is proof that with the right strategy, marketing and luck, you really can fake it until you make it.

Not that that's really a surprise or was unknown, it's just fairly new to see in the open source ecosystem instead of the enterprise one.

badloginagain · on Dec 19, 2018

I'd posit its more a matter of maturing a new paradigm. There's a lot more edge cases you have to cover as NoSQL became more popular for production-at-scale.

SQL has decades of production maturation, and has wider domain knowledge.

kbenson · on Dec 19, 2018

I'm sure there's some of that. But a lot of the early problems were a bit more weighted towards poor engineering in general, IIRC. For example, I seem to recall an early problem was truncating large amounts of data on crash occasionally.

bigiain · on Dec 19, 2018

That's probably true.

Unfortunately the number of my customers who would sign off on just "two nines" is approximately zero...

nemild · on Dec 19, 2018

The Guardian example was heavily used by MongoDB as a case study to pitch their database to others in 2011:

https://www.mongodb.com/customers/guardian

https://www.mongodb.com/presentations/mongodb-guardian

https://www.slideshare.net/tackers/why-we-chose-mongodb-for-...

And reupping my previous, three-part series on MongoDB:

On MongoDB

NoSQL databases were the future. MongoDB was the database for "modern" web engineers and used by countless startups. What happened?

https://www.nemil.com/mongo/index.html

csytan · on Dec 19, 2018

I think you're asking the wrong question. The question should be: How did MongoDB become so successful?

IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API. MongoDB was the natural choice because they excelled at being accessible to devs who were already familiar with Javascript and JSON.

Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?

nemild · on Dec 19, 2018

I think you'll enjoy the series then, I spent several months investigating and made the same point about JSON and the Javascript-like CLI (plus great Node support, plus savvy marketing). For example:

> 10gen's key contributions to databases — and to our industry — was their laser focus on four critical things: onboarding, usability, libraries and support. For startup teams, these were important factors in choosing MongoDB — and a key reason for its powerful word of mouth.

Startup Engineers and Our Mistakes with MongoDB

https://www.nemil.com/mongo/2.html

The Marketing Behind MongoDB

https://www.nemil.com/mongo/3.html

RobertKerans · on Dec 19, 2018

Halfway into part two, this is very good so far. Thank you for the effort you put in (it really does show throughout).

inferiorhuman · on Dec 20, 2018

> I think you're asking the wrong question. The question should be: How did MongoDB become so successful?

Marketing, marketing, and more marketing. Mongo was written by a couple of adtech guys.

> Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?

I remember being underwhelmed by two things at the one MongoConf I went to earlier this decade:

1.) My immediate boss was an unfathomable creep who was there mostly to pick up women

2.) Mongo was focused on how to work around the problems (e.g. aggregate framework) rather than how to solve them.

I can't recall ever seeing a Postgres rep, but I can recall having worked out a PostGIS bug with a fantastically tight feedback loop. The Postgres documentation and community are nothing short of amazing.

Meanwhile with Mongo I watched as jawdropping bugs languished. IDGAF what the reps say, anyone with even a few years experience should've been able to see through the bullshit that Mongo/10gen was/is selling.

quietbritishjim · on Dec 19, 2018

> IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API.

The thing I dislike about this type of comment – although I now notice yours doesn't explicitly say this – is the implication that devs don't like SQL because they're lazy or stupid. Well, sometimes that is probably true! But there are some tasks where you need to build the query dynamically at run time, and for those tasks MongoDB's usual query API, or especially its aggregation pipeline API, are genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties. For anyone not familiar, just look at how close to being a native Python API pymongo is:

    pipeline = [
        {"$unwind": "$tags"},
        {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
        {"$sort": SON([("count", -1), ("_id", -1)])}
    ]
    result_cursor = db.things.aggregate(pipeline)

Of course you could write an SQL query that does this particular job and is probably clearer. But if you need to compose a bunch of operations arbitrarily at runtime then using dicts and lists like this is clearly better.

Of course pipelines like this will typically be slow as hell because arbitrary queries, by their nature, cannot take advantage of indices. But sometimes that's OK. We do this in one of our products and it works great.

With JSONB and replication enhancements, Postgres is close to wiping out all of MongoDB's advantages. I would love to see a more native-like API like Mongo's aggregation pipeline, even if it's just a wrapper for composing SQL strings. I think that would finish off the job.

QuinnWilton · on Dec 19, 2018

Elixir's primary database wrapper, Ecto [0], lets you dynamically build queries at runtime, and also isn't an ORM. Here's two examples directly from the docs:

  # Query all rows in the "users" table, filtering for users whose age is > 18, and selecting their name
  "users"
  |> where([u], u.age > 18)
  |> select([u], u.name)

  # Build a dynamic query fragment based on some parameters
  dynamic = false
  
  dynamic =
    if params["is_public"] do
      dynamic([p], p.is_public or ^dynamic)
    else
      dynamic
    end
  
  dynamic =
    if params["allow_reviewers"] do
      dynamic([p, a], a.reviewer == true or ^dynamic)
    else
      dynamic
    end
  
  from "posts", where: ^dynamic

Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.

My main point, though, is that you don't need to reach for NoSQL if all you need is a way to compose queries without string interpolation.

[0] https://github.com/elixir-ecto/ecto

quietbritishjim · on Dec 20, 2018

As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline unless it can do analogous things to Postgres's JSONB fields. For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?

Also, one of the benefits of Mongo's API is that it has excellent native implementations in numerous languages (we already use C++ and Python), so a suggestion to switch language entirely is not really equivalent.

inferiorhuman · on Dec 20, 2018

> As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline

Huh? The aggregation framework is a solution to a mongo-only problem. Most other databases are performant, but Mongo suffers wildly from coarse locking and slow performance putting things into and retrieving things from the javascript VM.

> For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?

This sounds suspiciously like a SQL view.

Edit: But if you actually need an array in a cell, Postgres has an array type that's also a first-class citizen with plenty of tooling around it.

quietbritishjim · on Dec 20, 2018

The "this" was referring to dynamically building queries (the GP comment by me) in Ecto (the parent comment by QuinnWilton). What you've said is a non-sequitur in the context of this little discussion. My whole original point is that raw SQL isn't right in all situations, and you appear to be arguing that I just use SQL instead.

inferiorhuman · on Dec 20, 2018

I can't speak to every ORM or database interface in existence but ActiveRecord will happily handle Postgres arrays and let you use the built-in array functions just handily without having to write queries by hand. Ecto is less elegant, but you can still finangle some arrays with it.

As far as views are concerned, I don't know what to tell you. Sure, you'll probably have to craft the view itself by hand. The result is that you can then use most abstractions of your choosing on top of it though.

bonesss · on Dec 20, 2018

There's also the possibility of using automation to create, update, and manage views. That lets your app be 'dynamic' with regards to new data and new datatypes, but also preserves the performance, debugging, segregation, and maintenance benefits of the underlying DB.

xkcd-sucks · on Dec 20, 2018

>For example, can it...

Yes. There will be a subquery and jsonb indexes need to be thought out in order to make it fast

inferiorhuman · on Dec 20, 2018

> Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.

Ahh Elixir. My favorite language that really just tries so hard to shoot itself in the foot. I'm currently in the protracted process of trying to upgrade a Phoenix app to the current versions. Currently I'm at the rewrite it in Rust and try out Rocket + Diesel stage.

Diesel is... interesting and makes me long for Ecto (which is often used as an ORM although the model bits got split off into a different project).

inferiorhuman · on Dec 20, 2018

Love the downvotes instead of comments. I've walked away from Elixir as the best practice deployment methodology (Distillery) is non-op on FreeBSD[1] and has been for a few months while the Distillery author is mum. All of this despite the vast love that the Elixir community seems to heap on FreeBSD.

Erlang and Elixir have plenty of promise but there simply is no good story for production deployments. Distillery and edeliver approximate capistrano, and that sounds great when it works (although I'd just as soon skip edeliver). But when it doesn't I'd much rather dig into the mess of ruby that is Capistrano than the mess of shell scripts, erlang, and god knows what else goes into a Distillery release.

Elixir is a really interesting language, but Phoenix seems to still be pretty wet behind the ears and very much in flux. Ecto too to a much smaller extent.

1: Some of the distillery scripts can communicate with epmd, some just give up.

funkaster · on Dec 19, 2018

Well... you can also use a modern ORM. I think "stitching ... text strings" is definitively not the way to go when interfacing a SQL database. My go-to ORM is Sequel[1]. I think their API is one of the best I've seen: you can choose to use models, but you can also work directly with "datasets" (tables or views, or queries) and compose them as you like. It's really powerful and simple.

[1]: http://sequel.jeremyevans.net/

matttb · on Dec 19, 2018

> genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties.

You're using the Pymongo library as an example. Someone can just as easily use SQLAlchemy and not have to worry about those things.

xp84 · on Dec 20, 2018

I'm confused by the implication that someone doing things like the above would be writing in SQL. SQL is a little like assembly language in a game: You may need to drop down to it for some key highly-optimized areas, but you rarely need to directly use it for most tasks. While it's true that you should understand how it works so you don't generate queries that suck performance-wise, the same goes for Mongo's intricacies too.

Every language I know of has great ORMs which do this for whatever SQL flavors people tend to use on that platform. I write things like this all the time, and it gets turned into SQL for Postgres:

```` Article.where(author_id: 37).order(:modified_date, :desc).where.not(published: false) ````

When using an ORM correctly (and indeed, the less I'm using any of my own bits of SQL the more this is true) I am also protected against injection attacks.

I'm not saying NoSQL has no value, but I believe it to be the wrong tool for data that lends itself to an RDBMS. If you have a bunch of documents who have deeply nested or inconsistent structures and where it makes no sense that you'd want to query by something other than the primary key, sure, it's a no-brainer to use a NoSQL system. For a CMS, which has been implemented thousands of times in RDBMSs, it is madness though. I cringe at realizing that apaprently there are developers out there who have avoided learning SQL entirely in their career out of fear, and as a result have to use Mongo for every application because that's the only thing they know how to do. I'm sure they're out there, but I wouldn't hire one.

aidos · on Dec 19, 2018

I’ll try to avoid a flame war, but since you’re using python, SQLAlchemy allows for composing sql strings.

raverbashing · on Dec 19, 2018

Yes, an ORM, not SQL itself.

walrus1066 · on Dec 19, 2018

You can compose queries, using the queryAPI in SQLAlchemy, without touching the ORM.

zrail · on Dec 19, 2018

SQLAlchemy is not an ORM. There’s a companion ORM project if you want it, but it’s not necessary.

yawaramin · on Dec 19, 2018

Yes? What's the qualitative difference between using a best-of-breed SQL ORM and the Mongo API?

quietbritishjim · on Dec 20, 2018

For it to replace MongoDB's aggregation pipeline, it would need to play nicely with JSONB. Does it do that? This is the thing I'm really missing.

For example, if documents in the JSONB column all look roughly like this:

    {
        "someArrayField": [
            { "key": "steve", "value": 7 },
            { "key": "bob", "value": 15 },
        ],
        "someOtherField": [ "whatever" ]
    }

* Can I count the number of entries in someArrayField, summed across all records?

* Can I get the per-record mean of the "value" sub-field, summed across all records?

* Can I filter by records that have a "someArrayField" entry where "key" is "steve" and "value" is at least 10 (the above record should NOT match)?

luhn · on Dec 20, 2018

Yes, you can!

The jsonb_array_elements function is roughly similar to Mongo’s $unwind pipeline op. It explodes a JSON array into a set of rows. From there it’s pretty simple aggregates to achieve what you’re looking for.

I was evaluating Mongo a couple months back to solve roughly the same problems. Eventually discovered Postgres already had what I was looking for.

inferiorhuman · on Dec 20, 2018

More the point Postgres has an actual array data type (and has for a while). You don't need to shove everything into a JSON/JSONB blob unless you absolutely cannot have any sort of schema.

zaarn · on Dec 20, 2018

Not only arrays, you can, with some limitations, create proper types with field names, if your ORM supports that you should use that over JSONB if it fits.

inferiorhuman · on Dec 20, 2018

See also: PostGIS.

quietbritishjim · on Dec 20, 2018

Allow me to restate this question:

> Does it do that?

It was supposed to be clear from the context that this meant:

> Does building queries programmatically with SQLAlchemy do that?

Maybe I'm misreading your comment, but you seem to just be talking about writing queries directly in SQL.

If not, could you give an example/link of how to programmically build a query in SQLAlchemy that dynamically makes use of jsonb_array_elements? It would be hugely useful if I could do that.

simonw · on Dec 20, 2018

There are some old examples of how to use jsonb_array_elemens in SQLAlchemy here: https://github.com/sqlalchemy/sqlalchemy/issues/3566#issueco...

luhn · on Dec 20, 2018

I was speaking of SQL, but if you can write it in SQL you can usually map it to SQLAlchemy. If worse comes to worse, you can use text() to drop down to raw SQL for just a portion of the query.

SQLAlchemy’s Postgres JSONB type allows subscription, so you can do Model.col[‘arrayfield’]. You can also manually invoke the operator with Model.col.op(‘->’)(‘arrayfield’).

So you should be able to do something like:

func.sum(func.jsonb_array_elements(Model.col.op(‘->’)(‘arrayfield’)).op(‘->’)(‘val’))

(Writing on my mobile without reference, so may not be fully accurate)

ris · on Dec 20, 2018

It absolutely can, but in my experience, 99% of the time, choosing to make a data field JSON/JSONB ends up being a mistake.

mopierotti · on Dec 19, 2018

To add to the pile of responses: in Scala, Slick is great library that lets you compose sql queries and fragments of queries quite effectively. (http://slick.lightbend.com/)

At my company we built a UI on top of Slick that lets users of our web app define complex triggers based on dynamic fields and conditions which are translated to type-safe SQL queries.

vips7L · on Dec 20, 2018

QueryDSL and jOOq as well for java.

cmrdporcupine · on Dec 20, 2018

From my POV the rise of 'NoSQL' some years back was tied into a number of things:

- Misunderstanding by most developers of the relational model (I heard a lot of blathering about 'tabular data', which is missing the point entirely).

- The awkwardness and mismatchiness of object-relational mappers -- and the insistence of most web frameworks on object-oriented modeling.

- The fact that Amazon & Google etc. make/made heavy use of distributed key-value stores with relatively unstructured data in order to scale -- and everyone seemed to think they needed to scale at that level. (Worth pointing out that since then Google & Amazon have been able to roll out data stores that scale but use something closer to the relational model). This despite the fact that many of the hip NoSQL solutions didn't even have a reasonable distribution story.

- Simple trending. NoSQL was cool. Mongo had a 'cool' sheen by nature of the demographic that was working there, the marketing of the company itself.

I remember going to a Mongo meet-up in NYC back in 2010 or so, because some people in the company I was at at the time (ad-tech) were interested in it. We walked away skeptical and convinced it was more cargo-cult than solution.

I'm _very_ glad the pendulum is swinging back and that Postgres (which I've pretty much always been an advocate of in my 15-20 year career) is now seeing something of a surge of use.

acjohnson55 · on Dec 19, 2018

> When was the last time you saw a Postgres rep at a college tech event?

Is a Postgres rep a thing?

twirlip · on Dec 19, 2018

I remember a hyperbolic readme or other such txt file for Postgres in the far-away long-ago time when everyone was on Slashdot. The author had written one of the most enthusiastic lovenotes to software I'd ever read, and that includes Stephenson's "In The Beginning Was The Commandline." It was a Thomas Wolfe level of ejaculatory keenness. I'd love to read it again if anyone else knows where I can find the file. So, even if there aren't actual Postgres reps, there are most assuredly evangelists.

WJW · on Dec 19, 2018

That's the point.

OJFord · on Dec 19, 2018

Absolutely - if you don't know SQL and you do know JSON, postgres looks scary and Mongo looks familiar.

mulmen · on Dec 19, 2018

Saying "I don't know SQL so I will just use JSON" really misses the point though. SQL is easy. Data is hard. NoSQL products offer to get rid of SQL which includes an implication that SQL itself was the challenge in the first place. The problem then is that you have lost one of the best tools for working with data.

kibwen · on Dec 19, 2018

I dunno that SQL is exactly easy, though. It's one thing to say "select statements are essentially identical to Python list comprehensions", but in practice I still have to look up the Venn diagram chart every time I need to join anything, and performance optimization is still a dark art. I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.

rodelrod · on Dec 20, 2018

You could solve that by altogether dropping the Venn diagram metaphor when reasoning about joins. This is the number one problem I see with junior devs who have a hard time grokking SQL. If you think about a join as a cartesian product with a filter, where the type of join defines the type of filter, the reasoning is extremely easy.

Here's a good article about that: https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-whe...

mulmen · on Dec 19, 2018

The hard parts of "SQL" are the hard parts of data. Joins aren't easier in Mongo. The performance optimizations you reference are tuning of a relational database, not SQL itself.

If you want to work with databases a domain specific language like SQL really provides a lot of value in solving these hard data problems.

ris · on Dec 20, 2018

> performance optimization is still a dark art

The idea is, in relational databases, that the vast majority of the time you shouldn't have to do it. Because you're writing your queries in a higher level (nay, functional) language, the query planner can understand a lot more about what you're trying to do and actually choose algorithms and implementations that are appropriate for the shape and size of your data. And in 6 months time when your tables are ten times the size, it is able to automatically make new decisions.

More explicit forms of expressing queries have no hope of being able to do this and any performance optimization you do is appropriate only for now and this current dataset.

inferiorhuman · on Dec 20, 2018

> I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.

Mongo and Javascript don't solve that either. In fact you get additional problems by virtue of not being able to do a variety of joins. For extra points, you're going to need to go well beyond javascript with mongo if you want performance. 10gen invented this whole "aggregation framework" to sidestep the performance penalty that javascript brings to the table.

On the other side, the postgresql documentation is second to none. SQL isn't necessarily easy but the postgres documentation gives you an excellent starting point.

gaius · on Dec 20, 2018

Here is the dirty secret of Mongo: you always have a schema, you just don’t have any tools for validating it or enforcing it or manipulating it.

threeseed · on Dec 19, 2018

You make it sound like learning SQL is like learning Assembler. It's not that hard. And ORMs exist in every language to abstract it all away.

PostgreSQL looks scary because it is a swiss army knife. It has a million different features and data structures. MongoDB does only one thing.

zapzupnz · on Dec 19, 2018

> You make it sound like learning SQL is like learning Assembler

It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."

You seem like the kind of person ready and willing to learn the right tool for the job. From my experience a few years ago on an accredit computing course that covered database admin and programming, this attitude is not representative of most of the software engineering students //unless// there's a specific assignment that requires particular knowledge.

Cs get degrees. And for plenty of developers out there, knowing one language (not even particularly well) gets jobs.

inferiorhuman · on Dec 20, 2018

> It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."

And that's a big fat mistake. There are so many ways to shoot yourself in the foot with mongo such that simply knowing the language mongo uses for most of its queries while not actually knowing the particulars of how mongo uses that language… well that's just a road to a world of hurt.

For example, when I first inherited a mongo deployment I noticed the queries were painfully slow. Ah hah says me, let's index some shit. Guess what? Creating an index on a running system with that version of mongo = segfault.

After a bunch of hair pulling I got mongo up and running and got the data indexed. But the map reduce job was STILL running so slowly that we couldn't ingest data from a few tens of sensors in real time. So I made sure to set up queues locally on the sensors to buy myself some time.

Even in my little test environment with nothing else hitting the mongo server, mongod was still completely unable to run its map reduce nonsense in a performant manner. Mongo wisdom was: shard it! wait for our magical aggregation framework! Here's the thing: working at a dinky startup we can't afford to throw hardware at it especially that early in the game. Sharding the damn thing would also bring in mongo's inflexible and somewhat magical and unreliable sharding doohickey.

So I thought back to previous experience with time series data. BTDT with MySQL, you're just trading one awful lock (javascript vm) for another (auto increment). So I set up a test rig with postgres. Bam. I was able to ingest the data around 18x faster.

And that's the thing. Mongo appeals to people who are comfortable with javascript and resistant to learning domain specific knowledge. All that appealing javascript goodness comes with a gigantic cost. If you're blindly following the path of least resistance you're in for a bad time.

P.S. plv8 is a thing, and you can script postgres in javascript if you really wanted to.

rabidrat · on Dec 19, 2018

I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline. So while "sure I'll just learn SQL" is great for a personal or school project, when you've got to get something done next week, it's better to take maximal advantage of the tools/skills/workflow that you already have.

IOW, it's not just laziness, it's a kind of professional conservatism. which is partly what gets older engineers stuck in a particular mindset, but it's also a very effective learned skill. The opposite is being a magpie developer, which results in things like MongoDB taking off :)

inferiorhuman · on Dec 20, 2018

> I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline.

You have to do the exact same things with Mongo+JS (e.g. learning when to avoid the JS bits like the plague).

gaius · on Dec 20, 2018

learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language,

SQL is a skill that rewards investment in it 1000x over, in terms of longevity. It has spanned people’s entire careers! What’s the shelf life of the latest JS framework, 18 months at most...

rabidrat · on Dec 20, 2018

Yes, I know that, and that's why I know and use SQL instead of MongoDB. But that's a very similar reason to why I've resisted learning Rust, and Ruby, and React, and Docker, and Scala, and many more. I know I could learn the utter basics in a weekend, but I also know that those basics are utterly useless in a real-world context, and I would prefer to spend the weekend hacking on my open-source project in Python or C, which I've already invested the years into. And that's how engineers age into irrelevance..

jandrese · on Dec 19, 2018

Well, that and SQL has a somewhat undeserved reputation for being easy to learn, but also easy to screw up. Like you write a simple looking query and it turns out to have O^2 complexity and your system ends up bogged down in the database forever.

In practice people who fall into complexity traps are usually asking a lot more of their database engine than any beginner. It's usually not that hard to figure out the approximate cost of a particular query.

zimpenfish · on Dec 20, 2018

> Like you write a simple looking query and it turns out to have O^2 complexity

Or you have a simple fast query with a lovely plan until the database engine decides that because you now have 75 records in the middle table instead of 74, the indexes are suddenly made of lava and now you're table-scanning the big tables and your plan looks like an eldritch horror.

[Not looking at MySQL in particular, oh no.]

hnzix · on Dec 20, 2018

Learning SQL syntax isn't hard, but learning how to properly design relational databases is not something you can pick up from skimming blog posts.

int_19h · on Dec 20, 2018

I'd say that this goes for databases in general, not just relational ones.

Which brings us back to the original point: data is hard.

heavenlyblue · on Dec 19, 2018

Assembly language isn’t hard. It’s actually quite simple. The issue with learning assembly is the absolutely useless otherwise domain knowledge.

Document-based storage definitely fits the generalised use-case better than tabular storage.

bcoughlan · on Dec 20, 2018

> Not only that, their marketing/outreach efforts were also aimed at younger developers.

I do remember a lot of MongoDB t-shirts, cups and pens around every office I was in around 2011-2013. When I would ask they would tell me that a MongoDB developer flew halfway across the world to give them all a workshop on it.

k_bx · on Dec 24, 2018

> The question should be: How did MongoDB become so successful?

Ability to store Algebraic Data Types and values with lists without a hassle of creating a ton of tables and JOINs. Postgres added JSON support since, plus there are now things like TimescaleDB, which didn't exist previously.

threeseed · on Dec 19, 2018

None of this makes sense.

ORMs have existed for decades so developers can use a SQL database just fine without knowing the language. So it's definitely not this.

It's more likely because Mongo is (a) is extremely fast, (b) the easiest database to manage and (c) has a flexible schema which aligns better with dynamic languages which are more popular amongst younger developers.

paulie_a · on Dec 20, 2018

Postgres is faster at json than mongo. Also the pipeline query strategy of mongo is terrible to deal with. A schema should not be flexible. Now I have to write a bunch of code to handle things that should have been enforced by the database. Postgres is incredibly easy to manage with actual default security. I know the mongo tutorial says to not run the default configuration, then why is it the default configuration. It's so easy to manage anyone can take it over for ransom.

Mongo literally has no upside vs postgres.

Something1234 · on Dec 19, 2018

To use an ORM and not get crap performance you still need to understand sql, and what is happening under the hood.

pjmlp · on Dec 19, 2018

Fashion driven development and not understanding how to actually make use of SQL.

KaiserPro · on Dec 19, 2018

At "large financial news company" we had a "designed for the CV" tag that applied to stupid architectural decisions (of which there were many)

One of the biggest and most expensive was using Cassandra to store membership details. Something like 4 years of work, by a team of 40, wasted by stupid decisions.

They included: o Using Cassandra to store 6 million rows of highly structured, mostly readonly data

  o hosting it on real tin, expensive tin, in multiple continents (looking at >million quid in costs)

  o writing the stack in java, getting bored, re-writing it as a micro service, before actually finishing the original system

  o getting bored of writing micro services in java, switching to scala, which only 2/15 devs knew.

  o writing mission critical services in elixir, of which only 1 dev knew. 

  o refusing to use other teams tools

  o refusing to use the company wiki, opting for thier own confluence instance, which barred access to anyone else, including the teams they supported

setquk · on Dec 19, 2018

I think the worst place I ever worked was like that. Was going back quite a number of years now but it was a startup fired up by a one of the lesser MBAs to utilise a legal loophole to slice cash off the public via a web app and a metric ton of marketing to desperate people.

Step one was hire everyone the guy had worked with at his previous company. They were all winforms / excel / SQL / sharepoint / office developers from big finance and had no idea where to go really. None of them had even touched asp.net.

Cue "what's popular". Well that was Ruby on Rails back then on top of MySQL and Linux. 4 people with zero experience pulled this stack in and basically wrote winforms on top of it. Page hits were 5-8 seconds each. Infrastructure was owned by SSH worms and they hadn't even noticed.

I think I lasted two days there before I said "I'm done".

fujimotos · on Dec 22, 2018

I once saw a CEO who wrote his own Web Framework and forced the entire company to use it.

At the time, under the influence of React, the idea was to "build web application sorely based on Functional Programming". Since after years of trying no one could figure out what that meant, the company ditched the CEO and ended up wasting a couple years of work.

thawkins · on Dec 20, 2018

RDD, resume driven development....

qaq · on Dec 19, 2018

well Elixir might have being a good decisions :)

KaiserPro · on Dec 19, 2018

only if more than 0.3% of your workforce know it....

michaelcampbell · on Dec 20, 2018

And thus, nothing new was ever developed again.

bunderbunder · on Dec 20, 2018

It's not that you don't write anything in a new thing. It's that you start with small, less critical projects. Get your feet wet, give people a chance to get a feel for the pros and cons, that sort of thing.

Jumping right into writing "mission critical services" in the brand new language that few people at the company know well is asking for trouble.

KaiserPro · on Dec 20, 2018

There is a time and place for new. This isn't it.

there were no problems of scale, speed or latency. It was migrating from one terrible system to something that should be smaller, simpler, cheaper and easier to run.

The API is/was supposed to do precisely four things:

  o provide an authentication flow for customers
  0 provide an authentication flow for enterprises to allow SSO
  o handle payment info
  o count the number of articles read for billing/freemium/premium

That is all. Its a solved problem. Don't innovate, do.

Spend the time instead innovating on the bits that are useful to the business and provide an edge: CRM, pattern recognition and data analytics.

Noumenon72 · on Dec 20, 2018

"real tin"?

theshrike79 · on Dec 23, 2018

Actual physical hardware, bought by the company and installed in a rack somewhere.

Not ephemeral Cloud Stuff rented from some company.

perfunctory · on Dec 19, 2018

Reminds me of

“The computer industry is the only industry that is more fashion-driven than women’s fashion.” — Larry Ellison

SllX · on Dec 19, 2018

This actually resonates with me, maybe not in the way Ellison intended as I’m not familiar with the context he said it in. A bit off the main topic but the more I revert to just using emacs for some task I previously used a .app bundle or web page for, the more I question how much we the computer industry has just been spinning its wheels for the last 30+ years. I honestly can’t really tell what value WIMP-centric GUIs have brought to the table besides fashion, let alone the endless debates in the form of actual implementations about the best way to build one. Possibly the best argument ever made died with Mac OS 9.

I suspect that in a year I’ll be using what is effectively an Elisp machine with an Apple logo paired with an iPhone.

jandrese · on Dec 19, 2018

Discoverability and self documentation is generally the advantage of GUI systems. Well built systems can be understood in a few moments without needing to consult a manual. That's almost impossible in a pure CLI environment.

Of course it's entirely possible to screw this up, modern phone-centric design standards are really bad about it for example, but in general you need to consult the manual (or Google) far less often.

SllX · on Dec 21, 2018

I’m not convinced these are inherent properties of GUIness. We have had GUIs in use for long enough that many elements even across slightly different GUIs are familiar enough, this had allowed conventions and then conventional wisdom to develop.

Early GUI developers also put effort into making their GUIs at least somewhat intuitive, but they also bundled thick books of documentation on how to use their systems.

To a degree, they are self-documenting, but not because of their GUIness so much as their design choices; menus helped a lot with this. Menus actually are a good subject to touch upon, they are definitely one of the better conventions we developed, and they were based upon and analogous to a restaurant menu. However their very nature as a list of commands you can issue a GUI does also typically but not always, limit the application. If the only options available to you are what is on the menu and there is no other interface, then an application is much more limited. This isn’t even true of most restaurants which will often allow you to issue an order for something not on the menu if they have the ingredients, equipment and expertise to make it.

But as a convention, it is not limited solely to GUIs, you can incorporate menus into any interactive interface.

I think the real innovation wasn’t GUIs, it was interactive software. The innovation beyond that is scriptable software.

What is important isn’t GUIness, but the developer’s intent. If you develop software with the intent to be self-documenting and discoverable, you will end up with an interface that is both of these things provided you did a competent job of it. A GUI might help, relying on platform conventions might help, and using common cultural conventions might help, but these aren’t the necessary ingredients for those qualities.

Emacs has the quality of self-documentation, but it is emphatically not a GUI even though it is interactive.

I would say well built systems, provided you intend to do anything productive and even slightly complex, should have a manual included, or else it isn’t a well built system.

jfb · on Dec 19, 2018

> I suspect that in a year I’ll be using what is effectively an Elisp machine with an Apple logo paired with an iPhone.

It's been working for me for quite a while now. And it's depressing, but here we are.

SllX · on Dec 21, 2018

It isn’t all bad. My laptop is 6, almost 7 years old at this point and it has received a few upgrades in that time. This change in habits does lower the minimum system requirements for its eventual replacement from “runs Mac OS X” to “runs emacs”, but I might be able to stretch its life out a bit longer now.

romeisendcoming · on Dec 20, 2018

vi, cvs|svn, C, bash, awk,sed, python and tcl. Linux (rpm based) or FreeBSD. Firefox or lynx. Haven't changed much in 20 years.

chrisoverzero · on Dec 20, 2018

> I honestly can’t really tell what value WIMP-centric GUIs have brought to the table besides fashion […]

Literal billions of users.

SllX · on Dec 21, 2018

Billions of users of which things that are worthwhile?

untog · on Dec 19, 2018

IMO there is still a place for schema-less document databases. It's just that Postgres's JSON columns mean you can get the best of both worlds, which makes Mongo look weak by comparison.

da_chicken · on Dec 19, 2018

I would rather say, "Postgres's JSONB provides a hybrid compromise that may meet the needs of many users." JSONB feels like it's closer to creating more complex data types that are less primitive than those that SQL currently allows.

The real driver of the NoSQL movement, I believe, was that everybody wanted to be the next big social network or content aggregation site. Everybody wanted to be the next Facebook, Instagram, Twitter, etc. and that's what people were trying to build. Ginormous sites like these are are one of the applications that strongly favors availability/eventual consistency over guaranteed consistency, whereas most other applications are quite the opposite.

Nobody really cares if your Instagram post shows up 10 minutes later in New York than it does in LA, and certainly not if the comments appear in similarly inconsistent order. It's one step above best-effort delivery. However, your bank, hospital, etc. often care quite a bit that their systems always represent reality as of right now and not as of half an hour ago because there's a network problem in Wichita.

The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."

bunderbunder · on Dec 19, 2018

> The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."

Even that's too simplistic. For most RDBMSes, the answer depends on how you have it configured, and usually isn't "Error". If you're using a serializable transaction isolation level, it usually means, "you might have to wait an extra few milliseconds for your answer, but we'll make sure we get you a good one." Other isolation levels allow varying levels of dirty reads and race conditions, but typically won't flat out fail the query. This is probably the situation most people are working under, since, in the name of performance, very few RDBMSes' default configurations give you full ACID guarantees.

To the "DB in NY knows something different from DB in LA" example, there are RDBMSes such as the nicer versions of MSSQL that allow you to have a geographically distributed database with eventual consistency among nodes. They're admittedly quite expensive, but, given some of the stories I've heard about trying to use many NoSQL offerings at that kind of scale, I wouldn't be surprised if they're still cheaper if you're looking at TCO instead of sticker price.

ownagefool · on Dec 20, 2018

This likely isn't true for banking systems.

Many ATMs will still give you money when they're offline, and things become eventually consistent by comparing the ledger.

Shops also generally want to take orders and payments irregardless of the network availability, so whilst they might generally act as CP systems, they'll be AP in the event of network downtime, but will likely lose access to fraud checks, so may put limits of purchases, etc.

They're probably all CP locally and AP (or switchable from CP) from an entire system perspective.

arkh · on Dec 19, 2018

I like the JSONB support. Not for storing data but to query it with shitty drivers.

Some array magic and jsonb_agg and suddenly you can get easy to json decode results instead of having to play with rows in your app. Yes you can also do it with xml_agg but these days people tend to consider anything xml as evil (they're wrong).

riteshpatel · on Dec 19, 2018

I would say that development of JSON fields in Postgres and MySQL was accelerated by the adoption of Mongo.

Speed to market/first version using JSON stores is attractive, especially when you're still prototyping your product and won't have an idea of exact data structures until there's been some real world usage.

bunderbunder · on Dec 19, 2018

It's amazing how many insurmountable SQL performance problems can be surmounted by putting things in 3rd normal form.

michelpp · on Dec 19, 2018

That, and simply reading the documentation. N+1 problems and denormalization-by-design are a failure to understand SQL.

Scarbutt · on Dec 19, 2018

But denormalization is one strategy to improve the performance of a rdbms.

bunderbunder · on Dec 19, 2018

Standard rules about performance optimization apply. Denormalization can improve performance, but it can just as easily harm it.

For code, I think by now we all understand that you should always start with clean, well-factored code, and then optimize only as much as is necessary, which is usually not at all, and always under profiler guidance. It's the same with DBs: You start with a clean, well-normalized schema, and then de-normalize only as much as is necessary, which is usually not at all, and always under profiler guidance.

Also, keep in mind that improvements in compiler technology over time mean that the performance tricks of old can be useless or even actively harmful nowadays. This is true of SQL every bit as much as C.

trhway · on Dec 19, 2018

Denormalization as way of dealing with performance issues is like guilottin to cure headache. And more frequently than not the headache is still there even after that.

It is true that for some narrow class of analytical workloads 20-25 years ago (behold BW of 199x) the denormalized case performance was better compare with straight non-optimized running of the same queries over normalized schema. Since then, the exponential availability of RAM and huge increase in streaming speed of HDD with stagnating IOPS (the main mistake in analytical workloads on HDD in the last 10-15 years - using nested loop join with indexed lookup into the large facts tables :) have made denormalization obsolete and harmful. If anything, the emergence of SSD and the huge RAMs moved things even further toward and beyond normalization, by making the "super-normalization", i.e. columnar tables, a viable everyday thing.

h1d · on Dec 19, 2018

What do you suggest in place?

bunderbunder · on Dec 19, 2018

I would suggest not denormalizing at all.

The idea that joins are slow is a holdover from the bad old days when everyone used MySQL and MySQL sucked at joins. On a more robust DBMS, a normalized schema will often yield better performance than the denormalized one. Less time will be lost to disk I/O thanks to the more compact format, the working set will be smaller so the DBMS can make more efficient use of its cache, and the optimizer will have more degrees of freedom to work with when trying to work out the most efficient way to execute the query.

(edited to add: If you're having performance problems with a normalized schema, the first place to look isn't denormalization, it's the indexing strategy. And also making sure the queries aren't doing anything to prevent proper index usage. Use the Index, Luke! is a great, DB-agnostic resource for understanding how to go about this stuff: https://use-the-index-luke.com )

manigandham · on Dec 19, 2018

It's not, it's a strategy to improve the performance of a particular query or access pattern, and is usually the last resort after things like proper indexing, aggregations and materialized views.

JOINS are fast and it all comes down to how much data you're moving. If it's a large table joining to a small set of values, then the joined data is quickly retrieved and applied to the bigger table, with great performance.

If the join is between two large tables where every combination is unique then that's the unique case where the joined table is just adding another hop to get the full set of data for each row and is a perfect candidate for denormalization, although in that case it probably should've been a single table to begin with. Of course there's a spectrum between these two scenarios but it takes a lot before denormalization makes sense on any modern RDBMS.

jacques_chester · on Dec 19, 2018

True, if

1. You start from a normalized schema.

2. You denormalize in a structured way (eg dimensional modelling), rather than any old how.

3. You test the change.

Database query planners work better when they can take logically-safe shortcuts in their work. In large part that comes down to a properly-constructed schema.

Denormalizing makes it harder for the query planner. It also means you will probably lose out on future query planner enhancements.

Generally speaking, if someone wants to denormalize, I want to know the actual business value created and that the business risk is properly understood.

jfb · on Dec 19, 2018

Sure. But before one makes that argument, one must measure. It is not in any sense a panacea.

0x54D5 · on Dec 19, 2018

It turns out that not defining a structure to your model makes it easy to prototype but difficult to query the data afterwards.

Which is why MongoDB is awesome for teaching and building new stuff but horrible for reports, metrics, and scaling.

kages · on Dec 19, 2018

Who said that they moved to relational model? They chose PostgreSQL because it's JSON support makes it also a superb document database.

lmilcin · on Dec 19, 2018

What really happened is people wanted something new but did not want to change the ways they DESIGN their application and processes. There is place for NoSQL but if you are going to use it as if it was SQL then you would be better served by an SQL database.

Also, I think MongoDB tried to be everything and failed to be good at anything. It offers neither stellar performance nor scalability and I guess for most projects there is not much advantage over regular SQL database. Certainly nothing to fight over when there is much more technology choices to make.

brobdingnagians · on Dec 19, 2018

Just a few thoughts. A lot of little companies like picking hyped tools because they think it will differentiate them and server them well later, but most companies never get to that stage and don't really need what is being offered. Seems much safer to take the tried and true standard solution that has worked for decades and just make your UX outstanding rather than try to put together a "dream team" of new tech...

lmilcin · on Dec 19, 2018

Another thought. Most companies have absolutely no idea how to select the technology. A lot of people that are in position to decide don't have all that much hands on experience with the new technology and so they will decide based on other factors like whether his manager would like or would like not see new technology, what other companies are doing, etc.

Another example is "Agile". Everybody is doing it yet I still wait to see a single company that understands what the term means. My current boss is big promoter of "Agile" which in his language is synonym to "Scrum". Yet when asked he has never heard of Pheonix Project, The Goal, theory of constrains or basically any theory at all. So what the people are doing is fighting fires almost 100% of the time with not much project work done for the effort and absolutely no improvements. Yet, because everybody complies to do daily standups and Jira updates we are 100% agile.

Don't even get me started on devops...

dbrgn · on Dec 20, 2018

Nice writeup. I love this quote: "The first few times an engineer sees this kind of hype, they often think it's a structural shift. For engineers later in our career, we’ll often dismiss structural shifts as misplaced hype after getting burned too many times"

kevin_thibedeau · on Dec 19, 2018

> What happened?

Text and still imagery based media companies aren't big data providers. NoSql was a bad choice from the start.

orblivion · on Dec 19, 2018

I found it a little funny that NoSQL started becoming popular during at least some of the same years that static typing starting becoming popular (again).

acjohnson55 · on Dec 19, 2018

I don't see it that way -- I feel like NoSQL's rise was more or less coincident with the adoption of Rails, Django, and Node over Java. The surge in interest in new static languages has mirrored the resurgence of Postgres, right around version 9.4 (and JSONB).

orblivion · on Dec 19, 2018

Could be. I haven't thought too deeply about what happened when.

bonesss · on Dec 20, 2018

I think they're both responses to the same challenges: distributed web enabled applications.

On the server front that means ever increasing complexity with decoupled microservices and latency issues that play nicely with the classic approaches to those domains (static typing, functional programming).

On the data front sites like HN, Reddit, or Facebook need scability more than consistency, and have oodles of 'uninteresting' data that jives nicely with a schemaless document store.

acdha · on Dec 19, 2018

I've noticed that, too. There's an interesting ping-pong effect where a fair number people have flipped from strong typing at the database layer & dynamic typing at the view layer to the reverse — it seems like someone could write an interesting group psychology paper about how that cycle has repeated over the years.

dep_b · on Dec 20, 2018

I don't think I'm a Luddite but I always had the feeling I just had to sit out the noSQL / JS everywhere storm.

slaymaker1907 · on Dec 19, 2018

People realized a lot of the claimed not only SQL offerings were actually no SQL at all. It turns out it is nice to have those extra features of NoSQL as well as a more traditional RMDS instead of just the NoSQL parts.

WorldMaker · on Dec 19, 2018

Also, the RDBMS world is full of some of the oldest scaling experts in the yellow pages. It's interesting how surprised people were that many of the "traditional" RDBMS were able to catch up on some of the scaling support that were the biggest advertised advantages of NoSQL.

It's interesting because I feel like the NoSQL world spent a lot of time reinventing the RDBMS from the "opposite direction". For all its faults, SQL is a fascinating language because it mostly ignores low level details of how the database scales, how it operates under the hood. It's not that SQL is intrinsically hard to scale (certainly at the relational algebra roots it shouldn't be hard, in theory), but it certainly leaves a lot of work to anyone building a database engine to figure out what/when/why/how to scale. I feel like a lot of RDBMS' query analyzers/planners resemble things like HBase a lot more than folks realize.

It's great that NoSQL realized that sometimes those "low level details" in an SQL engine are useful in their own ways, and have increased the spectrum of performance versus power/flexibility trade-off options. But it shouldn't be that big of a surprise that SQL databases remain competitive in that trade-off space, given under the hood they've had to think about a lot of that stuff over many decades.

brianmcc · on Dec 21, 2018

A great series - thanks! Required reading for anyone preparing to "take a punt" at a technology they've not really done much actual work with.

h1d · on Dec 19, 2018

Even large organization like Guardian can have some cheep tech team who rush for such a poor choice? Also dumb for management to let it happen.

idbentley · on Dec 19, 2018

I really don't get this as an indictment of MongoDB, or their OpsManager product really.

They used the version of OpsManager that doesn't manage the deployment - is specifically not a deployment manager. Mongo does offer a managed version of this software, which the author mentions - with a justification for why they couldn't use that offering. However, I think this was the main mistake that The Guardian made. As the author notes: "Database management is important and hard – and we’d rather not be doing it ourselves." They underestimated the complexity of managing database infrastructure. If they had been attempting to set up and manage a large scale redundant PostgreSQL system, they would have spent an enormous engineering effort to do so as well. Using a fully managed solution - like PostgreSQL on RDS from the beginning would have saved them time. Comparing such a fully managed solution to an unmanaged one is an inappropriate comparison.

Full disclosure - I used to work at MongoDB. I have my biases and feelings w.r.t the product & company. In this case I felt that this article didn't actually represent the problem or it's source very accurately.

wftglf · on Dec 19, 2018

Fair criticisms! It's true if we'd used Mongo Atlas or something similar it would likely have been a different story - often the MongoDB support spent half the time on the phone trying to work out what version of mongo, opsmanager etc. we were running.

Re criticism of OpsManager - I think this is fair, given the sheer number of hoops we had to jump through to get a functioning OpsManager system running in AWS - no provided cloudformation, AMIs etc. £40,000 a year felt like a lot for a system that took 2 or more weeks of dev time to install/upgrade. The authentication schema thing was a bit of a pain as well, though we were going from a very nearly EOL version of Mongo (2.4 I think).

dmix · on Dec 20, 2018

> often the MongoDB support spent half the time on the phone trying to work out what version of mongo, opsmanager etc. we were running.

That sounds awful. Reading these stories I'm happy I work with small companies without such a huge infrastructure.

idbentley · on Dec 28, 2018

Support at this scale is a hard game. I know some of the guys who work over there, and they make a valiant effort.

Training is exceptionally hard. Databases are hard to manage, and it takes years to learn to diagnose their function on unknown hardware / software.

This all being said, "no provided cloudformation, AMIs etc." is no bueno - not a good experience for the user.

If you haven't used Mongo with WiredTiger, you really haven't used it at it's best.

beokop · on Dec 19, 2018

I do like the article but it sounds like they’re in over their heads, this whole (very risky) project could have been avoided if they just brought in someone that knew what they were doing.

> Clocks are important – don’t lock down your VPC so much that NTP stops working.

> Automatically generating database indexes on application startup is probably a bad idea.

> Database management is important and hard – and we’d rather not be doing it ourselves.

This is true of any database infrastructure with redundancy/scalability requirements.

What they did was take a technical problem and solve it by buying an off the shelf solution. Which is fine, of course, but I’m a bit surprised by the reaction here on HN.

mukundmr · on Dec 20, 2018

The pricing of hosted MongoDB solutions is high especially as volumes increase. The last I checked, disk space doesn't free up when you drop documents or collections and that adds to hosted cost unless you find time to fix the issue through manual intervention. This costing is moving us away from MongoDB in the future.

TimFogarty · on Dec 21, 2018

MongoDB will reuse free space when documents or collections are deleted (even though the storageSize stat will remain the same). You can compact the database to release free space to the OS. You can read more about how MongoDB's WiredTiger storage engine reclaims space here: https://docs.mongodb.com/manual/faq/storage/#how-do-i-reclai...

I work for MongoDB so if you have any questions about storage, feel free to reach out.

_gi12 · on Dec 19, 2018

Document DBs are like blockchain projects - overhyped and worse than existing solutions for nearly every use case. Why do ostensibly smart engineers keep falling for this stuff?

sandstrom · on Dec 19, 2018

Well, we're kind of comparing MongoDB of ~2011 (when The Guardian started using them) with Postgres of today.

One major change is that in 2010 you couldn't always run your whole DB in RAM, so there were some real performance benefits with MongoDB.

Another difference is that MongoDB was early with great JSON-support. Something that Postgres has since gained.

I think there are pros and cons with both. If I'd chose one today, for most tasks I'd probably chose Postgres.

That said, to understand why people made those choices, you need to 'teleport back' to that time and compare them in that year (given the tradeoffs at that time).

purple_ducks · on Dec 19, 2018

If somebody is teleporting back then, the fact Mongo had a _global_ Mongo instance level lock for writes should be more than enough for people to run screaming.

I worked with databases that only had table level locks(not row level) and there were more than enough occasions I cursed the creators.

Instance level(& indeed DB level) is insanity unless your DB is a read only DB.

underwater · on Dec 19, 2018

Not saying you're wrong, but a news site and CMS has different database requirements than many other products.

They have very few write actions; in the Guardian's case hundreds of authors publishing a few articles a day, versus millions of users reading data. Writes would be sporadic and batched.

News products also changes more rapidly than you might expect. A modern news team may be writing and shipping code to better cover a breaking news event. I can imagine why a document store would be appealing to them.

glogla · on Dec 20, 2018

I don't think Mongo was used much by people who would know why instance level lock is bad, or even what it is. And if someone who knew came later, it was too late - Mongo's proprietary querying language is a vendor lock-in.

sgt101 · on Dec 19, 2018

Oh! From The Guardian write up I had the impression they were using it until mid 2018..

Drdrdrq · on Dec 19, 2018

It is not easy to switch, so until the pain is too big, one doesn't do it.

wwweston · on Dec 19, 2018

For one thing, a good chunk of what happens in the industry isn't particularly well described by the term "engineering," which in my mind describes someone who is well-trained and experience-versed in relevant scientific models for the application domain and the dynamics of relevant "materials" and their interaction that can be composed into a solution. It might be the term "craft" better describes a lot of software development activity.

For another, it may be everyone is susceptible to the influence of trends. Even engineers. And the more complex the details of a subdomain of software engineering are, the bigger the tradeoff between becoming someone who can make a true engineering assessment in that niche and developing expertise elsewhere...

ownagefool · on Dec 24, 2018

It's also largely a fact that we're paid better for following hypes, at least in the UK. Thus a smart "engineer" who also cares about their bank balance pretty much has to use everything new to stay relevant.

Further, most of the use cases are fairly shallow, so it doesn't actually matter that much, and you'll be paid more to paper over the cracks.

We need to change the market, but it doesn't like the strong and stable engineers are getting lead roles to pay more strong and stable engineers to build boring tech, despite that actually being better for most businesses.

bdcravens · on Dec 19, 2018

Probably for the same reason people have a build process to load in 100s of KBs of Javascript in order to render an <h1> tag, and the compile it all down to HTML and declare static sites are the wave of the future.

frakkingcylons · on Dec 19, 2018

You're going to need some supporting arguments if you're actually trying to say that every document database is not worth using.

jsd1982 · on Dec 19, 2018

Sometimes the choice isn't in the hands of the person most qualified to make that decision.

zzzcpan · on Dec 19, 2018

They are not overhyped, definitely not as much as old school RDBMS stuff. It's just most engineers can't make good database choices no matter what database they choose. Or more generally they can't make good infrastructure choices, as those are out of their competence and are mostly about things like operations and distributed systems, that take a long time and a lot of experience to get to a level of good decisions.

michelpp · on Dec 19, 2018

You cannot be old-school and hype. That's the reason why it is old school. And the people who can't make good database design choices are exactly the kind of people who should be using SQL. Postgres knows how to optimize and plan queries efficiently based on the actual distributions of values in your dataset. These poor choosers should be doing that... by hand?

https://www.postgresql.org/docs/11/planner-optimizer.html

pas · on Dec 20, 2018

hipsters are allegedly the cool old-schoolers.

the whole promise of mongo is/was distributed (HA+LB), which was all the rage back then, when AWS AZs dropped like flies every few weeks and scaling was seen as the problem. go fast, break things was the mantra.

and it's still not trivial to do pgsql maintenance without downtime, whereas in a clustered/distributed "solution", you can enjoy certain additional freedoms. this includes the freedom to shoot yourself in the foot (data inconsistency, but you had to fight a bit for that state by promoting a non latest slave to master).

darkarmani · on Dec 20, 2018

> the whole promise of mongo is/was distributed (HA+LB)

Yes. It is so easy to bring up a mongo cluster and feel like you have HA. Don't worry it won't be proven wrong until you have writes during your cluster degradation and are unlucky.

pas · on Dec 21, 2018

it works. 4.0 gives you transactions.

it's pretty great.

and, most importantly, as far as I know the default replica set config is safe. (durable, consistent, atomic) it handles a lot of failover scenarios for you automatically. and you have to manually intervene to get into a bad state - which might be okay for that particular business case (better than full downtime)

zzzcpan · on Dec 19, 2018

This is what I'm talking about. You think that some feature that makes performance unpredictable is that important. While it's the last thing you should care about.

sparkie · on Dec 19, 2018

What people should care about is the integrity and consistency of the data they're storing. RDBMS do an extremely good job of generalising this problem with reasonable performance, which suits the majority of real world problems. The ideas behind it are understood by most of the industry.

It should be the default choice (or plain filesystem storage), unless you have a specific requirement for something different. In the latter case, this should be people who are well informed about the database choices they're making.

michelpp · on Dec 19, 2018

Unpredictable performance would mean, you can't predict how long a specific query will take. This is the opposite of how postgres works. If a query is slow, it's predictably slow if the data doesn't change. So what you must mean is, when the data changes, it doesn't behave like you expect.

So again, how do these poor choosers make better decisions in the face of changing data when they have to write all the algorithms by hand, and understand how to scale the data correctly? Seems to me that pretty soon they'd just be writing a very crappy database on top of some keystore.

astine · on Dec 19, 2018

The only way for query performance to be truly predictable is for your data to be static. If that's not your situation then a query optimizer is incredibly helpful because it means that you don't have to rewrite your queries every time you add an index.

bmdavi3 · on Dec 19, 2018

It also means that as your data grows or shrinks, the optimizer will notice and change the plan accordingly if it makes sense.

That means that 18 months after you implemented a feature you've long since forgotten about, you don't have to come back and figure out what the new plan should be. And that's huuuuge.

You can find accounts of the db changing the plan from a good one to a bad one, but I'll go out on not that far of a limb and say that those are < 1% of the cases. Nobody complains about the queries they didn't have to come back and change. And the better the optimizer, the better that trade off will be.

wongarsu · on Dec 19, 2018

> Why do ostensibly smart engineers keep falling for this stuff?

Review driven design is likely a large contributer. If you know you will be looking for a new job in 2 years, you better get some experience in $latestTrend instead of $provenTechnology.

hnbroseph · on Dec 19, 2018

perhaps they don't agree with your thesis.

SonnyWortzik · on Dec 19, 2018

I never got around to using Mongo as my main doc db because it was incredibly hard to find a management tool.

I now use json supported functions in SQL Server and do not have the need for a different type of database. SQL Server handles my small 'documents db' implementation with the infrastructure of a RDMS. Win win for me.

To me Mongo just got popular by mistake way too early. It's like having a celebrity retweet your post because they liked what they saw at the time, exposing you to the world where everyone now thinks you have something important to say. Not surprisingly, you don't!

mmsimanga · on Dec 19, 2018

I don't have any links to prove my point but I suspect we need to thank Mongo for pushing the traditional SQL databases to support JSON.