Specific technology choices aside, this was an incredible write-up of their migration process: thorough, organized, readable prose about a technical topic. It is helpful to read how other teams handle these types of processes in real production systems. Perhaps most refreshing is the description of choices made for the various infrastructure pieces, because it is reasonable and real-world. Blog posts so often describe building a system from scratch where all latest-and-greatest software can be used at each layer. However, here is a more realistic mix of, on the one hand, swapping out DBs for an entirely new (and better) one, but on the other hand finding new tools within their existing primary language to extend the API and proxy.
This was what was particularly interesting to me - that they went to the effort of writing a purely technical article on the particulars of how parts of their environment operate, and to publish it on their platform even when that's not the sort of content they're known for.
> A blog by the Guardian's internal Digital team. We build the Guardian website, mobile apps, Editorial tools, revenue products, support our infrastructure and manage all things tech around the Guardian
Hi! Thanks for your comments. I'm one of the authors of this post. It is the same platform at the moment (just not tagged with editorial tags so it stays away from the fronts), though sometimes the team that approves non-editorial posts to the site can be concerned about us writing about outages and things as it might carry a 'reputational risk', so we may end up migrating to a different platform in the future so we can publish more quickly, we'll see!
In our era of deplatforming a publisher publishing on something like Medium seems antithetical, even for a dev team that just wants to get words out. Should you spend some cycles on the dev blog? Probably, but you should also split test the donation copy and get the data warehouse moved forward for 2019 initiatives and fix 1200 other issues. Thanks for sharing a great post. I shared it with my team and we all learned a lot.
It's a bit disturbing to me that they seem to be using AWS for confidential editorial work.
> Due to editorial requirements, we needed to run the database cluster and OpsManager on our own infrastructure in AWS rather than using Mongo’s managed database offering.
In a happy world the guardian wouldn't rely on a company we spend a lot of time reporting on for unethical practices (tax avoidance, worker exploitation etc.) - but we decided it was the only way to compete. One of the big drivers was a denial of service attack on our datacentre in 2014 on boxing day - not an experience any of us want to have to deal with again.
>Since all our other services are running in AWS, the obvious choice was DynamoDB – Amazon’s NoSQL database offering. Unfortunately at the time Dynamo didn’t support encryption at rest. After waiting around nine months for this feature to be added, we ended up giving up and looking for something else, ultimately choosing to use Postgres on AWS RDS.
Exactly. As I read the original article, which mentions "encryption-at-rest", there was a voice in my head crying: "No, what they need is E2EE". That would enable the authors to write confidential drafts of the articles, no matter where the data is stored (and AWS would be perfectly fine of course).
Disclaimer: The voice is my head does not come out of nowhere. I am building a product which addresses this: https://github.com/wallix/datapeps-sdk-js is a API/SDK solution for E2EE. Sample app integration is available at: https://github.com/wallix/notes (you can switch between master and datapeps branches to see the changes of the E2EE integration)
In which case they could've just used a separate encryption layer with any database, including DynamoDB. The HSM security keys available from all the clouds makes this rather simple.
Encryption at rest is still important as it closes off a few attack/loss vectors: mis-disposed hard drives, re-allocated hosts. I'm probably missing a few others.
Sadly we don't trust our security practices anywhere enough for that! Secret investigations happen in an air gapped room on computers with their network cards removed then get moved across to the main CMS when they're ready to publish.
Probably not, no, until they were about to be published. I imagine that the choice between "run an entire data centre ourselves, store everything there" and "use AWS, but keep high sensitivity stories on local machines" is an easy one.
After all, the client computer that connects to the CMS is just as, or more likely to be compromised. I wouldn't be surprised if the coverage (or at least parts of it) were edited on airgapped laptops.
> the choice between "run an entire data centre ourselves, store everything there"
If those were the only two choices, you might be right. But the resources needed for the actual CMS functionality sound modest enough to run independently of the main website.
> the client computer that connects to the CMS is just as, or more likely to be compromised
They're using AWS VPC (Virtual Private Cloud) which isn't open to the world (you use a VPN to bridge the VPC into your internal network) and which you can spin up dedicated instances that don't share underlying hardware with other AWS customers.
This is pretty much how all Guardian articles are formatted. Some of their regular pieces could be called "blog posts" - Felicity Cloake's cooking series comes to mind.
Guess it makes sense to reuse the platform that already has the templates than use another platform and reimplement the design.
Totally agreed. This is pretty much the definitive guide on how to perform a high stakes migration where downtime is absolutely unacceptable. It's extremely tempting, particularly for startups, to simply have a big-bang migration where an old system gets replaced by something else in one shot. I have never, ever seen that approach work out well. The Guardian approach is certainly conservative but it's hard to read that article and conclude anything other than that they did the right thing at every step along the way.
Well done and congratulations to everyone on the team.
Yeah it did take a long time! Part of this though was due to people moving on/off the project a fair bit as other more pressing business needs took priority. We sort of justified the cost due to the expected cost savings from not paying for OpsManager/Mongo support (as in the RDS world support became 'free' as we were already paying for AWS support) - which took the pressure off a bit.
Another team at the guardian did a similar migration but went for a 'bit by bit' approach - so migrating a few bits of the API at a time - which worked out faster, in part because stuff was tested in production more quickly, rather than our approach with the proxy which, whilst imitating production traffic, didn't actually serve Postgres data to the users until 'the big switch' - so not really a continuous delivery migration!
The article mentions several corner cases that weren’t well covered by testing and caused issues later. What sort of test tooling did you use, Scalacheck?
Agreed! I don't think enough engineering orgs appreciate the value of a great narrative on any technical topic.
Part of my duties at work require me to deal with "large" issues. While a solution to them is usually necessary and quick and high quality, I've seen the analyses that come after them vary in quality.
Good writeups tend to stick around in people's memories and become company culture, and drive everyone to do better. Bad writeups are forgotten, and thus the lessons learned from them are forgotten as well.
This particular article stands out for me. English is not my first language, and I've spent most of my life dealing with very fundamental technical details, so most of my writeups aren't the best. I'm going to bookmark this one and come back to it to learn how to write accessible technical narratives.
I was actually a little confused by the article - it seems to go up and down in terms of technical depth. It feels like it was written by several people. The hyperlink to “a screen session” was odd as well ... ammonite hyperlink I get but... screen is a pretty ancient tool... people either know it or can find out about it. Like you link to screen but not “ELK” stack?
I like the article but it was a bit hard for me to consume with multiple voices in different parts.
But later versions were rock solid and I've matainer mongo installations at many startup and SMEs once you setup alertd for disk/memory usage, off you go. Works like charm 99% of the times.
I'd posit its more a matter of maturing a new paradigm. There's a lot more edge cases you have to cover as NoSQL became more popular for production-at-scale.
SQL has decades of production maturation, and has wider domain knowledge.
I'm sure there's some of that. But a lot of the early problems were a bit more weighted towards poor engineering in general, IIRC. For example, I seem to recall an early problem was truncating large amounts of data on crash occasionally.
I think you're asking the wrong question. The question should be: How did MongoDB become so successful?
IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API. MongoDB was the natural choice because they excelled at being accessible to devs who were already familiar with Javascript and JSON.
Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?
I think you'll enjoy the series then, I spent several months investigating and made the same point about JSON and the Javascript-like CLI (plus great Node support, plus savvy marketing). For example:
> 10gen's key contributions to databases — and to our industry — was their laser focus on four critical things: onboarding, usability, libraries and support. For startup teams, these were important factors in choosing MongoDB — and a key reason for its powerful word of mouth.
> I think you're asking the wrong question. The question should be: How did MongoDB become so successful?
Marketing, marketing, and more marketing. Mongo was written by a couple of adtech guys.
> Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?
I remember being underwhelmed by two things at the one MongoConf I went to earlier this decade:
1.) My immediate boss was an unfathomable creep who was there mostly to pick up women
2.) Mongo was focused on how to work around the problems (e.g. aggregate framework) rather than how to solve them.
I can't recall ever seeing a Postgres rep, but I can recall having worked out a PostGIS bug with a fantastically tight feedback loop. The Postgres documentation and community are nothing short of amazing.
Meanwhile with Mongo I watched as jawdropping bugs languished. IDGAF what the reps say, anyone with even a few years experience should've been able to see through the bullshit that Mongo/10gen was/is selling.
> IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API.
The thing I dislike about this type of comment – although I now notice yours doesn't explicitly say this – is the implication that devs don't like SQL because they're lazy or stupid. Well, sometimes that is probably true! But there are some tasks where you need to build the query dynamically at run time, and for those tasks MongoDB's usual query API, or especially its aggregation pipeline API, are genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties. For anyone not familiar, just look at how close to being a native Python API pymongo is:
Of course you could write an SQL query that does this particular job and is probably clearer. But if you need to compose a bunch of operations arbitrarily at runtime then using dicts and lists like this is clearly better.
Of course pipelines like this will typically be slow as hell because arbitrary queries, by their nature, cannot take advantage of indices. But sometimes that's OK. We do this in one of our products and it works great.
With JSONB and replication enhancements, Postgres is close to wiping out all of MongoDB's advantages. I would love to see a more native-like API like Mongo's aggregation pipeline, even if it's just a wrapper for composing SQL strings. I think that would finish off the job.
Elixir's primary database wrapper, Ecto [0], lets you dynamically build queries at runtime, and also isn't an ORM. Here's two examples directly from the docs:
# Query all rows in the "users" table, filtering for users whose age is > 18, and selecting their name
"users"
|> where([u], u.age > 18)
|> select([u], u.name)
# Build a dynamic query fragment based on some parameters
dynamic = false
dynamic =
if params["is_public"] do
dynamic([p], p.is_public or ^dynamic)
else
dynamic
end
dynamic =
if params["allow_reviewers"] do
dynamic([p, a], a.reviewer == true or ^dynamic)
else
dynamic
end
from "posts", where: ^dynamic
Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.
My main point, though, is that you don't need to reach for NoSQL if all you need is a way to compose queries without string interpolation.
As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline unless it can do analogous things to Postgres's JSONB fields. For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?
Also, one of the benefits of Mongo's API is that it has excellent native implementations in numerous languages (we already use C++ and Python), so a suggestion to switch language entirely is not really equivalent.
> As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline
Huh? The aggregation framework is a solution to a mongo-only problem. Most other databases are performant, but Mongo suffers wildly from coarse locking and slow performance putting things into and retrieving things from the javascript VM.
> For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?
This sounds suspiciously like a SQL view.
Edit: But if you actually need an array in a cell, Postgres has an array type that's also a first-class citizen with plenty of tooling around it.
The "this" was referring to dynamically building queries (the GP comment by me) in Ecto (the parent comment by QuinnWilton). What you've said is a non-sequitur in the context of this little discussion. My whole original point is that raw SQL isn't right in all situations, and you appear to be arguing that I just use SQL instead.
I can't speak to every ORM or database interface in existence but ActiveRecord will happily handle Postgres arrays and let you use the built-in array functions just handily without having to write queries by hand. Ecto is less elegant, but you can still finangle some arrays with it.
As far as views are concerned, I don't know what to tell you. Sure, you'll probably have to craft the view itself by hand. The result is that you can then use most abstractions of your choosing on top of it though.
There's also the possibility of using automation to create, update, and manage views. That lets your app be 'dynamic' with regards to new data and new datatypes, but also preserves the performance, debugging, segregation, and maintenance benefits of the underlying DB.
> Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.
Ahh Elixir. My favorite language that really just tries so hard to shoot itself in the foot. I'm currently in the protracted process of trying to upgrade a Phoenix app to the current versions. Currently I'm at the rewrite it in Rust and try out Rocket + Diesel stage.
Diesel is... interesting and makes me long for Ecto (which is often used as an ORM although the model bits got split off into a different project).
Love the downvotes instead of comments. I've walked away from Elixir as the best practice deployment methodology (Distillery) is non-op on FreeBSD[1] and has been for a few months while the Distillery author is mum. All of this despite the vast love that the Elixir community seems to heap on FreeBSD.
Erlang and Elixir have plenty of promise but there simply is no good story for production deployments. Distillery and edeliver approximate capistrano, and that sounds great when it works (although I'd just as soon skip edeliver). But when it doesn't I'd much rather dig into the mess of ruby that is Capistrano than the mess of shell scripts, erlang, and god knows what else goes into a Distillery release.
Elixir is a really interesting language, but Phoenix seems to still be pretty wet behind the ears and very much in flux. Ecto too to a much smaller extent.
1: Some of the distillery scripts can communicate with epmd, some just give up.
Well... you can also use a modern ORM. I think "stitching ... text strings" is definitively not the way to go when interfacing a SQL database. My go-to ORM is Sequel[1]. I think their API is one of the best I've seen: you can choose to use models, but you can also work directly with "datasets" (tables or views, or queries) and compose them as you like. It's really powerful and simple.
> genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties.
You're using the Pymongo library as an example. Someone can just as easily use SQLAlchemy and not have to worry about those things.
I'm confused by the implication that someone doing things like the above would be writing in SQL. SQL is a little like assembly language in a game: You may need to drop down to it for some key highly-optimized areas, but you rarely need to directly use it for most tasks. While it's true that you should understand how it works so you don't generate queries that suck performance-wise, the same goes for Mongo's intricacies too.
Every language I know of has great ORMs which do this for whatever SQL flavors people tend to use on that platform. I write things like this all the time, and it gets turned into SQL for Postgres:
When using an ORM correctly (and indeed, the less I'm using any of my own bits of SQL the more this is true) I am also protected against injection attacks.
I'm not saying NoSQL has no value, but I believe it to be the wrong tool for data that lends itself to an RDBMS. If you have a bunch of documents who have deeply nested or inconsistent structures and where it makes no sense that you'd want to query by something other than the primary key, sure, it's a no-brainer to use a NoSQL system. For a CMS, which has been implemented thousands of times in RDBMSs, it is madness though. I cringe at realizing that apaprently there are developers out there who have avoided learning SQL entirely in their career out of fear, and as a result have to use Mongo for every application because that's the only thing they know how to do. I'm sure they're out there, but I wouldn't hire one.
The jsonb_array_elements function is roughly similar to Mongo’s $unwind pipeline op. It explodes a JSON array into a set of rows. From there it’s pretty simple aggregates to achieve what you’re looking for.
I was evaluating Mongo a couple months back to solve roughly the same problems. Eventually discovered Postgres already had what I was looking for.
More the point Postgres has an actual array data type (and has for a while). You don't need to shove everything into a JSON/JSONB blob unless you absolutely cannot have any sort of schema.
Not only arrays, you can, with some limitations, create proper types with field names, if your ORM supports that you should use that over JSONB if it fits.
It was supposed to be clear from the context that this meant:
> Does building queries programmatically with SQLAlchemy do that?
Maybe I'm misreading your comment, but you seem to just be talking about writing queries directly in SQL.
If not, could you give an example/link of how to programmically build a query in SQLAlchemy that dynamically makes use of jsonb_array_elements? It would be hugely useful if I could do that.
I was speaking of SQL, but if you can write it in SQL you can usually map it to SQLAlchemy. If worse comes to worse, you can use text() to drop down to raw SQL for just a portion of the query.
SQLAlchemy’s Postgres JSONB type allows subscription, so you can do Model.col[‘arrayfield’]. You can also manually invoke the operator with Model.col.op(‘->’)(‘arrayfield’).
To add to the pile of responses: in Scala, Slick is great library that lets you compose sql queries and fragments of queries quite effectively. (http://slick.lightbend.com/)
At my company we built a UI on top of Slick that lets users of our web app define complex triggers based on dynamic fields and conditions which are translated to type-safe SQL queries.
From my POV the rise of 'NoSQL' some years back was tied into a number of things:
- Misunderstanding by most developers of the relational model (I heard a lot of blathering about 'tabular data', which is missing the point entirely).
- The awkwardness and mismatchiness of object-relational mappers -- and the insistence of most web frameworks on object-oriented modeling.
- The fact that Amazon & Google etc. make/made heavy use of distributed key-value stores with relatively unstructured data in order to scale -- and everyone seemed to think they needed to scale at that level. (Worth pointing out that since then Google & Amazon have been able to roll out data stores that scale but use something closer to the relational model). This despite the fact that many of the hip NoSQL solutions didn't even have a reasonable distribution story.
- Simple trending. NoSQL was cool. Mongo had a 'cool' sheen by nature of the demographic that was working there, the marketing of the company itself.
I remember going to a Mongo meet-up in NYC back in 2010 or so, because some people in the company I was at at the time (ad-tech) were interested in it. We walked away skeptical and convinced it was more cargo-cult than solution.
I'm _very_ glad the pendulum is swinging back and that Postgres (which I've pretty much always been an advocate of in my 15-20 year career) is now seeing something of a surge of use.
I remember a hyperbolic readme or other such txt file for Postgres in the far-away long-ago time when everyone was on Slashdot. The author had written one of the most enthusiastic lovenotes to software I'd ever read, and that includes Stephenson's "In The Beginning Was The Commandline." It was a Thomas Wolfe level of ejaculatory keenness. I'd love to read it again if anyone else knows where I can find the file. So, even if there aren't actual Postgres reps, there are most assuredly evangelists.
Saying "I don't know SQL so I will just use JSON" really misses the point though. SQL is easy. Data is hard. NoSQL products offer to get rid of SQL which includes an implication that SQL itself was the challenge in the first place. The problem then is that you have lost one of the best tools for working with data.
I dunno that SQL is exactly easy, though. It's one thing to say "select statements are essentially identical to Python list comprehensions", but in practice I still have to look up the Venn diagram chart every time I need to join anything, and performance optimization is still a dark art. I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.
You could solve that by altogether dropping the Venn diagram metaphor when reasoning about joins. This is the number one problem I see with junior devs who have a hard time grokking SQL. If you think about a join as a cartesian product with a filter, where the type of join defines the type of filter, the reasoning is extremely easy.
The hard parts of "SQL" are the hard parts of data. Joins aren't easier in Mongo. The performance optimizations you reference are tuning of a relational database, not SQL itself.
If you want to work with databases a domain specific language like SQL really provides a lot of value in solving these hard data problems.
The idea is, in relational databases, that the vast majority of the time you shouldn't have to do it. Because you're writing your queries in a higher level (nay, functional) language, the query planner can understand a lot more about what you're trying to do and actually choose algorithms and implementations that are appropriate for the shape and size of your data. And in 6 months time when your tables are ten times the size, it is able to automatically make new decisions.
More explicit forms of expressing queries have no hope of being able to do this and any performance optimization you do is appropriate only for now and this current dataset.
> I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.
Mongo and Javascript don't solve that either. In fact you get additional problems by virtue of not being able to do a variety of joins. For extra points, you're going to need to go well beyond javascript with mongo if you want performance. 10gen invented this whole "aggregation framework" to sidestep the performance penalty that javascript brings to the table.
On the other side, the postgresql documentation is second to none. SQL isn't necessarily easy but the postgres documentation gives you an excellent starting point.
> You make it sound like learning SQL is like learning Assembler
It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."
You seem like the kind of person ready and willing to learn the right tool for the job. From my experience a few years ago on an accredit computing course that covered database admin and programming, this attitude is not representative of most of the software engineering students //unless// there's a specific assignment that requires particular knowledge.
Cs get degrees. And for plenty of developers out there, knowing one language (not even particularly well) gets jobs.
> It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."
And that's a big fat mistake. There are so many ways to shoot yourself in the foot with mongo such that simply knowing the language mongo uses for most of its queries while not actually knowing the particulars of how mongo uses that language… well that's just a road to a world of hurt.
For example, when I first inherited a mongo deployment I noticed the queries were painfully slow. Ah hah says me, let's index some shit. Guess what? Creating an index on a running system with that version of mongo = segfault.
After a bunch of hair pulling I got mongo up and running and got the data indexed. But the map reduce job was STILL running so slowly that we couldn't ingest data from a few tens of sensors in real time. So I made sure to set up queues locally on the sensors to buy myself some time.
Even in my little test environment with nothing else hitting the mongo server, mongod was still completely unable to run its map reduce nonsense in a performant manner. Mongo wisdom was: shard it! wait for our magical aggregation framework! Here's the thing: working at a dinky startup we can't afford to throw hardware at it especially that early in the game. Sharding the damn thing would also bring in mongo's inflexible and somewhat magical and unreliable sharding doohickey.
So I thought back to previous experience with time series data. BTDT with MySQL, you're just trading one awful lock (javascript vm) for another (auto increment). So I set up a test rig with postgres. Bam. I was able to ingest the data around 18x faster.
And that's the thing. Mongo appeals to people who are comfortable with javascript and resistant to learning domain specific knowledge. All that appealing javascript goodness comes with a gigantic cost. If you're blindly following the path of least resistance you're in for a bad time.
P.S. plv8 is a thing, and you can script postgres in javascript if you really wanted to.
I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline. So while "sure I'll just learn SQL" is great for a personal or school project, when you've got to get something done next week, it's better to take maximal advantage of the tools/skills/workflow that you already have.
IOW, it's not just laziness, it's a kind of professional conservatism. which is partly what gets older engineers stuck in a particular mindset, but it's also a very effective learned skill. The opposite is being a magpie developer, which results in things like MongoDB taking off :)
> I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline.
You have to do the exact same things with Mongo+JS (e.g. learning when to avoid the JS bits like the plague).
learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language,
SQL is a skill that rewards investment in it 1000x over, in terms of longevity. It has spanned people’s entire careers! What’s the shelf life of the latest JS framework, 18 months at most...
Yes, I know that, and that's why I know and use SQL instead of MongoDB. But that's a very similar reason to why I've resisted learning Rust, and Ruby, and React, and Docker, and Scala, and many more. I know I could learn the utter basics in a weekend, but I also know that those basics are utterly useless in a real-world context, and I would prefer to spend the weekend hacking on my open-source project in Python or C, which I've already invested the years into. And that's how engineers age into irrelevance..
Well, that and SQL has a somewhat undeserved reputation for being easy to learn, but also easy to screw up. Like you write a simple looking query and it turns out to have O^2 complexity and your system ends up bogged down in the database forever.
In practice people who fall into complexity traps are usually asking a lot more of their database engine than any beginner. It's usually not that hard to figure out the approximate cost of a particular query.
> Like you write a simple looking query and it turns out to have O^2 complexity
Or you have a simple fast query with a lovely plan until the database engine decides that because you now have 75 records in the middle table instead of 74, the indexes are suddenly made of lava and now you're table-scanning the big tables and your plan looks like an eldritch horror.
> Not only that, their marketing/outreach efforts were also aimed at younger developers.
I do remember a lot of MongoDB t-shirts, cups and pens around every office I was in around 2011-2013. When I would ask they would tell me that a MongoDB developer flew halfway across the world to give them all a workshop on it.
> The question should be: How did MongoDB become so successful?
Ability to store Algebraic Data Types and values with lists without a hassle of creating a ton of tables and JOINs. Postgres added JSON support since, plus there are now things like TimescaleDB, which didn't exist previously.
ORMs have existed for decades so developers can use a SQL database just fine without knowing the language. So it's definitely not this.
It's more likely because Mongo is (a) is extremely fast, (b) the easiest database to manage and (c) has a flexible schema which aligns better with dynamic languages which are more popular amongst younger developers.
Postgres is faster at json than mongo. Also the pipeline query strategy of mongo is terrible to deal with. A schema should not be flexible. Now I have to write a bunch of code to handle things that should have been enforced by the database. Postgres is incredibly easy to manage with actual default security. I know the mongo tutorial says to not run the default configuration, then why is it the default configuration. It's so easy to manage anyone can take it over for ransom.
At "large financial news company" we had a "designed for the CV" tag that applied to stupid architectural decisions (of which there were many)
One of the biggest and most expensive was using Cassandra to store membership details. Something like 4 years of work, by a team of 40, wasted by stupid decisions.
They included:
o Using Cassandra to store 6 million rows of highly structured, mostly readonly data
o hosting it on real tin, expensive tin, in multiple continents (looking at >million quid in costs)
o writing the stack in java, getting bored, re-writing it as a micro service, before actually finishing the original system
o getting bored of writing micro services in java, switching to scala, which only 2/15 devs knew.
o writing mission critical services in elixir, of which only 1 dev knew.
o refusing to use other teams tools
o refusing to use the company wiki, opting for thier own confluence instance, which barred access to anyone else, including the teams they supported
I think the worst place I ever worked was like that. Was going back quite a number of years now but it was a startup fired up by a one of the lesser MBAs to utilise a legal loophole to slice cash off the public via a web app and a metric ton of marketing to desperate people.
Step one was hire everyone the guy had worked with at his previous company. They were all winforms / excel / SQL / sharepoint / office developers from big finance and had no idea where to go really. None of them had even touched asp.net.
Cue "what's popular". Well that was Ruby on Rails back then on top of MySQL and Linux. 4 people with zero experience pulled this stack in and basically wrote winforms on top of it. Page hits were 5-8 seconds each. Infrastructure was owned by SSH worms and they hadn't even noticed.
I think I lasted two days there before I said "I'm done".
I once saw a CEO who wrote his own Web Framework and forced the entire company to use it.
At the time, under the influence of React, the idea was to "build web application sorely based on Functional Programming". Since after years of trying no one could figure out what that meant, the company ditched the CEO and ended up wasting a couple years of work.
It's not that you don't write anything in a new thing. It's that you start with small, less critical projects. Get your feet wet, give people a chance to get a feel for the pros and cons, that sort of thing.
Jumping right into writing "mission critical services" in the brand new language that few people at the company know well is asking for trouble.
there were no problems of scale, speed or latency. It was migrating from one terrible system to something that should be smaller, simpler, cheaper and easier to run.
The API is/was supposed to do precisely four things:
o provide an authentication flow for customers
0 provide an authentication flow for enterprises to allow SSO
o handle payment info
o count the number of articles read for billing/freemium/premium
That is all. Its a solved problem. Don't innovate, do.
Spend the time instead innovating on the bits that are useful to the business and provide an edge: CRM, pattern recognition and data analytics.
This actually resonates with me, maybe not in the way Ellison intended as I’m not familiar with the context he said it in. A bit off the main topic but the more I revert to just using emacs for some task I previously used a .app bundle or web page for, the more I question how much we the computer industry has just been spinning its wheels for the last 30+ years. I honestly can’t really tell what value WIMP-centric GUIs have brought to the table besides fashion, let alone the endless debates in the form of actual implementations about the best way to build one. Possibly the best argument ever made died with Mac OS 9.
I suspect that in a year I’ll be using what is effectively an Elisp machine with an Apple logo paired with an iPhone.
Discoverability and self documentation is generally the advantage of GUI systems. Well built systems can be understood in a few moments without needing to consult a manual. That's almost impossible in a pure CLI environment.
Of course it's entirely possible to screw this up, modern phone-centric design standards are really bad about it for example, but in general you need to consult the manual (or Google) far less often.
I’m not convinced these are inherent properties of GUIness. We have had GUIs in use for long enough that many elements even across slightly different GUIs are familiar enough, this had allowed conventions and then conventional wisdom to develop.
Early GUI developers also put effort into making their GUIs at least somewhat intuitive, but they also bundled thick books of documentation on how to use their systems.
To a degree, they are self-documenting, but not because of their GUIness so much as their design choices; menus helped a lot with this. Menus actually are a good subject to touch upon, they are definitely one of the better conventions we developed, and they were based upon and analogous to a restaurant menu. However their very nature as a list of commands you can issue a GUI does also typically but not always, limit the application. If the only options available to you are what is on the menu and there is no other interface, then an application is much more limited. This isn’t even true of most restaurants which will often allow you to issue an order for something not on the menu if they have the ingredients, equipment and expertise to make it.
But as a convention, it is not limited solely to GUIs, you can incorporate menus into any interactive interface.
I think the real innovation wasn’t GUIs, it was interactive software. The innovation beyond that is scriptable software.
What is important isn’t GUIness, but the developer’s intent. If you develop software with the intent to be self-documenting and discoverable, you will end up with an interface that is both of these things provided you did a competent job of it. A GUI might help, relying on platform conventions might help, and using common cultural conventions might help, but these aren’t the necessary ingredients for those qualities.
Emacs has the quality of self-documentation, but it is emphatically not a GUI even though it is interactive.
I would say well built systems, provided you intend to do anything productive and even slightly complex, should have a manual included, or else it isn’t a well built system.
It isn’t all bad. My laptop is 6, almost 7 years old at this point and it has received a few upgrades in that time. This change in habits does lower the minimum system requirements for its eventual replacement from “runs Mac OS X” to “runs emacs”, but I might be able to stretch its life out a bit longer now.
IMO there is still a place for schema-less document databases. It's just that Postgres's JSON columns mean you can get the best of both worlds, which makes Mongo look weak by comparison.
I would rather say, "Postgres's JSONB provides a hybrid compromise that may meet the needs of many users." JSONB feels like it's closer to creating more complex data types that are less primitive than those that SQL currently allows.
The real driver of the NoSQL movement, I believe, was that everybody wanted to be the next big social network or content aggregation site. Everybody wanted to be the next Facebook, Instagram, Twitter, etc. and that's what people were trying to build. Ginormous sites like these are are one of the applications that strongly favors availability/eventual consistency over guaranteed consistency, whereas most other applications are quite the opposite.
Nobody really cares if your Instagram post shows up 10 minutes later in New York than it does in LA, and certainly not if the comments appear in similarly inconsistent order. It's one step above best-effort delivery. However, your bank, hospital, etc. often care quite a bit that their systems always represent reality as of right now and not as of half an hour ago because there's a network problem in Wichita.
The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."
> The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."
Even that's too simplistic. For most RDBMSes, the answer depends on how you have it configured, and usually isn't "Error". If you're using a serializable transaction isolation level, it usually means, "you might have to wait an extra few milliseconds for your answer, but we'll make sure we get you a good one." Other isolation levels allow varying levels of dirty reads and race conditions, but typically won't flat out fail the query. This is probably the situation most people are working under, since, in the name of performance, very few RDBMSes' default configurations give you full ACID guarantees.
To the "DB in NY knows something different from DB in LA" example, there are RDBMSes such as the nicer versions of MSSQL that allow you to have a geographically distributed database with eventual consistency among nodes. They're admittedly quite expensive, but, given some of the stories I've heard about trying to use many NoSQL offerings at that kind of scale, I wouldn't be surprised if they're still cheaper if you're looking at TCO instead of sticker price.
Many ATMs will still give you money when they're offline, and things become eventually consistent by comparing the ledger.
Shops also generally want to take orders and payments irregardless of the network availability, so whilst they might generally act as CP systems, they'll be AP in the event of network downtime, but will likely lose access to fraud checks, so may put limits of purchases, etc.
They're probably all CP locally and AP (or switchable from CP) from an entire system perspective.
I like the JSONB support. Not for storing data but to query it with shitty drivers.
Some array magic and jsonb_agg and suddenly you can get easy to json decode results instead of having to play with rows in your app. Yes you can also do it with xml_agg but these days people tend to consider anything xml as evil (they're wrong).
I would say that development of JSON fields in Postgres and MySQL was accelerated by the adoption of Mongo.
Speed to market/first version using JSON stores is attractive, especially when you're still prototyping your product and won't have an idea of exact data structures until there's been some real world usage.
Standard rules about performance optimization apply. Denormalization can improve performance, but it can just as easily harm it.
For code, I think by now we all understand that you should always start with clean, well-factored code, and then optimize only as much as is necessary, which is usually not at all, and always under profiler guidance. It's the same with DBs: You start with a clean, well-normalized schema, and then de-normalize only as much as is necessary, which is usually not at all, and always under profiler guidance.
Also, keep in mind that improvements in compiler technology over time mean that the performance tricks of old can be useless or even actively harmful nowadays. This is true of SQL every bit as much as C.
Denormalization as way of dealing with performance issues is like guilottin to cure headache. And more frequently than not the headache is still there even after that.
It is true that for some narrow class of analytical workloads 20-25 years ago (behold BW of 199x) the denormalized case performance was better compare with straight non-optimized running of the same queries over normalized schema. Since then, the exponential availability of RAM and huge increase in streaming speed of HDD with stagnating IOPS (the main mistake in analytical workloads on HDD in the last 10-15 years - using nested loop join with indexed lookup into the large facts tables :) have made denormalization obsolete and harmful. If anything, the emergence of SSD and the huge RAMs moved things even further toward and beyond normalization, by making the "super-normalization", i.e. columnar tables, a viable everyday thing.
The idea that joins are slow is a holdover from the bad old days when everyone used MySQL and MySQL sucked at joins. On a more robust DBMS, a normalized schema will often yield better performance than the denormalized one. Less time will be lost to disk I/O thanks to the more compact format, the working set will be smaller so the DBMS can make more efficient use of its cache, and the optimizer will have more degrees of freedom to work with when trying to work out the most efficient way to execute the query.
(edited to add: If you're having performance problems with a normalized schema, the first place to look isn't denormalization, it's the indexing strategy. And also making sure the queries aren't doing anything to prevent proper index usage. Use the Index, Luke! is a great, DB-agnostic resource for understanding how to go about this stuff: https://use-the-index-luke.com )
It's not, it's a strategy to improve the performance of a particular query or access pattern, and is usually the last resort after things like proper indexing, aggregations and materialized views.
JOINS are fast and it all comes down to how much data you're moving. If it's a large table joining to a small set of values, then the joined data is quickly retrieved and applied to the bigger table, with great performance.
If the join is between two large tables where every combination is unique then that's the unique case where the joined table is just adding another hop to get the full set of data for each row and is a perfect candidate for denormalization, although in that case it probably should've been a single table to begin with. Of course there's a spectrum between these two scenarios but it takes a lot before denormalization makes sense on any modern RDBMS.
2. You denormalize in a structured way (eg dimensional modelling), rather than any old how.
3. You test the change.
Database query planners work better when they can take logically-safe shortcuts in their work. In large part that comes down to a properly-constructed schema.
Denormalizing makes it harder for the query planner. It also means you will probably lose out on future query planner enhancements.
Generally speaking, if someone wants to denormalize, I want to know the actual business value created and that the business risk is properly understood.
What really happened is people wanted something new but did not want to change the ways they DESIGN their application and processes. There is place for NoSQL but if you are going to use it as if it was SQL then you would be better served by an SQL database.
Also, I think MongoDB tried to be everything and failed to be good at anything. It offers neither stellar performance nor scalability and I guess for most projects there is not much advantage over regular SQL database. Certainly nothing to fight over when there is much more technology choices to make.
Just a few thoughts. A lot of little companies like picking hyped tools because they think it will differentiate them and server them well later, but most companies never get to that stage and don't really need what is being offered. Seems much safer to take the tried and true standard solution that has worked for decades and just make your UX outstanding rather than try to put together a "dream team" of new tech...
Another thought. Most companies have absolutely no idea how to select the technology. A lot of people that are in position to decide don't have all that much hands on experience with the new technology and so they will decide based on other factors like whether his manager would like or would like not see new technology, what other companies are doing, etc.
Another example is "Agile". Everybody is doing it yet I still wait to see a single company that understands what the term means. My current boss is big promoter of "Agile" which in his language is synonym to "Scrum". Yet when asked he has never heard of Pheonix Project, The Goal, theory of constrains or basically any theory at all. So what the people are doing is fighting fires almost 100% of the time with not much project work done for the effort and absolutely no improvements. Yet, because everybody complies to do daily standups and Jira updates we are 100% agile.
Nice writeup. I love this quote: "The first few times an engineer sees this kind of hype, they often think it's a structural shift. For engineers later in our career, we’ll often dismiss structural shifts as misplaced hype after getting burned too many times"
I found it a little funny that NoSQL started becoming popular during at least some of the same years that static typing starting becoming popular (again).
I don't see it that way -- I feel like NoSQL's rise was more or less coincident with the adoption of Rails, Django, and Node over Java. The surge in interest in new static languages has mirrored the resurgence of Postgres, right around version 9.4 (and JSONB).
I think they're both responses to the same challenges: distributed web enabled applications.
On the server front that means ever increasing complexity with decoupled microservices and latency issues that play nicely with the classic approaches to those domains (static typing, functional programming).
On the data front sites like HN, Reddit, or Facebook need scability more than consistency, and have oodles of 'uninteresting' data that jives nicely with a schemaless document store.
I've noticed that, too. There's an interesting ping-pong effect where a fair number people have flipped from strong typing at the database layer & dynamic typing at the view layer to the reverse — it seems like someone could write an interesting group psychology paper about how that cycle has repeated over the years.
People realized a lot of the claimed not only SQL offerings were actually no SQL at all. It turns out it is nice to have those extra features of NoSQL as well as a more traditional RMDS instead of just the NoSQL parts.
Also, the RDBMS world is full of some of the oldest scaling experts in the yellow pages. It's interesting how surprised people were that many of the "traditional" RDBMS were able to catch up on some of the scaling support that were the biggest advertised advantages of NoSQL.
It's interesting because I feel like the NoSQL world spent a lot of time reinventing the RDBMS from the "opposite direction". For all its faults, SQL is a fascinating language because it mostly ignores low level details of how the database scales, how it operates under the hood. It's not that SQL is intrinsically hard to scale (certainly at the relational algebra roots it shouldn't be hard, in theory), but it certainly leaves a lot of work to anyone building a database engine to figure out what/when/why/how to scale. I feel like a lot of RDBMS' query analyzers/planners resemble things like HBase a lot more than folks realize.
It's great that NoSQL realized that sometimes those "low level details" in an SQL engine are useful in their own ways, and have increased the spectrum of performance versus power/flexibility trade-off options. But it shouldn't be that big of a surprise that SQL databases remain competitive in that trade-off space, given under the hood they've had to think about a lot of that stuff over many decades.
I really don't get this as an indictment of MongoDB, or their OpsManager product really.
They used the version of OpsManager that doesn't manage the deployment - is specifically not a deployment manager. Mongo does offer a managed version of this software, which the author mentions - with a justification for why they couldn't use that offering. However, I think this was the main mistake that The Guardian made. As the author notes: "Database management is important and hard – and we’d rather not be doing it ourselves." They underestimated the complexity of managing database infrastructure. If they had been attempting to set up and manage a large scale redundant PostgreSQL system, they would have spent an enormous engineering effort to do so as well. Using a fully managed solution - like PostgreSQL on RDS from the beginning would have saved them time. Comparing such a fully managed solution to an unmanaged one is an inappropriate comparison.
Full disclosure - I used to work at MongoDB. I have my biases and feelings w.r.t the product & company. In this case I felt that this article didn't actually represent the problem or it's source very accurately.
Fair criticisms! It's true if we'd used Mongo Atlas or something similar it would likely have been a different story - often the MongoDB support spent half the time on the phone trying to work out what version of mongo, opsmanager etc. we were running.
Re criticism of OpsManager - I think this is fair, given the sheer number of hoops we had to jump through to get a functioning OpsManager system running in AWS - no provided cloudformation, AMIs etc. £40,000 a year felt like a lot for a system that took 2 or more weeks of dev time to install/upgrade. The authentication schema thing was a bit of a pain as well, though we were going from a very nearly EOL version of Mongo (2.4 I think).
I do like the article but it sounds like they’re in over their heads, this whole (very risky) project could have been avoided if they just brought in someone that knew what they were doing.
> Clocks are important – don’t lock down your VPC so much that NTP stops working.
> Automatically generating database indexes on application startup is probably a bad idea.
> Database management is important and hard – and we’d rather not be doing it ourselves.
This is true of any database infrastructure with redundancy/scalability requirements.
What they did was take a technical problem and solve it by buying an off the shelf solution. Which is fine, of course, but I’m a bit surprised by the reaction here on HN.
The pricing of hosted MongoDB solutions is high especially as volumes increase. The last I checked, disk space doesn't free up when you drop documents or collections and that adds to hosted cost unless you find time to fix the issue through manual intervention. This costing is moving us away from MongoDB in the future.
MongoDB will reuse free space when documents or collections are deleted (even though the storageSize stat will remain the same). You can compact the database to release free space to the OS. You can read more about how MongoDB's WiredTiger storage engine reclaims space here: https://docs.mongodb.com/manual/faq/storage/#how-do-i-reclai...
I work for MongoDB so if you have any questions about storage, feel free to reach out.
Document DBs are like blockchain projects - overhyped and worse than existing solutions for nearly every use case. Why do ostensibly smart engineers keep falling for this stuff?
Well, we're kind of comparing MongoDB of ~2011 (when The Guardian started using them) with Postgres of today.
One major change is that in 2010 you couldn't always run your whole DB in RAM, so there were some real performance benefits with MongoDB.
Another difference is that MongoDB was early with great JSON-support. Something that Postgres has since gained.
I think there are pros and cons with both. If I'd chose one today, for most tasks I'd probably chose Postgres.
That said, to understand why people made those choices, you need to 'teleport back' to that time and compare them in that year (given the tradeoffs at that time).
If somebody is teleporting back then, the fact Mongo had a _global_ Mongo instance level lock for writes should be more than enough for people to run screaming.
I worked with databases that only had table level locks(not row level) and there were more than enough occasions I cursed the creators.
Instance level(& indeed DB level) is insanity unless your DB is a read only DB.
Not saying you're wrong, but a news site and CMS has different database requirements than many other products.
They have very few write actions; in the Guardian's case hundreds of authors publishing a few articles a day, versus millions of users reading data. Writes would be sporadic and batched.
News products also changes more rapidly than you might expect. A modern news team may be writing and shipping code to better cover a breaking news event. I can imagine why a document store would be appealing to them.
I don't think Mongo was used much by people who would know why instance level lock is bad, or even what it is. And if someone who knew came later, it was too late - Mongo's proprietary querying language is a vendor lock-in.
For one thing, a good chunk of what happens in the industry isn't particularly well described by the term "engineering," which in my mind describes someone who is well-trained and experience-versed in relevant scientific models for the application domain and the dynamics of relevant "materials" and their interaction that can be composed into a solution. It might be the term "craft" better describes a lot of software development activity.
For another, it may be everyone is susceptible to the influence of trends. Even engineers. And the more complex the details of a subdomain of software engineering are, the bigger the tradeoff between becoming someone who can make a true engineering assessment in that niche and developing expertise elsewhere...
It's also largely a fact that we're paid better for following hypes, at least in the UK. Thus a smart "engineer" who also cares about their bank balance pretty much has to use everything new to stay relevant.
Further, most of the use cases are fairly shallow, so it doesn't actually matter that much, and you'll be paid more to paper over the cracks.
We need to change the market, but it doesn't like the strong and stable engineers are getting lead roles to pay more strong and stable engineers to build boring tech, despite that actually being better for most businesses.
Probably for the same reason people have a build process to load in 100s of KBs of Javascript in order to render an <h1> tag, and the compile it all down to HTML and declare static sites are the wave of the future.
They are not overhyped, definitely not as much as old school RDBMS stuff. It's just most engineers can't make good database choices no matter what database they choose. Or more generally they can't make good infrastructure choices, as those are out of their competence and are mostly about things like operations and distributed systems, that take a long time and a lot of experience to get to a level of good decisions.
You cannot be old-school and hype. That's the reason why it is old school. And the people who can't make good database design choices are exactly the kind of people who should be using SQL. Postgres knows how to optimize and plan queries efficiently based on the actual distributions of values in your dataset. These poor choosers should be doing that... by hand?
the whole promise of mongo is/was distributed (HA+LB), which was all the rage back then, when AWS AZs dropped like flies every few weeks and scaling was seen as the problem. go fast, break things was the mantra.
and it's still not trivial to do pgsql maintenance without downtime, whereas in a clustered/distributed "solution", you can enjoy certain additional freedoms. this includes the freedom to shoot yourself in the foot (data inconsistency, but you had to fight a bit for that state by promoting a non latest slave to master).
> the whole promise of mongo is/was distributed (HA+LB)
Yes. It is so easy to bring up a mongo cluster and feel like you have HA. Don't worry it won't be proven wrong until you have writes during your cluster degradation and are unlucky.
and, most importantly, as far as I know the default replica set config is safe. (durable, consistent, atomic) it handles a lot of failover scenarios for you automatically. and you have to manually intervene to get into a bad state - which might be okay for that particular business case (better than full downtime)
This is what I'm talking about. You think that some feature that makes performance unpredictable is that important. While it's the last thing you should care about.
What people should care about is the integrity and consistency of the data they're storing. RDBMS do an extremely good job of generalising this problem with reasonable performance, which suits the majority of real world problems. The ideas behind it are understood by most of the industry.
It should be the default choice (or plain filesystem storage), unless you have a specific requirement for something different. In the latter case, this should be people who are well informed about the database choices they're making.
Unpredictable performance would mean, you can't predict how long a specific query will take. This is the opposite of how postgres works. If a query is slow, it's predictably slow if the data doesn't change. So what you must mean is, when the data changes, it doesn't behave like you expect.
So again, how do these poor choosers make better decisions in the face of changing data when they have to write all the algorithms by hand, and understand how to scale the data correctly? Seems to me that pretty soon they'd just be writing a very crappy database on top of some keystore.
The only way for query performance to be truly predictable is for your data to be static. If that's not your situation then a query optimizer is incredibly helpful because it means that you don't have to rewrite your queries every time you add an index.
It also means that as your data grows or shrinks, the optimizer will notice and change the plan accordingly if it makes sense.
That means that 18 months after you implemented a feature you've long since forgotten about, you don't have to come back and figure out what the new plan should be. And that's huuuuge.
You can find accounts of the db changing the plan from a good one to a bad one, but I'll go out on not that far of a limb and say that those are < 1% of the cases. Nobody complains about the queries they didn't have to come back and change. And the better the optimizer, the better that trade off will be.
> Why do ostensibly smart engineers keep falling for this stuff?
Review driven design is likely a large contributer. If you know you will be looking for a new job in 2 years, you better get some experience in $latestTrend instead of $provenTechnology.
I never got around to using Mongo as my main doc db because it was incredibly hard to find a management tool.
I now use json supported functions in SQL Server and do not have the need for a different type of database. SQL Server handles my small 'documents db' implementation with the infrastructure of a RDMS. Win win for me.
To me Mongo just got popular by mistake way too early. It's like having a celebrity retweet your post because they liked what they saw at the time, exposing you to the world where everyone now thinks you have something important to say. Not surprisingly, you don't!
Why? Seriously, JSON is lame, it's an idiotic trend, a fashion from the UI that's infected the back end for no gain whatsoever for anyone not using JavaScript. XML is better and we've just watch JSON slowly reinvent everything XML already had.
My guess is the grandparent is (mis-)remembering the kerfuffle[0] around mongo shipping a copy of PG as a "BI Connector". But yeah, the timeline is off, that was in 2015.
Actually, no. I'm remembering asking myself how Mongo worked back when it launched, and reading somewhere that it was a heavily modded Postgres under the hood (like Amazon Redshift, which makes no secret about it, since).
I might be misremembering, though. And there's a possibility that what I read then was inaccurate.
I might admittedly be remembering this wrong, but I seem to recollect that Postgres had a couple of (unofficial) contrib modules to support JSON back in 2008 or so, and that (unfortunately buried in Postgres vs Mongo articles) the initial release of Mongo was actually a fork of Postgres for all practical intents.
The relevant data structures and indexing mechanisms (hstore, tsvector, gin, gist) were all available -- if only experimentally -- before Mongo launched.
I have seen so many critical articles of Mongodb. Even if I did have a use-case particularly well-suited for it, I would never consider using Mongo.
As an aside, I did consider using SQL Server until I looked at the licensing fees. Why would someone choose SQL Server when options like Postgres or MySQL/MariaDB exist? Is there a specific SQL feature (MSSQL or Oracle SQL) provides which is not available elsewhere and would be a core feature which the companies data storage is built upon? I.e. that feature is so important that the companies product architecture would be fundamentally different without using the proprietary database.
SSRS is a big draw. There are other and probably better reporting tools, but the integration and support from MS is worth something to most companies. There's also simply the network effect at play and marketing, that MS has always backed that up with results. The money involved really isn't prohibitive for most small to mid-size companies that I've worked with. I've only seen the opposite, companies leaving their expensive UNIX licenses to move to an all-MS stack to save money. I have never seen large companies try to reduce their technology costs to zero with pure open source, because I think they intuitively know that you either pay with your licenses or you pay with maintaining a lot of expensive expertise, that also needs properly managed and effective. In the discussion of free as in freedom vs free as in beer, most companies core business is not technology, and they know there's no truly free beer and don't really care about the freedom. They want everything I've already mentioned, and the short and longterm support the license buys.
Outside of my professional career, with my personal projects I use SQLite, but if I were to build out something that intended to be larger scale business venture, I'd probably go with Azure SQL Database for multiple reasons. A large one being MS's overall integration, including their full control of the stack and CI/CD with .Net->Azure DevOps->github->Azure Pipelines->Azure PaaS/SQL Database. I admire what they're building, but a lot of people work outside of the MS realm. Companies don't tend to have a tech stack bone to pick as we do on HN, and many are already using part of the MS stack.
In sum, there's just a lot of business realities at play. I wouldn't personally run out and buy a SQL Server license myself, for what I do at home or with my (very) small side business, either.
Integration. It is all about the integration. M$ platform is so tightly coupled that many companies see the value in spending on this tight integration than spending it on 25 different Engineers to manage their 40 vendors that may be needed to run some of these applications and platforms.
Dealing with one vendor, M$, is better than dealing with 20. You pay for one support package and get all the benefits that come with one platform.
As much as I love open source, I do see the benefit of working with one vendor on all of your stack needs.
M$ reporting systems of SSRS SSIS and other is probably their bread and butter when it comes to DB space. Very few want to spend 60-70 hours a week building complex command line reports where with SSRS and SSIS most of these tools come with a nice GUI that helps you build these reports. Sure there are others that do the same but most required a third party vendor to add to the functionality. For example Apache. With Java and Apache I've had to deal with literally 15 to 20 different vendors just to do the same thing I can do with one vendor. Jenkins, Zookeeper, Camel, Cassandra, Tomcat all under the banner of Apache, but in reality governed by their own set of standards. I mean take a look for your self: https://www.apache.org/index.html#projects-list
>Why would someone choose SQL Server when options like Postgres or MySQL/MariaDB exist?
Because they're in an organization where they're already invested in Microsoft technologies, so it's much easier to just use MS's DB instead of something entirely different that doesn't integrate as well.
It's similar to why many people use Apple software products that aren't as good as alternatives: if you already have an Apple platform/device, it's easier and better integrated and likely already installed.
My answer is still the same, and actually assumed that. No one is really "greenfield": everyone is already invested in some technology to some extent. If your organization is already invested in MS technologies, your managers only know MS, and your developers only know MS/.NET/etc., then the choice is pretty obvious: MS SQL Server. Switching to Linux/PostgreSQL would be a sea change for an organization like that (even though, IMO, it would be better in the long run).
Similarly, if your organization is already a Linux-based one, running Linux servers for everything, adopting SQL Server would be a huge PITA and would likely be laughed out of the room if suggested.
I really like SQL Server, especially the dev environment, 1st and 3rd party management tools. Not sure if they have improved this, but replication was a limiting factor and the pricing changes a few years back pushed it down my preference list.
Disappointing article. There are no clear reasons mentioned for migrating away from MongoDB. "All these problems" and some issues with Ops Manager are mentioned, where "all these problems" is a couple of outages. As if other database technologies prevent outages. As soon as they experience a couple of outages with their new stack they will migrate to something else, presumably.
Thank you! I was reading through the comments wondering how no one has seen through the bullshit veil purported by the title, followed by a detailed outline of their migration which I think is the "shiny object" to distract people from an unfounded argument.
The main advantage of the migration was getting onto a fully managed DB. What we didn't mention was that there were also huge cost implications - savings of ~£40k - through switching to RDS - compared to paying for a mongo support contract.
1) MongoDB Atlas would work well here: it’s Mongo hosted on a cloud provider of your choosing managed by the people that make it.
2) a DBA, even a noSQL one*, is worth every penny. (I don’t mean this to put down noSQL DBAs just that it’s good to have someone dedicated to managing them and that a DBA versed in the administration and upkeep of a relational system could retain pretty quickly to help get the most out of a noSQL one)
A few things stood out at me. In no particular order:
- Going with Mongo in the first place cost them dearly. CMS are a weird application for schemaless. Not necessarily wrong, but definitely weird. I wonder if they would benefit from moving to more structured schema and I'm willing to bet a lot of the migration complexity comes from that in the first place.
- God damn that is a long migration. Holy shit. I know tech isn't their core competency but they do seem to have very competent staff. I've worked in places this glacially slow and I get how it can get this bad but it really strikes me as having no right to be. 10 months to migrate .... from the point where they were ready, until they were done. And somehow integration tests got overlooked during all that.
- Close call with dynamodb. Bit of a wtf on waiting for the feature to be implemented for nine months though. I'm sure they have an account manager with aws... I definitely think blocking an internal process on such a fragile externality as a closed process upstream publishing a feature is the wrong move and a red flag. Their migration path would have probably been harder with dynamodb too.
- I feel the pain of their troubleshooting issues on the load testing step. I can completely see how this can happen. That said it also raises a few red flags to me. Letting something as simple as the load testing step get complex enough to require weeks of engineering is ... Eh.
Have more thoughts but I hate typing on mobile. This is a fantastic write up, I love when non-tech companies publish this sort of stuff. And hurrah for postgres.
That depends. If the CMS is managing documents that have flexible structures that are tree-like, it might not be such a horrible idea to model that structure in a document instead of relationally.
Or to put that in context of a basic data structure:
C = Actual content to be published
MS = User accounts, user permissions, user authentication options, user authentication logging, meta change logs (user X removed tag Y at time Z), behaviour logs (user X viewed revision Y of article Z at time A)... and that's just scratching the surface of a very, very basic CMS.
I'm being generous and assuming articles, tags, bylines, and attached media (with full change history) is all "C".
I enjoyed the write up. I love these kinds of semi-technical, semi-story-telling, pieces.
I am saddened by many of the comments here though which equate to: "never try anything until you know everything" - sorry but that's just not realistic and it's unfair to the people who - commendably - contribute to these write ups and hold their hands up to mistakes-made, decisions that went badly with hindsight, etc.
Bigger picture: the Guardian appears to be thriving, and is succeeding based on the efforts of the tech team here. So if you read this and come away with a sense of "failure" you're probably missing something important.
It's OK to try new things and fail - just get back up and keep on trying, and use the new wisdom you build!
Plenty of people here like to dish on Mongo and the product seems to have been re-architected a few times since I used it seven years ago. By what metrics can we say the product is one worthy of passing a HN smell test? Passing Jepsen was seemingly not enough. https://www.mongodb.com/jepsen
> By what metrics can we say the product is one worthy of passing a HN smell test?
Common sense and formal education? I'm sorry, I'm aware of how incredibly snarky and arrogant that sounds, but in this case I always struggled to comprehend how MongoDB, or most of "NoSQL" in general, was considered viable to begin with.
"Schemaless" just immediately means that instead of the database keeping consistency, you now essentially have to do all your type and constraint checking in the application; I never understood how that's favorable, and how that remains maintainable in any way. On top of that, NoSQL (and maybe Mongo in particular?) also decided to throw away all guarantees that classical SQL databases have been offering for multiple decades for very good reason.
I think everyone who's had a somewhat theoretical class on databases and learned about, for example, what ACID really means, will have smelled that something doesn't quite add up here.
> "Schemaless" just immediately means that instead of the database keeping consistency, you now essentially have to do all your type and constraint checking in the application...
This is so true. Some component has to maintain integrity of the data. Should it be the application developer? Or the database software team?
I know which one I'd choose, and which one is focused on data integrity and not business logic.
And it’s not just who you trust, it’s also at what layer you want to enforce your integrity and consistency. Not letting the db do that means that every part of the application that touches the database needs to do that enforcement. Even if you abstract that away into a strict layer directly above the database—which essentially means you’ve just implemented parts of a database on top of a database—you have to essentially invent and implement the methods and languages to do so yourself. At which point you’re pretty much guaranteed to be much worse of then if you had just let the DB do it. This might mean handling unexpected errors if you have a bug in your application logic, but that’s usually orders of magnitude better than introducing hard to fight inconsistencies.
> This is so true. Some component has to maintain integrity of the data. Should it be the application developer? Or the database software team?
if you're using partitioned (inherited) tables in Postgres, then it's the application developer, as foreign keys on inherited tables don't really work.
I also don’t see the benefit of schemaless. Schemes have been way too useful in my programming career.
But eventual consistency may be a viable trade off to help you scale, but the thing is you probably need to be at a massive scale before it is worth moving away from traditional SQL, which can have read replication and the app can have caching too. But there is probably some point at which eventually consistent nodes make sense.
In a sense that is what a CDN is, and we like to use those.
Are you reading the detailed reports prepared by the Jepsen audit team, or the press release? Jepsen audits are public.
The summary of the latest MongoDB report [1] follows,
"In this Jepsen report, we will verify that MongoDB 3.6.4’s sharded clusters offer comparable safety to non-sharded deployments. We’ll also discuss MongoDB’s new support for causal consistency (CC) in version 3.6.4 and 4.0.0-rc1, and show that sessions prevent anomalies so long as user stick to majority reads and writes. However, with MongoDB’s default consistency levels, CC sessions fail to provide the claimed invariants."
> This interpretation hinges on interpreting successful sub-majority writes as not necessarily successful: rather, a successful response is merely a suggestion that the write has probably occurred, or might later occur, or perhaps will occur, be visible to some clients, then un-occur, or perhaps nothing will happen whatsoever.
> We note that this remains MongoDB's default level of write safety.
In my previous startup we used mongo as the main datastore (this was pre Postgres 9.4, before indexable JSONB). It worked for us, and it worked fine. I saw the mob rising up against it, but we have not had any of the problems (maybe we just read the manual and had it configured correctly?).
Today my go-to is postgres of course, but I had to wait until JSONB was out to make the switch.
> Since all our other services are running in AWS, the obvious choice was DynamoDB – Amazon’s NoSQL database offering. Unfortunately at the time Dynamo didn’t support encryption at rest.
Whoa, that was close. I really don't see why anyone would choose DynamoDB as a general purpose data store, unless they enjoy wasting countless hours finding ways around the limitations it imposes about how data should be stored and accessed. At least that was my (admittedly limited) experience. Postgres is a much better choice.
NoSQL like DynamoDB or Cassandra does make for good general purpose data stores, but you have to work within the concepts of NoSQL: typically, use 1 table with careful choices in partition and sort keys.
It's an epic shift in thinking to go from a schema of 50-60 tables to 1. Almost every dev I've worked with is very new to this.
The #1 sign a dev has no business using NoSQL: They chat about how flexible NoSQL is. Yeah, you can add attributes on the fly but I've found Dynamo to require much more careful planning then RDBMSes. Most devs I've met can understand when they need to use an index. But, almost all of them have issues predicting if their changes will lead to unbalanced requests against partitions.
Anyhow, since you can even connect the PostgreSQL WAL easily to a log stream like Kafka or Kinesis, I'm not sure why you'd ever start with a NoSQL DB, unless you just had master NoSQL data wonks.
It's not a great article tbh, it's well written but it shows the clear lack of knowledge running a backend. The title should be "we didn't know what we were doing so we switched to a managed DB"
I mean yeah who knew that blocking NTP therefore time drifting would break everything...
For those criticizing MongoDB, Fortnite generates $3B/year and runs on MongoDB, you should tell them it's a mistake and that they should use PG instead.
I don't usually bite for these "X uses Y, so Y must be good", but I didn't know about Fortnite and MongoDB. A quick google suggest they've had downtime due to issues with Mongo and have had problems scaling it though.
I thought the article was well written but I have to agree about the MongoDB use case by Epic. Mongo has its place and it is mature enough that it can handle itself in a production setting.
That being said they had a massive outage due to a MongoDB issue.
Yeah except not every use case is the same. Treating every use case as the same shows, let's see what was it, 'clear lack of knowledge running a backend'.
Having a successful product doesn't mean that all technical decisions that were involved in making that product were successful. It raises the bayesian estimate that they were, of course - but not to an absolute boolean value.
Can you give an example of superior tooling or tooling that isn’t good in PostgreSQL compared to MySQL. Serious question. I’m curious cos I mostly use PostgreSQL.
1. Stop. Trying. To. Build. Your. Own. Cloud. A pizza shop doesn't build their own cars to deliver pizzas.
2. There's no such thing as hassle-free anything, unless you are paying someone else to deal with the hassle. Sales teams lie.
3. Justifying an untested idea with "but it's modern technology" is going to backfire. Follow established patterns with good track records.
4. Writing your own in-house behemoth product, of which there are already many kinds available, results in long-term expensive engineering projects necessary to to get around the high costs you didn't know were coming.
5. Don't write business logic, or your primary software product, in a way that talks directly to a database. Just... No.
6. Magic new technologies that remove the problems of old technology also introduce the problems of new technology.
1. Well, they probably own the car though which is a better parable. You don't need to rent your car, you can simply purchase it just as you can purchase servers. People build their own garages.
4. So you should only used already written software? Sometimes it is just faster and better to write it yourself. You get less dependencies, you know how the whole thing works etc.
I'm pretty sure he wants to say that you should abstract your database away so that your business code / domain model doesn't depend on it. It becomes a "plugin" to your application and you can easily switch it just by writing another implementation.
I couldn't disagree more, at least in the context of this article. If you have an abstraction layer so high level that app developers can't tell whether they're using MongoDB or PostgreSQL, then they're not able to use any of the advantages of either of those systems.
Sure, use an ORM to abstract away the differences between PostgreSQL and MySQL (up until you need to care about them). That's reasonable. But maintaining a magical MaybeSQL layer that's powerful enough to not totally suck is going to totally suck.
Why wouldn't you be able to take advantage of those systems? I think it's quite the opposite where you can take the advantages without even knowing about it / affecting other parts of the system.
Because of how different they are, taking advantage of them requires using them in specific ways that are also very different. If you have an abstraction layer that hides those differences, it's practically a given that it does so with the lowest common denominator approach, where you get all the flaws of both and none of their unique benefits.
(Alternatively, people find ways to use the abstraction layer such that it produces the desired usage pattern. Of course, then the code is no longer truly portable to another DB, because that same pattern is likely to be a perf issue there.)
and implementation can implement it in specific way to take advantage of chosen technology. If you have some esoteric use cases they could be handled in special way separate from business code.
For that specific use case, maybe. Where it falls apart is in `searchUser`, where the methods (and performance characteristics) of digging through the respective databases are going to be radically different. In a newspaper's implementation, you're going to have to search by date, subject, keyword, body text string, reporter, etc. etc. In MongoDB, that generally involves creating an index on the combination of fields you'll be searching together. In SQL, that generally looks like adding a `where reporter = "Jane Smith"` predicate. The MongoDB version may be faster if you have an enormous amount of data spread across a cluster. The PostgreSQL version will be more flexible when your boss wants to know how many reporters wrote stories about candied yams within three days of the publication of any story containing the word "Thanksgiving".
Being tasked to come up with an abstraction layer that supports the speed of precomputed, clustered indexes with the flexibility of SQL - if I were in a content creation business and not a database engine writing business - sounds like the kind of project that would make me quit my job and go on a one year silent meditation retreat.
That objection doesn't make sense. Queries are nothing more than a tree of predicates, how the backend end uses those predicates is not relevant to the API of specifying the predicates. Things like indexes whether in Mongo or in SQL are implementation details that can easily be hidden and not infect the API. You can interpret the tree of predicates into a SQL where clause or into a Mongo index search.
The OP is correct, your app can speak to an internal API without the underlying database infecting your domain code. That in no way implies you can't take advantage of the best of each database.
This is all rosy in theory. In practice, the way you write the query matters quite a bit. Often even between different SQL implementations.
And it's not just queries. Transactions often have important semantic differences that will be visible on application layer - again, even between different SQL implementations (e.g. MVCC vs locks).
> In practice, the way you write the query matters quite a bit.
Which is hidden in the query interpreter for said db implementation. Each implementation can break down that abstract query into whatever implementation specific query works best in that database.
There's always some abstract way to represent it that doesn't require vendor specific knowledge nor does it remove the ability to apply vendor specific abilities.
Look, I just don't agree with you, I agree with OP. Db specific stuff should be hidden from the domain layer by an abstract query representation and an abstract transaction representation to be plugged in at a later time.
Stuff like "each implementation can break down that abstract query into whatever implementation specific query works best in that database" is wishful thinking. It's like saying that Java is faster than C++, in theory, because JIT can produce better code. And in theory, it can. In practice, we're not there yet. Same thing with high-level database abstractions - they're all either leaky in subtle ways, or they constrain you to extremely basic operations that can be automatically implemented efficiently on everything (but e.g. forget joins).
Several actually, which is why I know what I'm talking about; I've explored this area extensively. When Fowler first released PEAA I dug and went nuts and spent years coding up and exploring all the possible approaches and figuring out which ones I liked and why and which ones I didn't and why.
> Same thing with high-level database abstractions - they're all either leaky in subtle ways, or they constrain you to extremely basic operations that can be automatically implemented efficiently on everything (but e.g. forget joins).
If you're doing joins in your ORM, frankly, you're doing it wrong. Most ORM's do it wrong, they try and replace what a db does best; the right way to do it is to keep joins in the db. The role of an ORM when used properly is to map tables and views into objects and allow querying over those tables and views with an abstract query syntax. Joins belong in a view, not in code. It's called the object relational impedance mismatch for a reason, you have to draw a line in a reasonable place to get anything reasonable to work well and putting joins into the ORM is crossing that line and is why most ORM's utterly suck. Joins aren't queries, they're projections; put the queries in the code and the projections into the database, this works perfectly and lets each side do what it does best. Queries are easily abstracted, projections are not, projections don't belong in the ORM.
Any language with named tuples has a type system that is sufficiently expressive to handle joins without any sort of impedance mismatch. So, the only reason to avoid them is exactly the one that I cited earlier - the underlying implementation difference between databases.
> Any language with named tuples has a type system that is sufficiently expressive to handle joins without any sort of impedance mismatch
Incorrect. Named tuples will give you nothing back but a result set; the impedance mismatch refers to the mismatch between result sets and a domain model; getting tuples back doesn't remotely address this problem. I'd suggest you don't understand what the objection relational impedance mismatch problem actually is.
It's not a type system problem, it's fundamental mismatch between the relational paradigm and the object oriented paradigm. If a domain model has customers and addresses, and you do a relational query that joins the customer table and address table to return only the customer name and address city, the resulting set (name, city) doesn't map to the domain objects and isn't enough data for the domain model to load either of those objects which may contain various business rules. This is what the impedance mismatch refers to, relational projections of new result sets simply do not map to the OO way of doing things. Joins that create new projections are a relational concept that have no place in the object oriented world view: objects don't do joins, and object queries don't return differently shaped objects.
Hacks like partial loading of domain objects are attempts to mitigate the impedance mistmatch, but they do not solve it; they cannot solve what is a fundamental difference between two different ways of seeing data. Data is primary in the relational model and its shape can change on a per query basis, this is incompatible with the object oriented view of the world in which whole objects are primary and data is encapsulated and thus hidden.
The object relational impedance mismatch does not refer to a language problem, it refers to a difference in paradigm between OO and relational. It exists in all language regardless of the languages abilities and it's not a problem that can be solved, only mitigated, if you want to use both paradigms. You can solve the problem by avoiding using two paradigms, by either bringing the relational model into the application and not using OO or by using an object database.
I don't think there's a point in arguing this further. The other poster seems unwilling to understand that there are differences it databases within a category, let alone that there are multiple categories of databases with entirely different characteristics.
It's not a common denominator. If I need a Customer aggregate or a Payment aggregate it doesn't matter if I get it from SQL or Document database as long as I get that aggregate. My code doesn't care about query implementation.
FedEx needs both big, slow, high-volume trucks for moving stuff between cities, and small, nimble tricks for delivering to the doorstep. Someone decides that it’s inefficient to maintain two separate standards: any driver should be able to get into any available vehicle and have it Just Work for whatever job they have at hand, right? So they decide to make a single vehicle that can fulfill all roles.
Well, a vehicle that can squeeze down an alley won’t have the cargo room of a giant highway truck. The inter-city drivers will hate its poor capacity. Likewise, one that has a big 20-speed transmission for for hauling heavy loads is going to drive the city drivers nuts. They’re going to end up with one single interface to all possible roadways that everyone can come together and agree to hate.
If the database API is so free-form that you can store anything in it, you won’t get the advantages of PostgreSQL’s strict typing and lightning fast joins. If you make it so regimented that your data model ends up looking like a set of tables with foreign keys, then it won’t be able to make full use of MongoDB’s... whatever it does well.
They’re different animals. Choosing one highly affects the rest of your system design, from how you arrange your data to how you add new data to how you search for it. PostgreSQL and MongoDB have fundamentally different strengths and weaknesses, and if you make something that works equally well with both, it’s inherently going to suck equally on either.
It could be an application frontend interfacing with a webservice or network service. That component handles the database. That could make implementations clean for migration from on-prem to cloud, or if you're ripping out the frontend stack semi-regularly as more web-oriented devs may do more often. I can't say I disagree, and moving an application to this model myself so that I can easily migrate it from a WinForms frontend to Blazor SPA at some point.
RE 5: I'd argue that with a relational database (barring extreme scale cases) is the perfect place for your business logic. Constraints, unique indexes, etc, all ensure that your data is always consistent with itself and with the business logic you've encoded.
I’m not really sure they articulated why they had to switch very clearly. They didn’t like managing Mongo. They said they couldn’t use Mongo’s hosted solution BUT they switch to hosted Postgres. Why not just overcome the limitations preventing them from switching to hosted MongoDB?
They probably mean that they wanted the DB hosted within the private subnet of their VPC. With a traditional hosted offering you’d connect over the public net to a system hosted by the SaaS provider. With RDS they could spin up the DB within their own VPC.
Atlas runs tne hosts inside mongodbs AWS account, we have the same restriant, becuase of client privacy isses and gdpr compliance we can let a 3rd party host the data, amazon is ok beacuse we control tbe data on tbe instances.
Client privacy issues should not preclude the use of MongoDB Atlas... Atlas offers encryption of data at rest as well as the ability to manage your own keys.
It looks like they wanted to run Mongo inside their own Amazon account- under their direct control. AWS could do that with Postgres while MongoDB couldn't.
Yep that's what we were doing, and the management software (OpsManager) was also running on EC2 instances. We messed up the VPC configuration so that NTP didn't work on some of the instances - which unsurprisingly broke authentication between OpsManager and the db instances
The way I read it, mongo was hosted the same way postgres was - on AWS VMs. OpsManager wanted to be more like a SaaS.
I'm a little skeptical of the idea that you can do custom software development with a custom database schema and realistically expect to outsource DB management. But sure, you can try. And in any case, you'd hope a largely read-only and document oriented dataset like a newspaper's has a relatively simple schema; without too many crazy schema quirks.
By server management - do you mean VM? Sure, you can outsource that. I mean that what and how a home-grown application uses a database tends to mean that database (software/schema/optimization/whatever) management cannot be application agnostic. So if you outsource this, you're either effectively hiring a consultant that still needs to deal with and learn details of specifically your application, or you should assume it's going to cost you some time to do yourself.
I'm sure you can get advice or buy know-how; but they're too coupled to think you're not also going to need to spend some time too. (At least: assuming your workload is large enough and complicated enough that naive brute force isn't an attractive option).
Everything you are saying is true, but when people talk about managed services, for the most part they are referring to someone else managing the VM, the operating system and the server application running on the VM - in this case the database.
Exactly! That's why I was confused - they didn't replace mongodb with colloquially "hosted" postgres; they stayed at the same level of (partial) hosting. I.e., they didn't "switch to hosted postgres", in any normal sense.
Interesting article. Our stack isn't using mongo, but reading these posts give me good ideas for how others manage their software lifecycle. I like their use of a pre-production environment as a step between QA/UAT and production. This allows for a clean trial run of anything new deployed to production, as well as greatly removing any deployment-day surprises.
Anecdotes like this are just that, anecdotes, there are no numbers in this article to show a trend away from Mongo, Mongo is actually continuing to gain adoption. See https://db-engines.com/en/ranking
Kudos to The Guardian for publishing this article on their top-level domain. It looked identical to one of their news articles (including ads). Only the content (excellent writeup) was different. Many tech companies silo their engineering teams writings into a “tech” part of their website. If this wasn’t an ad for The Guardian’s technology team, I think it could be. Putting the engineering team’s content in the same article CMS as the news articles is a small decision that reflects highly on The Guardian’s publishing organization.
Have you tried to find it from the home page though?
It's under `/info`, which un-suffixed redirects to `/about`, which I _can_ find from the home page (it's linked in the _footer_) but I can't find this engineering blog from `/about`, or anywhere else I can get to from `/`.
I don't think it's so much a positive choice to use the same system as it is just using (a possibly separate instance of) what they already had, and still squirreling it away in a corner.
It's the same system, just without any tags on it that mix it in with editorial content. I think if we wanted it published somewhere like the technology section we'd have had to go through a much more rigorous editing process (we probably wouldn't have got away with 3000 words!)
This requirement made some sense in a world where a rogue employee might yank your database server out of a rack and walk off with it, but I don't understand why this is still considered relevant in an AWS context. First of all your data is never really "at rest", a huge point of Dynamo is that the data is always available (at least 99.999% of "always" anyway). Second "at rest" where? Literally on a physical disk? Even if there was a single physical disk that held it I'm willing to bet that even if you were in the right AWS data center and had support of a willing AWS employee you couldn't find that disk and even if you could there'd be no way to remove it. I'll guess its literally millions of times more likely for you to fail to deprecate some AWS API key that's sitting in clear text on a laptop you sold as surplus than anybody gets their hands on your at rest Dynamo data.
> “But postgres isn’t a document store!” I hear you cry. Well, no, it isn’t, but it does have a JSONB column type, with support for indexes on fields within the JSON blob.
Interesting. I didn't know you could make indexes for things /within/ the JSON.
Yep, you can index any expression on a single row in postgres. The downside is that the optimizer will only use the index if the query contains that exact expression, while it's usually smarter with column indexes.
The only limit that I've run into in comparison with NoSQL databases is that you cannot index fields inside arrays within the column (json -> arrayIndex -> field), you need to normalize the array to a different table.
With a GIN index and a JSONB column you can index the entire json document. It's the same index type as Postgres uses for the full text search, but not super efficient. You can also project out specific subsets of the document to an index and Postgres will use them.
The syntax and terminology are different, but MySQL has supported indexes on values inside of JSON (or any arbitrary expression/function) since MySQL 5.7, released about three years ago. In MySQL, you can define a "generated column" and index it in InnoDB.
Why did they actually migrate databases? The real issue is database operations, not the database itself. The data model is simple so anything would've worked fine, and they already have a working system.
The proper solution is to onboard the proper talent and tools or outsource it all. MongoDB does have fully-managed offerings that will automatically migrate and run your database, across cloud-providers if you need, and would've cost less than all the time and effort spent on this. This is just another example of poor technical competency at media publishers.
Did they ever publish the rationale for why they used MongoDB in the first place? They mention having 2.3 million content items. If we assume that they have 100 DB entries for each content item, that's still only 230 million DB items. In that case, was it important to run a sharded cluster vs a typical primary-secondary HA setup? (which they ended up switching to)
I was reading the article waiting to get to the part where they handle their shards can scale. It ended up just reading like a migration from one db to another.
I'm guessing their original choice for MongoDB was more for the schemaless development flexibility rather than horizontal scaling capabilities. This seems to have worked well enough for them. DynamoDB seems like a natural fit coming from a document store and they did very well not to choose it.
I think AWS is due for a managed NewSQL store comparable to Google Cloud Spanner or MS Cosmos DB.
I wondered about that, yeah. That's just Not That Much Data; too, they're clearly able to sustain multiple-minute downtimes (the blog asserts as much). The original design reeks of over-engineering.
Great blogpost, though; well written, which is always a surprise.
Considering they've been using it from 2011 or something, there can't be any good reason but a dumb choice by tech team or management bought Mongo's sales talk without good verification.
Good read. Although I don’t quite understand why every few years media companies keep rewriting their content management layer which generally has no impact on company’s top line or bottom line whatsoever. How can these type of investments are justified when all you are doing is recreating exactly same APIs with same data. Am I missing something here?
It's not just media companies rewriting something significant every few years; everyone in the tech industry does it. There's a very good reason for it: it provides interesting and high-paying work for tech employees.
Who wants to go to a company, set up a great system that doesn't need any more work except maintenance, and then just settle down into maintenance mode? No one who wants a rising career. "Maintained system using $oldTech" doesn't look good on a developer's resume, whereas "migrated critical system to $newTech" does.
And it's not just the developers: managers who oversee big teams of developers working on a big new project have more prestige and are paid more than managers who just oversee some boring already-done project that's in maintenance mode.
Now, in this case, there were probably very good genuine reasons for making a big change (I'm no fan of MongoDB), but many times if you look closely with a critical eye, you'll probably find that people padding their resume and making up a reason for their job to exist is the real reason something is being done.
Busy-work is never interesting, and if it's "high-paying" someone is getting fleeced (and will do what they can so as not to get fleeced in the future). At some point, the constant rewriting will have to stop. In contrast, maintenance work on a well-engineered critical system can be quite interesting in its own right (and it's not like the "maintenance" doesn't involve writing some new stuff on occasion. Nothing is ever truly "done").
Taking part in a big new project to switch to something else can certainly be interesting, if it's a new technology, and even if it's really no better than the previous technology. If nothing else, it's really good for your resume.
Just to expand on this: if I were to take part in a project to move a critical service from Postgres to MongoDB (which I believe to be a worse solution), this would still be a better learning experience for me than just maintaining an existing installation. I'd have to learn about both DBs, and write a lot of new code to work with the new DB. That wouldn't happen with a Postgres system that's already set up and working fine. And my resume wouldn't look like someone that can come in and get to work on a big project and see it through by working on a maintenance project.
>At some point, the constant rewriting will have to stop.
At that point, the developers and managers who pushed for the big new project will have moved on to another company. Remember, moving companies every few years will generally yield a much higher salary than staying at the same company for your entire career.
>In contrast, maintenance work on a well-engineered critical system can be quite interesting in its own right
Sure, but you're not going to get a great salary doing that.
Media companies do not have the best technical leadership or talent, leading to high turnover with junior devs who build with the latest hype to build resumes.
Downvoters, would you explain? I didn't think this was a controversial opinion: Mongo is now under a license that is not listed as Open Source, as defined at https://opensource.org/licenses/alphabetical
If you care about using FOSS - as many companies do - MongoDB is no longer open for consideration.
For an enterprise scenario where the costs of remaining outweigh the costs of migrating this seems to me the clearest reason for doing so, and also likely why they skimped on their rationale ... in all likelihood they appeared to be nice juicy whale but turned out to be some form of shark.
Because there is no company behind PG, no one is loosing money if FANG get your app for free and build something around it. You should ask Redis, Grafana, Nginx ect ... if they're happy about super large company doing that kind of thing.
I'm glad that RedHat didn't feel that way when they were sponsoring kernel development, or that Netscape had panicked and locked down Netscape instead of opening it up, or that Sun had given Java a terrible license. Or Google and Kubernetes. LinkedIn and Kafka. Airbnb and Air Flow. I could keep going like this all day.
Mongo made a lot of money building a projects that runs on top of many other projects that were released as Free Software. Now they're upset that other people are building on top of Mongo in the same way.
So Redis, HAproxy, MongoDB, ect .. should feel good that nowdays multi-billion $ companies take your product and put more people on it that you do internally, sell it and give you nothing?
Explain to me how a startup of 10-20 people can compete against AWS once they grab what you're working on to make an AWS service?
Changes Mongo, Redis made to their licence were made to protected against those practices.
1. Redis — according to its creator — has always been, is, and will continue to be FOSS. Redis Labs did not create Redis, nor can they re-license it.
2. HAproxy is still GPLv2.
3. Redis Labs's CCL and MongoDB's SSPL are not open source licenses, but their purveyors sure do like to give off the impression that they are open source. If you want to keep your code proprietary, keep it proprietary. Don't pretend to be open source. If you are not okay with others using your work, even making money off of it, as long they adhere to the rules of the open source licence you used to license your work to them, then don't license your work to them under open source licenses, or don't cry foul when they use it under the terms of the license.
That's great there is no one single company behind PostgreSQL in this sense. The model when many companies around PG do their own niche things based on PG, having their own business models, cooperating and collaborating to make PG better, as it is the base of their business, is rather fruitful and stable in the long term.
>“But postgres isn’t a document store!” I hear you cry. Well, no, it isn’t, but it does have a JSONB column type, with support for indexes on fields within the JSON blob.
> approximately 2.3m content items.
I had a previous project where we did a similar thing (except with HSTORE instead of JSONB) and it exploded rather dramatically (very simple queries took multiple minutes or timed out entirely) after around 30m rows. I hope the Grauniad doesn't run into a similar issue, or at least anticipates it better than we did.
2.3m content items is tiny. So is 30m. You're at least 1-2 magnitudes away from something that will start to bother postgres. Anything before that is likely to be an index or IOPS issue.
On that note, one thing that the article points out is that managing the MongoDB setup was a full time job.....although managing Postgres will probably be as well. It would be easy and understandable to try to make it static and not needing attention, but that is a recipe for bad times.
Having issues accessing Xmillion rows seems like something that would be caught by someone focused on performance of the database full-time.
RDS is really good at not having to be managed tbh, at the guardian's scale. And when shit does happen they get to be able to throw money at AWS's premium support. I definitely think their move makes a ton of sense there.
Aye even so. They're definitely heavier to store, and their indexes can get quite big, though, so disk space / IOPS can be more of an issue. But still 2.3M is a drop in the ocean. (I don't know about HSTORE though…)
At my previous startup I was ingesting Hearthstone games at a rate of 1-2M / day. Before being handed off to permanent storage (s3, redshift etc) a bunch of data would get stored in a JSONB, with 14-day retention. This all ran on a 200GB t2.large instance on RDS, was our smallest instance and never really caused an issue.
1) Why use Scala to write (a relatively simple) internal CMS?
2) Why use a clustered database for 2 million records?
3) Why write your own proxy? (in Akka, none the less)
4) Why would you migrate articles from Mongo to Postgres using a script that runs overnight in screen?
The Guardian is, prima facie, a Wordpress blog. A simpler architecture would be:
1) Any CRUD web framework to build the CMS for reporters to draft their articles (Django, Rails, etc). Any basic RDBMS with read replication will do. Or, ditch the webapp entirely and just make a simple Markdown editor that commits to a git repo, a la Prose or Netlify.
2) When a reporter "publishes" an article, generate HTML for it and push to the CDN network. (I can't easily tell by looking at their HTTP headers, but I assume they're doing this already)
Okay, I'm being a little tongue-in-cheek. It's probably not that simple. But, one has to wonder, when you're serving up 100 million static HTML pages a day, if it really has to be this complicated.
I am not sure why you think that Rails or Django or any bloated web frameworks are necessary. The most trivial way to implement a website for high traffic is simple static content generator (like Jekyll) and use a CDN. There is incredible amount of CPU wasted on rendering the same exact content for every request. The content of the articles never (or very rarely) changes so you can put it in a CDN. CRUD, DMS, RDBMS are all wrong to be used here.
A bloated web framework makes your code simpler because there are many things that you don't need to reimplement by yourself.
The problem is when developers use a framework without understanding it well and:
1) Reimplement in the code features that are already present in the framework.
2) Fight against the framework because their business needs conflict with the conventions choosen by the framework.
3) Fight against the framework because they don't agree in the way the framework solved a particular problem and want to solve it their way.
In both cases, the origin of the issues is not the framework itself.
> The most trivial way to implement a website for high traffic is simple static content generator
Agree. But in any case, those kinds of sites are not hard to catch neither.
You can put a CDN in front of anything, including Wordpress or any framework. No need for a static site, and the homepages and section pages are rather dynamic now with user logins, story feeds, and personalization.
Exactly, I have moved companies from random web framework + random database to static site generators + CDN with high rate of success too. No point of using Rails/Django like stuff unless you have an extremely good case to, which is certainly not the Guardian use case.
Genuine question - because I'm currently doing CPR on an old Rails app with Mongo - what do you see as a "good use case" for Rails/Django and similar frameworks?
We used to use them internally for building a system management application that required talking to databases and an API that nodes could pull information from. This pre-dates system management tools like Chef, Ansible and it was just a tool like that. There is also workflow management tools that (like managing Hadoop jobs for example) that could be written in these. Generally, things where you need to deal with a lots of state (and state changes) from the outside world. I am pretty sure there are other use cases.
I spent 4 years at a large financial news company, where we benefited from many ex graunaids who decided to migrate over the river.
They helped us create the new front end to the website, to much acclaim. However, it was hard for a number of reasons(this is from the financial news company, not the gruan.):
1) The journalists hated change, especially as they couldn't see any benefit. They just want to keep their same interface exactly as it is, bugs and all. They also had an active union.
2) There is 20 years of "micro services" moving data from the CMS, through various things to allow stuff like translations, syndications (very important source of money) data extraction, meta data processing, physical page layout, and many many more. Most of which is done by a legacy ETL framework pushing to and from a solaris FTP server that is old enough to join the army.
3) there is more than one way to enter data into the CMS.
4) The type of article, and the data in said article changed depending on where it came from, and what services nobbled it.
5) looking after the journalist's interface, curating the data, sorting the articles and adding meta data, looking after paying subscribers, and finally the front end, were all different departments that refused to talk to each other.
This meant that unlike a rational place, there was no source of truth for the CMS. It wasn't like you could call up article 342923 and display it. There was no guarantee that it would have all of the metadata (like were we allows to publish it) required. Add to that the inter department rivalry, which meant that for some reason the membership department were allowed to spend 4 years re-writing the same bit of functionality over and over again. (user management and payment gateways is a solved issue, but alas it took the best part of 25 million quid to find that out.)
To answer your questions:
1) because it scales maaaaan, looks good on my CV, I don't want to spend time doing boring work, I want to learn a new tool
2) see 1
3) see 1
4) Because I suspect that they've never seen a working ETL system
To answer your bonus questions:
1) Journalists have unions, changing the editor requires a _boat_ load of training, and is almost never worth it. Buy over build every time. But yes, its just text. However its the metadata that makes it. Whos in the article, whats the subject. Is it a lifestyle piece, does it have photos, who owns the copyright for the photos, is the article syndicatable, can we syndicate this article, who edited it. Etc, etc, etc. The text entry is the easy bit, its the parts that make it a real news paper that are hard.
2) Nope, almost certainly never done like that. The article will be given a UUID, and dumped into the CMS DB. The front page generator system will then dynamically pull out the articles based on parameters given first by the editors, (front page image, leading headline etc) then the related articles might be curated by hand, or by keyword/metadata or user's preference.
Then the advertising and tracking bits have to be injected, which account for 50-70% of the effort.
Was is most surprising (well, not really), is the apparent immaturity of Scala tooling (Ammonite), and Akka-based projects like Akka-HTTP and Akka-Stream. If you can get into very cryptic issues when using these project, it makes you reconsider the actual value you get from using a functional, type safe and on-paper concurrent language.
Their proxy is one example where Rust will give the same capabilities as Scala and will be more robust, in the future, if it's not already the case today.
If anyone is considering doing something similar today, I'd suggest to take a look at the recent FoundationDB Document layer which speaks MongoDB v3 (https://foundationdb.github.io/fdb-document-layer/).
I haven't tried it yet, and there are some limitations at the moment, but I find it extremely promising.
I agree with the boring database, it's most often the good choice.
Elaborating on my comment though, it would make sense to consider a drop-in replacement for a technology for which you've already invested yourself, but got bitten enough time to think about moving away to a different model (operational burden, data loss, etc).
I'm not sure if "baroque" is a good qualifier for FDB.
It's recent in the open source community, but has been running in production for many companies (including Apple) for some years. It's based on sound architecture, compare to many others.
This is one of the reasons why building on top of the old API wasn’t an option. There was very little separation of concern in the original API and MongoDB specifics could be found even at the controller level. As a result the task of adding another database type in the existing API was too risky.
This seems like the problem was more related to the API layer code quality rather than where the data was stored.
I worked on a large site that used MongoDB. We hundreds of thousands of rows in one specific collection that were used as a log. We couldn't even query it without the whole database crashing.
We moved that specific collection to MySQL, no problem and there was virtually no change in the data structure. Both were of course indexed.
Standard disclaimer in database discussions involving NoSQL:
SQL is not RDBMS is not ACID is not NoSQL.
When comparing NoSQL to something you should really compare it to RDBMS which is what most people conflate with "SQL". This conceptual understanding gap is IMO what led to the huge growth and misuse of "NoSQL" for a decade.
“This is one of the reasons why building on top of the old API wasn’t an option. There was very little separation of concern in the original API and MongoDB specifics could be found even at the controller level. As a result the task of adding another database type in the existing API was too risky.“
This bit sounds like a cautionary tale about the importance of a layer and well structured API. There’s no reason why details about your data store should leak into an API Controller.
It seems like the use case of a news agency would be better served by storing metadata in a RDMS and using something like HDFS or other file system for the actual articles.
>Since migrating to AWS we’ve had two significant outages due to database problems, each preventing publication on theguardian.com for at least an hour.
This happened to us numerous times before and after we migrated to their Atlas platform. We are a major streaming company with more than 120M registered user. Disclosure, we are moving away from MongoDB as well.
I find this specially troubling just like abandoning the Edge browser development. Steering away from Mongo just for the sole reason it's not in AWS sounds like a bad advice in general. Also it sounds like a false dichotomy. Implies there are no other options but it didn't even mention MongoDB Atlas for example.
Would be interesting to share the different in ops cost (year support paid aside), for example, do you need a beefier server now? more servers? less? size on disk? IOPS of servers? also, any different in your metrics from an end user perspective? e.g: time to publish an article.
Why would you delete the MongoDB's so soon after the migration? What if things look great for a few hours/days, but then things go horribly wrong? I suppose they still had Mongo backups too? Just unsure why you wouldn't leave those around for awhile.
- They'd been running the two systems in parallel and comparing their outputs for months. By the time the pulled the plug, they were confident that PostgreSQL was working fine.
- They had written systems to copy data from the MonogoDB-backed service to the one running on PostgreSQL. Once they started treating PostgreSQL as the official store of record, they'd be faced with either writing another system to mirror PostgreSQL back to MongoDB or committing to lots of double entry. Either of those sound painful.
I'm not sure why a database is required at all. Surely the main advantage is to support tables and joins. If you just want a key-value store surely S3 or similar would be simplest. Do they search the data? Surely Cassandra is designed for this. Any ideas?
Cassandra isn't designed for this. Been there, done that.
postgres all the way.
THe FT use(d) cassandra to store their membership details (6 million rows of largely static, read-only well structured data) The cluster was massive (12+nodes in at least two regions, from memory) slow and was impossible to upgrade reliably.
The support from datastax is shite. Backups are not reliable. imports even less so, and you are beta testing the whole system every time you do a point release.
CMSs have highly structured data. swallow your pride, map the data and build a proper schema. Its really not that hard.
Yes, cassandra has a graph layer, no its really not worth it. Yes it has gremlin, no you shouldn't need it if you've modelled your data correctly.
Cassandra doesn't have graph support, it's just a sorted, distributed, nested key/value system.
Datastax builds a graph layer on top, or you can use something like JanusGraph, but it's never as good as using a real graph database with natively designed storage system.
What were the issues? You're the first person I hear complain about cassandra for this. Agreed that any database can do that type of data at that volume.
its a 50/50 split between terrible design and horrible support.
I've seen cassandra shine when it comes for write optimised loads. Pipe a bucket load of data into gremlin and magic happens.
But thats a specific workload, which is pretty rare, and certainly not suited 999:1 read to write ratio. Thats not cassandra's fault, thats the fault of the idiot that chose it, and the boatload of idiots who carried on and added loads of systems that makes it much harder to migrate away.
Then we come to support. Datastax is the defacto support provider. They make a lifecycle manager, backup/restore tool, and push a load of patches into the main codebase. But it is shit
o Backups fail silently
o The only way to make alerts work (ie do an action, rather than create a popup when you log into the ops center) requires work to navigate the API
o Its full of CVEs, which are script kiddy-able
o migrating data from backups to new clusters was impossible to do without a boat load of manual work, failed 50% of the time
o restoring from automated backups was impossible until august.
o it couldn't use instance profiles on AWS until august
o upgrading to a point release silently breaks backups, _always_
Basically I spent the first half of this year QA very expensive software. There are some very very good support people, but there were some terrible ones as well.
Is this all about their enterprise tools? I can't say I've used any of these.
For backups, we found a tool on github to snapshot to S3. Worked fine as far as I know. It's the guys in the office next to me that were handling this, not me, never heard of any major issue.
Cassandra is more of a DB than Mongo is. Dynamodb is basically the same, and they wanted the hosted version so their approach is reasonable.
They definitely want more than a key value store. They at least want date and content indexing, editor, author, etc. I imagine they probably have a few internal requirements as well wrt analytics, tooling and more. S3 doesn't work for all this and another user's suggestion that this is all treatable like static content sounds way off to me. The best you can do is pregenerate some of the HTML.
Great article. Such case studies are way more significant contributions to software engineering as a field than the big pile of whitelabel/fanboy articles we see around highlighting advantages of some fancy stuff.
Very well written article on data migration. Going to process and details. This will help anyone who is also considering such a move. I guess we will see a shift in direction towards Postgres in 2019.
I think MongoDB has a place, but that place isn't being the database for a big distributed content store for a web site or something like that. Where I do think it has a place is for some of the in-memory data structures of a node running an application or a service. Applications tend to have a lot of data in various structures that it works with while it's running. Sometimes you need this to be persistent and sometimes you need to be able to scale this across a node. MongoDB is fine for this, so I think of it more like a SQLite replacement.
Hi! I'm one of the authors. Good point, we should add something on that. Briefly our mongo db setup was running on 3 r4.xlarge instances which comes to around $5000/year with storage, plus the cost of opsmanager, which I think was around £40,000 including the support contract.
In Amazon it's harder to compare directly as the support contract is paid across all of our accounts, but we're spending around $13,000 for a highly available db.r4.xlarge postgres instance.
Performance wise, querying without an index is SLOW, basically not worth it - as we end up doing a scan of the entire database. Fortunately we don't need to do this as we can usually rely on the guardian content API for proper searching stuff. The average API (a Scalatra app) response time reaches 150ms at 'peak time'. This isn't a high-performance use case - around 1000 requests/minute at 'peak' time.
My company does bespoke software development for large enterprises. If they're on AWS having all your services within a VPC you control is something of table stakes for a lot of large company engineering and security teams.
At this point I'm happy enough with JSONB that, even as someone who wants to try new things, I'm finding it hard to justify trying out MongoDB.
My SAs seem cautiously willing to try it, but I won't put them through the work of learning how to manage it unless I can find a clear advantage over Postgres+JSONB.
Not sure, but there was Goodbye MySQL, Hello Postgres and Goodbye Postgres, Hello MySQL (same company did it actually) for sure. It is kind of weird why engineers are so obsessed with tools though.
> It is kind of weird why engineers are so obsessed with tools though.
Would you say the same thing about programming languages?
I was actually listening to the Full Stack Radio podcast, and the latest episode talked about the power of moving more into the database. What struck me as a strong idea was that we can't treat databases as equivalent - different database servers have different strengths (for example, some have better JSON support or time evaluation functions).
I think it's particularly telling that you seem to be thinking of the database as a "tool", but if you're like most, you fight religious battles over your language of choice. In my experience, the database IS the application (and in many cases, the business) and is a far more important choice than the implementation language, the cloud provider, etc.
Look, how I see these things is that we have tools. Do you think that people working on fields would fight over which tractor brand is better? Probably not. Do you thinks that builders would fight over which hammer brand is better. I don't think so. These fights over minor differences are happening mostly in the software industry. Freud wrote a pretty good thesis about it.
Moving into databases is not something I understand. I worked the last 10 years as data engineer but I like to move out from databases a lot more than moving into them. :)
> Do you think that people working on fields would fight over which tractor brand is better?
OMG yes. Having worked in the Midwest, I promise you that farmers absolutely do trash talk each other about John Deere vs New Holland vs International Harvester. Tell a Ford driver that you like his Chevy pickup and prepare to hear about it for the next hour.
> Do you thinks that builders would fight over which hammer brand is better.
You seriously haven't spent much time around blue collar workers, have you.
Marketing / Evangelism / Outreach. I'm not kidding, and I don't predominantly mean that in a bad way. I think it's something that the postgres community / companies really aren't good at.
For a serious note, there are occasionally situations where I have to take in a lot of data that I know nothing about. Sooner of the time this data is either in json or readily convertible to json so losing it into a mongodb database and poking around in it is a reasonable preliminary step to whatever the more permanent solution should be.
The only database we lost data with is Mongo. We had support at the time but even them could not recover. Migrated to Postgres the next month and never looked back.
bullshit. It's shared everything. if you have a 3 nodes cluster for redundancy, the 3 nodes keep the exact same data with only one node accepting reads and writes. It's literally the opposite of linear scalability.
In a sharded setup, every shard must be on a 3 nodes cluster for redundancy (a replica set), the 3 nodes keep the exact same data with only one node accepting reads and writes.
You have to add capacity (shards) 3 nodes at a time, two third of which sit unused. It's not scalable at all.
Its scalable in terms that if your total data doesn't fit into single machine, you can distribute it between shards in mongo, but can't in pgsql.
replica set is for redundancy and availability, but you can use replicas for reads, so scale your reads traffic. pgsql works absolutely the same way, you have one master which accepts writes and read-only slaves/replicas.
The article is very well written. Their team do leverage great tooling for thing such as Gor. I love Gor. I used it + Mongo OpsLog when migrate from AWS -> DigitalOcean and eventually have to migrate back to AWS because DigitalOcean cannot deliver good performance as AWS and they throtle CPU(lots of steal CPU).
OpsManager is great for team that don't have dedicated devops I think and have great dashboard/visualization.
However, run your own MongoDB is very easy. Not like Postgres(unless you used RDS). However, when using RDS, you still have downtime when upgrading db, it still have a small amount of time the standby in MultiAZ is promoted to master, DNS is updated, and during that yourcurrent primary is not writeable or even not available. MongoDB is way easier to operator. You add/delete/node on fly and client auto discover network topology. Plus https://docs.mongodb.com/manual/administration/production-no... this links give great info to tune it: thing like run on XFS, mount with notatime options...puts opslog on high iops volume etc...
Peformance isn't a factor to pick your database much nowsaday. They looks great on the benchmark. However, try for yourself before pick one on your workload. Postgres does has its own ward.
Pick a database based on how well your team confident with it, how does the database help you move fast enough or deliver business value. Don't follow the hype or silly benchmark.
Great read. Well done.