Specific technology choices aside, this was an incredible write-up of their migr...

chronotis · on Dec 19, 2018

This was what was particularly interesting to me - that they went to the effort of writing a purely technical article on the particulars of how parts of their environment operate, and to publish it on their platform even when that's not the sort of content they're known for.

viraptor · on Dec 19, 2018

It's not their main platform though:

> Digital Blog

> A blog by the Guardian's internal Digital team. We build the Guardian website, mobile apps, Editorial tools, revenue products, support our infrastructure and manage all things tech around the Guardian

wftglf · on Dec 19, 2018

Hi! Thanks for your comments. I'm one of the authors of this post. It is the same platform at the moment (just not tagged with editorial tags so it stays away from the fronts), though sometimes the team that approves non-editorial posts to the site can be concerned about us writing about outages and things as it might carry a 'reputational risk', so we may end up migrating to a different platform in the future so we can publish more quickly, we'll see!

reilly3000 · on Dec 20, 2018

In our era of deplatforming a publisher publishing on something like Medium seems antithetical, even for a dev team that just wants to get words out. Should you spend some cycles on the dev blog? Probably, but you should also split test the donation copy and get the data warehouse moved forward for 2019 initiatives and fix 1200 other issues. Thanks for sharing a great post. I shared it with my team and we all learned a lot.

wftglf · on Dec 20, 2018

Thanks to hacker news and reddit this piece got over 100,000 page views which should be enough to justify the blog staying on platform!

AlexCoventry · on Dec 20, 2018

Does SecureDrop run on your AWS infrastructure?

wftglf · on Dec 20, 2018

No, I think it's on our own infrastructure.

AlexCoventry · on Dec 20, 2018

Thanks for the info, here and below.

dmix · on Dec 20, 2018

It’s probably better if they didn’t answer this question...

AlexCoventry · on Dec 20, 2018

It's a bit disturbing to me that they seem to be using AWS for confidential editorial work.

> Due to editorial requirements, we needed to run the database cluster and OpsManager on our own infrastructure in AWS rather than using Mongo’s managed database offering.

wftglf · on Dec 20, 2018

In a happy world the guardian wouldn't rely on a company we spend a lot of time reporting on for unethical practices (tax avoidance, worker exploitation etc.) - but we decided it was the only way to compete. One of the big drivers was a denial of service attack on our datacentre in 2014 on boxing day - not an experience any of us want to have to deal with again.

nexuist · on Dec 20, 2018

>Since all our other services are running in AWS, the obvious choice was DynamoDB – Amazon’s NoSQL database offering. Unfortunately at the time Dynamo didn’t support encryption at rest. After waiting around nine months for this feature to be added, we ended up giving up and looking for something else, ultimately choosing to use Postgres on AWS RDS.

AlexCoventry · on Dec 20, 2018

Anyone who gets control of the live server can still read a database, even if it encrypts its storage.

hbbio · on Dec 20, 2018

Exactly. As I read the original article, which mentions "encryption-at-rest", there was a voice in my head crying: "No, what they need is E2EE". That would enable the authors to write confidential drafts of the articles, no matter where the data is stored (and AWS would be perfectly fine of course).

Disclaimer: The voice is my head does not come out of nowhere. I am building a product which addresses this: https://github.com/wallix/datapeps-sdk-js is a API/SDK solution for E2EE. Sample app integration is available at: https://github.com/wallix/notes (you can switch between master and datapeps branches to see the changes of the E2EE integration)

manigandham · on Dec 20, 2018

In which case they could've just used a separate encryption layer with any database, including DynamoDB. The HSM security keys available from all the clouds makes this rather simple.

hbbio · on Dec 20, 2018

Yes, any db including Dynamo would have been fine.

Our software E2EE solution has advantages over HSM though: Cost obviously, and more features and extensibility.

AlexCoventry · on Dec 20, 2018

Great idea.

grogenaut · on Dec 20, 2018

Encryption at rest is still important as it closes off a few attack/loss vectors: mis-disposed hard drives, re-allocated hosts. I'm probably missing a few others.

AlexCoventry · on Dec 20, 2018

Yeah, but it doesn't really address my concern.

nixgeek · on Dec 20, 2018

Anyone who can control a server in any environment can potentially interact with the database powering applications running on that server.

How is running on AWS different than Guardian Cloud in their basement?

AlexCoventry · on Dec 20, 2018

The level of control over who has physical access, of course.

Did reports of the Snowden revelations reside on the CMS?

wftglf · on Dec 20, 2018

Sadly we don't trust our security practices anywhere enough for that! Secret investigations happen in an air gapped room on computers with their network cards removed then get moved across to the main CMS when they're ready to publish.

untog · on Dec 20, 2018

Probably not, no, until they were about to be published. I imagine that the choice between "run an entire data centre ourselves, store everything there" and "use AWS, but keep high sensitivity stories on local machines" is an easy one.

After all, the client computer that connects to the CMS is just as, or more likely to be compromised. I wouldn't be surprised if the coverage (or at least parts of it) were edited on airgapped laptops.

AlexCoventry · on Dec 20, 2018

> the choice between "run an entire data centre ourselves, store everything there"

If those were the only two choices, you might be right. But the resources needed for the actual CMS functionality sound modest enough to run independently of the main website.

> the client computer that connects to the CMS is just as, or more likely to be compromised

That's faulty reasoning.

untog · on Dec 20, 2018

> That's faulty reasoning.

Why? It's an obvious potential point of compromise.

AlexCoventry · on Dec 20, 2018

Sorry, I misunderstood. I read it as saying "We're going to get hacked via this other vector, anyway, so why bother?" I see your point, now.

philjohn · on Dec 20, 2018

They're using AWS VPC (Virtual Private Cloud) which isn't open to the world (you use a VPN to bridge the VPC into your internal network) and which you can spin up dedicated instances that don't share underlying hardware with other AWS customers.

pfriend · on Dec 21, 2018

Thanks for writing the blog post. Insights like these and of such high quality are rare.

Can I ask what was the total cost of the migration?

If there was software that could do this database migration without downtime, how much would you/Guardian be willing to pay?

jwdunne · on Dec 19, 2018

This is pretty much how all Guardian articles are formatted. Some of their regular pieces could be called "blog posts" - Felicity Cloake's cooking series comes to mind.

Guess it makes sense to reuse the platform that already has the templates than use another platform and reimplement the design.

vanderZwan · on Dec 19, 2018

Given their employers, one would hope that they get good editorial support though!

pdpi · on Dec 19, 2018

Ish. From what I can tell, the “Digital Blog” seems to be set up as just another column on the platform.

revel · on Dec 19, 2018

Totally agreed. This is pretty much the definitive guide on how to perform a high stakes migration where downtime is absolutely unacceptable. It's extremely tempting, particularly for startups, to simply have a big-bang migration where an old system gets replaced by something else in one shot. I have never, ever seen that approach work out well. The Guardian approach is certainly conservative but it's hard to read that article and conclude anything other than that they did the right thing at every step along the way.

Well done and congratulations to everyone on the team.

konschubert · on Dec 20, 2018

I agree, but it looked them a year if I am reading the article right.

In most early stage startups, that would be an unacceptable loss of time.

So I don't judge them for doing a one-shot migration even if it causes an hour of downtime.

It all depends on the business.

wftglf · on Dec 20, 2018

Yeah it did take a long time! Part of this though was due to people moving on/off the project a fair bit as other more pressing business needs took priority. We sort of justified the cost due to the expected cost savings from not paying for OpsManager/Mongo support (as in the RDS world support became 'free' as we were already paying for AWS support) - which took the pressure off a bit.

Another team at the guardian did a similar migration but went for a 'bit by bit' approach - so migrating a few bits of the API at a time - which worked out faster, in part because stuff was tested in production more quickly, rather than our approach with the proxy which, whilst imitating production traffic, didn't actually serve Postgres data to the users until 'the big switch' - so not really a continuous delivery migration!

eecc · on Dec 20, 2018

The article mentions several corner cases that weren’t well covered by testing and caused issues later. What sort of test tooling did you use, Scalacheck?

hideo · on Dec 19, 2018

Agreed! I don't think enough engineering orgs appreciate the value of a great narrative on any technical topic.

Part of my duties at work require me to deal with "large" issues. While a solution to them is usually necessary and quick and high quality, I've seen the analyses that come after them vary in quality.

Good writeups tend to stick around in people's memories and become company culture, and drive everyone to do better. Bad writeups are forgotten, and thus the lessons learned from them are forgotten as well.

This particular article stands out for me. English is not my first language, and I've spent most of my life dealing with very fundamental technical details, so most of my writeups aren't the best. I'm going to bookmark this one and come back to it to learn how to write accessible technical narratives.

harel · on Dec 19, 2018

Writing is their Core business after all :) I agree though, I read it like a fascinating breaking news story

jdietrich · on Dec 19, 2018

The BBC's Online and R&D departments have very interesting blogs, if you like this sort of thing.

http://www.bbc.co.uk/blogs/internet

https://www.bbc.co.uk/rd

DRW_ · on Dec 19, 2018

Also: https://medium.com/bbc-design-engineering

deadbunny · on Dec 20, 2018

Has world class platform for delivering articles. Uses Medium.

I don't even.

brown9-2 · on Dec 19, 2018

I imagine this is a nice benefit of working at an organization that is primarily about writing.

ranman · on Dec 20, 2018

I was actually a little confused by the article - it seems to go up and down in terms of technical depth. It feels like it was written by several people. The hyperlink to “a screen session” was odd as well ... ammonite hyperlink I get but... screen is a pretty ancient tool... people either know it or can find out about it. Like you link to screen but not “ELK” stack?

I like the article but it was a bit hard for me to consume with multiple voices in different parts.

thanatropism · on Dec 20, 2018

screen is hard to google.

luord · on Dec 20, 2018

This is the first thing I noticed too. This was an excellent read.

InGodsName · on Dec 19, 2018

Old versions of mongo were very bad.

We accured lots of downtime due to mongo.

But later versions were rock solid and I've matainer mongo installations at many startup and SMEs once you setup alertd for disk/memory usage, off you go. Works like charm 99% of the times.

kbenson · on Dec 19, 2018

MogoDB is proof that with the right strategy, marketing and luck, you really can fake it until you make it.

Not that that's really a surprise or was unknown, it's just fairly new to see in the open source ecosystem instead of the enterprise one.

badloginagain · on Dec 19, 2018

I'd posit its more a matter of maturing a new paradigm. There's a lot more edge cases you have to cover as NoSQL became more popular for production-at-scale.

SQL has decades of production maturation, and has wider domain knowledge.

kbenson · on Dec 19, 2018

I'm sure there's some of that. But a lot of the early problems were a bit more weighted towards poor engineering in general, IIRC. For example, I seem to recall an early problem was truncating large amounts of data on crash occasionally.

bigiain · on Dec 19, 2018

That's probably true.

Unfortunately the number of my customers who would sign off on just "two nines" is approximately zero...