Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Their architectural choices are puzzling to me:

1) Why use Scala to write (a relatively simple) internal CMS?

2) Why use a clustered database for 2 million records?

3) Why write your own proxy? (in Akka, none the less)

4) Why would you migrate articles from Mongo to Postgres using a script that runs overnight in screen?

The Guardian is, prima facie, a Wordpress blog. A simpler architecture would be:

1) Any CRUD web framework to build the CMS for reporters to draft their articles (Django, Rails, etc). Any basic RDBMS with read replication will do. Or, ditch the webapp entirely and just make a simple Markdown editor that commits to a git repo, a la Prose or Netlify.

2) When a reporter "publishes" an article, generate HTML for it and push to the CDN network. (I can't easily tell by looking at their HTTP headers, but I assume they're doing this already)

Okay, I'm being a little tongue-in-cheek. It's probably not that simple. But, one has to wonder, when you're serving up 100 million static HTML pages a day, if it really has to be this complicated.



I am not sure why you think that Rails or Django or any bloated web frameworks are necessary. The most trivial way to implement a website for high traffic is simple static content generator (like Jekyll) and use a CDN. There is incredible amount of CPU wasted on rendering the same exact content for every request. The content of the articles never (or very rarely) changes so you can put it in a CDN. CRUD, DMS, RDBMS are all wrong to be used here.


> bloated web frameworks

A bloated web framework makes your code simpler because there are many things that you don't need to reimplement by yourself.

The problem is when developers use a framework without understanding it well and:

1) Reimplement in the code features that are already present in the framework. 2) Fight against the framework because their business needs conflict with the conventions choosen by the framework. 3) Fight against the framework because they don't agree in the way the framework solved a particular problem and want to solve it their way.

In both cases, the origin of the issues is not the framework itself.

> The most trivial way to implement a website for high traffic is simple static content generator

Agree. But in any case, those kinds of sites are not hard to catch neither.


You can put a CDN in front of anything, including Wordpress or any framework. No need for a static site, and the homepages and section pages are rather dynamic now with user logins, story feeds, and personalization.


Your architectural choices are puzzling to me - Rails/Django -- I've rescued more bad Django apps than I can count.


Exactly, I have moved companies from random web framework + random database to static site generators + CDN with high rate of success too. No point of using Rails/Django like stuff unless you have an extremely good case to, which is certainly not the Guardian use case.


> No point of using Rails/Django like stuff unless you have an extremely good case to, which is certainly not the Guardian use case.

Django itself was literally developed to suit the use cases of a newspaper.


Genuine question - because I'm currently doing CPR on an old Rails app with Mongo - what do you see as a "good use case" for Rails/Django and similar frameworks?


We used to use them internally for building a system management application that required talking to databases and an API that nodes could pull information from. This pre-dates system management tools like Chef, Ansible and it was just a tool like that. There is also workflow management tools that (like managing Hadoop jobs for example) that could be written in these. Generally, things where you need to deal with a lots of state (and state changes) from the outside world. I am pretty sure there are other use cases.


yes and no.

I spent 4 years at a large financial news company, where we benefited from many ex graunaids who decided to migrate over the river.

They helped us create the new front end to the website, to much acclaim. However, it was hard for a number of reasons(this is from the financial news company, not the gruan.):

1) The journalists hated change, especially as they couldn't see any benefit. They just want to keep their same interface exactly as it is, bugs and all. They also had an active union.

2) There is 20 years of "micro services" moving data from the CMS, through various things to allow stuff like translations, syndications (very important source of money) data extraction, meta data processing, physical page layout, and many many more. Most of which is done by a legacy ETL framework pushing to and from a solaris FTP server that is old enough to join the army.

3) there is more than one way to enter data into the CMS.

4) The type of article, and the data in said article changed depending on where it came from, and what services nobbled it.

5) looking after the journalist's interface, curating the data, sorting the articles and adding meta data, looking after paying subscribers, and finally the front end, were all different departments that refused to talk to each other.

This meant that unlike a rational place, there was no source of truth for the CMS. It wasn't like you could call up article 342923 and display it. There was no guarantee that it would have all of the metadata (like were we allows to publish it) required. Add to that the inter department rivalry, which meant that for some reason the membership department were allowed to spend 4 years re-writing the same bit of functionality over and over again. (user management and payment gateways is a solved issue, but alas it took the best part of 25 million quid to find that out.)

To answer your questions:

1) because it scales maaaaan, looks good on my CV, I don't want to spend time doing boring work, I want to learn a new tool

2) see 1

3) see 1

4) Because I suspect that they've never seen a working ETL system

To answer your bonus questions:

1) Journalists have unions, changing the editor requires a _boat_ load of training, and is almost never worth it. Buy over build every time. But yes, its just text. However its the metadata that makes it. Whos in the article, whats the subject. Is it a lifestyle piece, does it have photos, who owns the copyright for the photos, is the article syndicatable, can we syndicate this article, who edited it. Etc, etc, etc. The text entry is the easy bit, its the parts that make it a real news paper that are hard.

2) Nope, almost certainly never done like that. The article will be given a UUID, and dumped into the CMS DB. The front page generator system will then dynamically pull out the articles based on parameters given first by the editors, (front page image, leading headline etc) then the related articles might be curated by hand, or by keyword/metadata or user's preference.

Then the advertising and tracking bits have to be injected, which account for 50-70% of the effort.

CDNs now allow a lot of logic to be pushed to the edge. (see https://labs.ft.com/2014/10/caching-user-agent-specific-resp...) which means that its not overly taxing to host a very large website.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: