How to build a news app that never goes down and costs you practically nothing

danso · on Feb 25, 2013

I had the pleasure of working with NPR's data news chief, Brian Boyer, who taught me a lot about actual good software practices.

I agree with his preaching the power of flat-files. Not that flat-files should be used to do things that they inherently can't...but that too many projects (or hobby apps) don't consider them and then spend as much time figuring out how to keep their server from crashing. I find it pretty amazing that they have only one small EC2 instance for their news apps (this is separate from the NPR.org site overall) just to do cron jobs.

Flat files, of course, require good planning...not least of which involves an accurate gauging of how often an app's data needs to be refreshed. But I like that kind of planning and thinking more than I do the kind it takes to maintain a stable server.

ericcholis · on Feb 25, 2013

Recently I re-engineered an ecommerce site that takes many things from the "flat-file" montra. It's amazing how little database interaction you actually need for many sites.

I think the database-driven content comes primarily from kitchen sink packages like Wordpress. More often than not, it's better to get an overall look at your structure and decide what needs to be dynamic and what can be "static".

When talking about it internally, we refer to "flat-file" as "generated". Simply meaning that it's dynamically created by a task or user interaction.

wting · on Feb 25, 2013

The file system is the original "database". I solved an interview question recently using a few Linux utilities, pipes, and flat files. It was a bash one liner that impressed the interviewer, but then he asked me to solve it the "real" way.

Both databases and file systems use B-trees to implement fast read / writes, it's just that SQL databases enforce structure. OTOH, using flat files moves data checking into application space and losing certain ACID properties.

Similar trade offs are made between NoSQL and SQL, dynamic and static languages, but I digress...

ewang1 · on Feb 25, 2013

I would imagine "writes" are still being written to the database, which then triggers the re-generation of the appropriate flat-file(s).

wting · on Feb 25, 2013

No, given a bunch of Apache logs I needed to find the top 10 queries that met conditions A and B.

1. grepping for conditions and extracting the query with sed

2. appending a char to a flat file (i.e. increasing query count by 1)

3. sort by largest files and get the top 10

Parsing log files[0]:

  for ./*.log -type f -exec echo 1 >> $(grep cond_A {} | grep cond_B | sed -E "s:.*(query).*:./results/\1.txt:") \;

Finding top 10 results:

  ls ./results/*.txt -Sl | head

[0]: Untested code, and it only grabs one line per log file instead of grepping all matching lines in the log file. I'd have to move the grep to the outside and loop the `echo 1 >>` command, but you get the point.

mbreese · on Feb 25, 2013

You probably could have done the same thing in one line:

    cat *.log | grep condA | grep condB | sed 'some regex to get rid of dates, etc' | sort | uniq -c | sort -nr | head -n 5

Or something like that... Not very efficient, but it would work in a pinch. I actually do something like this all the time for large datasets. In the time it would take me to write something better, this set of commands is already done.

lotsofcows · on Feb 26, 2013

Those that don't know awk are doomed to complex pipe lines.

legutierr · on Feb 25, 2013

Maybe I shouldn't be, but I am kind of surprised that you can do an ecommerce site using mostly generated files. The two things that I wonder about are (1) displaying cart/user information on every page (is it all javascript?) and (2) csrf in the forms.

Also, I would assume that javascript becomes a requirement for the functioning of the site; I always considered ecommerce to be the one part of the web were you wanted to be able to operate without javascript, so as not to turn anyone away.

I found this blog post that says that in 2010, 2% in the US had JS disabled:

http://developer.yahoo.com/blogs/ydn/posts/2010/10/how-many-...

pjungwir · on Feb 25, 2013

You can also integrate with a third-party shopping cart then load a small iframe to show "2 Items in the Cart." Not my favorite way to build a site, but it is very inexpensive to build, and for small vendors the site becomes entirely static. Even if you build it in Rails, you can use page-level caching so after the first hit it's served as a static html file.

jasonwatkinspdx · on Feb 25, 2013

Page level caching is one of the most underused features of rails. Most applications show anonymous users identical content, and can tolerate that content being cached briefly. Setting this up gives you slashdot protection as well as access to a last chance degraded mode if there's a chain of catastrophic failures further into the app stack.

ericcholis · on Feb 26, 2013

Exactly, this is the general idea of what I do in the ecommerce app I mentioned.

ericcholis · on Feb 25, 2013

I guess I should have been more specific about my strategy for "generated" content in an ecommerce environment. The traditional ecommerce elements that should be real-time are; such as carts, pricing, availability, etc. Everything else is generated on update via our [custom] inventory management software.

The closest analogy I can give is Varnish's Edge Side Includes. It's not exactly the same, but it's very similar.

subpixel · on Feb 25, 2013

FYI I'd be totally interested in more info about how you achieve this in an e-commerce environment. Blog post?

snowwrestler · on Feb 26, 2013

Serving flat files...isn't that what a caching layer is for? I don't understand why someone would go to extra trouble architecting an app around flat files on the back end, rather than building it the easiest way and sticking Varnish or nginx in front it.

stcredzero · on Feb 25, 2013

> I had the pleasure of working with NPR's data news chief, Brian Boyer

I found the unexplained use of the word "Boyerism" in this article to be confusing and unprofessional. I have nothing against the use of slang, or the expectation that readers will need to Google terms: this is the reality of our culture in the age of the Internet and excellent search. However, the casual and unexplained use of a private group's inside joke reveals a lack of awareness of how search affects interacts with culture. Is it a deliberate attempt to confuse and snub the larger audience, or is in unintentional?

gottagetmac · on Feb 25, 2013

I think you are bothered too easily.

danso · on Feb 25, 2013

I agree with you...I didn't even realize that Boyer wasn't introduced in the OP by name. I mentioned his name because he is well known in news circles and frequently evangelizes about flat files. The OP should have said who Brian Boyer was even if the NPR apps blog is meant as insider baseball for news devs.

willholloway · on Feb 25, 2013

This is an excellent post. I used almost exactly the same system to build my app streamjoy.tv . Everyday streamJoy scours Netflix, iTunes and Amazon to find the availability of movies for streaming, rental and digital purchase.

I built it as a portfolio piece and haven't finished it because I've been doing consulting jobs, but if you want to watch a particular movie online without going the pirate route, it's the start of a legal alternative to the old sites like sidereel.

The tech behind it is the same as this article. Flask and jinja render a static html page for each of about 92,000 movies.

I use a flask app for the search functionality that accesses an elastic search database.

I used mongodb because it was incredibly easy to create a local cache of the JSON data I was getting from the apis I was accessing.

It all took a lot longer than I ever would have though to even get it to this point. There were a lot of little annoying issues with the various APIs I had to access, and the annoyance of parsing XML amongst other schelps I had to deal with.

I have only ever mentioned it on hacker news one other time and the last time my elastic search server crashed from the traffic. It is all running on a $5 a month digital ocean vps.

The flask/jinja static page creation is rock solid and would never fail if I pushed it to s3, right now my elastic search server is the bottle neck. I haven't taken the time to throw hardware at it or set up clustering.

All in all it's a pretty cool service in my opinion, I built it for myself because I love movies and spend a lot of time watching them online and made a decision to never pirate a content creators work again. Also the experience of netflix, amazon and itunes is orders of magnitude better than the old megavideo/bittorrent trouble of finding the real deal and not being inflicted with spammy ads with voiceover.

I really like the flask/jinja/bootstrap/javascript/mongodb/elastic search stack. I've learned a lot of tips and tricks by building streamJoy and if people want I would be happy to share them with the community.

I know this sounds like self promotion of my app but I haven't even taken the time to implement affiliate tracking for any service besides amazon. Consulting is serious and real money right now and that takes priority over this little side project I did.

lindowe · on Feb 25, 2013

Sounds like a cool app, I'd love to hear any tips or tricks you've learned along the way

willholloway · on Feb 25, 2013

Thanks! I'm going to write up a post with more details about the tech behind it. Stay tuned.

LAMike · on Feb 26, 2013

Looking forward to it!

bphung · on Feb 25, 2013

Out of curiosity, what do you use to get the movie images for the background?

willholloway · on Feb 25, 2013

The images are promotional images that have been uploaded to the movie database, an open source alternative to imdb.

bostonvaulter2 · on Feb 26, 2013

streamjoy.tv looks nice. Too bad you're not planning to finish it. Another alternative that I've been using recently is canistream.it but that get's wonky some times.

abraininavat · on Feb 25, 2013

Just curious. Could you explain this part? I used mongodb because it was incredibly easy to create a local cache of the JSON data I was getting from the apis I was accessing.

If you are using the data from the various APIs to render static pages, why do you need the local cache of the JSON data? And if you've got a local cache of all the JSON data, why render static pages rather than serve dynamic pages that reference the DB of cached data?

willholloway · on Feb 25, 2013

Sure, I would be happy to explain.

> If you are using the data from the various APIs to render static pages, why do you need the local cache of the JSON data?

Because the data that I am displaying for each movie is not available from any single source. streamJoy is the abstraction layer tying together the disparate data sources.

Mongo was the right choice because instead of having to map out postgres schema, I could just store the entire JSON response as a dict.

What I have learned from all of this is that API's are not always reliable. The information changes, you don't always know what you are going to get back.

Mongo just made this early process where I didn't know what I was going to get much more fault tolerant. And, no schema mapping.

> And if you've got a local cache of all the JSON data, why render static pages rather than serve dynamic pages that reference the DB of cached data?

Rendering static pages allow you to use S3 or nginx to serve your html. For a one man operation like I am running, S3 is manna from heaven for scaling the serving of html files.

I'm using nginx right now only because I haven't taken the time to use S3, but S3 is the smarter choice here.

My goal with this was to have the smallest possible dynamic server footprint as possible.

The other reason for caching json data is I run analysis on the db items, even though I haven't published any of those features yet.

By having my own copy of the data, I can run a process on 92000 items in 6 minutes instead of taking a day due to API rate limits.

derefr · on Feb 25, 2013

> For a one man operation like I am running, S3 is manna from heaven for scaling the serving of html files.

Have you tried putting CloudFlare in front of it? Buttered jelly-roll manna.

espeed · on Feb 26, 2013

You can accomplish the same thing as flat files by using something like Amazon CloudFront (http://aws.amazon.com/cloudfront/) -- CloudFront caches your dynamically generated pages and serves the generated page from cache rather than hitting your server each time.

See http://aws.amazon.com/cloudfront/dynamic-content/

SmileyKeith · on Feb 25, 2013

Am I the only one who thinks NPR is awesome for having a Github profile with seemingly good and useful code on it?

jeremyjbowers · on Feb 25, 2013

We're trying to do as much in the open as we can! The way we figure it: If public media can't share code, who can?

SmileyKeith · on Feb 25, 2013

That is so awesome. Thank you.

gjreda · on Feb 25, 2013

Brian Boyer did the same when he was leading the news apps team at the Chicago Tribune (I believe he left for NPR this past summer). The team's motto was "Show your work."

https://github.com/newsapps

SmileyKeith · on Feb 25, 2013

I think that's a great way for companies to lead the way away from these old crappy systems.

nicpottier · on Feb 25, 2013

A very pragmatic approach for serving tons of static content. S3 transfer costs are still pretty competitive, doubly so if you compare them to not just your own iron but the expertise on staff to maintain and scale it all. (unlikely you would ever match the S3 uptime even if you did)

iharris · on Feb 25, 2013

Not to mention that S3 buckets tend to remain online when EC2 instances in US-east are exploding due to <insert a problem related to EBS, network, or datacenter power failure>. :)

andrewmunsell · on Feb 25, 2013

The last major outage for S3 was something like 4-5 years ago, IIRC. Coupled with CloudFront or another CDN, your site probably would be extremely resilient to traffic spikes or hardware/datacenter issues.

jeremyjbowers · on Feb 25, 2013

For elections 2012, we actually used S3 buckets in two different availability zones and CloudFront to abstract away the backend.

In the event that we lost a geographic region, we planned to switch to a different S3 bucket.

andrewmunsell · on Feb 25, 2013

It's interesting to see a site like NPR handle a site deploy like this. I've seen blog owners start to consider switching to a static website, but news sites are definitely a bit more difficult to maintain like this.

Personally, I use Jekyll on my own blog in a similar manner (http://andrewmunsell.com/).

< ShamelessPlug >

I also wrote a tutorial (http://www.andrewmunsell.com/tutorials/jekyll-by-example/) about using Jekyll, in case you want to try something similar to what NPR did, but with a different platform.

< /ShamelessPlug >

whimsicality · on Feb 26, 2013

It's worth noting that our main news site, npr.org, is not all served from flat files. This is specific to our news applications team (apps.npr.org) and our client-side projects.

andrewmunsell · on Feb 27, 2013

Which does make sense, since it's a constantly updated site-- regenerating all of the files (to update "what's new" lists and such) each time an article is written would be major overkill.

While I'm sure you guys already do this, proper caching can have a similar effect to a completely static site in terms of performance.

tomgp · on Feb 26, 2013

The BBC News website is entirely flat files. Though the newer responsive/mobile version isn't.

marknutter · on Feb 25, 2013

Someone help me understand; the flat assets are hosted on S3 but how do http requests get resolved to the correct html file? Is it done with DNS settings?

Rabidgremlin · on Feb 25, 2013

Yes with a CNAME and some metadata on your S3 buckets. See here http://aws.typepad.com/aws/2011/02/host-your-static-website-...

aninteger · on Feb 25, 2013

Can you help me understand why someone would host static content on S3 versus any other host? Is it because S3 automatically scales up when traffic increases? I know almost nothing about S3.

pjscott · on Feb 25, 2013

The pricing is competitive, its easy and reliable, and it can handle as much traffic as you like. S3 may not be the best option, but it's definitely a good, no-worries option.

TillE · on Feb 26, 2013

"Competitive" is an interesting word to use. For example, I can get 100Mbit bandwidth from Hetzner for $9/TB. Transferring the same amount of data over S3 would cost me 10x more.

Now, if you're serving large files and you really need more than 100Mbit sustained, S3 makes sense. But it's unquestionably a premium service for a premium price.

shorthack · on Feb 26, 2013

I've just created an account just to comment on that. I'm working for different NGOs helping them with their IT projects. Recently one of them asked me to create an equivalent of iTunes store (with the multimedia files available for free to the members). One of their guys was hyper enthusiastic about S3. But no matter how we calculated it, Hetzner was (much) cheaper. Now, the service has been running for over a year on Hetzner services and everybody is very happy. There was one incident (the motherboard on the server died - they replaced it very quickly), that's all. We have all the options we need. Note that the current setup is fairly small: we're serving ca. 4 TB of data to ca. 4000 members.

Now, when people say that S3 scales well, I totally agree. But why do they say that the prices are competitive, that's beyond me. Take Xirra's XS-12 storage (200€/month for 36 TB) and compare it to storing 18 TB (I assume RAID 1) on S3. 1 TB costs $95 on S3! How on Earth can you call it 'competitive'? Now, that's even without bandwidth costs. I totally agree it's a premium service for a premium price. There are plenty of cases when using S3 is just a big waste of money. (And Amazon's decision not to implement cost capping isn't helping either.)

oijaf888 · on Feb 26, 2013

Is that for an S3 like service that you can just upload an unlimited number of static files to? Or do you have to maintain a server too?

andrewmunsell · on Feb 27, 2013

Well, if you buy a server or hosting from Hetzner and want a similar setup as S3, you'd have to purchase multiple servers for redundancy. With S3 (and especially CloudFront or other CDNs), you can pay a minimal amount to have all of your files redundantly stored with an extremely high availability (and/or with multiple geographic locations).

marknutter · on Feb 26, 2013

Thank you!

SolarUpNote · on Feb 25, 2013

Not sure how others do this, but I use Amazon Cloudfront. You just prepend the cloudfront url to your asset path (like mybucket.cloudfront.net/path/to/my/local/file), and ... that's it. That's all you have to do. It's fantastic.

chadmaughan · on Feb 25, 2013

This is beautiful. I love NPR.

I'm curious, what are the rules/requirements for initiating a new "NPR app"? An election app seems totally obvious, but what about other apps? Is it based on available data? Available funds? Pervasiveness of a certain story? An individual reporter's weight? (for example, if I was on the team and Nina Totenberg made an app request, I'd drop everything and do it for her - she's dreamy)

Also, how much lead time do you typically get with your apps? A few days, a few weeks, longer?

dryan77 · on Feb 25, 2013

We did a lot of this at the Obama campaign as well. Can't recommend it enough.

LAMike · on Feb 25, 2013

Can someone explain the concept of "flat files" to me and why people like to use it?

willholloway · on Feb 25, 2013

Instead of dynamically generating HTML and JS on each page view, you run a render process that creates html files and serve them statically until the next refresh cycle.

Its the difference between running a wordpress server and a jekyll blog.

The security and scaling benefits are immense. With javascript you can replicate a lot of the functionality of dynamic sites.

LAMike · on Feb 25, 2013

Oh I think I get it, instead of waiting for a user to click on a page and then call the server to render that page, the flat file method renders all the pages at once w/o worrying about which pages the user will click on... Correct?

njharman · on Feb 25, 2013

I say this in "how far have we come" "we've come full circle" and not in any way as disparaging.

As someone who remembers Perl CGI's ability to serve dynamic content as being revolutionary and awesome. It amazes me to hear someone who's only experience is dynamic content and needs http file serving explained to them.

LAMike · on Feb 26, 2013

Haha I began to code last year and got started with backbone so this flat file paradigm is a little new to me :0

polymatter · on Feb 26, 2013

Well, I learnt Java's oop first and then C structs. So to me a C struct is just like a Java object, only with no methods in it. My poor TA couldn't get over it.

shorthack · on Feb 26, 2013

I have the same feeling. The new generation takes dynamic content being generated on the fly for granted, even though in many cases it's not necessary at all.

stdbrouw · on Feb 25, 2013

It really is as simple as it sounds: files. It's just HTML and CSS and JavaScript and what not. There's no server running, except to (re)generate that HTML when it's out of date.

willholloway · on Feb 25, 2013

Yes, exactly.

It takes a different way of thinking because you have to design your html + js to work for everyone that accesses it.

It is still possible to personalize the content a user sees because you can still authenticate them and use ajax GET/POST requests, but you have to do all that with javascript.

LAMike · on Feb 25, 2013

So if I wanted to build a app that delivers customized news, I create a user account + save their preferences.

Then when they authenticate, the app only uses a GET request to fetch those certain feeds specified in their preferences?

Thanks for all the info, streamjoy looks pretty sweet btw

willholloway · on Feb 25, 2013

Yes, definitely.

You still need a server, but instead of your server generating the full html for each request, it only needs to generate a json response for some percentage of total page views. This is just lighter weight, and the architectural approach we are heading to with backbone.js et al...

Basically you offload as much of the computational load to the client side AND the batch rendering process as possible.

One reason this was important for my purposes with streamjoy is that even though 92000 movies isn't Big Data, it still takes a long time when you have to run 10 kinds of algorithms on 92000 items AND deal with the restricted API limits of a big API like Amazon (about one request per 1.8 seconds unless you get special permission or are doing a lot of revenue)

So for streamJoy I had to do a lot of computation and network queries, and I had to respect the 24 hour period that Amazon wants for data freshness.

So the batch jobs I had to run had to be as fast as possible, and that means the fastest CPU + SSDs + ample RAM.

A dedicated i7 server with 32gb of ram and SSDs is $180 at the cheapest data center I could find. Thats way too much for a portfolio piece!

So I offloaded all the batch computation to my own local workstation and then just push the rendered results to the cloud.

Thats one of the big benefits of the flat file approach.

Thanks for the compliment!

antihero · on Feb 26, 2013

Thing Doom vs Myst.

mckoss · on Feb 25, 2013

Is there still an issue with GZIP encoding S3-delivered static assets? From my reading, it looks like Amazon will not automatically convert assets to to GZIP encoding when the browser indicates support. Rather, you have to upload GZIPed files and manually configure the content encoding headers. This approach would be broken for browsers that don't support GZIPed files.

driverdan · on Feb 25, 2013

Which browsers don't support gzip?

mckoss · on Feb 26, 2013

I believe some mobile browsers do not support GZIP - but I don't have the data. Presumably older models (a dying breed) - but returning GZIP'ed content without receiving the appropriate accepts header is non-compliant with the HTTP standard.

wahnfrieden · on Feb 25, 2013

http://www.stevesouders.com/blog/2009/11/11/whos-not-getting... (just googled your post, this was pretty informative)

driverdan · on Feb 28, 2013

Thanks, but that's from 2009 and isn't entirely relavent. Those stats are for incoming requests that don't specify that they accept gzip, not that they can't use it. Often this is caused by proxies stripping headers from the request.

I remember reading an article, I believe by Google, about using browser sniffing instead of encoding headers to determine if gzip should be used. I can't find it now but the conclusion was that it's almost entirely safe to use gzip 100% of the time.

clint · on Feb 25, 2013

We did this a ton when I worked at Ars Technica. Flat files should definitely be one of the things in your toolbox you consider first before moving on to more complicate schemes!

mati · on Feb 26, 2013

And what about handling HTTP 500 errors? Although rare, they can still happen. And you cannot make the user's browser retry in such case (as you could do if you were making the request in your app). Amazon's Best Practices document http://aws.amazon.com/articles/1904 clearly states that that's what you should do:

500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter).

However, I still think hosting on S3 is a great option. They are pretty reliable anyway.

Tactic · on Feb 25, 2013

I ran a site (stomped.com) in the late '90s and we did this. We were serving up news items to millions of unique visitors a month and hitting the DB on each page hit. Rather than try to implement a bunch of caching methods we went with generated HTML files. Given how often content changed (a few times an hour at most) it seemed a waste to generate 10s of thousands of db/cache hits when things rarely changed. Simple. Stable.

tantalor · on Feb 25, 2013

> Compile app_config.py into app_config.js so our application configuration is also available in JavaScript

Terrifying! Why do you need to specify your configuration in code? I would think configuration as data is simpler.

0x0 · on Feb 25, 2013

Why not? With config as data, you need an extra parsing step. More moving parts, more maintenance, and the same end result.

andrewmunsell · on Feb 25, 2013

To be fair, that parsing step is still here-- it's just being parsed into Javascript.

pyre · on Feb 25, 2013

If the configuration was data, then you would have to parsing steps:

  Data => Python
  Data => JavaScript

With this setup they only have one:

  Python => JavaScript

Granted, storing it in YAML/JSON/whatever means that you could potentially have many different codebases / languages reading it without a Language A => Language B conversion. It just depends on what works best for your project / team.

wonnage · on Feb 25, 2013

Some would say the distinction doesn't really exist.

gnaritas · on Feb 25, 2013

> Why do you need to specify your configuration in code? I would think configuration as data is simpler.

Code is data.

wahnfrieden · on Feb 25, 2013

That's mostly irrelevant. Code is often non-deterministic in its output, so its merit as a medium for configuration is highly dependent on context.

gnaritas · on Feb 25, 2013

Sure, but the context is often quite suitable for using code as configuration; quite often.

nickmerwin · on Feb 25, 2013

NPR.org itself has a great JSONP API that makes apps like the one in this article possible. Here's a site I put together that uses it and HTML5 audio for a very quick and minimal NPR listening experience (best viewed on iOS safari):

http://npr.io

It's a Jekyll app mostly written in CoffeeScript, deployed to S3 with CloudFront CDN'ing.

Here's a lengthier introduction for anyone interested: http://nickmerwin.com/2012/02/25/npr-io/

stcredzero · on Feb 25, 2013

> There are three salient Boyerisms I’ve picked up in my month as an NP-Rapper that sum up these differences...On our team, these Boyerisms aren’t just preached — they’re practiced and implemented in code.

Summary: Go ahead and toss off an unusual term or slang, but don't expect it to actually inform readers if they can't Google it. If you aren't trying to inform readers, exactly what are you trying to do?

I found the unexplained use of the word "Boyerism" in this article to be confusing and unprofessional. I have nothing against the use of slang, or the expectation that readers will need to Google terms: this is the reality of our culture in the age of the Internet and excellent search. However, the casual and unexplained use of a private group's inside joke reveals a lack of awareness of how search interacts with culture. Is it a deliberate attempt to confuse and snub the larger audience, or is in unintentional?

pseut · on Feb 25, 2013

Your "..." deleted three paragraphs explaining the "Boyerisms" in detail. As someone who doesn't know the term, I substituted "principles" without (I assume) any loss of understanding. But, since you're upset on my behalf, what did I miss?

jeremyjbowers · on Feb 25, 2013

I can attest that it's unintentional. We roll pretty informally. And, frankly, we're pleased that anyone outside of the small (and insular) news apps community is reading about our setup.

stcredzero · on Feb 25, 2013

> We roll pretty informally. And, frankly, we're pleased that anyone outside of the small (and insular) news apps community is reading about our setup.

If only there was this network kind of thing that people all over the world could use to find and read such text...

pifflesnort · on Feb 25, 2013

> We roll pretty informally.

Between the swearing, and the grossly informal dialog when explaining what should be rational and defensible technical choices:

1) I can tell.

2) It doesn't reflect [well on] NPR.

monatron · on Feb 25, 2013

Who determines what the decorum for such a dialog should be? Does the informality effect in any way the rationality or defensibility of their choices? Why do people care? I'd much rather read an article in this tone rather than some stuffy technical blog like many other engineering teams sometimes put out.

pifflesnort · on Feb 25, 2013

> Does the informality effect in any way the rationality or defensibility of their choices?

Clarity in writing is not a stuffy affectation, but rather is the entire mechanism by which one both expresses an opinion, and provides an understandable, rational justification for that opinion. Without this, the reader is left with nothing but unsupported conjecture, opinion, and emotive appeals.

stcredzero · on Feb 25, 2013

> Does the informality effect in any way the rationality or defensibility of their choices?

This implicitly calls me a liar. As I stated above, the word choice and informality don't bother me. That they spent so little time thinking if the article would make sense to a random reader does. That goes against my expectations for NPR.

endianswap · on Feb 25, 2013

How does it not reflect well on NPR? I'm reading an employee's frank and open (both appreciated) blog post, am I supposed to feel put off because it says "shitty" in it?

pifflesnort · on Feb 25, 2013

> How does it not reflect well on NPR?

NPR advertises itself as "news & analysis", and in my experience, excels at both. A key component of "analysis" is the rational study, explanation, and discussion around (often complex and nuanced) topics.

There are few topics as complex as that of software engineering, and it also warrants due consideration and analysis.

To see engineers describing their emotive appeals as "how they roll" does not lead me to believe that NPR's hiring in their engineering department is on par with their hiring in the editorial department.

As such, it reflects badly on their engineering department, and there is a strong implication that this is not somewhere that a studious engineer would choose to work.

Contrast with the posts from Netflix on their architecture. Those posts are not unduly formal, but they provide logical, reasoned arguments, sufficient background as to judge their claims and conjecture, and demonstrate not only their own technical and engineering capacity, but their respect for the technical capacity of their reader.

rhizome · on Feb 26, 2013

It's incredibly appropriate that this appeal to manners and social mores would be posted by someone calling themselves, "pifflesnort."

pifflesnort · on Feb 26, 2013

It's an appeal to rational (rather than emotive) discourse, not arbitrary manners or social mores.

rhizome · on Feb 26, 2013

The prioritization of rationality over emotion (such as you use it) is a function of one's social framework, and the history of imposing the trappings of such onto others is pretty much the history of aristocracy.

pifflesnort · on Feb 26, 2013

That's an interesting rhetorical tack, but rationality over emotion is also pretty much the history of science, which is also closely tied with the history of aristocracy.

Fortunately for all involved, we don't need to rope aristocracy or philosophy into the argument to provide some level of understanding of the qualitative nature of engineering, be it physical or digital. Instead, we have algorithms, maths/logic, and real measurements to serve as the bedrock of our field.

Unfortunately, articles such as this one abandon that bedrock in favor of appeals to emotion, which leaves the article (and whatever conclusions it may ostensibly provide) unsupported by fact or logic.

This loosely grounded approach to discussion is perfectly suited when discussing one's television preferences, but provides a net negative value to the world of technical discourse by propagating a culture of unsubstantiated and emotionally driven opinion and pop culture ideals.

rhizome · on Feb 26, 2013

In other words, form over content.

pifflesnort · on Feb 26, 2013

If you're referring to the article, then I'd agree. If you're referring to what I wrote, then I'd necessarily question your reading comprehension.

rhizome · on Feb 26, 2013

Your entire argument has been hung up on how what was said was not said the way you would prefer them to say it. Here's where you call them stupid:

As such, it reflects badly on their engineering department, and there is a strong implication that this is not somewhere that a studious engineer would choose to work.

pifflesnort · on Feb 26, 2013

No, my argument was 'hung up' on the use of emotive language instead of well constructed, well reasoned, and well supported arguments.

marshray · on Feb 25, 2013

Oh good grief.

prawn · on Feb 26, 2013

Does "shitty" really still fall under the "swearing" umbrella?

kstop · on Feb 25, 2013

You haven't spent much time in newsrooms, have you.

charliepark · on Feb 25, 2013

Her boss is Brian Boyer. I assume she wrote out the article and simply forgot to actually include that piece of info.

stcredzero · on Feb 25, 2013

> I assume she wrote out the article and simply forgot to actually include that piece of info.

That was my guess as well. From this, I take it that too little thought was given potential readers. (And if it was meant only for insiders, why is it on the public Internet?)

omni · on Feb 26, 2013

This is a really interesting perspective, you should write a blog post about what it's like to have never made a tiny, insignificant mistake in your life and share it with all of us.

stcredzero · on Feb 26, 2013

This is completely irrelevant. How about you post something on your blog about how things like clear, considered prose are no longer required of media organizations like NPR? Is pointing out something like that wrong? It would seem that pointing out a mistake in public is some kind of ultimate wrong, even if it is a highly regarded, internationally known media outlet.

huhsamovar · on Feb 25, 2013

So, how does it not go down if you have to ship to EC2?

jeremyjbowers · on Feb 25, 2013

Most of our sites run entirely on S3. While we do have a cron server running, we could just as easily ship the updated JSON files from one of our laptops.

In fact, for the 2012 election, our fallback in case EC2 was unresponsive was to decamp to a coffee shop with a laptop and s3cmd.

justincormack · on Feb 25, 2013

It is served from S3 (this was not entirely clear).

jere · on Feb 25, 2013

Dumb question, but isn't it possible for S3 to go down (or is the uptime just so high it doesn't matter)?

jeremyjbowers · on Feb 25, 2013

There hasn't been an acknowledged S3 outage since April 2009 that we can find. Also, we use two different buckets in different geographic availability zones to allow us to stay up if one AZ goes out. Finally, our biggest projects are cached in CloudFront, which would serve a stale cached item if the backend were unavailable.

deskamess · on Feb 25, 2013

Do you 'invalidate' all files when pushing an update or do you use a low cache expiry value for CloudFront?

In particular, I am thinking of edge cases where a news story has a typo/other important correction and you want to update just that story. What is your strategy? How is it impacted by caching done by CloudFront? Thanks.

huhsamovar · on Feb 25, 2013

Thanks for the clarification, all! I learned some shizzle.

misterbwong · on Feb 25, 2013

Yes and yes. Yes, it is possible for S3 to go down and yes uptime is very high. For their purposes, S3 uptime is much higher than they can reasonably sustain with in house hosting.

nachteilig · on Feb 25, 2013

Plus if you combine S3 with CloudFront you have a very potent combination for uptime

codebeard · on Feb 25, 2013

The hell? I posted this two days ago and it got no attention!

That particular NPR blog always has greatly insightful posts.

zv · on Feb 25, 2013

"Servers are for chumps". Well, using EC2 still counts as using servers.

acdha · on Feb 25, 2013

Note that EC2 isn't in their critical path: it's a prep stage rather than serving directly to the masses.

jeremyjbowers · on Feb 25, 2013

We do use EC2 for cron, but could easily use a laptop instead if there were an outage.

Some of our more dynamic projects do require a server; the inauguration project that uploaded photos to Tumblr did, for example. https://github.com/nprapps/inauguration/

When we run servers, it's usually Flask and Nginx/uWSGI.

EGreg · on Feb 25, 2013

While I agree with many of the sentiments expressed in the article, I think "never goes down" and "costs you practically nothing" only seems true while nothing bad happens.

When a hard drive crashes or a truck runs into your data center (here's looking at you, Rackspace) or you need failover for any reason, that's when you wish you had virtualized in more than one machine.

Want something that's always available and never crashes? Look at freenet. Distributed computing model. If we can failover the DNS, you can have the same thing on the web.

pjscott · on Feb 25, 2013

S3 does replication and failover. That's part of its advantage over using a dedicated server running nginx or something.

EGreg · on March 1, 2013

That is what I was talking about