I had the pleasure of working with NPR's data news chief, Brian Boyer, who taught me a lot about actual good software practices.
I agree with his preaching the power of flat-files. Not that flat-files should be used to do things that they inherently can't...but that too many projects (or hobby apps) don't consider them and then spend as much time figuring out how to keep their server from crashing. I find it pretty amazing that they have only one small EC2 instance for their news apps (this is separate from the NPR.org site overall) just to do cron jobs.
Flat files, of course, require good planning...not least of which involves an accurate gauging of how often an app's data needs to be refreshed. But I like that kind of planning and thinking more than I do the kind it takes to maintain a stable server.
Recently I re-engineered an ecommerce site that takes many things from the "flat-file" montra. It's amazing how little database interaction you actually need for many sites.
I think the database-driven content comes primarily from kitchen sink packages like Wordpress. More often than not, it's better to get an overall look at your structure and decide what needs to be dynamic and what can be "static".
When talking about it internally, we refer to "flat-file" as "generated". Simply meaning that it's dynamically created by a task or user interaction.
The file system is the original "database". I solved an interview question recently using a few Linux utilities, pipes, and flat files. It was a bash one liner that impressed the interviewer, but then he asked me to solve it the "real" way.
Both databases and file systems use B-trees to implement fast read / writes, it's just that SQL databases enforce structure. OTOH, using flat files moves data checking into application space and losing certain ACID properties.
Similar trade offs are made between NoSQL and SQL, dynamic and static languages, but I digress...
No, given a bunch of Apache logs I needed to find the top 10 queries that met conditions A and B.
1. grepping for conditions and extracting the query with sed
2. appending a char to a flat file (i.e. increasing query count by 1)
3. sort by largest files and get the top 10
Parsing log files[0]:
for ./*.log -type f -exec echo 1 >> $(grep cond_A {} | grep cond_B | sed -E "s:.*(query).*:./results/\1.txt:") \;
Finding top 10 results:
ls ./results/*.txt -Sl | head
[0]: Untested code, and it only grabs one line per log file instead of grepping all matching lines in the log file. I'd have to move the grep to the outside and loop the `echo 1 >>` command, but you get the point.
You probably could have done the same thing in one line:
cat *.log | grep condA | grep condB | sed 'some regex to get rid of dates, etc' | sort | uniq -c | sort -nr | head -n 5
Or something like that... Not very efficient, but it would work in a pinch. I actually do something like this all the time for large datasets. In the time it would take me to write something better, this set of commands is already done.
Maybe I shouldn't be, but I am kind of surprised that you can do an ecommerce site using mostly generated files. The two things that I wonder about are (1) displaying cart/user information on every page (is it all javascript?) and (2) csrf in the forms.
Also, I would assume that javascript becomes a requirement for the functioning of the site; I always considered ecommerce to be the one part of the web were you wanted to be able to operate without javascript, so as not to turn anyone away.
I found this blog post that says that in 2010, 2% in the US had JS disabled:
You can also integrate with a third-party shopping cart then load a small iframe to show "2 Items in the Cart." Not my favorite way to build a site, but it is very inexpensive to build, and for small vendors the site becomes entirely static. Even if you build it in Rails, you can use page-level caching so after the first hit it's served as a static html file.
Page level caching is one of the most underused features of rails. Most applications show anonymous users identical content, and can tolerate that content being cached briefly. Setting this up gives you slashdot protection as well as access to a last chance degraded mode if there's a chain of catastrophic failures further into the app stack.
I guess I should have been more specific about my strategy for "generated" content in an ecommerce environment. The traditional ecommerce elements that should be real-time are; such as carts, pricing, availability, etc. Everything else is generated on update via our [custom] inventory management software.
The closest analogy I can give is Varnish's Edge Side Includes. It's not exactly the same, but it's very similar.
Serving flat files...isn't that what a caching layer is for? I don't understand why someone would go to extra trouble architecting an app around flat files on the back end, rather than building it the easiest way and sticking Varnish or nginx in front it.
> I had the pleasure of working with NPR's data news chief, Brian Boyer
I found the unexplained use of the word "Boyerism" in this article to be confusing and unprofessional. I have nothing against the use of slang, or the expectation that readers will need to Google terms: this is the reality of our culture in the age of the Internet and excellent search. However, the casual and unexplained use of a private group's inside joke reveals a lack of awareness of how search affects interacts with culture. Is it a deliberate attempt to confuse and snub the larger audience, or is in unintentional?
I agree with you...I didn't even realize that Boyer wasn't introduced in the OP by name. I mentioned his name because he is well known in news circles and frequently evangelizes about flat files. The OP should have said who Brian Boyer was even if the NPR apps blog is meant as insider baseball for news devs.
This is an excellent post. I used almost exactly the same system to build my app streamjoy.tv . Everyday streamJoy scours Netflix, iTunes and Amazon to find the availability of movies for streaming, rental and digital purchase.
I built it as a portfolio piece and haven't finished it because I've been doing consulting jobs, but if you want to watch a particular movie online without going the pirate route, it's the start of a legal alternative to the old sites like sidereel.
The tech behind it is the same as this article. Flask and jinja render a static html page for each of about 92,000 movies.
I use a flask app for the search functionality that accesses an elastic search database.
I used mongodb because it was incredibly easy to create a local cache of the JSON data I was getting from the apis I was accessing.
It all took a lot longer than I ever would have though to even get it to this point. There were a lot of little annoying issues with the various APIs I had to access, and the annoyance of parsing XML amongst other schelps I had to deal with.
I have only ever mentioned it on hacker news one other time and the last time my elastic search server crashed from the traffic. It is all running on a $5 a month digital ocean vps.
The flask/jinja static page creation is rock solid and would never fail if I pushed it to s3, right now my elastic search server is the bottle neck. I haven't taken the time to throw hardware at it or set up clustering.
All in all it's a pretty cool service in my opinion, I built it for myself because I love movies and spend a lot of time watching them online and made a decision to never pirate a content creators work again. Also the experience of netflix, amazon and itunes is orders of magnitude better than the old megavideo/bittorrent trouble of finding the real deal and not being inflicted with spammy ads with voiceover.
I really like the flask/jinja/bootstrap/javascript/mongodb/elastic search stack. I've learned a lot of tips and tricks by building streamJoy and if people want I would be happy to share them with the community.
I know this sounds like self promotion of my app but I haven't even taken the time to implement affiliate tracking for any service besides amazon. Consulting is serious and real money right now and that takes priority over this little side project I did.
streamjoy.tv looks nice. Too bad you're not planning to finish it. Another alternative that I've been using recently is canistream.it but that get's wonky some times.
Just curious. Could you explain this part? I used mongodb because it was incredibly easy to create a local cache of the JSON data I was getting from the apis I was accessing.
If you are using the data from the various APIs to render static pages, why do you need the local cache of the JSON data? And if you've got a local cache of all the JSON data, why render static pages rather than serve dynamic pages that reference the DB of cached data?
> If you are using the data from the various APIs to render static pages, why do you need the local cache of the JSON data?
Because the data that I am displaying for each movie is not available from any single source. streamJoy is the abstraction layer tying together the disparate data sources.
Mongo was the right choice because instead of having to map out postgres schema, I could just store the entire JSON response as a dict.
What I have learned from all of this is that API's are not always reliable. The information changes, you don't always know what you are going to get back.
Mongo just made this early process where I didn't know what I was going to get much more fault tolerant. And, no schema mapping.
> And if you've got a local cache of all the JSON data, why render static pages rather than serve dynamic pages that reference the DB of cached data?
Rendering static pages allow you to use S3 or nginx to serve your html. For a one man operation like I am running, S3 is manna from heaven for scaling the serving of html files.
I'm using nginx right now only because I haven't taken the time to use S3, but S3 is the smarter choice here.
My goal with this was to have the smallest possible dynamic server footprint as possible.
The other reason for caching json data is I run analysis on the db items, even though I haven't published any of those features yet.
By having my own copy of the data, I can run a process on 92000 items in 6 minutes instead of taking a day due to API rate limits.
You can accomplish the same thing as flat files by using something like Amazon CloudFront (http://aws.amazon.com/cloudfront/) -- CloudFront caches your dynamically generated pages and serves the generated page from cache rather than hitting your server each time.
Brian Boyer did the same when he was leading the news apps team at the Chicago Tribune (I believe he left for NPR this past summer). The team's motto was "Show your work."
A very pragmatic approach for serving tons of static content. S3 transfer costs are still pretty competitive, doubly so if you compare them to not just your own iron but the expertise on staff to maintain and scale it all. (unlikely you would ever match the S3 uptime even if you did)
Not to mention that S3 buckets tend to remain online when EC2 instances in US-east are exploding due to <insert a problem related to EBS, network, or datacenter power failure>. :)
The last major outage for S3 was something like 4-5 years ago, IIRC. Coupled with CloudFront or another CDN, your site probably would be extremely resilient to traffic spikes or hardware/datacenter issues.
It's interesting to see a site like NPR handle a site deploy like this. I've seen blog owners start to consider switching to a static website, but news sites are definitely a bit more difficult to maintain like this.
It's worth noting that our main news site, npr.org, is not all served from flat files. This is specific to our news applications team (apps.npr.org) and our client-side projects.
Which does make sense, since it's a constantly updated site-- regenerating all of the files (to update "what's new" lists and such) each time an article is written would be major overkill.
While I'm sure you guys already do this, proper caching can have a similar effect to a completely static site in terms of performance.
Someone help me understand; the flat assets are hosted on S3 but how do http requests get resolved to the correct html file? Is it done with DNS settings?
Can you help me understand why someone would host static content on S3 versus any other host? Is it because S3 automatically scales up when traffic increases? I know almost nothing about S3.
The pricing is competitive, its easy and reliable, and it can handle as much traffic as you like. S3 may not be the best option, but it's definitely a good, no-worries option.
"Competitive" is an interesting word to use. For example, I can get 100Mbit bandwidth from Hetzner for $9/TB. Transferring the same amount of data over S3 would cost me 10x more.
Now, if you're serving large files and you really need more than 100Mbit sustained, S3 makes sense. But it's unquestionably a premium service for a premium price.
I've just created an account just to comment on that.
I'm working for different NGOs helping them with their IT projects. Recently one of them asked me to create an equivalent of iTunes store (with the multimedia files available for free to the members).
One of their guys was hyper enthusiastic about S3. But no matter how we calculated it, Hetzner was (much) cheaper.
Now, the service has been running for over a year on Hetzner services and everybody is very happy. There was one incident (the motherboard on the server died - they replaced it very quickly), that's all. We have all the options we need.
Note that the current setup is fairly small: we're serving ca. 4 TB of data to ca. 4000 members.
Now, when people say that S3 scales well, I totally agree. But why do they say that the prices are competitive, that's beyond me. Take Xirra's XS-12 storage (200€/month for 36 TB) and compare it to storing 18 TB (I assume RAID 1) on S3. 1 TB costs $95 on S3! How on Earth can you call it 'competitive'? Now, that's even without bandwidth costs.
I totally agree it's a premium service for a premium price. There are plenty of cases when using S3 is just a big waste of money. (And Amazon's decision not to implement cost capping isn't helping either.)
Well, if you buy a server or hosting from Hetzner and want a similar setup as S3, you'd have to purchase multiple servers for redundancy. With S3 (and especially CloudFront or other CDNs), you can pay a minimal amount to have all of your files redundantly stored with an extremely high availability (and/or with multiple geographic locations).
Not sure how others do this, but I use Amazon Cloudfront. You just prepend the cloudfront url to your asset path (like mybucket.cloudfront.net/path/to/my/local/file), and ... that's it. That's all you have to do. It's fantastic.
I'm curious, what are the rules/requirements for initiating a new "NPR app"? An election app seems totally obvious, but what about other apps? Is it based on available data? Available funds? Pervasiveness of a certain story? An individual reporter's weight? (for example, if I was on the team and Nina Totenberg made an app request, I'd drop everything and do it for her - she's dreamy)
Also, how much lead time do you typically get with your apps? A few days, a few weeks, longer?
Instead of dynamically generating HTML and JS on each page view, you run a render process that creates html files and serve them statically until the next refresh cycle.
Its the difference between running a wordpress server and a jekyll blog.
The security and scaling benefits are immense. With javascript you can replicate a lot of the functionality of dynamic sites.
Oh I think I get it, instead of waiting for a user to click on a page and then call the server to render that page, the flat file method renders all the pages at once w/o worrying about which pages the user will click on... Correct?
I say this in "how far have we come" "we've come full circle" and not in any way as disparaging.
As someone who remembers Perl CGI's ability to serve dynamic content as being revolutionary and awesome. It amazes me to hear someone who's only experience is dynamic content and needs http file serving explained to them.
Well, I learnt Java's oop first and then C structs. So to me a C struct is just like a Java object, only with no methods in it. My poor TA couldn't get over it.
I have the same feeling. The new generation takes dynamic content being generated on the fly for granted, even though in many cases it's not necessary at all.
It really is as simple as it sounds: files. It's just HTML and CSS and JavaScript and what not. There's no server running, except to (re)generate that HTML when it's out of date.
It takes a different way of thinking because you have to design your html + js to work for everyone that accesses it.
It is still possible to personalize the content a user sees because you can still authenticate them and use ajax GET/POST requests, but you have to do all that with javascript.
You still need a server, but instead of your server generating the full html for each request, it only needs to generate a json response for some percentage of total page views. This is just lighter weight, and the architectural approach we are heading to with backbone.js et al...
Basically you offload as much of the computational load to the client side AND the batch rendering process as possible.
One reason this was important for my purposes with streamjoy is that even though 92000 movies isn't Big Data, it still takes a long time when you have to run 10 kinds of algorithms on 92000 items AND deal with the restricted API limits of a big API like Amazon (about one request per 1.8 seconds unless you get special permission or are doing a lot of revenue)
So for streamJoy I had to do a lot of computation and network queries, and I had to respect the 24 hour period that Amazon wants for data freshness.
So the batch jobs I had to run had to be as fast as possible, and that means the fastest CPU + SSDs + ample RAM.
A dedicated i7 server with 32gb of ram and SSDs is $180 at the cheapest data center I could find. Thats way too much for a portfolio piece!
So I offloaded all the batch computation to my own local workstation and then just push the rendered results to the cloud.
Thats one of the big benefits of the flat file approach.
Is there still an issue with GZIP encoding S3-delivered static assets? From my reading, it looks like Amazon will not automatically convert assets to to GZIP encoding when the browser indicates support. Rather, you have to upload GZIPed files and manually configure the content encoding headers. This approach would be broken for browsers that don't support GZIPed files.
I believe some mobile browsers do not support GZIP - but I don't have the data. Presumably older models (a dying breed) - but returning GZIP'ed content without receiving the appropriate accepts header is non-compliant with the HTTP standard.
Thanks, but that's from 2009 and isn't entirely relavent. Those stats are for incoming requests that don't specify that they accept gzip, not that they can't use it. Often this is caused by proxies stripping headers from the request.
I remember reading an article, I believe by Google, about using browser sniffing instead of encoding headers to determine if gzip should be used. I can't find it now but the conclusion was that it's almost entirely safe to use gzip 100% of the time.
We did this a ton when I worked at Ars Technica. Flat files should definitely be one of the things in your toolbox you consider first before moving on to more complicate schemes!
And what about handling HTTP 500 errors? Although rare, they can still happen. And you cannot make the user's browser retry in such case (as you could do if you were making the request in your app). Amazon's Best Practices document http://aws.amazon.com/articles/1904 clearly states that that's what you should do:
500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter).
However, I still think hosting on S3 is a great option. They are pretty reliable anyway.
I ran a site (stomped.com) in the late '90s and we did this. We were serving up news items to millions of unique visitors a month and hitting the DB on each page hit. Rather than try to implement a bunch of caching methods we went with generated HTML files. Given how often content changed (a few times an hour at most) it seemed a waste to generate 10s of thousands of db/cache hits when things rarely changed. Simple. Stable.
If the configuration was data, then you would have to parsing steps:
Data => Python
Data => JavaScript
With this setup they only have one:
Python => JavaScript
Granted, storing it in YAML/JSON/whatever means that you could potentially have many different codebases / languages reading it without a Language A => Language B conversion. It just depends on what works best for your project / team.
NPR.org itself has a great JSONP API that makes apps like the one in this article possible. Here's a site I put together that uses it and HTML5 audio for a very quick and minimal NPR listening experience (best viewed on iOS safari):
> There are three salient Boyerisms I’ve picked up in my month as an NP-Rapper that sum up these differences...On our team, these Boyerisms aren’t just preached — they’re practiced and implemented in code.
Summary: Go ahead and toss off an unusual term or slang, but don't expect it to actually inform readers if they can't Google it. If you aren't trying to inform readers, exactly what are you trying to do?
I found the unexplained use of the word "Boyerism" in this article to be confusing and unprofessional. I have nothing against the use of slang, or the expectation that readers will need to Google terms: this is the reality of our culture in the age of the Internet and excellent search. However, the casual and unexplained use of a private group's inside joke reveals a lack of awareness of how search interacts with culture. Is it a deliberate attempt to confuse and snub the larger audience, or is in unintentional?
Your "..." deleted three paragraphs explaining the "Boyerisms" in detail. As someone who doesn't know the term, I substituted "principles" without (I assume) any loss of understanding. But, since you're upset on my behalf, what did I miss?
I can attest that it's unintentional. We roll pretty informally. And, frankly, we're pleased that anyone outside of the small (and insular) news apps community is reading about our setup.
> We roll pretty informally. And, frankly, we're pleased that anyone outside of the small (and insular) news apps community is reading about our setup.
If only there was this network kind of thing that people all over the world could use to find and read such text...
Who determines what the decorum for such a dialog should be? Does the informality effect in any way the rationality or defensibility of their choices? Why do people care? I'd much rather read an article in this tone rather than some stuffy technical blog like many other engineering teams sometimes put out.
> Does the informality effect in any way the rationality or defensibility of their choices?
Clarity in writing is not a stuffy affectation, but rather is the entire mechanism by which one both expresses an opinion, and provides an understandable, rational justification for that opinion. Without this, the reader is left with nothing but unsupported conjecture, opinion, and emotive appeals.
> Does the informality effect in any way the rationality or defensibility of their choices?
This implicitly calls me a liar. As I stated above, the word choice and informality don't bother me. That they spent so little time thinking if the article would make sense to a random reader does. That goes against my expectations for NPR.
How does it not reflect well on NPR? I'm reading an employee's frank and open (both appreciated) blog post, am I supposed to feel put off because it says "shitty" in it?
NPR advertises itself as "news & analysis", and in my experience, excels at both. A key component of "analysis" is the rational study, explanation, and discussion around (often complex and nuanced) topics.
There are few topics as complex as that of software engineering, and it also warrants due consideration and analysis.
To see engineers describing their emotive appeals as "how they roll" does not lead me to believe that NPR's hiring in their engineering department is on par with their hiring in the editorial department.
As such, it reflects badly on their engineering department, and there is a strong implication that this is not somewhere that a studious engineer would choose to work.
Contrast with the posts from Netflix on their architecture. Those posts are not unduly formal, but they provide logical, reasoned arguments, sufficient background as to judge their claims and conjecture, and demonstrate not only their own technical and engineering capacity, but their respect for the technical capacity of their reader.
The prioritization of rationality over emotion (such as you use it) is a function of one's social framework, and the history of imposing the trappings of such onto others is pretty much the history of aristocracy.
That's an interesting rhetorical tack, but rationality over emotion is also pretty much the history of science, which is also closely tied with the history of aristocracy.
Fortunately for all involved, we don't need to rope aristocracy or philosophy into the argument to provide some level of understanding of the qualitative nature of engineering, be it physical or digital. Instead, we have algorithms, maths/logic, and real measurements to serve as the bedrock of our field.
Unfortunately, articles such as this one abandon that bedrock in favor of appeals to emotion, which leaves the article (and whatever conclusions it may ostensibly provide) unsupported by fact or logic.
This loosely grounded approach to discussion is perfectly suited when discussing one's television preferences, but provides a net negative value to the world of technical discourse by propagating a culture of unsubstantiated and emotionally driven opinion and pop culture ideals.
Your entire argument has been hung up on how what was said was not said the way you would prefer them to say it. Here's where you call them stupid:
As such, it reflects badly on their engineering department, and there is a strong implication that this is not somewhere that a studious engineer would choose to work.
> I assume she wrote out the article and simply forgot to actually include that piece of info.
That was my guess as well. From this, I take it that too little thought was given potential readers. (And if it was meant only for insiders, why is it on the public Internet?)
This is a really interesting perspective, you should write a blog post about what it's like to have never made a tiny, insignificant mistake in your life and share it with all of us.
This is completely irrelevant. How about you post something on your blog about how things like clear, considered prose are no longer required of media organizations like NPR? Is pointing out something like that wrong? It would seem that pointing out a mistake in public is some kind of ultimate wrong, even if it is a highly regarded, internationally known media outlet.
Most of our sites run entirely on S3. While we do have a cron server running, we could just as easily ship the updated JSON files from one of our laptops.
In fact, for the 2012 election, our fallback in case EC2 was unresponsive was to decamp to a coffee shop with a laptop and s3cmd.
There hasn't been an acknowledged S3 outage since April 2009 that we can find. Also, we use two different buckets in different geographic availability zones to allow us to stay up if one AZ goes out. Finally, our biggest projects are cached in CloudFront, which would serve a stale cached item if the backend were unavailable.
Do you 'invalidate' all files when pushing an update or do you use a low cache expiry value for CloudFront?
In particular, I am thinking of edge cases where a news story has a typo/other important correction and you want to update just that story. What is your strategy? How is it impacted by caching done by CloudFront? Thanks.
Yes and yes. Yes, it is possible for S3 to go down and yes uptime is very high. For their purposes, S3 uptime is much higher than they can reasonably sustain with in house hosting.
We do use EC2 for cron, but could easily use a laptop instead if there were an outage.
Some of our more dynamic projects do require a server; the inauguration project that uploaded photos to Tumblr did, for example. https://github.com/nprapps/inauguration/
When we run servers, it's usually Flask and Nginx/uWSGI.
While I agree with many of the sentiments expressed in the article, I think "never goes down" and "costs you practically nothing" only seems true while nothing bad happens.
When a hard drive crashes or a truck runs into your data center (here's looking at you, Rackspace) or you need failover for any reason, that's when you wish you had virtualized in more than one machine.
Want something that's always available and never crashes? Look at freenet. Distributed computing model. If we can failover the DNS, you can have the same thing on the web.
I agree with his preaching the power of flat-files. Not that flat-files should be used to do things that they inherently can't...but that too many projects (or hobby apps) don't consider them and then spend as much time figuring out how to keep their server from crashing. I find it pretty amazing that they have only one small EC2 instance for their news apps (this is separate from the NPR.org site overall) just to do cron jobs.
Flat files, of course, require good planning...not least of which involves an accurate gauging of how often an app's data needs to be refreshed. But I like that kind of planning and thinking more than I do the kind it takes to maintain a stable server.