You can do an awful, awful lot for your business by taking this idea one or two iterations further:
1) Identify data source
2) Extract value from data source
3) Spit out templated content pieces extracted from data source
4) Farm out templated articles to freelancers for thickening up, Demand Media style
My client from this summer who paid me to do it for the average value of particular college degrees is launching sometime in the next week or so. I'll happily play show-and-tell with the non-proprietary parts if folks want, after it launches.
I tried to do almost exactly what you're describing about a year ago -- scraping structured data about mutual funds and constructing articles about each fund, which I submitted to Associated Content.
Unfortunately, I should have spent more time "thickening them up" -- after the first half dozen or so, AC began rejecting them for being too similar to each other.
I'm curious: why did you submit them to Associated Content instead of building your own site, where you'd have total control and keep most of the value you created? A deep backbench of semi-automated articles about funds plus relatively fewer pillar content pieces for linkability strikes me as a potentially very viable business in an industry which is quite literally awash in cash to spend on marketing.
At the time, AC was paying about $3 up front, plus pay-per-view incentives. And it has a good Google PR. So, I figured I could either be lazy and submit to AC or build my own site and spend a lot of time trying to get a decent PageRank. I chose lazy.
Just in the spirit of introducing you to other options: there exist people who already have profitable affiliate sites in the space who you could pitch on the idea of "bolt this onto your site and get X more inventory which will rank on the strength of your existing brand/trust/etc." I'd be thinking more in the five figure range than the $3 up front range.
Personally, I'm not a fan. For generic content like this, I'd rather read it in a table or chart. The data is being encoded into natural language, and then when we read it we have to parse the important information back out.
This is a weird example, but look at Groupon. One of the main reasons it's so big is the custom, humorous descriptions that go along with each item.
If newspapers want to survive, they shouldn't be automating their content- it just makes it more generic and forgettable. Nobody wants to read an article that a computer wrote.
> One of the main reasons it's so big is the custom, humorous descriptions that go along with each item.
Personally, I never read those. I skim the headline, and look at the deal details if the topic interests me and is a really good deal.
There's a difference in content though; with deals I just want the hard facts. With many news articles, I want some well-written copy to add context to otherwise meaningless/bland data.
For the Powerball results, the sentence format is a bit more engaging and I can skim it as fast as I would skim a table. If nothing else, it makes me feel like the publisher cares more about the reader.
You can get staggering productivity wins by automating enough of the right 1, 5, 15 minute tasks, especially when you consider how terrible people's schedulers are. If your computer lost an hour of productive work every time it context switched, you'd figure out ways to eliminate it's list of small recurring chores, too. Happily, your computer has very, very efficient context switching relative to you.
I've been keeping a running count of time spent on various activities this month, for giggles. Total support time for BCC in the last two weeks: eight minutes. The machine has been humming so efficiently I burned some time yesterday just to check that the whole thing hadn't been hit with a meteor or something.
A second point would be that the time you spend automating tasks has other payoffs, in the form of learning and inspiration.
For instance, I suspect that part of the inspiration for Appointment Reminder came from Patrick's own realization of how valuable the automation of small tasks can be (chasing down missed appointments, in this case).
---
To go off on another tangent, this is akin to eliminating technical debt from your workflow. By taking time to "refactor" certain tasks by doing them in a more efficient way, you get a net savings going forward. You can increase your ability to take on new tasks by increasing the efficiency of existing ones.
In keeping with the refactoring theme, if you have tasks that don't scale well, you may need to spend more time on them in a crisis. For example, in the event of a site outage you might suddenly have a deluge of support emails. Having support tickets be automatically created would save you n*60 seconds of copy/pasting.
I remember interviewing for a position at a company I was already at, moving from one department to another.
I'd gotten a new manager in the old department, and I don't think I'd had enough time to get over an initial bad impression. I was the night shift guy, on a help desk, and the call volume was generally low. To help better justify the use of my time, I was tasked with some additional duties, like processing account delete requests and other tedium.
My manager at the time found fault with the fact that I had written a series of scripts to completely automate the additional tasks I'd been assigned, and considered me lazy.
In the interview, I was asked the question "How do you respond to the accusation that you sometimes cut corners?"
My answer was something like "I think cutting corners is a good thing. If there's not a requirement for square corners, and the rounded corners don't impact the quality or integrity of the work being performed, and if the corners aren't irreparably damaged in the process, I think that taking the more direct path of least resistance is not a bad thing."
I got the job, and I was later told that the manager was very impressed with the answer.
Uhhh, that was a roundabout way of saying "I agree."
Nice article. It could be improved if the author collected numerous, diverse lottery result articles and used these to create a script that outputs randomized articles.
This task screams for DSLs, especially on the generation end. Doing this in PHP directly (or most other general purpose languages) encourages too little variation because adding alternatives is relatively heavyweight. Writing a good DSL that makes it easy to offer more alternatives in a single template will make it much easier to produce something even less distinguishable from a human report.
I think you are missing the point. What's remarkable is that this person is combining basic hacking skills with a completely different career, not the cleverness of the design. Not everyone has to be a poet but amazing societal changes happen when everyone can read and write.
Further thought: Actually, not being a programmer reinforces my original DSL point, not contradicts it. Part of the purpose of a DSL is to reduce as much as possible the "programming" part of the task so the domain expert can concentrate on what needs to be done.
I never said "this guy should have written a DSL instead", which I didn't say because it would be an asshole thing to say. I said that this task screams for a DSL, and that's only more true if this guy isn't a programmer.
Don't worry; some people don't understand threaded conversation, and think that every reply to something has to "continue the conversation" of it.
EDIT: Okay, I'll just repost a comment from about a month ago here, that was upvoted instead of downvoted and yet made exactly the same point:
This is a threaded comment system. We can have as many discussions about something (post or other comment) as we want: go off on wild tangents, point out the spelling, have a pun thread, mention patterns of blogging/commenting the parent fits into, reply to the author on a separate subject, share anecdotes related to the subject of the post, and actually talk about the content of a post or comment, all at the same time, without breaking anything. That's what's so neat about threaded discussion: it doesn't require the "comparative notability" that a linear conversation needs in order to function.
In this case, we can have a discussion about combining hacking skills with a completely separate career, and then have a tangential discussion about using DSLs to generate text, with neither conversation interfering with the other. No one has to be "missing the point," and in fact jerf could be contributing elsewhere in the discussion alongside his creation of this tangent.
Hey, I'm the blog author. You're right -- I'm just so used to using cURL for more complicated requests that the simpler solution skipped my mind. I'll update.
You may know already, but a lesser known feature of php is that you can pass a [stream context](http://php.net/manual/en/function.stream-context-create.php) as optional argument to most file-operations. This enables you to make fine grained http-control (post, headers etc.), still using `file_get_contents` and friends.
this a classic seo content generation move, called mad lib sites. You create a templated article, with variables for each piece of dynamic content. Usually, you will also create "spun" content so that each article created with the madlib template is even more unique.
Then, you can scrape or find large databases of consistent information and deploy very large sites.
The trouble is getting google to fully index these sites. It requires a good amount of link building both to the madlib pages and the home page to get enough juice for the crawlers to spend time on the site and get things indexed.
They can be very useful sites to build for a variety of reasons, and can actually add some value depending on the data you're publishing
Microsoft Word can already do the document generation thing from an Excel spreadsheet. Of course, it’s more complicated than a purpose-built app would be, but probably also more powerful.
Here's a description of how MS Word and MS Excel were actually used to create 4,600 parameterized web pages, complete with samples of the Excel and Word documents used:
1) Identify data source
2) Extract value from data source
3) Spit out templated content pieces extracted from data source
4) Farm out templated articles to freelancers for thickening up, Demand Media style
My client from this summer who paid me to do it for the average value of particular college degrees is launching sometime in the next week or so. I'll happily play show-and-tell with the non-proprietary parts if folks want, after it launches.