Hacker News new | past | comments | ask | show | jobs | submit login
How I automated my writing career (oreilly.com)
142 points by RobbieStats on Nov 3, 2011 | hide | past | favorite | 40 comments



That was a very inaccurate headline on the story.

Our software can create eight paragraphs now, but is it possible to create eight chapters' worth of content? The answer is "yes," but not quite the same kind of technical books I used to write, at least right now.

...which is why the previous paragraph says "[b]ecause I've been so focused on running Automated Insights, I haven't had time to write any new books recently."

That is a variant on the Calvin and Hobbes bed making robot[1]. I strongly suspect he will end up like Calvin, with something that doesn't work as planned. If he is lucky, like Calvin he will find he accomplished his goal, but discover he started with the wrong goal in mind.

[1] http://books.google.com/books?id=NV4WEqQtvTYC&pg=PA126&#...


the headline was a lede to draw you in...which it did. The reality is that he has created really interesting technology which is automating content, which is really freaking cool. As far as him becoming like calvin and developing something that won't "work as planned" - seems like from the article the plan wasn't and isn't to replace himself, but rather to replace some of the lower value pieces of content. the human/machine interplay is an important part of the whole concept.


Just a prediction here, so take it for what it's worth, but my hunch is that the profession of writing is going to fragment into more and more subsets in the coming years. At the low end of the totem pole -- what we'll call "low-value" writing -- will be the sorts of articles that software can eventually automate. Things like news updates, information dumps, how-to pieces, lists, summaries, and so forth. Much of what traditional journalism would call "news" stories, and what magazine journalism would call "informational" pieces, fall into this bucket. In these sorts of articles, substance is more important than style. These pieces are all about the facts, or summations of the facts. Or, in the case of content farms, they're about relaying and recombining information in endless mixes, using provocative headlines. You don't need a Pulitzer-caliber author to crank these out. Hell, pretty soon you won't even need a human to crank them out. It's no surprise that this type of writing doesn't pay well, because frankly, it's the fast food of journalism. It's cheap, it's disposable to the consumer, and so it pays cheaply.

On the other hand, higher-value writing will be that which isn't easily automated, and for which style is every bit as important as substance. Fiction (good fiction, at least), features, human-interest stories, editorials (especially those relying on expertise), and so forth. This will be the kind of writing that either pays crap, or pays big, depending on the writer's skill level -- and his or her ability to build a market or following for it. There will always be a need for this kind of writing, and until such time as software AI becomes genuinely creative, it'll be very hard to automate the highest-quality, most interesting, and most innovative stuff.

Low-value writing will, if anything, see its value decline even further. It is the equivalent of the man on the assembly line who can be replaced by a tireless, hyper-efficient machine. High-value writing will not, on average, find itself paying more handsomely than it used to. It will still be a high-variance profession. But it will be what remains for professional writers in the age of content farms, automated news, social networking, and so forth.

Essentially, the way to earn a decent living in the future will be: 1) be damned good, 2) build and maintain a following, 3) differentiate yourself, and 4) produce at high volume.


Fiction, at least, has been well-studied and broken into defined components. All stories follow a predictable arc: exposition, conflict, climax, conclusion, and denouement [1]. Use well-tested plot devices to move through the story arc [2]. Define a skeleton framework of what kind of plot devices go where in the story. Have a database of scene location descriptions. Have a database of character stereotypes. Make your algorithm mix-and-match them to fill-in the story skeleton.

Seen this way, A New Hope and Raiders of the Lost Ark are kind of similar.

Plot: Rescue the princess/my father

Hero: young/grizzled adventurer

Plucky sidekick: loyal robots/Professor

Ally: Semi-sleazy smuggler

Scene: Fascist Empire spaceship/Nazi North Africa

Bad guys: Stormtroopers/Nazis

Main villain: black-clothed mystical bad Jedi/Archeologist

Heck, using this format, you could make your own OWS drama:

Plot: Rescue my bankrupt mother

Hero: idealistic hippy

Plucky sidekick: loyal golden retreiver

Ally: Semi-sleazy drug dealer

Scene: Urban city streets

Bad guys: city police

Main villain: black-suited bad finance executive

I would call it "Wall Street Raiders" or "Wall Street: A New Hope".

Yes, FWIW, I've been contemplating coding just such a story-generating engine. Like Mad Libs on steroids.

[1]http://en.wikipedia.org/wiki/Dramatic_structure

[2]http://en.wikipedia.org/wiki/Plot_device


Sure, all stories rely on formulas. That's nothing shocking or new. Even the ancient Greeks wrote about how all stories break down into a handful of basic formulas. You could get a computer program to piece together a plot from time-honored tropes, archetypes, templates, and act-based story structure.

Where things get harder, and require a human touch, is in making things interesting. Setting, to a degree. But most important, characters. Compelling, human, relatable, challenging, interesting, flawed, dynamic characters. These elements are much harder to paint by numbers.

Also: style. Very hard to computer-generate an innovative and unique style.

I'm not saying these things can't be done, or never will be done, by algorithm. If modern history has taught us anything, it's that we can never underestimate what technology will make possible. For the foreseeable future, however, I'm still banking on a need for human fiction writers.


I agree that human reviewers will be needed, especially in the beginning, to both fix-up the output stories and to iteratively improve the engine.

But, I think that the engine could definitely pump out and use characters that are interesting. For example, lets say we give the character 1-n flaws. Drawn from our list of flaws, we could pick "Abandonment issues", and that flaw could have modifiers like "abandoned by parents at birth", "abandoned by crew on an island", or "abandoned by creators to shovel trash into little cubes". That flaw could then be used to pick certain plot devices to be used at certain points in the story skeleton.

Characters drive stories and we've just made the basis for Luke Skywalker, Captain Jack Sparrow, and Wall-E.

Also, I agree that it will take a little while to define the different cadences and vernaculars used in, for example, a Fairy Tale vs. a Crime Noir story, but I do think it is possible to make such definitions. "Once upon a time in a land far, far away, ...." vs. "On a dark and stormy night, she burst into my office with legs a mile long...."


The problem with this approach is that it will, at best, churn out formulaic and derivative product. All fiction relies on some degree of formula, but good fiction is not formulaic. Good fiction breaks new ground, finds new twists on old tropes, invents new tropes outright, or invents new worlds and new personas in breakthrough ways. Good fiction, like all good art, is inherently creative. It creates what didn't exist before. While all art has its influences, its processes, its references, and its tropes, it is not a simple recombination of these things.

Again, I'm not saying a program will never turn out great, or even decent, fiction. But I think that day is far off.

In the interim, we already have what could be classified as -- albeit in a very loose sense of the word -- cooperative fiction by man-machine collaborations. Writers use programs like Final Draft and Scrivener to help them keep track of characters and plot points, notes, formats, outlines, etc. It's not a huge stretch to imagine future iterations of these programs that offer algorithmically-guided notes on the structure, settings, consistencies and inconsistencies, etc., as the author is writing or editing. In the same way that a word processor runs automatic spell check, these programs might someday run automatic plot check (Imagine some dystopian, futuristic version of Clippy: "It looks like you're writing a science fiction novel; do you want some structural templates?")


Could this be done in reverse by software? Instead of story-generating engine, automated identification of the plot line, character profile. Basically automatically building the database you mentioned by feeding already written fiction. I think a larger business may be in building such a repository. Story generating engine will commoditize fiction creation.


I think that there are far fewer plot devices, character stereotypes, and background scenes for fiction than you expect. There are enough though to have, at least, 100s of millions of possible combinations. And those combinations are where the "creative" novelty of fiction novels arises.

For example, let us suppose that at the end of Braveheart, instead of getting disemboweled, Optimus Prime flys down out of the sky, kills the executioner, saves Mel Gibson, and lays waste to the English forces; thus allowing Mel Gibson to have his revenge and to ride off into the sunset with the English queen. Creative? Sure. Formulaic? Yes!

I think that a few Google searches would reveal lists of the majority of the different story components


I wrote a project proposal for this (though I went with a different project) and there's some existing work. Email me and I'll send it to you.


This sounds really interesting. If you do this, can you post your progress and results?


Sure.

I've been thinking of taking a stab at Romance novels first because they tend to be very formulaic and are a surprisingly profitable fiction niche.


I think Mills and Boons titles are probably generated by a very simple grammar: The [Nationality] [High-powered Profession]'s [Asdjective for Sexy] [Synonym for Lover]. Eg The Turkish Millionaire's Sultry Mistress, The North Korean Cadre's Sleazy Hooker, etc.

But I think your project could be AI-hard - to write a compelling story you'd need to have a virtual world with virtual characters with virtual psychologies. You'd need to be able to model complicated stuff like "Romeo knows Juliet thinks he is already dead", and simple stuff like "If Romeo is under Juliet's balcony, he can't see her until he climbs up". Interesting project though.


Good point. I wonder if you could specify pre-/post-requirements for plot devices to make the scenarios you described easier to handle.


I think Hollywood has been using a program like this to generate movies for the last 5 years at least.


I think this is going to be a case of every writer needing a brand and following. People aren't going to be hired as much for their ability to churn out automatable articles, but by the audience they can bring with them due to their style and skills. An example would be if Jason Fried's or Joel Spolskey's column in Inc. I will go to the Inc site just to read their articles because I respect their opinion, they have expertise and write about subjects that I care about. If their columns moved somewhere else, I'd read them there and maybe read other stories that catch my eye. They are valuable because of their audience.


To be honest I believe this was true in the past as well. People will read one specific newspaper because of one journalist. Somewhere along the way people stopped doing this and it's become more easy to observe this with the advent of the Internet where anyone can become a writer. I do really hope we'll get back to quality focused writing because at the moment there's not a newspaper that figured this out. They're all stuck in the past.


If you cinsider the current state of the art in natural language processing and the current trends, basically the contrary jonnathanson prediction will happen, machines have truble automating contect BUT are good at orginizing that content, therefore low-value writing will feed automated high-value writing, the future probably be: produce in hi volume, that might come from social media, and editors to refine and limit what get frontpage.


"How I didn't do what I claimed in the title, really, but I'm going to use this title anyway because it's bound to get clicks and upvotes"

cynic


Slightly off topic, but when I was writing The Geek Atlas one of the things I did was keep metrics about my writing so that I knew where I was, and then I used those metrics to predict the book's delivery date to O'Reilly, and measure how I was doing against the required delivery date.

This was all done in a spreadsheet and it enabled me to see whether I was ahead or behind on my writing. Turned out to be very, very useful.


What metrics did you use?


The Geek Atlas consists of 128 similarly sized 'chapters' (one for each place) so I had a number of key metrics:

1. Number of chapters completed. A very gross progress bar that I could use to get a rough estimate of when I would deliver.

2. Words per chapter. I used this to determine if I was changing the length of the chapter without realizing (which did happen) and correct for that so that the book would be consistent.

3. Hours per chapter. I used this to test my writing speed and work out more accurately when I would be done and also how many hours I could allow per chapter.


On an unrelated note, if you highlight a portion of the article, you can listen to it. Very useful for listening to articles while working, instead of listening to music, or just to rest your eyes.

It is powered by http://www.readspeaker.com and AFAICT is the best sounding Text to Speech implementation, I have heard so far.

If you just listen to the text, it sounds like a human news reader, much better than Siri. Wow. And the cherry on the top is that it highlights the text which it is reading as it's being read.


I tried a little automated journalism a while back and wrote a blog post about my code:

"I wrote this article with one mouse click" http://coding.pressbin.com/60/I-wrote-this-article-with-one-...

There are a whole bunch of little things that go into play with something like this that you just don't think much about until you try it -- stuff like subject/verb agreement, when to use figures and when to spell out numbers, etc.


What's Googles take on this? Are they for or against automated content creation? Will they be kicking these sites out of search or letting them stay?


Regardless of whether they are for or against it, they would have a tough time detecting it in the first place. In theory, their detection algorithm would have to be at least as smart as the StatSheet Algorithm.

This also differs from most forms of automated content creation in that the underlying data is recent and newsworthy as opposed to simply reworking text found elsewhere on the internet.


This is one example he gives of his automated writing: "Second-seeded North Carolina was defeated in the Elite Eight with a 76-69 loss to fourth-seeded Kentucky in the Regional Finals in Newark."

That's perfectly serviceable. But it makes me wonder...what is the point of this? Not his automated-writing tool, but why are we putting what was meant for a statistical/symbolic graphic into sentence form?

This kind of writing is only possible with the collection of discrete datapoints: the date, the score, the participants, and the location. From there, you can do any kind of variation of subject-verb etc., even adding adjectives if the point spread is high.

So we're taking data and turning it into a less efficiently readable form. It's no fault of the auto-writer of course, that's just how we are taught to read and write. Someday, we move towards a society in which other forms of communication, particularly visual, are as commonplace. [insert your own Tufte-inspired rant here)


why are we putting what was meant for a statistical/symbolic graphic into sentence form?

Because many people (and other important constituents, like Googlebot) cannot read graphs and so perceive a graph as having zero value but a paragraph telling 1/10th the story of the graph as having positive value.

This is hardly the only "inferior form factor dominates because of ease of consumption" thing out there. For example, rather than look at an unemployment graph or read a paragraph beginning with "Government figures released earlier today say that unemployment is the highest it has been since 2004", most of the world prefers someone whose professional competence is looking pretty to read "Unemployment is up" to them for approximately 4 seconds before cutting to commercial.


I loved the 3rd bullet point: "Software doesn't get bored and start wondering how to automate itself."

I'm not sure if this will ever be applied to non-data driven fields but this is still extremely cool.


"A common, and funny, question I get from journalists is:* "when will you automate me out out of a job?"* I find the question humorous because built into the question is the assumption that if our software can write the perfect story on a particular topic, then no one else should attempt to write about it. That's just not going to happen."

It only takes one misguided and uninformed manager to fire good writers, thinking that they can be replaced with an army of computers, only to find that the product is now crap. Damage will have been done.

The example I'm thinking of? Square-Enix firing their developers and outsourcing core development to China. The Result? Crappy games. (In a humorous twist, they've since then been asking the very developers they fired to come back and work for them)


It makes sense that the sports genre was chosen. With scores, winners, teams, tournaments and the like all being mentioned in pretty much every article out there, it stands to reason that it would be fairly easy to parse them all and get good data. I'd imagine tech and celebrity writing would also work well.

Political stories have such a wide range of views, this approach would produce gibberish until you sort out all the articles on a left-right scale.


> Political stories have such a wide range of views, this approach would produce gibberish until you sort out all the articles on a left-right scale.

I'd hope we need more scales than that.


If you're writing style happens to be very repetitive and templatized, then yes, you've automated your writing career.

A more likely scenario for applying this tech to journalism would be for providing filler paragraphs around the more substantive prose banged out by an human journalist, that way the journalist doesn't have to write as much or spend time on pulling tedious raw data into the story.


A friend of mine was working on an automated story telling system for Nethack as his Masters (Linguistics & CS/AI, IIRC) thesis.

It was never really completed, but there was some interesting work in applying goal-based planning AI in reverse to generate possible long-term motivations for individual actions.

I don't think it's available online though, sadly.


Funny how articles about computer-generated prose are never computer-generated.


Would like to see an example of this - even if it's not all that impressive - applied to some non-data intensive area, i.e. someplace other than reporting (sports, finance, etc).


An INSEAD professor built an automated system to do something similar, though the quality of the output leaves something to be desired: http://www.nytimes.com/2008/04/14/business/media/14link.html...


If you're interested in this guy - Phillip M. Parker - check out his list on Amazon (http://www.amazon.com/gp/search/ref=sr_nr_i_0?rh=k%3APhilip+...), currently standing at 111k+ "books" published.

From a skim of his titles, it seems like there are different series of his books - from the "The Official Patient's Sourcebook on $disease" series ($28.95) to the "The 2007-2012 World Outlook for $niche-industrial-product" ($795.00) series.


I stand unimpressed. I am still waiting for "How I automated my reading hacker news"...


Just like you use a VCR to watch your favourite TV shows for you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: