Hacker News new | past | comments | ask | show | jobs | submit login
The Controlled Natural Language of Randall Munroe's Thing Explainer [pdf] (arxiv.org)
176 points by tkuhn on May 10, 2016 | hide | past | favorite | 91 comments



I thought Thing Explainer is a fun experiment and a delightful book, but as an attempt (and it isn't a serious attempt) to use only simple and super-commonly-used language, it doesn't hit the home base. As I speak English as my second language I'm acutely aware of this.

Thing explainer uses the 1000 most commonly used lemmas, but words have multiple senses, and some of them are commonly used and some are not. From a viewpoint of a language learner, an unfamiliar use of a word might be another word for what it's worth. (Of course they might have a clear semantical connection, which helps guessing.)

Another thing is that phrasal verbs and set phrases are essentially vocabulary items too – you can't decode them using only extralinguistic knowledge (that is, knowledge about the world).

Randall Munroe developed a text editor that highlights any words outside his word list to help with writing the book, but I think an editor that could handle word senses and multi-word phrases would be a formidable thing. Of course it needs much more high-level NLP, word sense disambiguation and such. (Possibly impossible to pull that off cleanly with the current level of tech?) I'd love to see one.


This is also a common complaint about Simple English books written for English learners. The books claim to have a restricted vocabulary, but they cheat by using phrasal verbs.

Their vocabulary includes verbs like put and set along with prepositions like up, with and upon. You can combine these to generate an enormous number of phrasal verbs like put up with, set upon and so on, which are normally considered to be separate vocabulary items.


As a person whose first language is not English, I find phrasal verbs one of the most challenging parts of the language. In fact I just searched set upon and was greatly surprised that it means attacking violently! I would have never guessed that from the words alone.


In this case the phrase is a more artistic way of describing the beginning of a fight. Think of "John set upon Mike angrily" as John set, or selected, Mike as his target.

It's also a great example of the topic at hand because the same phrase could be used like "The book was set upon the table." Describing a book that was placed on a table.

Each version uses the same common vocabulary, to describe wildly different things.


Or phonetically similar / by moving a space:

"John was set up on a date" - someone arranged a romantic encounter for John. "John was setup on a date" - on some specific day John was assembled. "John was set upon a date" - John assaulted a piece of fruit.


> "John was setup on a date" - on some specific day John was assembled.

I read this one as: A romantic encounter was the guise by which John was framed for a crime.


Nice try :-), but there's a reason why we write "upon" without a space. In normal speech the prosody makes it perfectly clear that "upon" is one word, rather than two.


My read of the last one is that it means "John was set upon [getting] a date"


Who is this romantic android enemy of fruit?


John.


The last one made me laugh but you need to remove the 'was'.


No, it still works. Someone else has caused John to assault the fruit. "The guard dog was set upon an intruder, while John was set upon a date."


Doesn't the "was" change who the action applies to? E.g. "John assaulted a date" vs "John was assaulted by a date"? I'm not entirely sure "John was set upon a date" makes sense. Then again, I've never been great with the specific rules of English, even though it's my first language (likely because it's my first language, and I learned it not as rules, but through immersion as a child).


"Set upon" can mean either "to begin to attack" or "to cause to begin to attack", although in the latter case the target of attack goes between "set" and "upon":

The dog set upon the cat. (began to attack the cat)

The person set the dog upon the cat. (caused the dog to begin to attack the cat)

The second meaning could make sense in the original sentence with John being the attacker, a date (fruit) being the target of attack, and an unspecified party being the one who caused John to attack. In this case there is a past-tense passive with "was" + the past participle of the verb (like "was liked", "was seen", "was taken"), but the past participle of "set upon" is identical to the present form, so "was set upon" means either "was attacked" or was caused to attack. But only the second meaning is plausible when followed by "a date", as opposed to "by a date".

"John was set upon a date" → John was caused to attack a date

"John was set upon by a date" → John was attacked by a date

This reminds me that phrasal verbs really are one of the trickiest things in English. Some of my non-native speaker friends have several books just about this topic, because it's so subtle and pervasive in English.


Thank you for the clarification. I suspected the "by" portion might be key, mainly because it felt like it was missing (and thus a requirement). That it helps distinguish the meaning makes sense.

I really should spend some time to research the mechanics of my native tongue, rather than rely on what sounds right and the simplistic rules I can remember from primary school. This has been on my mental to-do list at different times over the years, but I always seem to have it de-prioritized and then I forget about it. :/


Just remember that language isn't truly normative - in general, one of the best tests for "correctness" is "appears in speech by native speakers". Which is somewhat complicated by the existence of written language. In English, Shakespeare is somewhat famously the inventor of a number of idioms that would probably seem strange at the time, but are now ingrained to the English language, see eg:

"You're quoting Shakespeare" - Rob Brydon reveals popular Shakespeare phrases in everyday use: https://www.youtube.com/watch?v=Ig6f5fT0Xho

Or, "My Shakespeare - a new poem by Kate Tempest": https://www.youtube.com/watch?v=i_auc2Z67OM


Sure. Another analogy without the phrasal verb might be "give", which has an indirect object:

John has given a date. (he stated when something would happen, or he donated a fruit to someone)

John was given a date. (he received a fruit as a gift)

John was given by a date. (most likely interpretation is that someone's romantic partner, maybe John's, nominated John for some position or role)


GP means it in the sense of "The attack dog was set upon the intruder" (e.g. by its master).


Or, alternatively, it means that John was placed on top of a fruit.


"setup" is a noun form of "set up"


Whilst this ambiguity is a pain for those learning the language and for clarity, it's also one of the things which makes English a great language for puns/wordplay.

I suspect that having a reduced vocabulary would likely increase the chances of ambiguity, so could be a great source of CNL puns.


Or, John was set upon going to the movies tonight, but his friend refused to join him.

Admittedly, that sounds a bit old-fashioned.


Honestly any use of upon sounds old-fashioned. It's generally been replaced by on, except in a few contexts by up. Even Google defines it as "a more formal term for on".


Set on is probably a bit more natural in that context although I'm not sure either is particularly common phraseology.


From what I recall, 'set' is the word in the English language which has most meanings - over 50.


Depending on which dictionary you use, it's either put or set.


Perhaps it's time to revisit some OO/database terms, to avoid things like: Person.set(up), db.put(down), set.contains(Person) ;-)


Context helps: "His foe set upon him with such vigor that he knew he was hopelessly outmatched. He began to silently pray for assistance."


I remember when I was studying English in the U.S. with other foreign students some years ago, when the topic was phrasal verbs the grades decreased for entire class. I am Brazilian, the class was composed most of South Koreans and Saudi Arabians.


Phrasal verbs are a hidden problem for estimating English vocabulary competence (as was alluded to earlier in this thread), because they play serious havoc with the idea of "knowing" a word. If you asked English learners if they knew the meanings of "work", "in", "out", and "off", most would say yes at a pretty early stage of their English education.

But that doesn't mean that they'd necessarily correctly interpret (or produce) "work in" 'incorporate (in a narrative or plan), "work out" 'resolve (a problem); deduce (a solution or consequence); deliberately perform physical exercise', "work off" 'eliminate (a debt, obligation, or excessive food intake) through effort'.

And there are hundreds more of those combinations with meanings that need to be learned independently.


The text editor is here:

http://xkcd.com/simplewriter/

Try it!

1,000 words is an unrealistic constraint. But I'm amazed how often I use a complex word when the simple alternative is better.

Or...

Sticking to 1,000 words is hard. But doing it often makes my writing better.


I would love this editor with a heatmap instead of a hard cutoff at 1,000 words. So common words are green, less common ones turn yellow, and rare ones are orange to red. That would be really nice in highlighting parts of a text that are unnecessarily more complicated than the rest.



I want this, but not (immediately) for my own writing - I want to run papers through it, especially those in the soft sciences.


The constraint made me realize that the complex word isn't that helpful when the other person doesn't know much behind the word. You need to get inventive, and at some point, you might start having fun. I used this to explain encryption, and from the feedback I've got, I seem to have done a good job. It gets repetitive at times, but I managed to slip in PKI, password storage, and a brief summary of the Crypto Wars.

https://theandrewbailey.com/article/177/Hidden-Writing


That's pretty well done, actually. Personally, I would add an addendum at the end that breaks out of the ten hundred most commonly used words and actually introduces the reader to the terms you're dancing around.


Thanks. I don't think that having a vocabulary list is necessary. Even though I recall there being such a list at the back of the book, I can't locate it on the XKCD site. I liberally linked to other places to further explain things with the big words. (Isn't that how the web is supposed to work?)


That's a great idea!

"Now you know what's going on, here are some words that make it easier to talk about."


I like how the text of your icon of an unencrypted messsage is completely breaking out of the Allowed Wordset.


I wasn't familiar with that tool, so I put a few of my last HipChat messages in there to see how I faired. Once I removed specific words or proper nouns, I still didn't do all that well. Even the preceding sentence had four problem words, and I'm pretty sure "preceding" would join them.

I agree, sticking to 1,000 words is hard. But I don't think I agree that it makes the writing better. I look at language as a form of expression, and some words are just more colorful than others, but it takes all of them to paint a picture. I honestly don't know if I could limit myself to the 1,000 most common words, but I wouldn't want to do that if I could. That said, I'm curious why you think it makes your writing better? Is it that you think more about what you're saying? Or that you think it will be more easily digested by others?


To me, language is wielded with dual intention: (1) to convey meaning ("A thorough lecturer") & (2) to convey meaning in a pleasing manner to the listener ("A good lecturer").

Expanding vocabulary to accentuate (2) by necessity compromises (1) for classes of listeners / readers who are not familiar with that vocabulary superset. Which decomposes the optimization problem to "Who is my audience and what is their comfort vocabulary set?" I would expect it's >1,000 words even for ESOL listeners. However, it's certainly < "the full set of florid English words".

And furthermore, I think English writers writing for English consumers (I count myself among these, sadly) often undervalue writing in the most effective style for the widest audience when applicable (and nowadays it almost always is: research papers, comments on a public forum, blog posts, how to's, etc etc).

Does anyone have any links to courses to help develop a working minimally spanning English vocabulary for international technical communication?

This is one problem I've had with academic literature in historically liberal arts fields. "Just learn the obscure English vocabulary (before you can understand, work, or research in a field)" is a ridiculous bar to set in front of contributions.


It's

1. Convey meaning

2. Establish social/tribal register - which can be done in inclusive (welcoming) or exclusive (aggressive) ways.

I think elegant, beautiful English peaked in the 1950s. I have a small collection of books from that period about various topics. They're all written in an effortlessly understated and unaffected English style that seems to have vanished now.

George Orwell's essays have some of the same quality.

At the other extreme, academic arts literature can be particularly bad, because the wilder fringes of (e.g.) critical theory seem to have developed a cargo cult vocabulary that primarily exists for social signalling, not for communication - while, ironically, spending a lot of time discussing social signalling.

When it's so easy to hack together an academic paper generator [1][2] and no one is much the wiser, it's clear that communication is no longer the point.

[1] http://bocktherobber.com/2010/05/post-modernism-generator/ [2] https://www.theguardian.com/technology/shortcuts/2014/feb/26...


Rare words can make writing more efficient and enjoyable. But it doesn't always.

To my ears:

"I use language for two reasons:"

Is better than

"To me, language is wielded with dual intention"

But I always enjoy the word "florid", even though it's not in the top 1000 words.


> "To me, language is wielded with dual intention"

I've got to be honest. I actually prefer that one. Not for conveying meaning but it definitely sounds nicer -- as a though it were a part of a Shakespearean soliloquy.


I believe that's the nicest thing anyone has ever said about words I put on the internet.


Indeed. Efficient use of language can be far more compelling than florid prose. Of course, it can also be more difficult to convey the same meaning tersely, hence the old quote (who's source I forget, and I'm likely mangling) "I didn't have time to write you a short letter, so I wrote you a long one".



Which of those I'd use would be dependant on the situation. If I'm trying to convey a tone of militant anger or something similar, I'd go with the latter.


I am not sure anyone believes that writing with only 1000 words is better - but it should make the writing more understandable. There is a sharp curve in understanding from text that is easy (most magazine or newspaper articles) to dizzyingly complicated (most academic writing). For me the utility in simple writing is its accessibility to a wider audience.

Incidentally you did rather well in your second paragraph - only 'honestly, limit, common, curious, digested' were out of bounds, and they are easily swapped without much loss in expression.


Many of the "Simple English"-language articles on Wikipedia are better (IMHO) than their English counter-parts. Perhaps the best would be a mix of both, eg the introduction to:

https://simple.wikipedia.org/wiki/Fraction_(mathematics)

"A fraction is a number that shows how many equal parts there are. When we write fractions, we show one number with a line above another number, for example, (...) 1⁄4 or 1/4. The top number tells us how many parts there are, the second number tells us the total number of parts.

The top part of the fraction is called a numerator. The bottom part of the fraction is called a denominator. For example, 1/4: The 1 is the numerator here, and the 4 is the denominator."

vs the "English" one:

"A fraction (from Latin fractus, "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight-fifths, three-quarters. A common, vulgar, or simple fraction (examples: 1/2 and 17/3) consists of an integer numerator displayed above a line (or before a slash), and a non-zero integer denominator, displayed below (or after) that line. Numerators and denominators are also used in fractions that are not common, including compound fractions, complex fractions, and mixed numerals."

> Is it that you think more about what you're saying?

I think this is one important reason - and key behind all good writing. Using a limited vocabulary is one way to force yourself to do that.

It's also a way to force yourself to examine what you write, and make sure you actually use words you understand, and not stray too far into your passive vocabulary and accidentally introduce ambiguity because you think you are using more precise words than you actually are.

I would also say that, while rich language can be fun, most writing can benefit from being simplified. Not everything needs to read like Paradise Lost.

In the words of Hemmingway: "Don’t get discouraged because there’s a lot of mechanical work to writing. There is, and you can’t get out of it. I rewrote the first part of A Farewell to Arms at least fifty times. You’ve got to work it over. The first draft of anything is shit. When you first start to write you get all the kick and the reader gets none, but after you learn to work it’s your object to convey everything to the reader so that he remembers it not as a story he had read but something that happened to himself."

(Here with more context, than the traditional quip: "The first draft of anything is shit.", in order to emphasize that the point is that everything needs to be reworked).

On a similar note, I recommend that everyone who writes (ie: everyone) read: "On Writing Well" by Zinsser (himself a propoment of revisions, the book is in its 30th edition):

http://www.amazon.com/Writing-Well-Classic-Guide-Nonfiction/...


Using a simple vocabulary is not inherently better. It's a trade-off. What it does, is reach a wider audience, at the expense of clear, concise communication. All communication has this same trade-off, whether it's between people, or between a person and a compiler or virtual machine.

For example, when speaking in a domain, such as computer science and programming, there are certain words we use that are specific to that domain which help us express ideas efficiently. The assumption is that the people receiving the communication know those words, or at least know how to quickly find out what they are referring to, in context. For example, type system. The layperson may have wildly different expectations for what that means, if they even want to hazard a guess. Replacing "type system" with ten to thirty words describing what we mean instead of using type system does make the communication more approachable (if longer) to the layperson, but at some small inconvenience to the writer and to readers who are already familiar with the concept. I don't want to read a few sentences to just come to the conclusion that the writer is describing what we already both have efficiently categorized in our minds as a type system.

There is, in this, quite a bit in common with programming languages, and how we communicate with computers and later programmers using them. Perl, for example, often labelled as a write-only language. Some of this is due to quite a bit of historical programming dating to the early web, when there was less thought to maintainability and not abusing the flexibility of the language, and some of this is due to the rich syntax and expressions allowed in the language. Perl comes with a larger vocabulary than many other languages, and the syntax is complex (expressive) enough that the learning curve is a bit longer. The benefit is that people well versed in the language can express themselves clearly, concisely and quickly. Python, on the other hand, as a smaller vocabulary, and more constrained syntax. This emphasizes clarity over conciseness and quickness. This constraint allows programmers that are amateurs to still write code in a way that does not appear extremely different than the code an expert writes (although I believe there is an underlying complexity in the code that this masks). The important thing to note here is that there's a trade-off in language design, and in the use of languages. I believe that a group of five expert Perl programmers will achieve more in the same time period than a group of five expert Python programmers, all other things being equal (including module ecosystems). I don't think this is a a controversial statement, just as I don't think five expert QBASIC programmers will be as efficient as five Python programmers, or that five expert Perl programmers will be as efficient as five expert APL programmers. The key point here is that if they are all experts, whatever your trade-off for accessibility was is now a liability. The other key point is that people rarely become experts, so optimizing for some lower level of skill, whether it be amateur or professional, is often a more useful strategy, because it trade of quality (efficiency) for quantity is often a good one to make at this level. Conversely, I would argue going too far the other way, to the level of QBASIC, constrains people past the amateur level far too much and hurts efficiency as well.

I guess that's really just a long-winded way of saying you should write to your audience.:)

P.S. I often find I use wordier expressions when I could have used something simpler. I like to think that's because I'm trying to be concise, and convey a specific meaning, and I value my time and the time of those that read what I write. In the end, I use words like "inherently" because it accurately captures exactly what I was trying to express. Hopefully for the audience at HN, and the many non-native English speakers we have, that's not problematic. I don't think I'm too wordy, but I have a feeling I'm unable to accurately assess myself on this subject.


In English lessons (in England) we had to write newspaper articles in the style of The Times and The Sun.

Writing for The Sun was more difficult, since it required keeping a reading age of about 8, and a very distinct style.


Try writing only in red ...


That's really hard!


P.S. Machine translation is often considered as one of the most typical examples of NLP problems, but it would be also interesting to see "machine simplification" to get more attention. Can we automate the translation from English to Simple English? Currently not very well.


I don't think that's an easier problem than machine translation between "different" languages. You still need to understand the source text well enough to figure out the meaning and reproduce it using different words.


Text simplification and summarization are active fields of research. Machine translation is indeed more active, but this is understandable given its commercial applications.


I think it is more likely that computers will first manage to understand texts written in Simple English way before they are able to do the same with English.

That is not surprising, as that's exactly how humans learn to work with language.


It seems like using a model that can estimate its certainty about the meaning of a sentence might be a good way of going about this. If the model is uncertain then it could recommend re-writing the sentence.


You could train a language model from a corpus of writing by e.g. ESL students or young children, and then disallow any words with p<p(thousandth_word) given the preceding context. :)


Thing explainer is fun -- but I don't think the descriptions in it are particularly clear. It's just fun to see how he solves the problem of describing them given his self-imposed limitation.

Even more fun is Poul Andersons "uncleftish beholding": http://www2.warwick.ac.uk/fac/cross_fac/complexity/people/st...


Agreed. Often you have to already know what he's talking about to get meaning from his words.

For example, his description of the Saturn V Rocket (he replaced rocket with 'up goer') has this description of hydrogen: "The kind of air that once burned a big sky bag and people died (And someone said "Oh the [humans]!")

I guess that's amusing if you already know about the Hindenburg, or something.

Also, speaking of the propellant used for the Saturn V's F-1 engine he states: "This is full of that stuff they burned in lights before houses had power". The propellant in the F-1 engine was RP-1 (kerosene). I assume he was talking about 'town gas' for lighting.


Kerosene lamps were ubiquitous before electric lighting. Look at the pictures in the wikipedia page[0], you'll recognize them for sure.

[0] https://en.wikipedia.org/wiki/Kerosene_lamp


It seems that, in trying to make a second point, jgrahamc accidentally proved his first.


It wasn't an accident. I was showing that the ambiguity introduced by his language means that it's hard to understand what's really going on.


Well that was clever. :-)


To be fair, Up-goer Five is the original for the comic, it was hardly intended at the time that it would take off the way it did. That it generated so much interest led to the book.


Are you sure it wasn't a teaser for a work in progress?


It was done in 2012, a teaser 3 years before the book release would be unusual.


I'm a fan of Guy Steele's Growing a Language, a talk about programming language design where he limited himself to using mostly only single-syllable words: https://www.youtube.com/watch?v=_ahvzDzKdB0


Not to mentioned hand crayon drawn transparencies.


This seems like a good companion (probably as a sequel) to:

"Richard Feynman - Computer Heuristics Lecture": https://www.youtube.com/watch?v=EKWGGDXe5MA

In general, Feynman might be a poster child for using simple language to explain complex things.


For me, Thing Explainer was like a funny joke at first which got old VERY fast.

To paraphrase an old Andrew Dice Clay quote : "It was like masturbating with a cheese grater. Slightly amusing, but mostly painful."

I lost all interest in the book within five minutes of browsing it.


I'm interested to know what German, Dutch or Norse-descendant speakers think of that.

I'm learning Danish, and reading that is similar, but easier, to deciphering letters from the electricity company.


Off topic:

I attended a school with little to none "secular" education, but I was by nature really curious about things.

At some point, I got my hands on "How Stuff Works", the book[0], and devoured it. It was a super-enlightening book. I'd even venture to say it set me on my autodidactic path to programming.

If this book serves the same purpose for someone as that book did for me, it's benefits cannot be overstated IMO.

As an extra bonus, I was considered by my peers to be far more knowledgable than I actually was, due to having a layman's understanding (or at least the appearance thereof) of _so many_ esoteric concepts. 10/10 would read again!

[0] http://amzn.com/0785824324


While this is tagged as PDF the link is actually to the website where you can read the abstract.

The PDF link is: http://arxiv.org/pdf/1605.02457v1.pdf


I have Thing Explainer right here, my main issue is I'd like a separate version with the actual terms so I can have a conversation about the covered topics without sounding like a moron.


I'm not sure why you got downvoted for that. It's a reasonable request. Restrict yourself to the thousand words for words, but not for names.


What I think would be interesting is if he tried another version without the 1000 most common words.


I don't think it's possible to form non-trivial sentences without function words. But leaving out common content words could be interesting.


Someone wrote a 260 pages long novel without using a single word that contains the letter 'e'. That seems impossible to me.

https://en.wikipedia.org/wiki/Gadsby_%28novel%29


There are multiple examples. It is difficult but apparently not so difficult that nobody attempts it. The accepted term for texts like that is "lipogram"; it's a popular form of the broader category of constrained writing.

Wikipedia has a page for "logology" which is apparently a term for the general activity of playing with language in that kind of way on a per-letter basis (so, including anagrams and palindromes, etc.)

https://en.m.wikipedia.org/wiki/Logology


If you sort all English words by frequency, the ones that contain 'e' would be distributed more or less randomly, so it seems more doable. Banning the first 1000 would prevent you from using the most useful words. I don't think you could construct sentences that wouldn't sound totally weird.


https://en.wikipedia.org/wiki/A_Void (from the original _La Disparition_)

is a noir/fantasy story, its actual plot is the story's lack of 'e'.

It's amazing. It's difficult and confusing to follow the writing, but it's an amazing book.


I know, it's fascinating. But it doesn't detract from my point.


Looking at the diagram of "Bags of stuff inside you" in "Thing Explainer" I was surprised to find that when annotating blood vessels, intestines etc he uses "hallway" and not "pipe", which I thought would be in the 1000 most often used english words. How curious.


Sounds very similar to Simple English Wikipedia: https://simple.wikipedia.org/wiki/Wikipedia:Simple_English_W...


In fact, Randall makes an explicit mention of it in a comic:

https://xkcd.com/547/


Thing Explainer was a nice experiment, but, really... what's the value in trying to parse CNL explanations that are harder to understand than using the actual word?


self-plug: let's you explore which English words are the most common

The Long Tail of the English Language http://blog.wordsapi.com/2015/01/the-long-tail-of-english-la...


Is that your site? You have a spam problem in your comments.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: