I thought Thing Explainer is a fun experiment and a delightful book, but as an attempt (and it isn't a serious attempt) to use only simple and super-commonly-used language, it doesn't hit the home base. As I speak English as my second language I'm acutely aware of this.
Thing explainer uses the 1000 most commonly used lemmas, but words have multiple senses, and some of them are commonly used and some are not. From a viewpoint of a language learner, an unfamiliar use of a word might be another word for what it's worth. (Of course they might have a clear semantical connection, which helps guessing.)
Another thing is that phrasal verbs and set phrases are essentially vocabulary items too – you can't decode them using only extralinguistic knowledge (that is, knowledge about the world).
Randall Munroe developed a text editor that highlights any words outside his word list to help with writing the book, but I think an editor that could handle word senses and multi-word phrases would be a formidable thing. Of course it needs much more high-level NLP, word sense disambiguation and such. (Possibly impossible to pull that off cleanly with the current level of tech?) I'd love to see one.
This is also a common complaint about Simple English books written for English learners. The books claim to have a restricted vocabulary, but they cheat by using phrasal verbs.
Their vocabulary includes verbs like put and set along with prepositions like up, with and upon. You can combine these to generate an enormous number of phrasal verbs like put up with, set upon and so on, which are normally considered to be separate vocabulary items.
As a person whose first language is not English, I find phrasal verbs one of the most challenging parts of the language. In fact I just searched set upon and was greatly surprised that it means attacking violently! I would have never guessed that from the words alone.
In this case the phrase is a more artistic way of describing the beginning of a fight. Think of "John set upon Mike angrily" as John set, or selected, Mike as his target.
It's also a great example of the topic at hand because the same phrase could be used like "The book was set upon the table." Describing a book that was placed on a table.
Each version uses the same common vocabulary, to describe wildly different things.
"John was set up on a date" - someone arranged a romantic encounter for John.
"John was setup on a date" - on some specific day John was assembled.
"John was set upon a date" - John assaulted a piece of fruit.
Nice try :-), but there's a reason why we write "upon" without a space. In normal speech the prosody makes it perfectly clear that "upon" is one word, rather than two.
Doesn't the "was" change who the action applies to? E.g. "John assaulted a date" vs "John was assaulted by a date"? I'm not entirely sure "John was set upon a date" makes sense. Then again, I've never been great with the specific rules of English, even though it's my first language (likely because it's my first language, and I learned it not as rules, but through immersion as a child).
"Set upon" can mean either "to begin to attack" or "to cause to begin to attack", although in the latter case the target of attack goes between "set" and "upon":
The dog set upon the cat. (began to attack the cat)
The person set the dog upon the cat. (caused the dog to begin to attack the cat)
The second meaning could make sense in the original sentence with John being the attacker, a date (fruit) being the target of attack, and an unspecified party being the one who caused John to attack. In this case there is a past-tense passive with "was" + the past participle of the verb (like "was liked", "was seen", "was taken"), but the past participle of "set upon" is identical to the present form, so "was set upon" means either "was attacked" or was caused to attack. But only the second meaning is plausible when followed by "a date", as opposed to "by a date".
"John was set upon a date" → John was caused to attack a date
"John was set upon by a date" → John was attacked by a date
This reminds me that phrasal verbs really are one of the trickiest things in English. Some of my non-native speaker friends have several books just about this topic, because it's so subtle and pervasive in English.
Thank you for the clarification. I suspected the "by" portion might be key, mainly because it felt like it was missing (and thus a requirement). That it helps distinguish the meaning makes sense.
I really should spend some time to research the mechanics of my native tongue, rather than rely on what sounds right and the simplistic rules I can remember from primary school. This has been on my mental to-do list at different times over the years, but I always seem to have it de-prioritized and then I forget about it. :/
Just remember that language isn't truly normative - in general, one of the best tests for "correctness" is "appears in speech by native speakers". Which is somewhat complicated by the existence of written language. In English, Shakespeare is somewhat famously the inventor of a number of idioms that would probably seem strange at the time, but are now ingrained to the English language, see eg:
Whilst this ambiguity is a pain for those learning the language and for clarity, it's also one of the things which makes English a great language for puns/wordplay.
I suspect that having a reduced vocabulary would likely increase the chances of ambiguity, so could be a great source of CNL puns.
Honestly any use of upon sounds old-fashioned. It's generally been replaced by on, except in a few contexts by up. Even Google defines it as "a more formal term for on".
I remember when I was studying English in the U.S. with other foreign students some years ago, when the topic was phrasal verbs the grades decreased for entire class.
I am Brazilian, the class was composed most of South Koreans and Saudi Arabians.
Phrasal verbs are a hidden problem for estimating English vocabulary competence (as was alluded to earlier in this thread), because they play serious havoc with the idea of "knowing" a word. If you asked English learners if they knew the meanings of "work", "in", "out", and "off", most would say yes at a pretty early stage of their English education.
But that doesn't mean that they'd necessarily correctly interpret (or produce) "work in" 'incorporate (in a narrative or plan), "work out" 'resolve (a problem); deduce (a solution or consequence); deliberately perform physical exercise', "work off" 'eliminate (a debt, obligation, or excessive food intake) through effort'.
And there are hundreds more of those combinations with meanings that need to be learned independently.
I would love this editor with a heatmap instead of a hard cutoff at 1,000 words. So common words are green, less common ones turn yellow, and rare ones are orange to red. That would be really nice in highlighting parts of a text that are unnecessarily more complicated than the rest.
The constraint made me realize that the complex word isn't that helpful when the other person doesn't know much behind the word. You need to get inventive, and at some point, you might start having fun. I used this to explain encryption, and from the feedback I've got, I seem to have done a good job. It gets repetitive at times, but I managed to slip in PKI, password storage, and a brief summary of the Crypto Wars.
That's pretty well done, actually. Personally, I would add an addendum at the end that breaks out of the ten hundred most commonly used words and actually introduces the reader to the terms you're dancing around.
Thanks. I don't think that having a vocabulary list is necessary. Even though I recall there being such a list at the back of the book, I can't locate it on the XKCD site. I liberally linked to other places to further explain things with the big words. (Isn't that how the web is supposed to work?)
I wasn't familiar with that tool, so I put a few of my last HipChat messages in there to see how I faired. Once I removed specific words or proper nouns, I still didn't do all that well. Even the preceding sentence had four problem words, and I'm pretty sure "preceding" would join them.
I agree, sticking to 1,000 words is hard. But I don't think I agree that it makes the writing better. I look at language as a form of expression, and some words are just more colorful than others, but it takes all of them to paint a picture. I honestly don't know if I could limit myself to the 1,000 most common words, but I wouldn't want to do that if I could. That said, I'm curious why you think it makes your writing better? Is it that you think more about what you're saying? Or that you think it will be more easily digested by others?
To me, language is wielded with dual intention: (1) to convey meaning ("A thorough lecturer") & (2) to convey meaning in a pleasing manner to the listener ("A good lecturer").
Expanding vocabulary to accentuate (2) by necessity compromises (1) for classes of listeners / readers who are not familiar with that vocabulary superset. Which decomposes the optimization problem to "Who is my audience and what is their comfort vocabulary set?" I would expect it's >1,000 words even for ESOL listeners. However, it's certainly < "the full set of florid English words".
And furthermore, I think English writers writing for English consumers (I count myself among these, sadly) often undervalue writing in the most effective style for the widest audience when applicable (and nowadays it almost always is: research papers, comments on a public forum, blog posts, how to's, etc etc).
Does anyone have any links to courses to help develop a working minimally spanning English vocabulary for international technical communication?
This is one problem I've had with academic literature in historically liberal arts fields. "Just learn the obscure English vocabulary (before you can understand, work, or research in a field)" is a ridiculous bar to set in front of contributions.
2. Establish social/tribal register - which can be done in inclusive (welcoming) or exclusive (aggressive) ways.
I think elegant, beautiful English peaked in the 1950s. I have a small collection of books from that period about various topics. They're all written in an effortlessly understated and unaffected English style that seems to have vanished now.
George Orwell's essays have some of the same quality.
At the other extreme, academic arts literature can be particularly bad, because the wilder fringes of (e.g.) critical theory seem to have developed a cargo cult vocabulary that primarily exists for social signalling, not for communication - while, ironically, spending a lot of time discussing social signalling.
When it's so easy to hack together an academic paper generator [1][2] and no one is much the wiser, it's clear that communication is no longer the point.
> "To me, language is wielded with dual intention"
I've got to be honest. I actually prefer that one. Not for conveying meaning but it definitely sounds nicer -- as a though it were a part of a Shakespearean soliloquy.
Indeed. Efficient use of language can be far more compelling than florid prose. Of course, it can also be more difficult to convey the same meaning tersely, hence the old quote (who's source I forget, and I'm likely mangling) "I didn't have time to write you a short letter, so I wrote you a long one".
Which of those I'd use would be dependant on the situation. If I'm trying to convey a tone of militant anger or something similar, I'd go with the latter.
I am not sure anyone believes that writing with only 1000 words is better - but it should make the writing more understandable. There is a sharp curve in understanding from text that is easy (most magazine or newspaper articles) to dizzyingly complicated (most academic writing). For me the utility in simple writing is its accessibility to a wider audience.
Incidentally you did rather well in your second paragraph - only 'honestly, limit, common, curious, digested' were out of bounds, and they are easily swapped without much loss in expression.
Many of the "Simple English"-language articles on Wikipedia are better (IMHO) than their English counter-parts. Perhaps the best would be a mix of both, eg the introduction to:
"A fraction is a number that shows how many equal parts there are. When we write fractions, we show one number with a line above another number, for example, (...) 1⁄4 or 1/4. The top number tells us how many parts there are, the second number tells us the total number of parts.
The top part of the fraction is called a numerator. The bottom part of the fraction is called a denominator. For example, 1/4: The 1 is the numerator here, and the 4 is the denominator."
vs the "English" one:
"A fraction (from Latin fractus, "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight-fifths, three-quarters. A common, vulgar, or simple fraction (examples: 1/2 and 17/3) consists of an integer numerator displayed above a line (or before a slash), and a non-zero integer denominator, displayed below (or after) that line. Numerators and denominators are also used in fractions that are not common, including compound fractions, complex fractions, and mixed numerals."
> Is it that you think more about what you're saying?
I think this is one important reason - and key behind all good writing. Using a limited vocabulary is one way to force yourself to do that.
It's also a way to force yourself to examine what you write, and make sure you actually use words you understand, and not stray too far into your passive vocabulary and accidentally introduce ambiguity because you think you are using more precise words than you actually are.
I would also say that, while rich language can be fun, most writing can benefit from being simplified. Not everything needs to read like Paradise Lost.
In the words of Hemmingway: "Don’t get discouraged because there’s a lot of mechanical work to writing. There is, and you can’t get out of it. I rewrote the first part of A Farewell to Arms at least fifty times. You’ve got to work it over. The first draft of anything is shit. When you first start to write you get all the kick and the reader gets none, but after you learn to work it’s your object to convey everything to the reader so that he remembers it not as a story he had read but something that happened to himself."
(Here with more context, than the traditional quip: "The first draft of anything is shit.", in order to emphasize that the point is that everything needs to be reworked).
On a similar note, I recommend that everyone who writes (ie: everyone) read: "On Writing Well" by Zinsser (himself a propoment of revisions, the book is in its 30th edition):
Using a simple vocabulary is not inherently better. It's a trade-off. What it does, is reach a wider audience, at the expense of clear, concise communication. All communication has this same trade-off, whether it's between people, or between a person and a compiler or virtual machine.
For example, when speaking in a domain, such as computer science and programming, there are certain words we use that are specific to that domain which help us express ideas efficiently. The assumption is that the people receiving the communication know those words, or at least know how to quickly find out what they are referring to, in context. For example, type system. The layperson may have wildly different expectations for what that means, if they even want to hazard a guess. Replacing "type system" with ten to thirty words describing what we mean instead of using type system does make the communication more approachable (if longer) to the layperson, but at some small inconvenience to the writer and to readers who are already familiar with the concept. I don't want to read a few sentences to just come to the conclusion that the writer is describing what we already both have efficiently categorized in our minds as a type system.
There is, in this, quite a bit in common with programming languages, and how we communicate with computers and later programmers using them. Perl, for example, often labelled as a write-only language. Some of this is due to quite a bit of historical programming dating to the early web, when there was less thought to maintainability and not abusing the flexibility of the language, and some of this is due to the rich syntax and expressions allowed in the language. Perl comes with a larger vocabulary than many other languages, and the syntax is complex (expressive) enough that the learning curve is a bit longer. The benefit is that people well versed in the language can express themselves clearly, concisely and quickly. Python, on the other hand, as a smaller vocabulary, and more constrained syntax. This emphasizes clarity over conciseness and quickness. This constraint allows programmers that are amateurs to still write code in a way that does not appear extremely different than the code an expert writes (although I believe there is an underlying complexity in the code that this masks). The important thing to note here is that there's a trade-off in language design, and in the use of languages. I believe that a group of five expert Perl programmers will achieve more in the same time period than a group of five expert Python programmers, all other things being equal (including module ecosystems). I don't think this is a a controversial statement, just as I don't think five expert QBASIC programmers will be as efficient as five Python programmers, or that five expert Perl programmers will be as efficient as five expert APL programmers. The key point here is that if they are all experts, whatever your trade-off for accessibility was is now a liability. The other key point is that people rarely become experts, so optimizing for some lower level of skill, whether it be amateur or professional, is often a more useful strategy, because it trade of quality (efficiency) for quantity is often a good one to make at this level. Conversely, I would argue going too far the other way, to the level of QBASIC, constrains people past the amateur level far too much and hurts efficiency as well.
I guess that's really just a long-winded way of saying you should write to your audience.:)
P.S. I often find I use wordier expressions when I could have used something simpler. I like to think that's because I'm trying to be concise, and convey a specific meaning, and I value my time and the time of those that read what I write. In the end, I use words like "inherently" because it accurately captures exactly what I was trying to express. Hopefully for the audience at HN, and the many non-native English speakers we have, that's not problematic. I don't think I'm too wordy, but I have a feeling I'm unable to accurately assess myself on this subject.
P.S. Machine translation is often considered as one of the most typical examples of NLP problems, but it would be also interesting to see "machine simplification" to get more attention. Can we automate the translation from English to Simple English? Currently not very well.
I don't think that's an easier problem than machine translation between "different" languages. You still need to understand the source text well enough to figure out the meaning and reproduce it using different words.
Text simplification and summarization are active fields of research. Machine translation is indeed more active, but this is understandable given its commercial applications.
I think it is more likely that computers will first manage to understand texts written in Simple English way before they are able to do the same with English.
That is not surprising, as that's exactly how humans learn to work with language.
It seems like using a model that can estimate its certainty about the meaning of a sentence might be a good way of going about this. If the model is uncertain then it could recommend re-writing the sentence.
You could train a language model from a corpus of writing by e.g. ESL students or young children, and then disallow any words with p<p(thousandth_word) given the preceding context. :)
Thing explainer is fun -- but I don't think the descriptions in it are particularly clear. It's just fun to see how he solves the problem of describing them given his self-imposed limitation.
Agreed. Often you have to already know what he's talking about to get meaning from his words.
For example, his description of the Saturn V Rocket (he replaced rocket with 'up goer') has this description of hydrogen: "The kind of air that once burned a big sky bag and people died (And someone said "Oh the [humans]!")
I guess that's amusing if you already know about the Hindenburg, or something.
Also, speaking of the propellant used for the Saturn V's F-1 engine he states: "This is full of that stuff they burned in lights before houses had power". The propellant in the F-1 engine was RP-1 (kerosene). I assume he was talking about 'town gas' for lighting.
To be fair, Up-goer Five is the original for the comic, it was hardly intended at the time that it would take off the way it did. That it generated so much interest led to the book.
I'm a fan of Guy Steele's Growing a Language, a talk about programming language design where he limited himself to using mostly only single-syllable words: https://www.youtube.com/watch?v=_ahvzDzKdB0
I attended a school with little to none "secular" education, but I was by nature really curious about things.
At some point, I got my hands on "How Stuff Works", the book[0], and devoured it. It was a super-enlightening book. I'd even venture to say it set me on my autodidactic path to programming.
If this book serves the same purpose for someone as that book did for me, it's benefits cannot be overstated IMO.
As an extra bonus, I was considered by my peers to be far more knowledgable than I actually was, due to having a layman's understanding (or at least the appearance thereof) of _so many_ esoteric concepts. 10/10 would read again!
I have Thing Explainer right here, my main issue is I'd like a separate version with the actual terms so I can have a conversation about the covered topics without sounding like a moron.
There are multiple examples. It is difficult but apparently not so difficult that nobody attempts it. The accepted term for texts like that is "lipogram"; it's a popular form of the broader category of constrained writing.
Wikipedia has a page for "logology" which is apparently a term for the general activity of playing with language in that kind of way on a per-letter basis (so, including anagrams and palindromes, etc.)
If you sort all English words by frequency, the ones that contain 'e' would be distributed more or less randomly, so it seems more doable. Banning the first 1000 would prevent you from using the most useful words. I don't think you could construct sentences that wouldn't sound totally weird.
Looking at the diagram of "Bags of stuff inside you" in "Thing Explainer" I was surprised to find that when annotating blood vessels, intestines etc he uses "hallway" and not "pipe", which I thought would be in the 1000 most often used english words. How curious.
Thing Explainer was a nice experiment, but, really... what's the value in trying to parse CNL explanations that are harder to understand than using the actual word?
Thing explainer uses the 1000 most commonly used lemmas, but words have multiple senses, and some of them are commonly used and some are not. From a viewpoint of a language learner, an unfamiliar use of a word might be another word for what it's worth. (Of course they might have a clear semantical connection, which helps guessing.)
Another thing is that phrasal verbs and set phrases are essentially vocabulary items too – you can't decode them using only extralinguistic knowledge (that is, knowledge about the world).
Randall Munroe developed a text editor that highlights any words outside his word list to help with writing the book, but I think an editor that could handle word senses and multi-word phrases would be a formidable thing. Of course it needs much more high-level NLP, word sense disambiguation and such. (Possibly impossible to pull that off cleanly with the current level of tech?) I'd love to see one.