Hacker News new | past | comments | ask | show | jobs | submit login
Do 20 pages of a book gives you 90% of its words? (vocapouch.com)
113 points by kiechu on June 30, 2017 | hide | past | favorite | 57 comments



Although it sounds high, recognising 90% of words makes for a pretty horrible reading experience.

That's 1 word in 10 that you don't know (1-2 words per sentence), or assuming as you did in that post a page length of 300 words, then it's 30 new words a page.

I actually recently wrote an article discussing the same phenomenon in Chinese [0]

Where to get a reasonable level of new characters (e.g. no more than 1 a page) you'd need to know 99.8% of the text on any page.

And the level of recognition required to be able to recognise and learn new words completely from context is about 98%. [1]

0: https://www.chinesethehardway.com/article/hsk-6-gets-you-hal...

1: https://www.youtube.com/watch?v=JbYMZZISPrU


I know nothing of Chinese, but in western languages, like English, a number of words may be "unique" when counted by a simple algorithm, (even if - as the author did - words were reduced to their "basic form" thus deduplicating a lot of slightly different forms) but often you can get the meaning of the word by the context and by similarities with other knwown words, so the 90% percentage while actually meaning that you don't know 1 word every 10, does not directly mean that you cannot understand 1 word every 10 or that your reading experience is so horrible.

Every reader attempting to learn a new language goes through that odd phase where he/she can manage to understand the overall meaning of a sentence even if there is one or two "holes" in it, and actually it is part of the learning process.


Although you might be able to pick up the meaning or get the gist of the occasional word at 1 in 10, the video I linked to in [1] above makes a convincing case that the rate at which unknown words stop being a hindrance to understanding and can be picked up from context is around 98%


Most probably that 98% is an extremely accurate number, along the metrics of the professor, and I was not commenting on that video (that I didn't watch), I was only relating what in my experience happens, in my experience it isn't so horrible and I am still within that experience with more than one language where I scarcely reach 70 or (maybe) 80%.


I've read a lot of books in languages I didn't know well at the time. I often got the gist of the story. Mostly unknown adjectives describing possibly unknown objects to set the mood. Of course the story is not as good without, but it works.


Speaking of Chinese, I wrote http://pingtype.github.io to add pinyin, tone colours, literal translations, and a real translation to Chinese text.

I use it every day for reading the Bible, and I also processed music lyrics, movie subtitles, and restaurant menus with it.

I also made my own keyboard that works by decomposing characters into their radicals. You can also load Word Lists (HSK or TOCFL) to highlight the words you know, or hash-out the characters in the text that you don't know.

Please tell me how to market it. I keep posting on here when I finish new projects that use Pingtype, but I still haven't got many users.


It is a horrible reading experience. But it gets better as you learn, and you get a more interesting story than a child's story (though not as much culture).


I agree there is utility in it (and have done reading at that level of comprehension and below), and it's more interesting than children's stories, and it's a great way to improve your vocabulary, I was just pointing out that despite sounding good 90% comprehension falls well short of comfortable reading and you're better off trying to read easier texts than struggle with more difficult ones.

In fact that was one of the reasons why I made a program called Chinese Text Analyser [0], so that you could quickly identify suitable texts based on your current vocabulary without needing to read the novel first.

0: https://www.chinesetextanalyser.com/


This doesn't really address your teacher's claim about having to look words up, though. What you want to look at is the distribution of low frequency words across the book. What do the plots look like when you remove proper nouns, functional words (e.g., "the", "and", prepositions) and, say, the top 1000 most frequent words in English?


Would be very interesting to see this applied to blogs in different categories to rapidly learn languages through reading based on the words that you currently know and the most frequent words in that language. So it would always present you with the article that suits your level and you would have the benefit of learning the most new words.


Also would be interesting to see it applied to newspapers, with obvious slices like particular author, section (sports v world news etc) distribution year to year, and which paper. TV news broadcasting could also be interesting to compare by the same dimensions, though the conversational style in some interview shows would possibly make this less telling. .


That's something worth trying.


I imagine it would look very much like the plots of unique words given in the article. As you suspect, the chances of coming across one of these is much more evenly distributed.


It probably would look more or less similar. They are excluded very quickly. There is something I cannot asses: How is the word important to understanding the sequence?


FWIW, Ulysses isn't particularly incomprehensible. To the extent that it's difficult to read, it's much more the shifting narrative perspective, widely ranging references, and stream-of-consciousness rather than the vocabulary.

Take this typical section from the "Lotus Eaters" chapter, wherein Mr. Bloom is contemplating the origins of the wares in a tea shop:

So warm. His right hand once more more slowly went over again: choice blend, made of the finest Ceylon brands. The far east. Lovely spot it must be: the garden of the world, big lazy leaves to float about on, cactuses, flowery meads, snaky lianas they call them. Wonder is it like that. Those Cinghalese lobbing around in the sun, in dolce far niente. Not doing a hand's turn all day. Sleep six months out of twelve. Too hot to quarrel. Influence of the climate. Lethargy. Flowers of idleness. The air feeds most. Azotes. Hothouse in Botanic gardens. Sensitive plants. Waterlilies. Petals too tired to. Sleeping sickness in the air.

Hard to be too confused by the imagery and mood in this passage.

Now, Finnegans Wake...


Here are Finnegans Wake graphs. It is indeed even more complicated. https://github.com/vocapouch/vocapouch-research/blob/master/...

Number of Pages: 729 Number of Total Words: 218793 Number of Unique Words: 50872 You will know 90% of words after 387 pages which are 53.09% of the book. At that page, you will know 60.64% of unique words.


Excerpts from Finnegans Wake are great fodder to break CSS layouts during development. Somanyobscenelylongwordswithoutbreaks.


I will run Finnegans Wake in a moment and I will get back with a response. I must find it in a text format.


Can you try a low-brow book, like Twilight?


Have you tried non-fiction books?


What do you have in mind?


The Bible.

(Just kidding)

What about having this read a tweet history, say that of a POTUS?


From what I see, POTUS is circling in basic 1000 words.


According to bill you posted:

Number of Pages: 217 Number of Total Words: 65396 Number of Unique Words: 3106 You will know 90% of words after 28 pages which are 12.90% of the book. At that page, you will know 36.77% of unique words.

The graph is less regular but it has more or less same shape. I will not publish this part because it is not a book.


Actually,

I do have a requested challenge for you: US laws.

Can you evaluate this law that just was signed 3 days ago:

http://leginfo.legislature.ca.gov/faces/billTextClient.xhtml...

(There is a link to a PDF of the law if you prefer to DL the PDF first...)


That is bit different analysis, but I will. Ping me on @r_kierzkowski on twitter, so we will stay in touch.


Can you plot his most used to least used, please.


I was thinking of popular business books, be it in marketing, management... or, why not, "The Mythical Man-Month".


ulysses and finnegans wake is inspiration porn. i find the more jarring the sequence of words/sentences/phrases from a few passages the more inspired i become after.

best to have this type of stuff on hand when you get stuck. it's neurological


My mom, an english teacher, once went through my library of science fiction and analyzed it for reading level. I had the usual collection: Lots of Heinlein, Asimov, Niven, Andre Norton, etc.

Her assessment: Most of the material was about 8th grade level, based on word count.

From time to time I re-read one of those books, and run across pages where she had penciled-in notations and underlined words.


> we turned words to their basic forms (went to go, cars to car, jumps to jump etc.)

FYI, this is called stemming. https://en.wikipedia.org/wiki/Stemming


or contextually, lemmatization


That's correct, it's lemmatization. Stemming does not reduce "went" to "go"


For the record, Ulysses is at least a full order of magnitude more comprehensible than Joyce's next book, Finnegans' Wake.

I'd also expect it to give a skewed response on a test of this kind because it is composed of a number of different sections, which vary considerably in their style. But maybe that's the point of including it.


Here are Finnegans Wake graphs. It is indeed even more complicated. https://github.com/vocapouch/vocapouch-research/blob/master/.... Number of Pages: 729 Number of Total Words: 218793 Number of Unique Words: 50872 You will know 90% of words after 387 pages which are 53.09% of the book. At that page, you will know 60.64% of unique words.


I think their teacher was referring to Zipfian Distribution[0]. I've seen this distribution hold on Wikipedia corpus, as well. Of course it's empirical.

[0]: https://en.wikipedia.org/wiki/Zipf%27s_law


A nice, interesting idea, and experiment, thanks.

Not so casually the blue lines remind me of the one in the graph for the birthday problem:

https://en.wikipedia.org/wiki/Birthday_problem


The use of Eve's diary doesn't make any sense here, of course the distribution of words in a short story are going to be longer than in a book.

Ulysses is fair, but I would expect it and works of a similar caliber to be outliers.


It is Myth Buster's kind of science. The goal was to see how it works with short and long books and with one with reputation being easy and a hard read. It would be interesting to see it on larger population, with more of statistic involved.


As little as one character of almost any document will usually give you 100% of the binary symbols 0 and 1. Usually, the first character will do this, after which the rest of it is just mindless repetition.


This is good, interesting work. I wonder what the difference between stemming and lemmatization shows?

Edit: I see you are doing lemmatization now. Did you try just stemming?


This doesn't seem all that groundbreaking, it's just an instance of Zipf's law in action, is it not?


Specifically the equivalent Heap's law: https://en.wikipedia.org/wiki/Heaps%27_law


Yes, that's Zipf's law applied. I doubt that many language learners knew about this law. I think it is still worth pointing out, that when you go through the beginning of the book, reading will become rapidly easier.


It'd be an interesting exercise in Modernist writing to try producing a book that violates Zipf's law, say by hashing all but the most common few hundred words into chapter buckets.


Maybe, but if you know a nice proof I'd love to hear it.


I bet this is not true for the Encyclopedia Britannica, by design.


I think this is a very useful idea - it could be used to "rate" the books for English learners to see how difficult they are.


"incomprehensibility"


Fixed. Thank you!


90% of the words is not 90% of the meaning... but I get your point.


Yes, if the book is 22 pages long!


Not if it's a dictionary!


Or a phone book.


I wonder how Pale Fire by Nabokov would look after this sort of analysis. For the unfamiliar, per wikipedia, "Starting with the table of contents, Pale Fire looks like the publication of a 999-line poem in four cantos ("Pale Fire") by the fictional John Shade with a Foreword, extensive Commentary, and Index by his self-appointed editor, Charles Kinbote. Kinbote's Commentary takes the form of notes to various numbered lines of the poem. Here and in the rest of his critical apparatus, Kinbote explicates the poem surprisingly little. Focusing instead on his own concerns, he divulges what proves to be the plot piece by piece, some of which can be connected by following the many cross-references. Espen Aarseth noted that Pale Fire "can be read either unicursally, straight through, or multicursally, jumping between the comments and the poem."[4] Thus although the narration is non-linear and multidimensional, the reader can still choose to read the novel in a linear manner without risking misinterpretation."


Huh, sounds a little like House Of Leaves, which has a similarly weird structure.

I'll have to check out Pale Fire.


Ah nice, I need to do the same with House of Leaves, I'm a big fan of stories with unconventional structuring. Sometimes a Great Notion by Kesey is my favorite; it's told from multiple first-person perspectives that shift pretty rapidly, where the shifts are indicated by having a particular speakers' text italicized, in parenthesis, with no formatting applied, etc. It's pretty neat.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: