Although it sounds high, recognising 90% of words makes for a pretty horrible reading experience.
That's 1 word in 10 that you don't know (1-2 words per sentence), or assuming as you did in that post a page length of 300 words, then it's 30 new words a page.
I actually recently wrote an article discussing the same phenomenon in Chinese [0]
Where to get a reasonable level of new characters (e.g. no more than 1 a page) you'd need to know 99.8% of the text on any page.
And the level of recognition required to be able to recognise and learn new words completely from context is about 98%. [1]
I know nothing of Chinese, but in western languages, like English, a number of words may be "unique" when counted by a simple algorithm, (even if - as the author did - words were reduced to their "basic form" thus deduplicating a lot of slightly different forms) but often you can get the meaning of the word by the context and by similarities with other knwown words, so the 90% percentage while actually meaning that you don't know 1 word every 10, does not directly mean that you cannot understand 1 word every 10 or that your reading experience is so horrible.
Every reader attempting to learn a new language goes through that odd phase where he/she can manage to understand the overall meaning of a sentence even if there is one or two "holes" in it, and actually it is part of the learning process.
Although you might be able to pick up the meaning or get the gist of the occasional word at 1 in 10, the video I linked to in [1] above makes a convincing case that the rate at which unknown words stop being a hindrance to understanding and can be picked up from context is around 98%
Most probably that 98% is an extremely accurate number, along the metrics of the professor, and I was not commenting on that video (that I didn't watch), I was only relating what in my experience happens, in my experience it isn't so horrible and I am still within that experience with more than one language where I scarcely reach 70 or (maybe) 80%.
I've read a lot of books in languages I didn't know well at the time. I often got the gist of the story. Mostly unknown adjectives describing possibly unknown objects to set the mood. Of course the story is not as good without, but it works.
Speaking of Chinese, I wrote http://pingtype.github.io to add pinyin, tone colours, literal translations, and a real translation to Chinese text.
I use it every day for reading the Bible, and I also processed music lyrics, movie subtitles, and restaurant menus with it.
I also made my own keyboard that works by decomposing characters into their radicals. You can also load Word Lists (HSK or TOCFL) to highlight the words you know, or hash-out the characters in the text that you don't know.
Please tell me how to market it. I keep posting on here when I finish new projects that use Pingtype, but I still haven't got many users.
It is a horrible reading experience. But it gets better as you learn, and you get a more interesting story than a child's story (though not as much culture).
I agree there is utility in it (and have done reading at that level of comprehension and below), and it's more interesting than children's stories, and it's a great way to improve your vocabulary, I was just pointing out that despite sounding good 90% comprehension falls well short of comfortable reading and you're better off trying to read easier texts than struggle with more difficult ones.
In fact that was one of the reasons why I made a program called Chinese Text Analyser [0], so that you could quickly identify suitable texts based on your current vocabulary without needing to read the novel first.
This doesn't really address your teacher's claim about having to look words up, though. What you want to look at is the distribution of low frequency words across the book. What do the plots look like when you remove proper nouns, functional words (e.g., "the", "and", prepositions) and, say, the top 1000 most frequent words in English?
Would be very interesting to see this applied to blogs in different categories to rapidly learn languages through reading based on the words that you currently know and the most frequent words in that language. So it would always present you with the article that suits your level and you would have the benefit of learning the most new words.
Also would be interesting to see it applied to newspapers, with obvious slices like particular author, section (sports v world news etc) distribution year to year, and which paper. TV news broadcasting could also be interesting to compare by the same dimensions, though the conversational style in some interview shows would possibly make this less telling. .
I imagine it would look very much like the plots of unique words given in the article. As you suspect, the chances of coming across one of these is much more evenly distributed.
It probably would look more or less similar. They are excluded very quickly. There is something I cannot asses: How is the word important to understanding the sequence?
FWIW, Ulysses isn't particularly incomprehensible. To the extent that it's difficult to read, it's much more the shifting narrative perspective, widely ranging references, and stream-of-consciousness rather than the vocabulary.
Take this typical section from the "Lotus Eaters" chapter, wherein Mr. Bloom is contemplating the origins of the wares in a tea shop:
So warm. His right hand once more more slowly went over again: choice blend, made of the finest Ceylon brands. The far east. Lovely spot it must be: the garden of the world, big lazy leaves to float about on, cactuses, flowery meads, snaky lianas they call them. Wonder is it like that. Those Cinghalese lobbing around in the sun, in dolce far niente. Not doing a hand's turn all day. Sleep six months out of twelve. Too hot to quarrel. Influence of the climate. Lethargy. Flowers of idleness. The air feeds most. Azotes. Hothouse in Botanic gardens. Sensitive plants. Waterlilies. Petals too tired to. Sleeping sickness in the air.
Hard to be too confused by the imagery and mood in this passage.
Number of Pages: 729
Number of Total Words: 218793
Number of Unique Words: 50872
You will know 90% of words after 387 pages which are 53.09% of the book.
At that page, you will know 60.64% of unique words.
Number of Pages: 217
Number of Total Words: 65396
Number of Unique Words: 3106
You will know 90% of words after 28 pages which are 12.90% of the book.
At that page, you will know 36.77% of unique words.
The graph is less regular but it has more or less same shape. I will not publish this part because it is not a book.
ulysses and finnegans wake is inspiration porn. i find the more jarring the sequence of words/sentences/phrases from a few passages the more inspired i become after.
best to have this type of stuff on hand when you get stuck. it's neurological
My mom, an english teacher, once went through my library of science fiction and analyzed it for reading level. I had the usual collection: Lots of Heinlein, Asimov, Niven, Andre Norton, etc.
Her assessment: Most of the material was about 8th grade level, based on word count.
From time to time I re-read one of those books, and run across pages where she had penciled-in notations and underlined words.
For the record, Ulysses is at least a full order of magnitude more comprehensible than Joyce's next book, Finnegans' Wake.
I'd also expect it to give a skewed response on a test of this kind because it is composed of a number of different sections, which vary considerably in their style. But maybe that's the point of including it.
Here are Finnegans Wake graphs. It is indeed even more complicated. https://github.com/vocapouch/vocapouch-research/blob/master/....
Number of Pages: 729 Number of Total Words: 218793 Number of Unique Words: 50872 You will know 90% of words after 387 pages which are 53.09% of the book. At that page, you will know 60.64% of unique words.
I think their teacher was referring to Zipfian Distribution[0]. I've seen this distribution hold on Wikipedia corpus, as well. Of course it's empirical.
It is Myth Buster's kind of science. The goal was to see how it works with short and long books and with one with reputation being easy and a hard read. It would be interesting to see it on larger population, with more of statistic involved.
As little as one character of almost any document will usually give you 100% of the binary symbols 0 and 1. Usually, the first character will do this, after which the rest of it is just mindless repetition.
Yes, that's Zipf's law applied. I doubt that many language learners knew about this law. I think it is still worth pointing out, that when you go through the beginning of the book, reading will become rapidly easier.
It'd be an interesting exercise in Modernist writing to try producing a book that violates Zipf's law, say by hashing all but the most common few hundred words into chapter buckets.
I wonder how Pale Fire by Nabokov would look after this sort of analysis. For the unfamiliar, per wikipedia, "Starting with the table of contents, Pale Fire looks like the publication of a 999-line poem in four cantos ("Pale Fire") by the fictional John Shade with a Foreword, extensive Commentary, and Index by his self-appointed editor, Charles Kinbote. Kinbote's Commentary takes the form of notes to various numbered lines of the poem. Here and in the rest of his critical apparatus, Kinbote explicates the poem surprisingly little. Focusing instead on his own concerns, he divulges what proves to be the plot piece by piece, some of which can be connected by following the many cross-references. Espen Aarseth noted that Pale Fire "can be read either unicursally, straight through, or multicursally, jumping between the comments and the poem."[4] Thus although the narration is non-linear and multidimensional, the reader can still choose to read the novel in a linear manner without risking misinterpretation."
Ah nice, I need to do the same with House of Leaves, I'm a big fan of stories with unconventional structuring. Sometimes a Great Notion by Kesey is my favorite; it's told from multiple first-person perspectives that shift pretty rapidly, where the shifts are indicated by having a particular speakers' text italicized, in parenthesis, with no formatting applied, etc. It's pretty neat.
That's 1 word in 10 that you don't know (1-2 words per sentence), or assuming as you did in that post a page length of 300 words, then it's 30 new words a page.
I actually recently wrote an article discussing the same phenomenon in Chinese [0]
Where to get a reasonable level of new characters (e.g. no more than 1 a page) you'd need to know 99.8% of the text on any page.
And the level of recognition required to be able to recognise and learn new words completely from context is about 98%. [1]
0: https://www.chinesethehardway.com/article/hsk-6-gets-you-hal...
1: https://www.youtube.com/watch?v=JbYMZZISPrU