The best part of this article is perhaps the following critique of ngrams and by extension their popular use in modern algorithms:
> The text of Etymonline is built entirely from print sources, and is done entirely by human beings. Ngrams are not. They are unreliable, a sloppy product of an ignorant technology, one made to sell and distract, one never taught the difference between "influence" and "inform."
> Why are they on the site at all? Because now, online, pictures win and words lose. The war is over; they won.
One never taught the difference between "influence" and "inform". What a scathing rebuke of our modern world and the social media that is part of it. Algorithms that attempt to quantify human speech and interaction and get it wrong most of the time in their quest to maximize their owner's profits.
This somber warning is especially poignant in an age more and more ruled by generative AI, which I'm told is essentially an ngram predictor.
> The text of Etymonline is built entirely from print sources, and is done entirely by human beings. Ngrams are not.
I'm confused about this part actually. I assume by "entirely from print sources" it means it does not include digital sources? That doesn't sound very relevant to the issues mentioned in the article though: unless it uses the "complete" set of all print source, it totally could have the same skewed-dataset issues too; and humans can make the same mistake as OCR does.
Etymonline compiles the information on etymology and historical usage from printed books (eg the Oxford English Dictionary). That is what is being referred to here. They are not having humans tally up different words from books. That data is entirely from ngrams.
Influence and inform are two sides of the same moral coin, where we claim others ideas aren't their own, whereas we are the virtuous informed ones who draw our own conclusions.
The low-pass filter of the mind only allows in what fits somewhere inside the existing framework. If you don't reject something, then being informed by it and being influenced by it are the same thing. In that framework, people who claim to be informed come off as high and mighty and a little lacking in self consciousness.
Disagree, influencing someone and informing someone are orthogonal.
Influencing someone just means changing their behavior and/or beliefs. This can be done with either the truth or lies, or even just opinion (green is better than blue - neither true nor false).
Informing someone specifically means giving them true information, which may or may not influence them.
If we think more along the lines that truth is in itself always a moral judgement, then in that light, influencing and informing again become the same thing.
For instance, if I were to say something and you were to disagree, you don't get to say that you're the one that's right and that you're the one that's informing people, and that I'm the evil influencer, without it being a moral judgement.
And if you think that you are actually some sort of oracle of truth, then calling your judgements truths is still a moral judgement, predicated on the belief in your infallibility.
This is the fundamental problem of data analysis: your analysis is only as good as your data.
This is not an easy problem.
It's hard in general to evaluate data quality: How do we know when our data is good? Are we sure? How do we measure that and report on it?
If we do have some qualitative or quantitative assessment of data quality, how do we present it in a way that is integrated with the results of our analysis?
And if we want to quantitatively adjust our results for data quality, how do we do that?
There are answers to the above, but they lie beyond the realm of a simple line chart, and they tend to require a fair amount of custom effort for each project.
For example in the Google Ngrams case, one could present the data quality information on a chart showing the composition of data sources over time, broken out into broad categories like "academic" and "news". But then you have to assign categories to all those documents, which might be easy or hard depending on how they were obtained. And then you also have to post a link to that chart somewhere very prominently, so that people actually look at it, and maybe include some explanatory disclaimer text. That would help, but it's not going to prevent the intuitive reaction when a human looks at a time series of word usage declining.
Maybe a better option is to try to quantify the uncertainty in the word usage time series and overlay that on the chart. There are well-established visualization techniques for doing this. but how do we quantify uncertainty in word usage? In this case, our count of usages is exact: the only uncertainty is uncertainty related to sampling. In order to quantify uncertainty, we must estimate how much our sample of documents deviates from all documents written at that time. It might be doable, but it doesn't sound easy. And once we have that done, will people actually interpret that uncertainty overlay correctly? Or will they just look at the line going down and ignore the rest?
Your analysis is only as good as your data. This has been a fundamental problem for as long as we have been trying to analyze data, and it's never going to go away. We would do well to remember this as we move into the "AI age".
It also says something about us as well: throughout our lives, we learn from data. We observe and consider and form opinions. How good it is the data that we have observed? Are our conclusions valid?
The authors assert that the ngram statistics for "said" are wrong, and imply that they have evidence of the contrary, but they don't provide the evidence. Looking at their own website, all they provide is google ngram statistics: https://www.etymonline.com/word/said#etymonline_v_25922.
This coupled with the huge failing of not displaying zero on the y-axis of their graph, and even interpreting the bad graph wrong, makes me not believe them at all. A very low quality article.
A decline to half the usage of "said" within 6 decades, followed by a recovery to the previous level within two decades? Show me evidence that the English language changed so fast in that way. It's extraordinary and you'd have to bring something convincing. Otherwise I believe their hypothesis and their conclusion that ngrams are bunk.
Yeah they interpreted the "toast" graph wrong. They should be more careful to read shitty graphs that cut off at the low point.
It depends entirely on what the data set is, and to conclude that it's "wrong" you'd have to consider the underlying data too. Google ngrams makes no claim to be a consistent benchmark type data set. Over time the content its based on shifts, which can cause effects like this.
To make any sort of claim like "this word's usage changes over time" in an academic sense you'd need to include a discussion of the data sources you used and why those are representative of word usage over time. The fact that they'd even try to use google ngrams in this way shows how little they actually researched the topic.
Google ngrams is a cute data set that can sometimes show rough trends, but it's not some "authoritative source on usage over time" and it doesn't claim to be.
The authors, on the other hand, are claiming to be authoritative and thus the burden of evidence on their claims is far far far higher. I didn't even get into their completely unobjective and vague accusations of "AI" somehow doing something bad. Ngrams don't involve AI, it's simple word counting.
> The authors, on the other hand, are claiming to be authoritative and thus the burden of evidence on their claims is far far far higher.
From what I read the authors are only claiming that some Google n-grams fail the common sense test and that the data shouldn't be considered rigorous.
"said" is in the top 300 most frequent English words, according to Wiktionary. For its usage to halve in 80 years then double again in 20 would represent a profound shift in English that would certainly be known to linguists.
Or, as with "toast", one could simply doubt the veracity of the data.
According to this page (https://books.google.com/ngrams/info), if you want to write a paper based on their results (why would you do this against a cute dataset?) make sure to quote their very authoritative sounding paper "Quantitative Analysis of Culture Using Millions of Digitized Books"
It's possible (but I think unlikely) it could be somewhat due to different usage of words than the English language changing completely (which clearly didn't happen).
i.e. maybe instead of lots of books having direct text like "David said" or "Dora said", over time there was a trend to use a different more varied/descriptive way of describing that, i.e. "David replied" or "Dora retorted"?
Yea there may be a shift in usage hidden in those numbers. As this article laments, we can't use ngrams to measure the develpment of usage between said, replied, and retorted.
It’s hard to present evidence because there’s only one source. So the article basically calls out flaws in the methodology of Google Books/Ngram.
I think this is reasonable. As otherwise we end up accepting things that exist solely, but are flawed. Just because something exists and is easy to use doesn’t mean it’s right.
Just like the answer to “the most tweeted thing is X therefore it is most popular and important” does not require a separate study to find the truth. It’s acceptable just to say “this is a stupid methodology, don’t accept it just because that’s what twitter says.”
I think what you want is for someone (yourself, me, the author) to review newspapers or some similar source and determine how the frequency percent changes over time for the word "said".
This is a reasonable request, but I also think it's fine for the author to state it _as an expert_ that newspapers continued using said at a similar frequency. The story they tell us plausible, and I don't really think the burden of proof is on them.
A low effort comment. That "said" haven't declined and raised the way shown isn't what needs evidence.
It's the extraordinary claim that it has that does.
That claim is Google's, and before accusing the author of the blog, maybe how representative their unseen dataset is. Should we take statistics with no knowledge of their input set at face value because "trust Google"?
Google isn't claiming any such statement. It's merely providing fun statistics based on their data set. With that context, when I read a headline claiming that the statistics are "wrong," it would imply that the counts are somehow off. Maybe due to a bug in the algorithm or the like.
Instead, we get a strawman put up where they misrepresent what the data set is, make up things that its "claiming," fail to investigate the underlying data sources and look into "why" they see the trend they see, and also fail to provide any alternative data.
It's cheap and snobby grandstanding, ironically complete with faulty interpretations of the little data they DO present.
It should be marked "Fun statistics" with a big red label "Not representative of anything, any graph you see could be and probably is totally bogus" then.
>Instead, we get a strawman put up where they misrepresent what the data set is, make up things that its "claiming," fail to investigate the underlying data sources and look into "why" they see the trend they see, and also fail to provide any alternative data.
A, blame the victim and goalpost moving. Old favorites.
Why the fuck would the author need to "provide alternative data"? Google is showing statistics, that people, including journalists and scholars, take at face value.
Now they're suddenly just "fun statistics", so if they take them seriously, it's on them?
But Google is claiming such thing by calling it "trends", which the dictionary defines as "a general direction in which something is developing or changing.", if they didn't want to create such misunderstandings they would just call it "word frequency on Google books" so the biases of the data would be a lot more clear.
EtymOnline isn't in the business of tracking shifts in the popularity of words over time, they set out to track shifts in meaning. So it's understandable that they don't have any specific contrary evidence in their listing for "said".
As for why they don't include the evidence in TFA, as others have noted, it's the extraordinary claim that "said" dropped to nearly 1/3 of its peak usage that needs extraordinary evidence backing it up. It's plenty sufficient for them to say "this doesn't make any sense at all on its face, and is most likely due to a major shift in the genre makeup of Google's dataset".
> Ngram says toast almost vanishes from the English language by 1980, and then it pops back up.
The Ngram plot does not say that. It shows usage dropping ~40% (since 1800). It’s indeed a problem that the graph Y axis doesn’t go to zero, as others have pointed out. But did the etymonline authors really not notice this before declaring incorrectly what it says? I would find that hard to believe (especially considering the subsequent “see, no dip” example that has a zero Y and a small but visible plateau around 1980), and it’s ironic considering the hyperbolic and accusatory title and and opening sentence.
The graph axis isn't the only problem. The word "toast" did not drop in usage by 40%, Google's dataset shifted dramatically towards a different genre than it was composed of previously. I've been in conversations with people trying to explain those drops in the 70s, and no one (myself included) realized that it was such a dramatic flaw in the data.
That’s fair, the article has a very valid point, which would be made even stronger without the misreading of the plots they’re critiquing, whether it was accidental or intentional. I always thought Ngrams were weird too, I remember in the past thinking some of the dramatic shifts it shows were unlikely.
Sort of, but it's pretty blunt. You can select between a few different English corpuses, but it's basically fiction versus everything, not more fine than that.
When it comes to results like this it is more “lusting for clickbait” or the scientific equivalent thereof. (e.g. papers in Science and Nature aren’t really particularly likely to be right, but they are particularly likely to be outrageous, particularly in fields like physics that aren’t their center)
On the other hand, “Real Clear Poltics” always had a toxic sounding name to me since there is nothing “Real” or “Clear” about poltics: I think the best book about politics is Hunter S. Thompson’s Fear and Loathing on the Campaign Trail ‘72 which is a druggie’s personal experience following the candidates around and picking up hitchhikers on the road at 3am and getting strung out on the train and having moments of jarring sobriety like the time when he understood the parliamentary maneuvering that won McGovern the nomination while more conventional journalists were at a loss.
What I do know is 20 years from now an impeccably researched book will come out that makes a strong case that what we believed about political events today was all wrong and really it was something different. In the meantime different people are going to have radically different perspectives and… that’s the way it is. Adjectives like “real” and “clear” are an attempt to shut down most of those perspectives and pretend one of those viewpoints is privileged. Makes me thing of Baudrillard’s thorough shitting on the word “real” in Simulacra and Simulation which ought to completely convince you that people peddling the fake will be heralded by the word “real”.
(Or for that matter, that Scientology calls itself the “science of certainty.”)
> 20 years from now an impeccably researched book will come out that makes a strong case that what we believed about political events today was all wrong and really it was something different
The one good thing about politics is that the motives are crystal clear, politicians want to stay in power first, and only secondarily want to improve things.
Once you know this, everything makes sense. Even if we never find out what "really" happened
> politicians want to stay in power first, and only secondarily want to improve things.
The politicians who want to be in power first, and only secondarily want to improve things, tend to be the politicians in power.
Politicians who want to improve things first do exist, but they tend not to achieve power, because power is not their goal, and they are out-maneuvered by the first type.
Notably, politicians who want to improve things are easily side-tracked by suggesting that their proposed policy is not the best way to improve things, and that some other way would be better. This explains to some degree a lot of infighting on the left, because many do want to genuinely help, but it's never 100% clear what the best way to help is. It also explains why the right can put aside major differences of opinion (2A is important to fight the government who can't be trusted, but support the troops and arm the police!) to achieve power, because acquiring and maintaining power is more important than exactly what you plan to do with it.
>2A is important to fight the government who can't be trusted, but support the troops and arm the police!
I fail to see the contradiction here. 2A proponents would say that 2A is there for when the government goes wrong, or "when in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another." At all other times, however, it would be up to the government to enforce the law and protect the people. Destroying the state is a different ideology.
(To be clear, the last few wars may not have been about protecting the people. But that the US has not been attacked since Pearl Harbor may be a result of the investment made in "defence" since then, as well as favourable borders ect.)
In any case 'both sides' have people who people who actualy care about society. And there are people on the left who may simply want power, and complex people who seem to be a bit of both (for example perhaps Lyndon Johnson depending on how you see him).
politicians want to stay in power first, and only secondarily want to improve things.
In all honesty, many don't even want to improve things. Most people with power, love power. It's contrary to their nature to change a system that confers power to themselves. That's not just in your own, but in any nation, the people in power will be resistant to change.
That’s as close as you will get to a master narrative but it isn’t all of it.
Politicians aren’t always sure what will win for them, often face a menu of unappetizing choices and have other motivations too. (Quite a few of the better Republicans have quit in disgust in the last decade: I watched the pope speak in front of congress flanked by Joe Biden, then VP and John Boehner, then House Speaker when the pope obliquely said they should start behaving like adults and then Boehner quit a few days later and got into the cannabis business.)
I was an elected member of the state committee of the Green Party of New York and found myself arguing against a course of action that I emotionally agreed with, thought was a tactical mistake, and that my constituents were (it turns out fatally) divided about. It was a strategic disaster in the end.
You're right, I should have added that politics is also extremely difficult and filled with unpalatable choices. Each of the politicians I have met are intelligent, caring people with a clear grasp of the issues.
And then you see what they do, and you wonder, what the...
You can never construct a representative image of the past. You are operating with a limited amount of sources which have survived in one form or another. They are not evenly distributed across time and space. There is an inherent “data loss” problem when a person dies - gone are all the impressions, unwritten experiences, familiar smells. Even a living person’s memory may not be reliable at one point.
Wikipedia is not meant to be an archive of all information. It's meant to be an encyclopedia of things that are notable [1], which is probably where the confusion comes from.
As you can imagine, the topic of what notability is, has been discussed at length since Wikipedia's inception [2].
Sure, what you pay attention to will impact what you remember, but this experience goes further and show how your attention can be manipulated to be blind to ploted events.
It seems to me that Google Ngram isn't wrong. It's reporting statistics on the words it correctly identified in the corpus. The problem is the context of the statistics. You may somewhat confidently say the word "said" dips in usage at such and such time in the Google Books corpus. You can more confidently say it dips at such and such time for the subset of the corpus for which OCR correctly identified every instance of the word.
But you can't make claims in a broader context like "this word dipped in usage at such and such time" without having sufficient data.
And this is why sampling methodology is so much more vastly important in drawing inferential population statistics than sample size.
Sample 1 million books from an academic corpus, and you'll turn up a very different linguistic corpus than selecting the ten best-selling books for each decade of the 20th century.
Classic mistake of not including zero on the vertical axis of a graph. If you're thinking "but then there won't be so much variation" you're right. Leaving zero off allows small variations to look large.
On the other hand there are the cases where you do want to emphasize small variations. In a control chart showing the fill weight of cereal boxes you certainly don’t want zero on the chart. Neither do you want to plot daily temperatures in a city on a chart that includes 0 Kelvin.
Exactly. A lot of investment market charts are zoomed in like that because small deviations can matter a lot, and you don't want the base price (or whatever measure you're looking at) to swamp the signal.
Including zero would have helped the "said" graph but not solved it—it just would still look like "said" dropped to almost 1/3 of its prior popularity, when what actually happened is the makeup of the sample changed dramatically.
Is this that the n-grams are wrong, or that they are limited in what you can do/say with them? I find the data fun, but I'm not entirely sure what to make of it. You will be doing a query on past books on today's lexicon. Which just feels wrong.
As an easy example that I know, if you search for "þe", you will not find a lot of hits. Which, is mostly fair, as historically we know that "þ" dropped off around the 1400s. That said, add in "ye" and you see a ton of its use.
Is that an intentional feature of n-grams? Feels more like an encoding mistake passed down through the ages. Would be like getting upset at the great vowel shift and not realizing that our phonetic symbols are not static universal truths.
While the point made by the authors is certainly a valid one, it's a bit sneaky and not very fitting to their overall message that they have the Y-axes on the ngram graphs not 0-indexed. This makes the google results seem more extreme than they in fact are and is a bit of misdirection in itself.
Compare e.g. to the actual ngram viewer which seems to index by 0 per default:
Kind of. The author could fix a lot of their problems with the very prominent dropdown above the graph letting them select the collection— English fiction for example. The long s character can be tricky for OCR, but is not likely relevant to most people's casual use of the tool. I worked on a team that overcame it in a high volume scanning project so they should be able to correct that with software and their existing page images. The plurals criticism is just wrong— you can even do case sensitive searches.
It's not perfect, but it's not useless, and it's not a "lie"— it's just a blunt instrument. Even if the criticism was factually correct, 'proving' that you can't do fine work with blunt instrument is of dubious value.
I think a lot of folks around here are super thirsty to see big tech companies get zinged and when it happens, their fact checking skills suffer.
The n-grams aren't wrong, but it is a real problem that the underlying corpus distribution changes massively over time (in this case, proportion of academic vs. non-academic works).
This is a really devilish problem with no easy answer.
Because on the one hand, it's certainly easy enough to normalize by genre -- e.g. fix academic works at 20%, popular magazines at 20%, fiction books at 40%, and so forth.
But the problem is that the popularity of genres changes over time separately in terms of supply and demand, as well as consumption of printed material overall. Fiction written might increase while fiction consumed might decrease. Or the consumption of books might decrease as television consumption increases.
So there isn't any objectively "right" answer at all.
But it would be nice if Google allowed you to plot popularity by genre -- I think that would help a lot in terms of determining where and how words become more or less common.
HN in general doesn't like "editorialized" titles. HN titles are meant to be a factual representation of what you are going read without the attention grabbing (albeit clever) title.
Both your and GP comment are inaccurate and/or unclear.
HN prefers but does not require the original title.
HN does not permit submitter editorialising.
Where the original title is clickbait, which may include editorialising, HN requests that submitters change the title, if at all possible to some phrase within the article.
Another de facto rule concerns "title fever", which is when a title is so distracting that it overwhelms the content of the article in discussion.
From the guidelines:
If the title includes the name of the site, please take it out, because the site name will be displayed after the link.
If the title contains a gratuitous number or number + adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."
Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.
What is it specifically about the 1970/80s that causes this dip? Was there an explosion of this academic writing around that era or something else to have this effect?
The title is true for a lot more areas of life than linguistics. There are no shortcuts to truth, DVD anyone who tries to offer you one is probably trying to sell you something.
The redditification of HN is sad. With reddit de facto purging third-party apps with increased API prices, we now see reddit-tier conversations spamming message boards like HN.
I don’t have any clue what it is supposed to mean. I very rarely landed on a reddit page through my searches in my entire life, and as far as I’m concerned it could have never existed it would not have changed anything in my direct experience of the web, just like Twitter to give an other example of an other popular stuff that I just don’t care about.
So, your "redditer detector" went through a false positive it seems. :)
It's probably supposed to be "and" instead of "DVD". Both words have a similar shape on the keyboard, especially if you're doing swipe-style smartphone keyboard input.
I think this is one of the one-liners that sound good, but is bogus at closer inspection.
That articles talks about history. In that context it might make sense as it is hard to say something with certainty.
But in every speech I can say things with certainty without lying.
If we furthermore drag the word certainty out of a philosophers grip and apply a layman meaning to it, then many things are certain as the word can also mean commitment.
In every speech you can say some things with certainty without lying.
But I think the point of the saying is in the other direction. If you are listening to a speech, the things that the speaker can say with certainty may not be the ones where you want certainty. And if you demand certainty on those things, you will find those who will give it to you. But the certainty itself is a lie - that's why the speaker can't (honestly) say those things with certainty.
What is the optimum political program for the United States? There are plenty of people willing tell you with (apparent) certainty what the answer is. The truth is that nobody knows with certainty, and so the answers that sound certain are lies. The actual program may be correct - may be - but the certainty itself is a lie.
This is often true in linguistics, and history, and politics, and economics. Don't demand certainty where there is none.
I've seen people who strongly crave for (a feeling of) certainty prefer simplified categorizations and false absolutes to complexity that doesn't offer absolute certainty and discrete clarity.
Similarly, some things aren't readily quantifiable, and in some cases any quantification might be a great oversimplification at best. In those cases wanting a quantified and measurable answer instead of a more complex answer with less (of a feeling of) certainty can amount to wanting a lie. Or at least to wanting an answer that feels a lot more certain and true than it actually is.
I think that's what the post is about.
Of course the title isn't absolutely true either. Of course you can say and find things that are true and (to a good approximation) certain. But that's not really what the post or its title are trying to say.
This hits close to home with all the appeals to authority over the last few years. With absolute confidence they were holders of the truth, "trust the science!".
Kinda, but most of the anti-scientific bullshit out there is a symptom of precisely this phenomenon. Actual science cannot offer absolute certainly, so people reach for whatever alternate theory offers the feeling of certainty. Blind faith in "the science" kind of works, and even gets pretty decent practical results, but you know what's structurally really hard to disprove and thus amenable to feeling certain? Conspiracy theories!
I hear what you're saying. In the end, we have to believe something -- on less than perfect information.
But understanding human nature, isn't a conspiracy theory. And accepting obviously overreaching statements of "fact", that literally nobody had the data to state unequivocally, is not following the science.
It wasn't so long ago, that most people understood big pharma was a profit seeking machine, that wasn't primarily motivated by what is best for humanity. Overstating the risks of Covid, and pretending that we faced an existential threat, made everyone forget that truth, and unquestioningly believe that only the purest of intentions motivated the industrial/media response.
> we have to keep our actual belief in line with the evidence.
That's what everyone does.
Just with varying degrees of success and with differing levels of intellect and experience. But we are all faced with the same conundrum of evidence being less than perfect. Everything comes down to a best-guess in the end. Even for the most rigorous scientist, all conclusions are provisional, and susceptible to the emergence of new evidence.
> That's what everyone does. Just with varying degrees of success ...
If by "varying degrees of success" you mean "mostly abject failure", I guess we can agree. But no, not everyone does that. Most people broke in the early pandemic, either toward trusting "the science" or toward weird bullshit.
I'm happy to have LLMs now but in the future I'm going to be more concerned about the source of their training. I can see regulation coming that requires that LLM companies reveal those sources.
*Edit: I initially thought that saying ‘diachronic change’ was like saying ‘three-sided triangle’. But thinking about it, I suppose things do change in space but not time, e.g ‘the pattern changes abruptly’
I'm going to use that title on the next conversations I have about estimates, in particular in the context of 'we need to know that this piece of work will be started in 4 months and finished in 8'. Those conversations definitely ſuck for me.
Only one goal can be first. If you want to set absolute dates, all other requirements must be subordinate to that. In which case, sure, we can absolutely meet it.
I'm pretty sure that you are correct. Or at the very least it is a reference to that specific aphorism. The title is far too idiomatically Latin (if you overlook the awkwardness with the syntactic subject) to be a coincidence.
I personally feel like more people will click with this new title. The old one was far too vague and ambiguous for a news aggregation site. I thought the old title would be about scientific papers and trying too hard to get definitive answers out of them.
I think we're getting into matters of definition. Do I count on HN to stay aware of current events? No; it would be a very incomplete picture. Much of HN has nothing to do with current events.
> This only has to do with mismatch of the scope of HN and the scope that you are interested in.
It's not scope mismatch, it's that HN doesn't present a (nearly, roughly) complete picture of current events in any scope (other than the scope of itself, of course).
I wouldn't mind a reach-around. I mean, if you're offering.
Otherwise, OP's right. This isn't news agg. It's news talky-talk. There is a high degree of back and forth, without all of the mess of that other place. The back and forth requires intellectual curiosity. It's a prerequisite.
Horses for courses, but to me the original title was the forest and the stuff about Ngrams was the trees. As such I found TFA interesting, even though I have no interest in Ngrams or whether they're correct (which is why I definitely would not have clicked on the current title).
I, uhhhh.....I would like to know what TFA is meant to stand for, because I assume it is not "the ſucking article", but that was my first thought. Maybe "featured"? Google is only giving me "Teach For America" or "Trade Facilitation Agreement".
I feel that the presence of this term here means that HN is the successor to the venerable Slashdot. Kind of comforting that there’s a straight line from the site that I spent so much time on 20 years ago, to this one.
The article title is certainly provocative, yes, and that’s the problem. Do you want clickbait titles? The article’s title is a combination of a platitude, an inaccurate and/or irrelevant statement, and an implied inflammatory accusation. Swapping the title for the more accurate more informational less provocative first line is much better for me, but maybe true that not flinging around the word “lies” could result in fewer clicks.
the word "clickbait" is flung around way too readily these days. a good title is supposed to make you want to read the article, and at its best it is an artistic flourish that enhances the overall piece. and personally, i love that. i enjoy seeing how writers (or editors) come up with good titles, and the fun and interesting ways they relate to the text of the piece. i enjoy when the title is clearly an allusion or reference to something, and chasing it down leads me to learn something new. and i even enjoy when the title is just a pun or play on words, because writers live for moments like that :)
in this case i definitely felt "wow, that's an interesting quote, and i can see what they are getting at. let's read the article to see how it's substantiated or used as a springboard".
clickbait is more "we have some amazing!!!!! information to tell you but to find out what you will have to read the article", e.g. the classic listicle format "10 things we imagined a beowulf cluster of - number 4 will shock you!", the spammy "one weird trick doctors don't want you to know" or the tabloid "john brown's shocking affair!". and yes, that sort of thing is a plague on the internet and i would not like to see more of it, but also that is not what is going on here.
I agree with everything you said in general, and I also enjoy good titles. Do you feel like the article substantiated the quote? I don’t think it even came close. Where does it link a “lust” for certainty with lying?
This is admittedly a subtle point, but I’d be perfectly fine with starting the article with the same quote, attributed to someone, as a decorative introduction. That’s a pretty common writing device. (And importantly in that case, the quote doesn’t need to be substantiated.) It’s just using it for the title in this case that rubs me the wrong way.
Using the word “lies” is almost never good, especially when you are explicitly criticizing someone or something. IMO using “lies” is more or less equivalent to your example “number 4 will shock you”, use of that word is designed to invoke the same response. They stopped a hair short of literally stating Google is lying, but the implication combined with the first line of the article is very strong. One real problem with such an implication is that it may itself be wrong. It’s presuming active dishonesty when the problem could easily be a mistake. When putting these things together with the article’s misuse of the Y axis to again make emotional but not necessarily accurate points, I still think “clickbait” is warranted here - this writing is being a tiny bit manipulative.
To me the title reads very differently - it's saying that if you demand certainty you'll wind up treating something uncertain as certain, and hence believing something untrue.
For reference incidentally, the title is a callback to the last line of a previous post[1] by the same author, on an unrelated topic. So it's presumably meant to be less a statement about Ngrams, and more a recurring theme in the author's views on language.
yes, i thought the article substantiated the quote pretty well, though i get the feeling that i did not interpret the "lies" in the title the same way you did. i read it more as "something incorrect" than as a necessarily deliberate falsehood, and what the title was saying as "if you want 'certainty' then you need to be simplistic and reductive enough that you are discarding any hope of an actually correct answer".
yes, it absolutely isn't. it's a regular well-crafted title that conveys the flavour of the article and doesn't hint at must-see secret knowledge within.
This depends on your interpretation, so it is not absolute in any sense. I can see your interpretation of a cute quote as long as you take it out of context, but I came to a different conclusion because I think context matters and that this title, whether intentional or not, and whether deserved or not, is easily interpreted as a direct criticism of Google Ngrams, in this context.
I don't think "Ngrams are wrong" is what TFA is about. The author isn't an expert on Ngrams and he's not sharing any new information about them; what he's really talking about is how data about language is unreliable, and why Ngram images are on his site even though he knows they're flawed. Personally, I found the original title truer to the article than the current one.
Calculating story points based on the size of the t-shirt you're wearing that day might actually yield more realistic results. In my case every story would become an XL! :)
> The text of Etymonline is built entirely from print sources, and is done entirely by human beings. Ngrams are not. They are unreliable, a sloppy product of an ignorant technology, one made to sell and distract, one never taught the difference between "influence" and "inform."
> Why are they on the site at all? Because now, online, pictures win and words lose. The war is over; they won.
One never taught the difference between "influence" and "inform". What a scathing rebuke of our modern world and the social media that is part of it. Algorithms that attempt to quantify human speech and interaction and get it wrong most of the time in their quest to maximize their owner's profits.
This somber warning is especially poignant in an age more and more ruled by generative AI, which I'm told is essentially an ngram predictor.