Honestly it's worse than this. A good lab biologist/chemist will try to use it, understand that it's useless, and stop using it. A bad lab biologist/chemist will try to use it, think that it's useful, and then it will make them useless by giving them wrong information. So it's not just that people over-index when it is useful, they also over-index when it's actively harmful but they think it's useful.
You think good biologists never need to summarize work into digestible language, or fill out multiple huge, redundant grant applications with the same info, or reformat data, or check that a writeup accurate reflects data?
I’m not a biologist (good or bad) but the scientists I know (who I think are good) often complain that most of the work is drudgery unrelated to the science they love.
Sure, lots of drudgery, but none of your examples are things that you could trust an LLM to do correctly when correctness counts. And correctness always counts in science.
Edit to add: and regardless, I'm less interested in the "LLM's aren't ever useful to science" part of the point. The point that actual LLM usage in science will mostly be for cases where they seem useful but actually introduce subtle problems is much more important. I have observed this happening with trainees.
Much of the coverage of this story has credulously repeated the company's marketing press release without any sort of critical appraisal of the actual significance of the announcement (read: not much) or of the stated goal, which is misguided in the extreme. I found this (older) article about the "de-extinction" project [1] to be much more informative. The same journalist covered the new announcement last week [2] (submitted to hn here: [3]).
Aren't false positives acceptable in this situation? I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying. If there is a 10% false positive rate, then the only cost is the wasted time of whoever needs to identify it's a false positive.
I guess this is a bad idea if these tools replace peer reviewers altogether, and papers get published if they can get past the error checker. But I haven't seen that proposed.
You'd win that bet. Most journal reviewers don't do more than check that data exists as part of the peer review process—the equivalent of typing `ls` and looking at the directory metadata. They pretty much never run their own analyses to double check the paper. When I say "pretty much never", I mean that when I interviewed reviewers and asked them if they had ever done it, none of them said yes, and when I interviewed journal editors—from significant journals—only one of them said their policy was to even ask reviewers to do it, and that it was still optional. He said he couldn't remember if anyone had ever claimed to do it during his tenure. So yeah, if you get good odds on it, take that bet!
Note that the section with that heading also discusses several other negative features.
The only false positive rate mentioned in the article is more like 30%, and the true positives in that sample were mostly trivial mistakes (as in, having no effect on the validity of the message) and that is in preprints that have not been peer reviewed, so one would expect that that false positive rate would be much worse after peer review (the true positives would decrease, false positives remain).
And every indication both from the rhetoric of the people developing this and from recent history is that it would almost never be applied in good faith, and instead would empower ideologically motivated bad actors to claim that facts they disapprove of are inadequately supported, or that people they disapprove of should be punished. That kind of user does not care if the "errors" are false positives or trivial.
Other comments have made good points about some of the other downsides.
People keep offering this hypothetical 10% acceptable false positive rate, but the article says it’s more like 35%. Imagine if your workplace implemented AI and it created 35% more unfruitful work for you. It might not seem like an “unqualified good” as it’s been referred to elsewhere.
It depends if you do stuff that matters or not. If your job is meaningless, then detecting errors with a 35% false positive rate would just be extra work. On the other hand, if the quality of your output matters - 35% seems like an incredibly small price to pay if it also detects real issues.
Lots to unpack here but I'll just say that I think it would probably matter to a lot of people if they were forced to use something that increased their pointless work by 35%, regardless of whether their work mattered to you or not.
> is reviewing the errors these tools are identifying.
Unfortunately, no one has the incentives or the resources to do doubly triply thorough fine tooth combing: no reviewer or editor’s getting paid; tenure-track researchers who need the service to the discipline check mark in their tenure portfolios also need to churn out research…
I can see its usefulness as a screening tool, though I can also see downsides similar to what maintainers face with AI vulnerability reporting. It's an imperfect tool attempting to tackle a difficult and important problem. I suppose its value will be determined by how well it's used and how well it evolves.
Being able to have a machine double check your work for problems that you fix or dismiss as false seems great? If the bad part is "AI knows best" - I agree with that! Properly deployed, this would be another tool in line with peer review that helps the scientific community judge the value of new work.
I don't see this a worse idea than AI code reviewer. If it spits out irrelevant advice and only gets 1 out of 10 points right, I consider it a win, since the cost is so low and many humans can't catch subtle issues in code.
As someone who has had to deal with the output of absolutely stupid "AI code reviewers", I can safely say that the cost of being flooded with useless advice is real, and I will simply ignore them unless I want a reminder of how my job will not be automated away by anyone who wants real quality. I don't care if it's right 1 in 10 times; the other 9 times are more than enough to be of negative value.
Ditto for those flooding GitHub with LLM-generated "fix" PRs.
and many humans can't catch subtle issues in code.
That itself is a problem, but pushing the responsibility onto an unaccountable AI is not a solution. The humans are going to get even worse that way.
You’re missing the bit where humans can be held responsible and improve over time with specific feedback.
AI models only improve through training and good luck convincing any given LLM provider to improve their models for your specific use case unless you have deep pockets…
That's not a scare quote. It's just a proposed subtext of the quote. Sarcastic, sure, but no a scare quote, which is a specific kind of thing. (from your linked wikipedia: "... around a word or phrase to signal that they are using it in an ironic, referential, or otherwise non-standard sense.")
Right. I don't agree with the quote, but it's more like a subtext thing and it seemed to me to be pretty clear from context.
Though, as someone who had a flagged comment a couple years ago for a supposed "misquote" I did in a similar form in style, I think hn's comprehension of this form of communication is not super strong. Also the style more often than not tends towards low quality smarm and probably should be resorted to sparingly.
This is my knee-jerk reaction when people complain about wayland as well, and is often a correct reaction, but in this case the author really is talking about nontrivial wayland changes.
I have been mostly-wayland for a while now and rarely have issues. I think the major pain point remaining is the fragmentation of the compositors, so that bugs (and features) depend strongly on your choice of desktop environment. I found a bug a couple years ago in sway on a thinkpad where holding down the physical mouse button (associated with the trackpoint device) and dragging on the clickpad would send button-up events every time I lifted my finger from the clickpad (even while the mouse button was held down). This turned out to be the responsibility of wlroots in a block of basically boilerplate code for translating libinput events. Meanwhile mutter was doing this correctly, so the wlroots developers pulled the fix from there (almost identical code). Some time later I switched to KDE and found that KWin also had the bug (fixed now after another bug report). The end result is that it's difficult to track down the source of unexpected behavior without intimate knowledge of your DE (whether the behavior was intended by the developers or not), and getting a bug fixed in one place is unlikely to fix it for everyone without a bunch of extra work. Like I said, I don't encounter a lot of these bugs, but there are a couple that I have been putting off tracking down because it feels like it will be a lot of work.
That's all well and good except that we look out and see all of the actively bad uses being hyped as the way of the future, at untold expense in both dollars and energy. The LLM is just a model that is what it is, bashing it doesn't make sense. People are bashing how it is used, both currently and in prospect.