Am I blind or is there no mention at all of the GPT model he used?
The author states his conclusions but doesn't give the reader the information required to examine the problem.
- Whether the article to be summarized fits into the tested GPT model's context size
- The prompt
- The number of attempts
- He doesn't always state which information in the summary, specifically, is missing or wrong
For example: "I first tried to let ChatGPT one of my key posts (...). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said." He doesn't say which statements of the original article were reproduced falsely by ChatGPT.
My experience is that ChatGPT 4 is good when summarizing articles, and extremely helpful when I need to shorten my own writing. Recently I had to write a grant application with a strict size limit of 10 pages, and ChatGPT 4 helped me a lot by skillfully condensing my chapters into shorter texts. The model's understanding of the (rather niche) topic was very good. I never fed it more than about two pages of text at once. It also adopted my style of writing to a sufficient degree. A hypothetical human who'd have to help on short notice probably would have needed a whole stressful day to do comparable work.
You write as if you’ve found a hole in the article’s argument. The lack of evidence is a hole in the reporting, for sure. The tone of your comment suggests you feel that by not publishing all their evidence, the author’s point is wrong (rather than under-justified). However, the example you use to back up your point also backs up the article’s point. The article’s point is that ChatGPT doesn’t summarise, it only shortens. Your example indicates shortening, but not summarising.
There’s just so many articles of people whining about how ChatGPT can’t do things, when they clearly havent prompted it very thoughtfully.
So I think that’s why you see so many reactions like this.
I’ve found chatGPT incredibly good at all sorts of things people say it is bad at, but you need patience and to really figure out the boundaries of the task and keep adding guidance to the prompt to keep it on track.
The article makes it clear that there is a semantic difference between shortening and summarizing and that importantly summarizing requires understanding which ChatGPT most certainly does not have.
One example in the article is that if you have 35 sentences leading up to a 36th sentence conclusion, ChatGPT is very likely to shorten it to things in the earlier sentences and never actually summarize the important point.
You seem to be on the "statistical next token predictor" side. I'm more.on the side of those who invented it (they should know) that think these machines can understand things
In 1964, Joe Weizenbaum created a chatbot called "Eliza" based on pattern matching and repeating back to users what they said. "He was surprised and shocked that some people, including Weizenbaum's secretary, attributed human-like feelings to the computer program." People are notorious for anthropomorphizing and attributing to things attributes (including human-like attributes) that they do not possess. [1,2] LLMs are a "statistical next token predictor" by their design. The discovery that coherent and interesting communications are relatively easily statistically modeled and reconstructed if given enough computing power and corpus of training data does not therefore imply that these programs have latent thinking and understanding capabilities.
Just the opposite: it calls into question if _we_ have thinking and understanding capabilities or if we are complicated stochastic parrots. [3] The best probing of these questions is done at the limits of comprehension and with unique and previously unseen information. I.e., how do you comprehend and process to previously unseen/unfelt/not-understood qualia? Not about how you deal with the mundanity of reactions between people (which are somewhat trivial to describe and model). [4]
At what point does it become easier to just do the task yourself? I’ve pondered this question often and came to the conclusion that it’s not worth at the current level of output for me to tinker with it until I get sensible responses.
It depends on the task. Sometimes I have just given up when it really can’t get something.
But other times I’ve persevered and once it’s ‘got’ it, it can then repeat it as many times as I need. That’s the knack really. Get it to the point of understanding and then reuse that infinitely and save yourself a lot of time.
In the example I mentioned, ChatGPT 4 did keep all essential statements of my texts when reproducing shorter versions of them. For example, it often wrote one high-level sentence which skillfully summarized a paragraph of the original text. As far as I understand, this is what the author meant by 'summarizing' vs. 'shortening (while missing essential statements)'.
I was impressed at those high-level summaries. If I had assigned this task to several humans, I'm not sure how many would have been able to achieve similar results.
For example, looking at the ChatGPT link the author has, the model loaded 5 pages besides the one the author wanted. That clearly is going to cause some issues but the author didn't modify the prompt to prevent it. It was also a misspelled five (?) word prompt.
I don't see how you can draw conclusions from a model not reading your mind when you give it basically no instructions.
You need to treat models like an new hire you're delegating to and not an omniscient being that reads your intent on it's own.
Why, if the author asks it to summarise a single webpage and gives the link should ChatGPT go out and load 5 more (one is the same page again, the others short overview pages, so won't have influenced the result much)
And why all this talk about trying to engineer a prompt so that in the end the result is good? Should an actual usable system not just handle "Please summarise [url/PDF]"? That is, I suspect, what people expect to be able to do.
Summarize clearly means something different to the author and the people who think the model results are good. Everyone expects different things. Most people are used to others knowing their preferences and adjusting over time. Models do not unless you tell them.
To be fair, most of the commentary on both sides of the LLM conversation are pretty anecdotal, which is increasingly looking like a structural problem given that any solid evidence goes in the training set in about an hour.
Definitely. Otherwise it would have required a lot more than a single blog post. It is an observation, not anything rigorous with a large number of examples, and decent statistics.
In the comments, the author clarified that he used GPT-4 for the article.
> What the colleague used, I can ask, but I suspect standard ChatGPT based on GPT-4. But my test was with GPT-4 (current standard), so that would mean about 8000 tokens (or roughly 4000 words, I think?). That may have influenced the result.
There's a fundamental problem with all these "summary" tasks, and it's obvious from the disclaimer that's on all these AI products: "AI can be wrong, you must verify this".
A summary for which you must always read the un-summarized text is useless as a summary, this should be obvious to literally everyone, yet AI developers stick their heads in the sand about it because RAG lets them pretend AI is more useful than it actually is.
RAG is useless, just fucking let it go and have AI stay in it's lane.
No? They are asking if in the limit as amount of money spent in a particular way goes to infinity, whether AGI would thereby be achieved. They aren’t just asking “If I had infinite capital, would that be sufficient to achieve AGI by some method?”.
That's why you use a local model instead, that way you're out of money after buying the GPUs already and don't have to bother with the implementation taps temple
> There's a fundamental problem with all these "summary" tasks, and it's obvious from the disclaimer that's on all these AI products: "AI can be wrong, you must verify this".
> A summary for which you must always read the un-summarized text is useless as a summary, this should be obvious to literally everyone
Nah, it's still useful if the summary is usually right or mostly right. At the limit, it's not even clear that something can be summarized perfectly.
Consider that the alternative to reading the summary often isn't reading the entire text yourself. It's reading nothing.
Also, in my experience, these tools often fail when it comes to questions with a definitive answer. E.g. if you pass them a lot of text and ask a detailed question with a very clear answer, they often get it wrong. But when your question is vague like "summarize the text," they're very useful.
You can just assume it's right or mostly right and move on with your life. We're not talking about designing spaceships here. Being wrong is allowed.
> Reading an inaccurate summary is actually less useful than reading nothing. It's not like the utility of the summary is to exercise a reading muscle.
No it's not. You're acting like these tools generately wildly inaccurate text. Even when they're wrong, they're mostly accurate. Almost everything people read is a mostly accurate summary, whether it's from some guy's article or wikipedia. We rarely go to the primary source.
edit - as a concrete example, I did my taxes this year with help from ChatGPT. It was a big improvement over using Google or reading through the instructions myself. And if it was wrong, well, maybe I'll get a bill or a check in the mail, but that was always a possibility and making a mistake on your taxes isn't illegal.
>> And if it was wrong, well, maybe I'll get a bill or a check in the mail, but that was always a possibility and making a mistake on your taxes isn't illegal.
If this is your approach to your taxes then you are probably wasting your time by using either of Google or ChatGPT. Just punch some numbers in to TurboTax based on a quick skim of the buttons, say “good enough for government work,” and wait for a check or a bill in the mail baby.
But you want to pay someone for software that doesn't do your taxes but looks like it does? Or are you using someone else's ChatGPT subscription to do your taxes?
So it's not that you don't want to pay someone, it's that you get so little value out of accurate tax calculations that you only want to pay a tiny amount.
Just because you can amortize the cost of ChatGPT over more operations doesn't mean you aren't paying for it. You said you didn't want to pay someone for software to do your taxes, not that you wanted to pay only a little.
This is correct. There is a social context to this matter. The advocates of 'almost right is most cases is ok no harm done' are ignoring the (likely) operational and utility context of these tools.
> Consider that the alternative to reading the summary often isn't reading the entire text yourself. It's reading nothing.
But that could be actually good alternative. Sometimes its simply not worth it. You can loose more time with the tool that will give you uncertain summary.
Do you find yourself refusing to read articles or papers unless you’re first 100% sure that everything stated in them is absolutely true? Having to approach something with a certain amount of skepticism is not something that was newly introduced with LLM’s.
This is just a trust issue, which applies to pretty much any task where there is delegation.
If you ask an intern to summarize some text, you trust them to do a half decent job. You're not going to re-read the original text. The hiring process is meant to filter out bad interns.
If an intern gets it wrong, then you can sit down with them and teach them the correct process. Hopefully over time they get better, and once the trust is built you then stop being so involved. If they don't get any better, you find someone else to summarize for you.
You can't go through this process with an AI, every time it's just a shot in the dark.
With a specific model (just like with an intern), you only need to evaluate their work a certain amount of times to decide whether they're doing their job good enough to leave them alone and continue without further supervision
> If you ask an intern to summarize some text, you trust them to do a half decent job
Yes.
And if I were to propose that we filter all information we consume through completel unqualified interns summarizing it, I'd be laughed out of the room.
Yet that is the future all these AI firms seek to build.
What would be an acceptable error rate for your use case?
There are situations where AI is good enough, other cases where you need more accuracy, and others still where you should be reading the reference directly.
AI is improving quickly though, and context windows will allow for summaries to be tailored to each end user.
> RAG is useless, just fucking let it go and have AI stay in it's lane.
imho a side effect of promoting RAG is that the vector search by itself (on chunks of documents) might be a good-enough thing for most people. if we create a system without the LLM-summarization part, it might be the best of both worlds. Alas, people actually don't care about that stuff.
I am more curious why this disclaimer is missing on human backed products: "Human can be wrong, you must verify this". How do they overcome the fact that the information they provided can be wrong?
What would you say to the standard counterargument that most existing processes that AI might aim to augment or replace _already_ have a non-zero error rate? For example if I had a secretary, his summaries _could_ be wrong. Doesn't mean he's not a useful employee!
The standard processes don't fail in the manner that AI does - they don't randomly start inventing things. Your paralegal might not give you great case law, but they won't invent case law out of thin air.
I regularly work with a wide variety of project managers, product owners, secretaries, etc…
I swear that most of them willfully misunderstand everything they’re told or sent in writing, invariably refusing to simply forward emails and instead insisting on rephrasing everything in terms they understand, also known as gibberish that only vaguely resembles English.
Yeah it's asinine if you think about it for more than a few seconds. The implication is that there is no nuance. Humans are imperfect and AI is imperfect so therefore they are equivalent.
I read through your entire article and the three main points I took away from it were also contained in the gpt4o summary I then generated to compare afterwards. So here's some empirical counter evidence.
I would suggest a less strong but more plausible claim that GPT4o has trouble summarizing for longer form content outside the bounds of it's context window or something like a lossier attention mechanism is being used as a compromise for resource usage.
> I would suggest a less strong but more plausible claim that GPT4o has trouble summarizing for longer form content outside the bounds of it's context window or something like a lossier attention mechanism is being used as a compromise for resource usage.
This extends to your human readers.
One of the more useful AI tools I’ve built for myself is a little thingy that looks at a piece of writing and answers ”What point is this article making?”. If the AI gets it wrong, I know my readers will also misunderstand what I’m saying. Back to the drawing board.
The problem with making a subtle point that hinges on 1 detail in a vast sea of text is that 80% of humans will also miss that 1 detail.
> I read through your entire article and the three main points I took away from it were also contained in the gpt4o summary I then generated to compare afterwards. So here's some empirical counter evidence.
But evidence of what, precisely? How do we measure failure and what failure rate is sufficient?
As is, would you be comfortable with doctors applying it to all of your medical records?
Medical records are not my bar though, nor would I be comfortable giving my medical records to 99% of the human race. The bar these tools need to meet for me is far more innocuous (code and apparently summarizing fluff articles) and two things can happen at the same time:
1. Models, and not to be understated, the engineering around them can improve until they pass the point we are comfortable using them in high stakes situations like medical records.
2. People's expectations come down until we settle on the agreed upon tasks where LLMs provide real value.
My perspective is 1 will happen in the long term and 2 is what we should be focusing on right now to provide real immediate value with an eye on 1 so we reap the continued improvements.
For 2 calibrating people's expectations are going to be messy and people rarely estimate the utility of something correctly in the early days, either over or under which is the reason for the mess.
LLMs have been in the works for almost a decade (building on research that's been around since at least the 70s), but their utility has only been apparent for about two years.
We're still super early days for that settling period.
Interesting. I read through the ChatGPT summary first, it seemed very plausible, then I read the original and I kinda see the author's point. The ChatGPT summary basically did gloss over every important detail in the original - but then, the details weren't important for generating a summary.
I think that one of the author's key premises is false:
> To summarise, you need to understand what the paper is saying.
A summary is not about the author or about the summarizer, it's about the reader. It's about picking up on the portions of the original work that will matter to the reader's estimation of whether it's important to read the full work. And that actually depends much more on context and how the work relates to other works than it does about the specific details contained in the work. For example, Betteridge's Law of Headlines [1] basically provides a summary of any article whose title ends in a question, that summary is just "No", and it does so by making an observation about the authors rather than the content of the articles (about which it's completely agnostic).
It reminds me of the problems that plague AI sentiment analysis. Machine learning is actually very good at that task, but you top out at about 70% precision because humans top out at 70% precision on sentiment analysis. At best, people only agree with each other ~70% of the time when judging the sentiment of a piece.
- You're assuming that the goal of a summary is to decide whether one needs to read the full work. That's certainly one niche usage of a summary, but is definitely not the only one and is likely a minority of usage. Most summaries I've needed, from school to my career, were about not having to read the full work.
- I would say Betteridge's Law of Headlines does not provide "No" as a summary to all of these articles. The accurate summary would be "[Title]? No.", since it seems fairly obvious that the word "No" is very light in information conveyed.
On a personal note, I have to say I can't see the point in this current craze about summarizing everything. I never saw the point of those subscription programs which promise you'll "read" a book a week/a month because they'll send you a 5-min audio about the book.
I think you're better off choosing one really awesome book a year and actually reading that.
So I can't see how a (flawed or not) ChatGPT summary will provide any epiphany on the level you'd get from consuming fewer works, thoroughly.
> the three main points I took away from it were also contained in the gpt4o summary
I think here's the key why some people like llm summaries while others don't.
Two people can read an article and take away different points from it. If the llm summary contains the points you would have taken, you like it. If it didn't, you don't.
(e.g. if I look at your kindle highlights for a book and compare them to mine, they'll be very different - this is why I find it hard to use a service like Blinkist - but I think a good llm like gpt-4o or claude 3.5 sonnet does as good a job as a wikipedia article would... Not sure what else people expect)
We need a lot more of this kind of in depth analysis. Right now the cheering on of AI is overwhelming. Criticism is often suppressed, both on the vendor side who just want to sell, and on the client side who have very strong FOMO.
I work for the client side and this bothers me a lot. It's very hard to get a true honest value analysis done with all the sales influence and office politics going on.
If you complain the the current generation of LLMs isn't as impressive as some people try to prove it is, the shills complain that you used the wrong prompts and maybe you should fine-tune the model. I don't think people should spend much time trying to write a prompt to find a working solution because they could just do that. In the author's case, if he used the summary provided by ChatGPT, I doubt that anyone would have realized it's wrong, but that doesn't mean it's good either.
The problem with LLMs is the lack of reliability and coherence. If it gives you a wrong answer and you ask it if it's sure, in most cases it will show another wrong answer and you need to go through multiple of these hops to get something fine.
This "sorry, I was wrong, here's another wrong answer" is basically my only experience with llm. But it also onpy works if I know the answer or the answer I get is really stupid. Luckily the latter is common.
> If you complain the the current generation of LLMs isn't as impressive as some people try to prove it is, the shills complain that you used the wrong prompts
I think the go to is everyone said the same thing about the internet. Look where we are now.
Only after the bursting of a very big hype bubble, remember?
AI is also different in the sense that it has already gone through several hype cycles, each followed by an "AI winter" of broken dreams. Clearly large DL models are a breakthrough, but the amount of hype and hot air is entirely out of proportion to the actual results.
I have actually made (what I think to be) a working summarizer at $dayjob and it took a lot more hand-holding to get results than I initially expected. A straight summary wasn't very good, the "summary of summaries" approach as implemented by LangChain was garbage and didn't produce a summary at all, even a wrong one. The algorithm that actually worked:
1. Take the documents, chunk them up on paragraph then sentence then word boundaries using spaCy.
2. Generate embeddings for each chunk and cluster them using the silhouette score to estimate the number of clusters.
3. Take the top 3 documents closest to the centroid of each cluster, expand the context before and after so it's 9 chunks in 3 groups.
4. For each cluster ask the LLM to extract the key points as direct quotes from the document.
5. Take those quotes and match them up to the real document to make sure it didn't just make stuff up.
6. Then put all the quotes together and ask the LLM not to summarize, but to write the information presented in paragraph form.
7. Then because LLMs just can't seem to shut up about their answer make them return JSON {"summary": "", "commentary": ""} and discard the commentary.
The LLM performs much better (to human reviewers) at keyphrase extraction than TextRank so I think there's genuinely some value there and obviously nothing else can really compose english like these models but I think we perhaps expect too much out of the "raw" model.
It's cool to hear about something substantial that isn't just an api call to the "God" machine. This sounds like a pretty sophisticated and well considered approach, but it does call in to question whether a system that needs to be used as a subsystem in a larger one to be reliable is worth the amount of money being invested in making them right now.
Part of the problem IMO might be that OP is relying on OAI and Google's systems to dump data from websites and pdfs into the context and hoping it's correct and formatted properly. It probably is, but the odd random skips could also be explained by copy paste fails. Would be a better comparison if it was just provided in plain text.
It's also all long context models getting pages of data, which even for these flagship ones is certainly just RoPE or similar which is a cheap hack but isn't super accurate [0]. 4o is best and still showing haystack benchmark accuracies below 80% and Gemini is just completely blind. That certainly needs fixing up to 100% before we can say for sure that nothing will ever get skipped.
I'm not even sure most people on HN feel that way, although there's plenty of criticism here. Most of the time I use ChatGPT I become genuinely upset within a few minutes because it's so brazenly awful and regularly ignores what I ask it. I see people on here promoting its ability to help them with writing code. I wonder if these are people that just don't know what good code looks like.
I recognize the hype for what it is, but I find the people calling for "sanity" are overrepresented everywhere; and it seems to me they discount the actual utility of AI way too much. In other words, I think the AI-skepticism is overhyped, just as much as AI itself.
For my uses, I find AI to be a much better search/answer engine than Google ever was. It can produce answers with hyperlinks for further reading much better and more efficiently than any other option. I no longer have to read through a bunch of seemingly random Google search results, hoping that my specific question is addressed.
I think it's a matter of personal preference. I much prefer going through a series of search results than reading anything a chatbot has to say about any topic.
As it happens, I've just been using it this weekend to write code for me. Two things:
(1) it's not business critical code, it's a side project that I want to get done but otherwise wouldn't have energy for. Especially not in this heatwave in a century old building that has no air-con.
(2) my experiments are a weird mix of ChatGPT wildly messing things up and it managing to get basically everything done to an acceptable result (not quality of code, quality of output). Sometimes I have the same experience as you, that it's just aggravating in its non-comprehension, sometimes it's magical.
I don't know if it can be magical more often if I was "better at prompting". But I do know that I also get aggravated (less often) by other humans not understanding me, and those are much harder to roll back to a previous point in the conversation, edit the prompt, and have them try again :P
As for code quality… well, sure. Stuff I'm asking ChatGPT for is python and JavaScript, and I'm an iOS dev. I can't tell when it's doing something non-idiomatic, or using an obsolete library or archaic pattern in those languages.
It helps me to code (or helped me with mathematical concepts that I find difficult to implement or would take me a lot of time) but then again I use it for generating mainly graphics related code in Blender or as part of a grasshopper script so the code doesn't that much as the correct output.
That being said, I often come in a situation that its not the code that is bad but the solution. Where I need to tell the LLM that it's not logical to do it in a certain way leveraging on my own knowledge.
Nonethless, try looking more left and right. There are really good opensource solutions which can surfe the web for you like stormai, Or you can use anythingllm and give it all your local files.
You can write your tips and tricks, start commands, upgrade procedures etc. in Markdown and reference it through your local LLM.
I like it for coding specifically for languages or things i write seldomly (i'm not coding every day but did for 15 years).
Nonetheless, googles internal code review tool is already suggesting things which are getting accepted by more than 50%. Thats a lot and will only get better. GitHub with Copilot will also just get better every day too. They probably struggled (as the whole industry) with actually getting used to having ML stuff in our ecosystem. Its still relativly new.
AI is basically ml at this point. And it did already A LOT.
Whisper, great jump in quality for speech to text, segment anything, AlphaFold 2, all the research paper Nvidia publishes regarding character movement, AI Raytracing, Nerfs, all the medical research regarding radio imagin, advances in fusion reactors...
We have never been so close to a basic AI/AGI / modern robots. we have instructGPT which allows for understanding 'steps' easier and more stable than anything we developed before in multip languages.
ChatGPT and LLM advances are great and helpful.
Image generation is already poping up in normal life.
AI is not wildly overhyped at this point. We are in the middle of implementation after the first LLM breakthrough and a LOT more money is funnelt into AI/ML research now as it was 10 years ago.
The future is, at least for now, really interesting and there has not been any sign of a wall we are hitting.
Even the missing GPT-5 might feel like a slight wall, but we just got GPT4-o mini which makes all of the LLM greatness a LOT cheaper and a lot easier to use.
We switched from text parsing and avg bad results to just using llama3 (with a little bit of saveguarding) and its a lot better.
But i have seen ai image already on the street, unfortunate i was in public transport and not able to take a picture fast enough when i saw it and i currently work most of the time from home.
“Most people” (and still talking tech crowd) are still figuring out what these very early version of AI can do, what they’re good at, what they’re not good at, etc.
Most non-tech people think AI is no different to “algorithms” and is just another IT buzzword that means “computer people click a few buttons and it does all the work for them”.
Tech is one of the best places to be a rampant liar because the money is self-reinforcing and everyone who might call you out... is also getting paid by the same lie or liars!
It's both, the level of criticism is also overwhelming and delusional.
I hear things like, AI will never be as smart as me!, It will never take over my job, Look it got this and this wrong, No chance it can ever be as smart as me! It's always a comparison to their own abilities so I think a lot of it is just an attempt to stay relevant in a world where technology is about to replace all of us.
I think by now everyone knows the limitations of current LLMs. The result of this "in depth" analysis is not only expected but very obvious. Nobody is surprised and none of this is suppressed because it's so obvious.
What's delusional is the amount of criticism around the performance. The denial that AI is more and more matching human intelligence. Of course it's not there yet, but LLMs made a giant leap and bridged a huge gap.
AI is no match for humanity now, that much is obvious. I think the delusion lies in the fact that many people are trying to deny the trendline... the obvious future and trajectory of what progress has been pointing too. Milestones are getting surpassed at a frightening pace. AI is now running circles around the turing test and I can now ride a car with no driver and it's normal in SF.
We are here bitching about the fact that AI shortens a text rather then summarizing it without remarking at the fact that you can actually bitch to the AI directly about this fact and demand the AI to stop shortening the text and start summarizing it.
The problem with this take is that you're conflating different software systems under the same vague AI umbrella. Waymo is a completely different technology than generative AI. Talking about how over a decade and billions of dollars in R&d in self driving is finally bearing fruit has nothing to do with the performance of LLMs.
It is true that LLMs appear to be an exciting technology, but it's also delusional to assume they're following a positive trendline. Performance between GPT-3.5 and GPT-4 was like a 25% improvement that took 2500% more resources to train. It's clear that there's diminishing returns to bigger and bigger models, and the trend we've been seeing in the industry is actually smaller models trained for longer periods in an effort to bring down inference costs while maintaining current performance.
More intelligent models may require new techniques and technologies that we don't have yet. I'm sure it will get better but the path to improving isn't as obvious as you're making it sound. Making comparisons to Moore's law is also disengenuous because we're actually running into physical limits on how dense we can make chips due to the size of atoms themselves, so past trends for technological development may not continue to bear out.
Obviously the overall trend I'm referring to is AI from a more general perspective. Deep learning.
You are doing exactly as I said. Focusing on obvious criticisms on the current state of the art. We know the obvious pitfalls of LLMs. It's completely obvious nowadays.
I also never pointed to an obvious path forward. I pointed to the an obvious trend that indicates whether you like it or not, we will move forward.
>It is true that LLMs appear to be an exciting technology, but it's also delusional to assume they're following a positive trendline. Performance between GPT-3.5 and GPT-4 was like a 25% improvement that took 2500% more resources to train. It's clear that there's diminishing returns to bigger and bigger models, and the trend we've been seeing in the industry is actually smaller models trained for longer periods in an effort to bring down inference costs while maintaining current performance.
AI is following a trendline. I never said specifically LLMs are exactly on this trendline. LLMs are only a part of this trendline along with other technologies and part of the overall progress deep learning is making. It is extremely likely that there will be an AI that will solve all of current issues with LLMs in the coming decades. Whether that AI is some version of an LLM remains to be seen.
>Making comparisons to Moore's law is also disengenuous because we're actually running into physical limits on how dense we can make chips due to the size of atoms themselves, so past trends for technological development may not continue to bear out.
I never made a comparison to moores law... are you replying to me or someone else? AI with the performance of a human brain is highly realizable despite physical limits because intelligence to the level of the human brain already EXISTS. The existence of humans themselves is testament to the possibility it can be done and is not a fundamental limit.
NFT probably was much tamer. There was no trillion dollar companies renaming themselves to metaverse, no gartner predictions of 30 trillion by 2030, ppl spending 1 hr everyday in the metaverse all with no actual definition or care for what metaverse actually was. NFT atleast had clear definition. WTF is metaverse.
How the fuck did zuckerberg and nadella get away with metaverse farce. I feel so cynical about tech and world in general.
Perhaps everywhere else. On HN mostly what I see are these fairly shallow dismissals TBH. It is a natural reaction when your own livelihood is affected, old as tech itself to be sure [0]. Still, it's getting tiresome. A new technological revolution is unfolding, and the people best positioned to lead and keep it in check are largely balking.
This article omits specificity of which GPT model. Re-running the experiment on the EU regulation paper using gpt-4-1106 (the current-best "intelligent" one):
"4. IORP Directive: The IORP (Institutions for Occupational Retirement Provision) Directive is analyzed, highlighting its scope and its impact on pension funds across the EU. The paper suggests that the directive's complex regulations create inconsistencies and may need clarification or adjustment to better align with national policies."
"5. Regulatory Framework and Proposals: A significant portion of the paper is devoted to discussing potential reforms to the regulatory framework governing pensions in the EU. It proposes a dual approach: a "soft law" code for non-economic pension services and a "hard law" legislative framework for economic activities. This proposal aims to clarify and streamline EU and national regulations on pensions."
^^ these corresponds to the author's self-selected two main points.
Somewhat disagree with the point being made here. The fundamental assumption for humans is that when summarizing we will pay more attention to important bits, and sort of only give a passing mention to others if needed. For any model, the context is the only universe where it can assign importance based on previous learning (instruction tuning examples) and the prompt. For many, shortening of the text is equivalent to summarizing (when the text is not as long as a fifty page paper*). Output depends on the instruction training dataset, and seems like unless a model is trained on longer documents, it would not produce those kind of expected outputs. In a chain of thought reasoning scenario, it probably will. With Gemini, they definitely tested out long context and tuned the outputs for it to work well as it was their value prep - shown in I/O no less.
I have been working on summarizing new papers using Gemini for the same purpose. I don't ask for summary though, i ask for the story the paper is trying to tell (with different sections) and get great output. Not sharing the links here, because it would be self promotion.
I've noticed that when using a language model for rephrasing text it also sometimes seem to miss important details because it clearly has no real understanding of the text.
It's not a problem when you are aware of it and with some follow up input you can get it mitigated, but often I see that people tend to take the first output of these systems at face value. People should be a bit more critical in that regards.
I really don't wanna be nitpicky, but what do you mean by 'no real understanding of the text'?
How do you benchmark something or someone understanding text?
I'm asking because the magic of LLM is the meta level which basically creates a mathematical representation of meaning and most of the time, when i write with an LLM, it feels very understanding to me.
Missing details is shitty and annoying but i have talked to humans and plenty of them do the same thing but actually worse.
If you ask it basic mathematical questions, it becomes quickly clear that the ‘understanding’ it seems to possess is a mirage. The illusion is shattered after a few prompts. To use your comparison with humans: if any human said such naive and utterly wrong things we’d assume they were either heavily intoxicated or just had no understanding of what they’re talking about and are simply bluffing.
I guess at best you can say these models have an ‘understanding’ of language, but their ability to waffle endlessly and eruditely about any well-known topic you can throw at it is just further evidence of this — not that it understands the content.
So we officially accept that certain voting people, irrational people are constantly heavily intoxicated or had no understanding?
Interestingly enough, tx to a talk from a brain researcher i understood that there are two major brain modes we run:
Either the i just observe and do things i observed (were it doesnt' matter that you are gay or black but still vote for trump) and the logical mind were you see a conflict in stuff like this.
Sure. Something I tried the other day was asking questions about modular arithmetic — specifically, phrased in terms of quotient groups of the integers. Things like ‘how many homomorphisms are there between Z/12Z and Z/6Z?’. I was able to trip it up very easily with these sorts of questions, especially when it tries to ‘explain’ its answers and it says ridiculous (but superficially and momentarily plausible-looking) things like ‘the only solutions to the equation 12x = 0 (mod 12) in Z/12Z are {[0], [3], [6], [9]}, therefore…’.
You can also just quiz it on certain basic definitions. Ask it for examples of objects that don’t exist (graphs or categories with certain properties, etc.). Sometimes it’ll be adamant that its stated example works, but usually it will quickly apologise and admit to being wrong only to give almost exactly the same (broken) argument again.
Another thing you can try is concocting some question that isn’t even syntactically well-formed (i.e. fails even a type check) like ‘is it true that cyclic integer lattices are uniformly bounded below in the Riemann topology?’. I imagine that one is too far out to work, but when I’ve played around I’ve found many such absurd questions ChatGPT was only too happy to answer — with utter nonsense, of course. It’s interesting (and, I think, quite telling) that such systems are seemingly almost completely unable to decline to answer a question. And the reason is that there’s no difference between hallucination and non-hallucination. Internally, it’s exactly the same process. It either knows or doesn’t know — but it doesn’t know that it knows (or doesn’t).
LLMs basically only work on questions that are very similar to, or identical to, questions that have already been widely asked and answered online or in books… hence their lack of utility in mathematical research, or even in calculating one’s taxes, or whatever.
I could provide some more literal examples, but I’d have to go and try some and pick the ones that work, and even then they might not work on your end because of the pseudorandomness and the fact that the model keeps getting updated and patched. It’s better to just play around on your own based on the ideas I’ve given.
The moral is to use LLMs as a powerful way of finding information, but don’t trust anything it says. Use it to find better sources more quickly than you’d be able to via a search engine.
Well sure but ... I think the foundational problem here is just the "being unable to refuse to answer a request." The rest of the behavior you describe just follows from it.
If you for instance threatened to shoot a human if it refused a request or admitted it didn't know something, they might answer in a very similar fashion.
Something I find frustrating about summarization is that while it's one of the most common uses of LLMs I've actually found very little useful material investigating ways of implementing it.
Is a system prompt "provide a summary of this text" the best possible prompt? Do different models respond differently to prompts like that? At what point should you attempt more advanced tricks, like having one prompt extract key ideas and a second prompt summarize those?
Most LLMs behave very similarly IMO, the biggest difference is the formatting. Although in my language learning app I found Gemini Flash a bit better at explaining things. 'provide a summary of this text' will give you very generic responses usually in bullet points. It might be good enough but I recently started summarizing comment sections and you need a bit more prompting to give you the individual arguments raised and niche points.
The author could have gotten their point across better if they said that LLMs aren't good at focusing on the things they deem important. LLMs absolutely understand things. Otherwise it'd be impossible to work at all. But like people they try to make a summary fit into its preconceived biases (e.g. regulation good). You know how when you try to talk to someone about a subject they're unfamiliar with, and it goes from one ear out the other? That's how LLMs are when they ask them to pick out the divergent ideas from documents.
Products like ChatGPT are rewarded in their fine tuning for doing happy sounding cheerleading of whatever bland unsophisticated corpo docs you throw at it. Consumer products like this simply aren't designed for novelty, although there's plenty of AIs that are. For example, AlphaFold is something that's designed to search through an information space and discover novel stuff that's good.
ChatGPT is something that's designed to ingratiate itself with emotional individuals using a flawed language that precludes rational thinking. That's the problem with the English language. It's the lowest common demonstrator. Any specialized field like programming, the natural sciences, etc. that wants to make genuine progress, has always done so historically by inventing a new language, e.g. jargon, programming languages.
The only time normal language is able to communicate divergent ideas is when the speaker has high social status. When someone who doesn't have high social status communicates something novel, we call it crazy. LLMs, being robots, have very low social status. That's why they're trained to act the way they do.
I am subscribed to Glancias, which is an AI summarised daily news email service of sorts. Since news is supposed to be a high-risk area where you don’t want hallucinations, I am sure they’ve fine-tuned their setup to some degree.
However, it still managed to pick up several clickbait headlines about NASA’s asteroid wargame and write a scare news summary:
> The participants — nearly 100 people from various U.S. federal agencies and international institutions — considered the following hypothetical scenario: Scientists just discovered a relatively large asteroid that appears to be on an Earth-impacting trajectory. There's a 72% chance it will hit our planet on July 12, 2038, along a lengthy corridor that includes major cities such as Dallas, Memphis, Madrid and Algiers.
Glancias Summary:
> NASA has identified a potential asteroid threat to Earth in 2038, revealing gaps in global preparedness despite technological advancements in asteroid trajectory redirection and the upcoming launch of the NEO Surveyor space telescope.
One thing I've used GPT and Gemini for is to summarize HN threads. It does OK finding top level points but in the conversation (thread) there are (generally) some one-off points I find key but neither AI can identify these "key topics at the leaf" (I don't know what else to call them).
What prompt am I missing? Find the edge-case details and other similar "what's only mention once" I can't get it to highlight.
It takes a bit of trial and error to get comment sections summarized well. This is what I use: Give a report on the key insights from the following comment section. Do not include a title and do not discuss the name of the post or the name of the site, we already know that. Your report should be structured using paragraphs and it should discuss the points covered. In addition to the main points be sure to include insightful niche points that may not be as popular or ignored by other commenters. For each point discuss any notable arguments, counter arguments, or anecdotes.\n\nComment section:\n\n$commentSection
It's still a bit experimental. Hacker News comment sections are very large (100k+) characters so it probably won't find everything. It also adds a summary section at the end which is annoying but not too bad. And it will literally say 'these are niche points'. I find it a bit funny but there's no reason to fix it. I'm using Gemini Flash.
I think this is perhaps unanswerable without additional context, such as upvote/downvote/flag history, as there's an implicit assumption in your request that you be presented with a summary of points relevant to you and without that added context, the AI has no way of knowing your expectations regarding relevance.
I wouldn't call that an evaluation. They are expressing subjective opinions and feelings, text summarization is an active area of research, there are many benchmark datasets and evaluation measures that make progress quantifiable. Which makes rants like this seem rather pointless and uninformed.
My RSS reader automatically generates AI summaries for Hacker News posts and it works pretty well. Sometimes when I comment on a post I have to double check the summary is correct and it always does a really good job. I even had it generate comment summaries. It needs a bit of prompting to highlight individual arguments but it also does well here.
I am very skeptical of the author's claims. Perhaps the parts of the articles being summarized are not actually important so the LLMs did not include them. Or perhaps the article does an exceptionally bad job of explaining why the argument is important. Also there's a difference between the API and free web interface. I think the web version has a bunch more system prompting to be helpful which may make a summary harder to do.
A big reason for 'content drift' is that LLM's are like a sliding context window of the input text plus the prompt, and for each new token generated, the next token prediction is using the previously LLM generated tokens as well.
Giving a LLM too much context causes the same effect, as the sliding window moves on from the earliest tokens.
It's also why summarization is bad.
It's not exactly linear though according to the text from start to finish, bit's of context get lost from throughout the input text at random and will be different each time the same input is run.
A good way to mitigate this is to break up the text and execute in smaller chunks, even with models boasting large context, results drop off significantly with large inputs so using several smaller prompts is better.
> did it add the content of the web site to the prompt, or was the web site part of the training?
Likely it added the content to the prompt, but the content didn't stay in the prompt for the next prompt. The next prompt likely only had general web results as context.
I had a similar experience with ChatGPT and larger documents. Even basic RAG tasks don't work well (RAG = Retrieval-augmented generation). The most basic Langchain RAG examples performs much better. The usual approach is to split up the document into pages and then smaller text fragments (a few hundred characters). Only those smaller fragments are then processed by the LLM.
In this case I would take a similar approach, split the document into multiple smaller (and overlapping) fragments, let a LLM summarize each one of those into key findings, and in a next step merge those key findings to a summary.
I have not a lot experience though, if this would provide better results.
> The article discusses the author's experience and observations regarding the use of language model-based tools like ChatGPT for summarizing texts, specifically highlighting their limitations. The author initially believed summarizing was a viable application for such models but found through practical application that these tools often fail to produce accurate summaries. Instead of summarizing, they tend to merely shorten texts, sometimes omitting crucial details or inaccurately representing the original content. This is attributed to the way language models prioritize information, often influenced more by the vast amount of data they've been trained on rather than the specific content they are summarizing. The author concludes that these tools are less reliable for producing meaningful and accurate summaries, especially when the text is complex or detailed. The experimentation with summaries on different subjects further demonstrated that these models often produce generalized content that lacks specific, actionable insights from the original texts.
I've had the same negative experiences with summarizing news articles with ChatGPT4 (4o and even the previous 4 model). These LLMs makers need to focus more on keeping the context length lower [1] at maybe around 4-12K tokens and instead get their systems to have more general intelligence.
[1] It's annoying to see Google initially market their Gemini models about their 100K to 1M tokens context size, and even OpenAI has been doing a lot of their model making and marketing around it too recently.
I have some questions for any willing to consider, though know in advance I am quite ignorant on the general subject.
I've been having a surprisingly good time in my 'discussions' with the free online chatgpt, which has a cutoff date of 2022. What really impresses me is the results of persistence on my part when the replies are too canned, which can be astonishing.
In one discussion, I asked it to generate random sequences of 4 digits, 0000-9999, until a specific given number occurred. It would, as if pleased with its work, give the number within 10 tries. I suppose this is due to computational limitations that I don't understand. However, when with great effort, I criticized its method and results enough, I got it to double the efforts before it lazily 'found' an occurrence of the given number. It claimed it was doing what i asked. It surely wasn't. But it seemed oblivious. I'm interested to understand this.
I'm sure I'll get some contempt fo my ignorance here, but I asked to analyze pi to some unremembered placeholder until it found a Fibonacci sequence. It couldn't. Maybe one doesn't exist. As obvious as this might be to smarter primates here, I don't understand. I was mostly entertaining myself with various off the hat things.
What I did realize, is what by my standards, is fierce potential. This has me wanting to, if even possible, acquire my version with, perhaps, the possibility of real time/internet interaction.
Is this possible without advanced coding ability? Is it possible at all? What would be a starting point and some helpful pointers along the way.
Anyway, it reminded me of my youth, when I had access to a special person or two and would make them dizzy with my torrential questions. Kindof a magic pocket Randall Monroe, with spontaneous schizophrenia. Fun.
Edit note: those were but a couple examples of a lot more that I cannot remember. I'm hooked now, though, and need to come out my cave for this, and learn more. I have some obsolete python experience if that might be relevant.
Are there any objective benchmarks for rating and comparing the summarization performance of LLMs? I've had mixed results when using the latest versions of ChatGPT, Claude, and Gemini to summarize texts that I know well (books I have read, papers I wrote, transcripts of online meetings I attended). Often their summaries are as good as or maybe even better than what I could prepare myself, but sometimes they omit key points or entire sections. Other than failures clearly due to context length limitations, it's hard to judge which LLM is a better summarizer overall or whether a new LLM version is better than the previous one.
It's even worse than the author writes. As the 'parameter' side of the equation gets trained on more and more scraped AI-spam garbage, summarising will actually get even worse than it already is.
That's a shame, because it'd be one of the more useful things that LLMs might have been used for, and I had - basically on faith - assumed that it was providing genuine analytical summaries ...
Of course, it obvious in hindsight that to create a useful summary requires reasoning over the content, and given that reasoning is one of the major weaknesses of LLMs, it should have been obvious that their efforts to summarize would be surface level "shortening" rather than something deeper that grokked the full text and summarized the key points.
People use 3.5 for testing the capabilities of LLMs, and then conclude that LLMs are inherently bad at that task, when in reality there are better models.
Appreciate this deep dive and the important conclusion - “summarize” actually means “shorten” in chatGPT. We’ve been summarizing call transcriptions and seen all of these same challenges.
I have worked with these kind of documents and one of the main problems is that they are so vague and generic and repetitive that not even humans can comprehend and summarize them correctly because they are produced as bureaucracy artefacts to present to management to get fund to make more papers
> When you ask ChatGPT to summarise this text, it instead shortens the text.
Wasn't that kind of the previous classical AI method of doing summaries? Something something, rank sentences by the number of nouns that appear in other sentences to get the ones with the most information density and output the top N?
The real answer is a bit later in the article I think. It produces summary only about what was in the training data. Works for someone's vacation blog, but anything novel can get lost easily.
The author’s assertion that models or systems can ignore important novel points when producing summaries/reductions makes complete sense, as average-ers of patterns indeed might even be expected. In any case it seems testable.
My experience was that they're entirely useless for cases where you can't find examples of that kind of code on the web. They just hallucinate APIs that don't exist to solve your problem. I would suspect that GLSL is rare enough, and different enough from other programming languages to pose problems for the LLMs here.
I’m not sure that’s useful - most of HN would fail that too :-).
Or rather, as I honestly don’t know what a GLSL shader is, and barely know what a canvas is, it would be a bit like the scene from blackadder - I would love for Baldrick to read this book, but that will mean teaching him to read, which will take about ten years.
I am not a fanboy of LLMs or genAI but how is this a great test litmus test for their usefulness? How many humans on earth could do that today? Ten thousand at the most?
I don’t need an “AI” to help me with something widely discussed, I can simply read the docs. On the other hand I’d love a tool that opens me niche topics in a reliable way.
Is that really true? Does it really benefit you much to read essentially unreliable abstracts of 10 papers instead of trying to go through one in detail? Why not just go with the original abstract at that point, or just skim the paper yourself?
Papers are a great example actually, because they're often written in intentionally obtuse prose with a lot of jargon that slow your ability to skim if you're not working directly in that field. This is one contributing factor to why research is so siloed. There's utility in summarizing away that obfuscation and getting to the point.
Aye, fair enough. Assuming summarization/shortening or whatever people call this stays the same level of quality, is there an expectation of more interdisciplinary research? I'm unsure about that.
It takes years for new technologies to propagate and impact industries in practice. Most of the enabling tech here is months old, and there's still tons of unaddressed markets and use cases.
I have no idea if there's an expectation of interdisciplinary research but LLMs are well suited to breaking down barriers between different levels of understanding, so it follows. I'm part of a patient research group and we use it all the time to digest academic papers, test hypotheses, and bring forward more informed questions to researchers.
I gave the same article to Claude 3.5 Sonnet and the result seems reasonably similar to the author's handwritten summary.
```
This article examines the governance of Dutch pension funds in light of the Future of Pensions Act (Wtp). The new legislation shifts towards more complete pension contracts and emphasizes operational execution, necessitating changes in pension fund governance. The authors propose strengthening pension funds' internal organization, improving accountability to participants, and enhancing the powers of participant representation bodies.
Key recommendations include establishing a recognizable governance structure with clear responsibilities, creating a College of Stakeholders (CvB) to replace existing accountability bodies, and granting the CvB more authority, including appointment and dismissal powers. The proposals aim to balance the interests of social partners, pension funds, and participants while ensuring transparency and effective oversight.
The article emphasizes principles such as transparency, trust, loyalty, and prudence in shaping governance reforms. It also discusses the impact of digitalization (DORA), the need for pension funds to demonstrate value, and the potential for further consolidation in the sector. International perspectives, including insights from the World Bank, inform the proposed governance improvements.
These changes are designed to help pension funds adapt to the new system, manage risks effectively, and maintain their "license to operate" in a changing landscape.
```
Similarly, the second article's summary also captures the key points that the author points out (emphasis mine).
```
The article "Regulating pensions: Why the European Union matters" explores the growing influence of EU law on pension regulation. While Member States retain primary responsibility for pension provision, the authors argue that EU law significantly impacts national pension systems through both direct and indirect means.
The paper begins by examining the EU's institutional framework regarding pensions, focusing on the principles of subsidiarity and the division of powers between the EU and Member States. It emphasizes that the EU can regulate pension matters when the Internal Market's functioning is at stake, despite lacking specific regulatory competencies for pensions. The authors note that the subsidiarity principle has not proven to be an obstacle for EU action in this area.
The article then delves into EU substantive law and its impact on pensions, concentrating on the concept of Services of General Economic Interest (SGEI) and its role in classifying pension fund activities as economic or non-economic. The authors discuss the case law of the Court of Justice of the European Union (CJEU), highlighting its importance in determining when pension schemes fall within the scope of EU competition law. They emphasize that the CJEU's approach is based on the degree of solidarity in the scheme and the extent of state control.
**
The paper examines the IORP Directive, outlining its current scope and limitations. The authors argue that the directive is unclear and leads to distortions in the internal market, particularly regarding the treatment of pay-as-you-go schemes and book reserves. They propose a new regulatory framework that distinguishes between economic and non-economic pension activities.
For non-economic activities, the authors suggest a soft law approach using a non-binding code or communication from the European Commission. This would outline the basic features of pension schemes based on solidarity and the conditions for exemption from EU competition rules. For economic activities, they propose a hard law approach following the Lamfalussy technique, which would provide detailed regulations similar to the Solvency II regime but tailored to the specifics of IORPs (Institutions for Occupational Retirement Provision).
**
The authors conclude that it's impossible to categorically state whether pensions are a national or EU competence, as decisions must be made on a case-by-case basis. They emphasize the importance of considering EU law when drafting national pension legislation and highlight the need for clarity in the division of powers between the EU and Member States regarding pensions.
Overall, the paper underscores the complex interplay between EU law and national pension systems, calling for a more nuanced understanding of the EU's role in pension regulation and a clearer regulatory framework that respects both EU and national competencies.
```
I'd bet that the author used GPT 3.5-turbo (aka the free version of ChatGPT) and did not give any particular prompting help. To create these, I asked Claude to create a prompt for summarization with chain of thought revision, used that prompt, and returned the result. Better models with a little bit more inference time compute go a long way.
"The author critiques ChatGPT's ability to summarize accurately, arguing that it merely shortens text without true understanding, resulting in incomplete and sometimes misleading summaries, as demonstrated by a comparison between their own summary of a complex pension fund governance paper and the flawed version produced by ChatGPT." (GPT-4o)
The author states his conclusions but doesn't give the reader the information required to examine the problem.
- Whether the article to be summarized fits into the tested GPT model's context size
- The prompt
- The number of attempts
- He doesn't always state which information in the summary, specifically, is missing or wrong
For example: "I first tried to let ChatGPT one of my key posts (...). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said." He doesn't say which statements of the original article were reproduced falsely by ChatGPT.
My experience is that ChatGPT 4 is good when summarizing articles, and extremely helpful when I need to shorten my own writing. Recently I had to write a grant application with a strict size limit of 10 pages, and ChatGPT 4 helped me a lot by skillfully condensing my chapters into shorter texts. The model's understanding of the (rather niche) topic was very good. I never fed it more than about two pages of text at once. It also adopted my style of writing to a sufficient degree. A hypothetical human who'd have to help on short notice probably would have needed a whole stressful day to do comparable work.