Yeah, Claude consistently impresses me. A commenter on another thread mentioned ...

imiric · 2024-10-29T17:58:08 1730224688

I recently tried to ask these tools for help with using a popular library, and both GPT-4o and Claude 3.5 Sonnet gave highly misleading and unusable suggestions. They consistently hallucinated APIs that didn't exist, and would repeat the same wrong answers, ignoring my previous instructions. I spent upwards of 30 minutes repeating "now I get this error" to try to coax them in the right direction, but always ending up in a loop that got me nowhere. Some of the errors were really basic too, like referencing a variable that was never declared, etc. Finally, Claude made a tangential suggestion that made me look into using a different approach, but it was still faster to look into the official documentation than to keep asking it questions. GPT-4o was noticeably worse, and I quickly abandoned it.

If this is the state of the art of coding LLMs, I really don't see why I should waste my time evaluating their confident sounding, but wrong, answers. It doesn't seem like much has improved in the past year or so, and at this point this seems like an inherent limitation of the architecture.

dartos · 2024-10-29T20:25:16 1730233516

FWIW I almost never ask it to write code for me. I did once to write a matplotlib script and it gave me a similar headache.

I ask it questions mostly about libraries I’m using (usually that have poor documentation) and how to integrate it with other libraries.

I found out about Yjs by asking about different operational transform patterns.

Got some context on the prosemirror plugin by pasting the entire provider class into Claude and asking questions.

It wasn’t always exactly correct, but it was correct enough that it made the process of learning prosemirror, yjs, and how they interact pretty nice.

The “complete” examples it kept spitting out were totally wrong, but the information it gave me was not.

imiric · 2024-10-29T21:39:31 1730237971

To be clear, I didn't ask it to write something complex. The prompt was "how do I do X with library Y?", with a bit more detail. The library is fairly popular and in a mainstream language.

I had a suspicion that what I was trying to do was simply not possible with that library, but since LLMs are incapable of saying "that's not possible" or "I don't know", they will rephrase your prompt and hallucinate whatever might plausibly make sense. They have no way to gauge whether what they're outputting is actually correct.

So I can imagine that you sometimes might get something useful from this, but if you want a specific answer about something, you will always have to double-check their work. In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code. I think there are coding assistant services that do this already, but frankly, I was expecting more from simple chat services.

dartos · 2024-10-30T01:03:06 1730250186

> if you want a specific answer about something

Specific is the specific thing that statistical models are not good at :(

> how do I do X with library Y?

Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts. Attention just has more data to work with when there are more tokens. Try extending that question like “I am using this programming language and am trying to do this task with this library. How do I do this thing with this other library”

I realize prompt engineering like this is fuzzy and “magic,” but short prompts have a consistent lower performance.

> In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code.

Not as simple as you’d think. You’re letting something run arbitrary code.

Tho you should give aider.chat a try if you want to test out that workflow. I found it very very slow.

imiric · 2024-10-30T07:10:09 1730272209

> Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts.

I'm aware of that. The actual prompt was more elaborate. I was just mentioning the gist of it here.

Besides, you would think that after 30 minutes of prompting and corrections it would arrive at the correct answer. I'm aware that subsequent output is based on the session history, but I would also expect this to be less of an issue if the human response was negative. It just seems like sloppy engineering otherwise.

> Specific is the specific thing that statistical models are not good at

Some models are good at needle-in-a-haystack problems. If the information exists, they're able to find it. What I don't need is for it to hallucinate wrong answers if the information doesn't exist.

This is a core problem of this tech, but I also expected it to improve over time.

> Tho you should give aider.chat a try

Thanks, I'll do that eventually. If it's slow, it can get faster. I'd rather the tool be slow but give correct answers, than it slowing me down by wasting my time error correcting it.

Thankfully, these approaches can work for programming tasks. There is not much that can be done to verify the output of any other subject.

geodel · 2024-10-29T18:55:20 1730228120

Well it is volume business. <1% of advanced skill developers will find AI helper useless but for 99% of IT CRUD peddlers these tools are quite sufficient. All in all if employers cut down 15-20% of net development costs by reducing head counts, it will be very worthwhile for companies.

WgaqPdNr7PGLGVW · 2024-10-29T20:02:14 1730232134

I suspect it will go a different direction.

Codebases are exploding in size. Feature development has slowed down.

What might have been a carefully designed 100kloc codebase in 2018 is now a 500kloc ball of mud in 2024.

Companies need many more developers to complete a decent sized feature than they needed in 2018.

outworlder · 2024-10-29T20:35:47 1730234147

It's worse than that. Now the balls of mud are distributed. We get incredibly complex interactions between services which need a lot of infrastructure to enable them, that requires more observability, which requires more infrastructure...

WgaqPdNr7PGLGVW · 2024-10-30T03:00:33 1730257233

Yeah. You can fit a lot of business logic into a 100kloc monolith written by skilled developers.

Once you start shifting it to micro services the business logic gets spread out and duplicated.

At the same time each micro-service now has its own code to handle rest, graphql, grpc endpoints.

And each downstream call needs error handling and retry logic.

And of course now you need distributed tracing.

And of course now your auth becomes much more complex.

And of course now each service might be called multiple times for the one request - better make them idempotent.

And each service will drift in terms of underlying libraries.

And so on.

Now we have been adding in LLM solutions so there is no consistency in any of the above services.

Each dev rather than look at the existing approaches instead asks Claude and it provides a slightly different way each time - often pulling in additional libraries we have to support.

These days I see so much bad code like a single microservice with 3 different approaches to making a http request.

geodel · 2024-10-30T01:48:19 1730252899

Agree. But we are already in that loop. A 50KLOC properly written "Monolith, hence outdated" app is now 30 micro services of 20KLOC surface + 100KLOC of submerged in terms of convenience libraries with kubernetes, grafana, datadog, servicemesh and so on. From what I am seeing companies are increasingly using off the shelf components so KLOC will keep rising but developer count would not.

imiric · 2024-10-29T21:47:28 1730238448

Sure, but my specific question was fairly trivial, using a mainstream language and a popular library. Most of my work qualifies as CRUD peddling. And yet these tools are still wasting my time.

Maybe I'll have better luck next time, or maybe I need to improve my prompting skills, or use a different model, etc. I was just expecting more from state of the art LLMs in 2024.

WgaqPdNr7PGLGVW · 2024-10-29T19:51:20 1730231480

Yeah there is a big disconnect between the devs caught up in the hype and the devs who aren't.

A lot of the devs in my office using Claude/gpt are convinced they are so much more productive but they aren't actually producing features or bug fixes any faster.

I think they are just excited about a novel new way to write code.

gonab · 2024-10-29T17:05:56 1730221556

Conversely I feel that the experience of searching has been degraded by a lot since 2016/17. My these is that, at this time, online spam increased by an order of magnitude

state_less · 2024-10-29T17:50:11 1730224211

Old style Google search is dead, folks just haven’t closed the casket yet. My index queries are down ~90%. In the future, we’ll look back at LLMs as a major turning point in how people retrieve and consume information.

darepublic · 2024-10-29T18:53:28 1730228008

I still prefer it over using llm. And I would be doubtful that llm search has major benefits over Google search imo

ben_w · 2024-10-29T19:08:55 1730228935

Depends what you want it for.

Right now, I find each tool better at different things.

If I can only describe what I want but don't know key words, LLM are the only solution.

If I need citations, LLMs suck.

esafak · 2024-10-30T02:08:31 1730254111

Abstractive vs. extractive search.

dageshi · 2024-10-29T17:46:01 1730223961

I think it was the switch from desktop search traffic being dominant to mobile traffic being dominant, that switch happened around the end of 2016.

Google used to prioritise big comprehensive articles on subjects for desktop users but mobile users just wanted quick answers, so that's what google prioritised as they became the biggest users.

But also, per your point, I think those smaller simpler less comprehensive posts are easier to fake/spam than the larger more compreshensible posts that came before.

zeknife · 2024-10-29T20:17:57 1730233077

Ironically, I almost never see quick answers in the top results, mostly it's dragged out pages of paragraph after paragraph with ads inbetween.

dartos · 2024-10-29T20:29:45 1730233785

Guess who sells the ads…

bobthepanda · 2024-10-29T17:23:00 1730222580

Winning the war against spam is an arms race. Spam hasn’t spent years targeting AI search yet.

EVa5I7bHFq9mnYK · 2024-10-29T22:16:25 1730240185

It's getting ridiculous. Half of the time now when I ask AI to search some information for me, it finds and summarizes some very long article obviously written by AI, and lacking any useful information.

th0ma5 · 2024-10-30T06:42:55 1730270575

Queries were rewritten with BERT starting even before then so it's still the same generative model problem.

TeaBrain · 2024-10-29T17:11:46 1730221906

I don't think this is necessarily converse to what they said.