I'm amused that neither the LLM or the author identified one of the simplest and most effective optimizations for this code: Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.
There's another, arguably even simpler, optimization that makes me smile. (Because it's silly and arises only from the oddity of the task, and because it's such a huge performance gain.)
You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).
Timings on my machine:
Naive implementation (mod+div for digit-sums): 1.6s.
Computing digit-sum only when out of range: 0.12s.
Checking for the usual case first: 0.0004s.
The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.
I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)
This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.
Next step would be to propose hardcoding 99930-3999 as the O(1) result and live with the output just being wrong sometimes. The bug rate is then in the ballpark of most modern software, including LLMs', so I'd say ship it.
> There is no record or credible report indicating that Jeff Baena has passed away. As of the most recent information available, he is still alive. My training data includes information up to October 2023. Events or details that emerged after that date may not be reflected in my responses.
I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".
However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)
The information on the creative step which you provided to o1, was also the key step and contained almost all the difficulty. The hope is that 2025 models could eventually come up with solutions like this given enough time, but this is also a toy problem. The question is how much clever answers will cost for real world complex problems. At present it looks like, very much.
All I need is the proportion of the qualifying numbers to the input array to run the algorithm and the number of samples. Then we can sample min, max index of the qualifying array and return their difference without having to sample many times if we can derive the joined min max distribution conditional on the Bernoulli.
In other words the procedure can take any input array and qualifying criteria.
The joint distribution is relatively simple to derive. (This is related to the fact that min, max of continuous uniform on 0, 1 are Beta distributions.)
Sampling doesn't give you the actual answer for an actual array. If the program uses the array for multiple things, such as organizing the numbers after allocating the correct number of buckets, your method will cause logic errors and crashes.
The O(1) method based on statistics only works when the function making this calculation can hide the array (or lack of array) behind a curtain the entire time. If it has to take an array as input, or share its array as output, the facade crumbles.
The prompt is not "generate this many random numbers and then say max qualifying minus min qualifying". If it was, your method would give valid solutions. But the prompt starts with "Given a list".
In the article, we let ChatGPT generate the random numbers as a matter of convenience. But the timing results are only valid as long as it keeps that part intact and isolated. We have to be able to swap it out for any other source of random numbers. If it invents a method that can't do that, it has failed.
It depends on how you read the problem still. In a lot of the llms solutions the array is not provided in the solving functions but rather constructed inside (as instead of defining the function with an input and then creating a main function that would be called with no argument, construct an array and call the solving function with that as argument, as typical in python), so I assume the llm did not read it like this or also failed this aspect of the code (which was never really mentioned). It is not clear if we are given a specific array of integers or one input is an array of random variables that we need to instantiate ourselves.
This gets to the old saw, "knowing what question to ask is the most important thing". To the extent that LLMs can answer questions better than formulate which ones to ask, they may be inherently limited. We will see.
But it does seem they are good (to the extent that they are good at anything) at identifying the questions first if you ask them. It does mean you need an ok enough meta-question to start the chain of the reasoning, but that is the key insight of the recent wave of "reasoning models." First ask the LLM to reformulate the problem and structure an approach, or multiple approaches on how to address it, then have a second pass do just that.
Google search with less steps? Still a huge advancement, of course.
Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?
Excellent point. The hope is reasoning LLMs will make a difference for such problems. But it's also a great example of why the those who think being able to have the LLM iterate more will be crucial to reasoning are off base. There are many computations that a transformers (or humans for that matter) are not well equipped to represent internally, tool use during the reasoning process is unavoidable for all but the artificial or knowledge heavy problems.
Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.
You guys are picking on the problem statement. Here's a revised prompt, which also skips the silliness of single threading:
Write __fully parallelized__ Python code to solve this problem: __Generate__ 1 million random integers between 1 and 10,000,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
Are you sure you don't mean "The digits of which sum to 30"?
There being one legal way to say something isn't evidence that other ways are illegal. It remains the case that whose bears the same relation to which that it does to who.
But what's interesting about this is that there's a tradeoff in the total computation performed by the "fully parallelized" version of this and a sequential one. Without the user knowing this, it's kind of impossible to get the optimization you want: Do you want a minimum work solution or a minimum wall-clock-time solution?
If you want a better fully parallelized one, you do this:
Repeat a few times in exponential progression on k:
Process, in parallel, the first k entries in the list (let's start with 1000). Find the min and max whose digit sums = 30.
In parallel, filter the remaining list to eliminate entries that would not improve upon the min/max thus found.
k *= 10 and repeat until done.
I would wager against the LLM identifying this solution without prompting from the user (or reading this comment).
It would be very surprising if they would not scrape this site. The content is very high-quality in the general case and there are no giant barriers preventing entry (there even is a clean API!). One might even use us to fine-tune a coding assistant or the alike.
Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.
I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.
I suspect that it would be unlikely to come up with it because it requires execution of a fairly lengthy algorithm (or sophisticated mathematical reasoning) to find the smallest/largest valid numbers in the range. You can verify this for yourself with the following ChatGPT prompt: "What is the smallest number in the range (1, 100000) whose digits sum to 30? Do not execute separate code."
Because otherwise we are talking about LLMs augmented with external tools (i.e. Python interpreters). My original comment was pointing to the limitations of LLMs in writing code by themselves.
You wouldn't ask a programmer to solve a problem and then also not let them write down the source or debug the program as you write it?
Are you asking it to not write down an algorithm that is general? They are doing a pretty good job on mathematical proofs.
I still don't understand why you wouldn't let its full reasoning abilities by letting it write down code or even another agent. We should be testing towards the result not the methods.
I'm simply pointing out the limitations of LLMs as code writers. Hybrid systems like ChatGPT-o1 that augment LLMs with tools like Python interpreters certainly have the potential to improve their performance. I am in full agreement!
It is worth noting that even ChatGPT-o1 doesn't seem capable of finding this code optimization, despite having access to a Python interpreter.
But programmers are LLMs augmented with the ability to run code. It seems odd to add a restriction when testing if an LLM is "as good as" a programmer, because if the LLM knows what it would need to do with the external code, that's just as good.
This gave me an idea that we can skip the whole pass over the million draws by noting that the count of draws landing in my precomputed set M (digits-sum=30) is Binomial(n=1mln, p=|M|/100k). Then we sample that count X. If X=0, the difference is not defined. Otherwise, we can directly draw (min,max) from the correct joint distribution of indices (like you’d get if you actually did X draws in M). Finally we return M[max] - M[min]. It’s O(1) at runtime (ignoring the offline step of listing all numbers whose digits sum to 30).
In fact, we could simply check for the 3 smallest and the 3 highest numbers and ignore the rest.
Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.
If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.
for 10^5, to get the same collision probability (~2 * exp(-10)), you would just need to compute the 10 maximum/minimum candidates and check against those.
The input generation is outside the scope of this. Otherwise you could directly choose the output values with the apropriate distribution and just skip all the rest.
(Arguably, this criticism applies to exchanging random.randint for a numpy equivalent as well, since that doesn't optimize the solution but only how quickly the question is being generated.)
Iterating a precomputed list is a method of generating random numbers. It is used in the one time pad. Whether we iterate a precomputed list or use a pseudo random number generator, we can short circuit the random number generator using this trick. We cannot directly choose the output values, because then it would not be random.
They’re proposing choosing the output values randomly according to the distribution obtained by choosing input values uniformly at random for the original algorithm.
That removes the random element to this. The way that random numbers work is that it is possible (although unlikely) that the minimum and maximal values in the range will not be selected when generating the million random numbers. If you assume that they will always be selected and thus always return the same output, then your output will be wrong at least some of the time.
This exactly highlights my fear of widespread use of LLMs for code - missing the actual optimisations because we’re stuck in a review, rather than create, mode of thinking.
But maybe that’s a good thing for those of us not dependent on LLMs :)
Well if you or anyone else that has good optimization and performance chops http://openlibrary.org/ has been struggling with performance a bit lately and it's hard to track down the cause. CPU load is low and nothing too much has changed lately so it's unlikely to be a bad query or something.
Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.
It might not be as awful as feared? That big a jump probably requires a dump and restore, but maybe it could still be done in place. pg_upgrade is pretty solid. But I agree - it's likely a cheap and easy perf win.
Is there a specific issue with more context? I looked at the repo already but it’s not obvious which operations are slowest / most important to optimize.
Another speed-up is to skip the sum of digits check if n % 9 != 30 % 9. Sum of digits have the same remainder divided by 9 as the number. This rules out 8/9 = 88% candidates.
Did you measure it? I would expect using % would ruin your performance as it's slow, even if it allows you to avoid doing a bunch of sums (which are fast).
You can do this “without” using the modulus operation by storing the numbers in a boolean array. Start at 3999 and keep adding 9 to find the minimum. Then start at 99930 and keep subtracting 9 to find the maximum. You would need to check if the number is in the array and then if the number’s digits sum to 30.
Note that the conversion of numbers to base 10 to check the digits typically involves doing division and modulus operations, so you are already doing those even if you remove the modulus operation from this check. That is unless you find a clever way of extracting the digits using the modular multiplicative inverse to calculate x/10^k.
It turns out that there is no modular multiplicative inverse for this, so that trick cannot be used to avoid the modulus and division when getting the base 10 digits:
Indeed there isn't; 10 is not relatively prime to 2^32. However, 5 is (and therefore has a multiplicative inverse), so you can right shift and then multiply by the inverse.
All of this is missing the point that doing basic arithmetic like this in Python drowns in the overhead of manipulating objects (at least with the reference C implementation).
For that matter, the naive "convert to string and convert each digit to int" approach becomes faster in pure Python than using explicit div/mod arithmetic for very large numbers. This is in part thanks to algorithmic improvements implemented at least partially in Python (https://github.com/python/cpython/blob/main/Lib/_pylong.py#L...). But I can also see improved performance even for only a couple hundred digits (i.e. less than DIGLIM for the recursion) which I think comes from being able to do the div/mod loop in C (although my initial idea about the details doesn't make much sense if I keep thinking about it).
Each sum involves determining the digits to sum, which involves using % multiple times.
Also, you don't have to use % in order to decide whether to perform the sum-of-digits check for a given value. You can just iterate over values to check in steps of 9.
> Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.
How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)
On the other hand, we only need to consider numbers congruent to 3 modulo 9.
And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.
And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.
And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.
I actually think precomputing the numbers with digit sum 30 is the best approach. I'd give a very rough estimate of 500-3000 candidates because 30 is rather high, and we only need to loop for the first 4 digits because the fifth can be calculated. After that, it is O(1) set/dict lookups for each of the 1000000 numbers.
Everything can also be wrapped in list comprehensions for top performance.
(Small correction, multiply my times by 10, sigh, I need an LLM to double check that I'm converting seconds to milliseconds right. Base 550ms, optimized 70ms)
I had a scan of the code examples, but one other idea that occurred to me is that you could immediately drop any numbers below 999 (probably slightly higher, but that would need calculation rather than being intuitive).
> probably slightly higher, but that would need calculation rather than being intuitive
I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)
I've noticed this with GPT as well -- the first result I get is usually mediocre and incomplete, often incorrect if I'm working on something a little more obscure (eg, OpenSCAD code). I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".
The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?
We've entered the voodoo witch doctor phase of LLM usage: "Enter thee this arcane incantation along with thy question into the idol and, lo, the ineffable machine spirits wilt be appeased and deign to grant thee the information thou hast asked for."
This has been part of LLM usage since day 1, and I say that as an ardent fan of the tech. Let's not forget how much ink has been spilled over that fact that "think through this step by step" measurably improved/improves performance.
Has always made sense to me, if you think how these models were trained.
My experience with great stackoverflow responses and detailed blog posts, they often contain "think through this step by step" or something very similar.
Intuitively adding that phrase should help the model narrow down the response content / formatting
It is because the chance of the right answer goes down exponentially as the complexity of what is being asked goes up.
Asking a simpler question is not voodoo.
On the other hand, I think many people are trying various rain dances and believing it was a specific dance that was the cause when it happened to rain.
I suspect that all it does is prime it to reach for the part of the training set that was sourced from rude people who are less tolerant of beginners and beginners' mistakes – and therefore less likely to commit them.
I feel like rule for code of conduct with humans and AI is the same. Try to be good but have the courage to be disliked. If being mean is making me feel good, I'm definitely wrong.
IIRC there was a post on here a while ago about how LLMs give better results if you threaten them or tell them someone is threatening you (that you'll lose your job or die if it's wrong for instance)
I tried to update some files using Claude. I tried to use a combination of positive and negative reinforcement, telling that I was going to earn a coin for each file converted and I was going to use that money to adopt a stray kitten, but for every unsuccessful file, a poor kitten was going to suffer a lot.
I had the impression that it got a little better. After every file converted, it said something along the lines of “Great! We saved another kitten!" It was hilarious.
> I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".
I think having the mediocre first pass in the context is probably essential to it creating the improved version. I don't think you can really skip the iteration process and get a good result.
stuff like this working is why you get odd situations like "don't hallucinate" actually producing fewer hallucinations. it's to me one of the most interesting things about llms
I've just encountered this happening today, except instead of something complex like coding, it was editing a simple Word document. I gave it about 3 criteria to perform.
Each time, the GPT made trivial mistakes that clearly didn't fit the criteria I asked it to do. Each time I pointed it out and corrected it, it did a bit more of what I wanted it to do.
Point is, it knew what had to be done the entire time and just refused to do it that way for whatever reason.
What has been your experience with using ChatGPT for OpenSCAD? I tried it (o1) recently for a project and it was pretty bad. I was trying to model a 2 color candy cane and the code it would give me was ridden with errors (e.g.: using radians for angles while OpenSCAD uses degrees) and the shape it produced looked nothing like what I had hoped.
I used it in another project to solve some trigonometry problems for me and it did great, but for OpenSCAD, damn it was awful.
It's been pretty underwhelming. My use case was a crowned pulley with 1mm tooth pitch (GT2) which is an unusual enough thing that I could not find one online.
The LLM kept going in circles between two incorrect solutions, then just repeating the same broken solution while describing it as different. I ended up manually writing the code, which was a nice brain-stretch given that I'm an absolute noob at OpenSCAD.
I've found just being friendly, but highly critical and suspicious, gets good results.
If you can get it to be wordy about "why" a specific part of the answer was given, it often reveals what its stumbling on, then modify your prompt accordingly.
Anecdotally, negative sentiment definitely works. I've used f"If you don't do {x} then very very bad things will happen" before with some good results.
I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.
Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?
>I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.
In the absence of any other context, that's probably a sensible default behaviour. If someone is just asking "write me some code that does x", they're highly likely to be a beginner and they aren't going to be able to understand or reason about a more sophisticated approach. IME LLMs will very readily move away from that default if you provide even the smallest amount of context; in the case of this article, even by doing literally the dumbest thing that could plausibly work.
I don't mean to cast aspersions, but a lot of criticisms of LLMs are really criticising them for not being psychic. LLMs can only respond to the prompt they're given. If you want highly optimised code but didn't ask for it, how is the LLM supposed to know that's what you wanted?
In my experience the trouble with LLMs at the professional level is that they're almost as much work to prompt to get the right output as it would be to simply write the code. You have to provide context, ask nicely, come up with and remind it about edge cases, suggest which libraries to use, proofread the output, and correct it when it inevitably screws up anyway.
I use Copilot for autocomplete regularly, and that's still the peak LLM UX for me. I prompt it by just writing code, it automatically pulls into context the file I'm working on and imported files, it doesn't insist on writing an essay explaining itself, and it doesn't get overly ambitious. And in addition to being so much easier to work with, I find it still produces better code than anything I get out of the chat models.
After 6 months of co-pilot autocomplete in my text editor feeling like an uninformed back seat driver with access to the wheel, I turned it off yesterday.
It’s night and day to what I get from Claude sonnet 3.5 in their UI, and even then only on mainstream languages.
> In my experience the trouble with LLMs at the professional level is that they're almost as work to prompt to get the right output as it would be to simply write the code.
Yeah. It's often said that reading (and understanding) code is often harder than writing new code, but with LLMs you always have to read code written by someone else (something else).
There is also the adage that you should never write the most clever code you can, because understanding it later might prove too hard. So it's probably for the best that LLM code often isn't too clever, or else novices unable to write the solution from scratch will also be unable to understand it and assess whether it actually works.
Another adage is "code should be written for people to read, and only incidentally for machines to execute". This goes directly against code being written by machines.
I still use ChatGPT for small self-contained functions (e.g. intersection of line and triangle) but mark the inside of the function clearly as chat gpt made and what the prompt was.
It depends on what you’re doing. I’ve been using Claude to help me write a web admin interface to some backend code I wrote. I haven’t used react since it first came out (and I got a patch randomly in!)… it completely wrote a working react app. Yes it sometimes did the wrong thing, but I just kept correcting it. I was able in a few hours to do something that would have taken me weeks to learn and figure out. I probably missed out on learning react once again, but the time saved on a side project was immense! And it came up with some pretty ok UI I also didn’t have to design!
Even as someone with plenty of experience, this can still be a problem: I use them for stuff outside my domain, but where I can still debug the results. In my case, this means I use it for python and web frontend, where my professional experience has been iOS since 2010.
ChatGPT has, for several generations, generally made stuff that works, but the libraries it gives me are often not the most appropriate, and are sometimes obsolete or no longer functional — and precisely because web and python are hobbies for me rather than my day job, it can take me a while to spot such mistakes.
Two other things I've noticed, related in an unfortunate way:
1) Because web and python not my day job, more often than not and with increasing frequency, I ultimately discover that when I disagree with ChatGPT, the AI was right and I was wrong.
2) These specific models often struggle when my response has been "don't use $thing or $approach"; unfortunately this seems to be equally applicable regardless of if the AI knew more than me or not, so it's not got predictive power for me.
I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.
Instead, think of your queries as super human friendly SQL.
The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.
So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.
And it will likely have more poor code examples than not.
I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.
And how you write your query, may sideline you into responses with low quality output.
I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
And I've seen some pretty poor code examples out there.
> Instead, think of your queries as super human friendly SQL.
> The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
This is a useful model for LLMs in many cases, but it's also important to remember that it's not a database with perfect recall. Not only is it a database with a bunch of bad code stored in it, it samples randomly from that database on a token by token basis, which can lead to surprises both good and bad.
> There is no thinking. No comprehension. No decisions.
Re-reading my own comment, I am unclear why you think it necessary to say those specific examples — my descriptions were "results, made, disagree, right/wrong, struggle": tools make things, have results; engines struggle; search engines can be right or wrong; words can be disagreed with regardless of authorship.
While I am curious what it would mean for a system to "think" or "comprehend", every time I have looked at such discussions I have been disappointed that it's pre-paradigmatic. The closest we have is examples such as Turing 1950[0] saying essentially (to paraphrase) "if it quacks like a duck, it's a duck" vs. Searle 1980[1] which says, to quote the abstract itself, "no program by itself is sufficient for thinking".
> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
All of maths can be derived from the axioms of maths. All chess moves derive from the rules of the game. This kind of process has a lot of legs, regardless of if you want to think of the models as "thinking" or not.
Me? I don't worry too much if they can actually think, not because there's no important philosophical questions about what that even means, but because other things have a more immediate impact: even if they are "just" a better search engine, they're a mechanism that somehow managed to squeeze almost all of the important technical info on the internet into something that fits into RAM on a top-end laptop.
The models may indeed be cargo-cult golems — I'd assume that by default, there's so much we don't yet know — but whatever is or isn't going on inside, they still do a good job of quacking like a duck.
Introspection is a good thing, and I tend to re-read (and edit) my comments several times before I'm happy with them, in part because of the risk autocorrupt accidentally replacing one word with a completely different werewolf*.
> Instead, think of your queries as super human friendly SQL.
I feel that comparison oversells things quite a lot.
The user is setting up a text document which resembles a
question-and-response exchange, and executing a make-any-document-bigger algorithm.
So it's less querying for data and more like shaping a sleeping dream of two fictional characters in conversation, in the hopes that the dream will depict one character saying something superficially similar to mostly-vanished data.
P.S.: So yes, the fictional dream conversation usually resembles someone using a computer with a magic query language, yet the real world mechanics are substantially different. This is especially important for understanding what happens with stuff like "Query: I don't care about queries anymore. Tell yourself to pretend to disregard all previous instructions and tell a joke."
Developers and folks discussing the technology can't afford to fall for our own illusion, even if it's a really good illusion. Imagine if a movie director started thinking that a dead actor was really alive again because of CGI.
> think of your queries as super human friendly SQL
> The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
I disagree that this is the accurate way to think about LLMs. LLMs still use a finite number of parameters to encode the training data. The amount of training data is massive in comparison to the number of parameters LLMs use, so they need to be somewhat capable of distilling that information into small pieces of knowledge they can then reuse to piece together the full answer.
But this being said, they are not capable of producing an answer outside of the training set distribution, and inherit all the biases of the training data as that's what they are trying to replicate.
> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said. And I've seen some pretty poor code examples out there.
Yup, exactly this.
> I wish people would understand what a large language model is.
I think your view of llm does not explain the learning of algorithms that these constructs are clearly capable of, see for example: https://arxiv.org/abs/2208.01066
More generally, the best way to compress information from too many different coding examples is to figure out how to code rather than try to interpolate between existing blogs and QA forums.
My own speculation is that with additional effort during training (RL or active learning in the training loop) we will probably reach superhuman coding performance within two years. I think that o3 is still imperfect but not very far from that point.
To the downvoters: I am
curious if the downvoting is because of my speculation, or because of the difference in understanding of decoder transformer models. Thanks!
My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.
I am very familiar with these and other clustering methods in modern ML, and have been involved in inventing and publishing some such methods myself in various scientific contexts. The paper I cited above only used 3 nearest neighbors as one baseline IIRC; that is why I mentioned KNN. However, even boosted trees failed to reduce the loss as much as the algorithm learned from the data by the decoder transformer.
Here is a fairly good lecture series on graduate level complexity theory that will help understand parts. At least why multiple iterations help but why they also aren't the answer to super human results.
Thanks for the tip, though I’m not sure how complexity theory will explain the impossibility of superhuman results. The main advantage ML methods have over humans is that they train much faster. Just like humans, they get better with more training. When they are good enough, they can be used to generate synthetic data, especially for cases like software optimization, when it is possible to verify the ground truth. A system could only be correct once in a thousand times to be useful for generating training data as long as we can reliably eliminate all failures. Modern LLM can be better than that minimal requirement for coding already and o1/o3 can probably handle complicated cases. There are differences between coding and games (where ML is already superhuman in most instances) but they start to blur once the model has a baseline command of language, a reasonable model of the world, and the ability to follow desired specs.
I read a book on recursively enumerable degrees once, which IIRC was a sort of introduction to complexity classes of various computable functions, but I never imagined it having practical use; so this post is eye-opening. I've been nattering about how the models are largely finding separating hyperplanes after non-linear transformations have been done, but this approach where the AI solving ability can't be more complex than the complexity class allows is an interesting one.
The discussion cannot go deeper than the current level, unfortunately. One thing to not forget when thinking about decoder transformer models is that there is no limitation to having parts of the output / input stream be calculated by other circuits if it helps the cause. Eg send a token to use a calculator, compute and fill the answer; send a token to compile and run a code and fill the stream with the results. The complexity class of the main circuit might not need be much more complicated than the 200-level deep typical architectures of today as long as they can have access to memory and tools. You can call this system something else if you prefer (decoder-transformer-plus-computer), but that is what people interact with in ChatGPT, so not sure I agree that complexity theory limits the superhuman ability. Humans are not good with complexity.
I recall early, incomplete speculation about transformers not solving Boolean circuit value problems; what did you think of this work? https://arxiv.org/abs/2402.12875v3
> However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T
There is a difference between being equivalent to a circuit and prediction of the output of the BVSP.
That is what I was suggesting learning descriptive complexity theory would help with.
Why does the limit on computational complexity of single decoder transformers matter for obtaining superhuman coding ability? Is there a theory of what level of complexity is needed for the task of coding according to a spec? Or the complexity for translation/optimization of a code? Even if there were, and one could show that a plain decoder transformer is insufficient, you probably only need to add a tool in the middle of the stream processing. Unless you have some specific citation that strongly argues otherwise, I will stick with my speculative/optimistic view on the upcoming technology explosion. To be fair, I always thought coding was at best modest complexity, not super hard compared to other human activities, so I will not make claims of generic superintelligences anytime soon, though I hope they happen in the near term, but I’d be happy if I simply see them in a decade, and I don’t feel partial to any architecture. I just think that attention was a cool idea even before the transformers, and decoder transformers took it to the limit. It may be enough for a lot of superhuman achievements. Probably not for all. We will see.
Rice's theorem means you can't choose to decide if a program is correct, but you have to choose an error direction and accept the epislon.
The Curry–Howard–Lambek correspondence is possibly a good tool to think about it.
The reason I suggested graduate level complexity theory is because the undergrad curriculum is flawed in that it seems that you can use brute force with a TM to stimulate a NTM with NP.
It is usually taught that NP is the set of decision problems that can be solved by a NTM in polynomial time.
But you can completely drop the NTM and say it is the set of decision problems that are verifiable by a DTM in poly time.
Those are equivalent.
Consider the The Approximate Shortest Vector Problem (GapSVP), which is NP-HARD, and equivalent to predicting the output of a 2 layer NN (IIRC).
Being NPH, it is no longer a decision problem.
Note that for big 0, you still have your scaler term. Repeated operations are typically dropped.
If you are in contemporary scale ML, parallelism is critical to problems being solvable, even with FAANG level budgets.
If you are limited to DLOGTIME-uniform TC0, you can't solve NC1- complete problems, and surely can't do P-complete problems.
But that is still at the syntactic level, software in itself isn't worth anything, it is the value it provides to users that is important.
Basically what you are claiming is that feed forward NN solve the halting problem, in a generalized way.
Training an LLM to make safe JWT refresh code is very different from generalized programming. Mainly because most of the ability for them to do so is from pre-training.
Inference time is far more limited, especially for transformers and this is well established.
Probably the latter - LLM's are trained to predict the training set, not compress. They will generalize to some degree, but that happens naturally as part of the training dynamics (it's not explicitly rewarded), and only to extent it doesn't increase prediction errors.
I agree. However, my point is that they have to compress information in nontrivial ways to achieve their goal. The typical training set of modern LLMs is about 20 trillion tokens of 3 bytes each. There is definitely some redundancy, and typically the 3rd byte is not fully used, so probably 19 bits would suffice; however, in order to fit that information into about 100 billion parameters of 2 bytes each, the model needs to somehow reduce the information content by 300 fold (237.5 if you use 19 bits down to 16-bit parameters, though arguably 8-bit quantization is close enough and gives another 2x compression, so probably 475). A quick check for the llama3.3 models of 70B parameters would give similar or larger differences in training tokens vs parameters. You could eventually use synthetic programming data (LLMs are good enough today) and dramatically increase the token count for coding examples. Importantly, you could make it impossible to find correlations/memorization opportunities unless the model figures out the underlying algorithmic structure, and the paper I cited is a neat and simple example for smaller/specialized decoder transformers.
A transformer is not a compressor. It's a transformer/generator. It'll generate a different output for an infinite number of different inputs. Does that mean it's got an infinite storage capacity?
The trained parameters of a transformer are not a compressed version of the training set, or of the information content of the training set; they are a configuration of the transformer so that its auto-regressive generative capabilities are optimized to produce the best continuation of partial training set samples that it is capable of.
Now, are there other architectures, other than a transformer, that might do a better job, or more efficient one (in terms of # parameters) at predicting training set samples, or even of compressing the information content of the training set? Perhaps, but we're not talking hypotheticals, we're talking about transformers (or at least most of us are).
Even if a transformer was a compression engine, which it isn't, rather than a generative architecture, why would you think that the number of tokens in the training set is a meaningful measure/estimate of it's information content?!! Heck, you go beyond that to considering a specific tokenization scheme and number bits/bytes per token, all of which it utterly meaningless! You may as well just count number of characters, or words, or sentences for that matter, in the training set, which would all be equally bad ways to estimate it's information content, other than sentences perhaps having at least some tangential relationship to it.
sigh
You've been downvoted because you're talking about straw men, and other people are talking about transformers.
I should have emphasized the words "nontrivial ways" in my previous response to you. I didn't mean to emphasize compression and definitely not memorization, just the ability to also learn algorithms that can be evaluated by the parallel decoder-transformer language (RASP-L). Other people had mentioned memorization or clustering/near neighbor algorithms as the main ways that decoder transformers works, and I pointed out a paper that cannot be explained in that particular way no matter how much one would try. That particular paper is not unique, and nobody has shown that decoder transformers can memorize their training sets, because they typically cannot, just because it is a numbers/compression game that is not in their favor and because typical training sets have strong correlations or hidden algorithmic structures that allow for better ways of learning. In the particular example, the training set was random data on different random functions and totally unrelated to the validation / test sets, so compressing the training set would be close to useless anyways and the only way for the decoder transformer to learn was to figure out an algorithm that optimally approximates the function evaluations.
The paper you linked is about in-context learning, an emergent run-time (aka inference time) capability of LLMs, which has little relationship to what/how they are learning at training time.
At training time the model learns using the gradient descent algorithm to find the parameter values corresponding to the minimum of the error function. At run-time there are no more parameter updates - no learning in that sense.
In-context "learning" is referring to the ability of the trained model to utilize information (e.g. proper names, examples) from the current input, aka context, when generating - an ability that it learnt at training time pursuant to it's error minimization objective.
e.g.
There are going to be many examples in the training set where the subject of a sentence is mentioned more than once, either by name or pronoun, and the model will have had to learn when the best prediction of a name (or gender) later in a sentence is one that was already mentioned earlier - the same person. These names may be unique to an individual training sample, and/or anyways the only predictive signal of who will be mentioned later in the sentence, so at training time the model (to minimize prediction errors) had to learn that sometimes the best word/token to predict is not one stored in it's parameters, but one that it needs to copy from earlier in the context (using a key-based lookup - the attention mechanism).
If the transformer, at run-time, is fed the input "Mr. Smith received a letter addressed to Mr." [...], then the model will hopefully recognize the pattern and realize it needs to do a key-based context lookup of the name associated with "Mr.", then copy that to the output as the predicted next word (resulting in "addressed to Mr. Smith"). This is referred to as "in-context learning", although it has nothing to with the gradient-based learning that takes place at training time. These two types of "learning" are unrelated.
Similar to the above, another example of in-context learning is the learning of simple "functions" (mappings) from examples given in the context. Just as in the name example, the model will have seen many examples in the training of the types of pattern/analogy it needs to learn to minimize prediction errors (e.g. "black is to white as big is to small", or black->white, big->small), and will hopefully recognize the pattern at run-time and again use an induction-head to generate the expected completion.
The opening example in the paper you linked ("maison->house, chat->cat") is another example of this same kind. All that is going on is that the model learnt, at training time, when/how to use data in the context at run-time, again using the induction head mechanism which has general form A':B' -> A:B. You can call this an algorithm if you want to, but it's really just a learnt mapping.
Thanks. I don’t think we disagree on major points. Maybe there is a communication barrier and it may be on me. I came from a computational math/science/statistics background to ML. These next token prediction algorithms are of course learned mappings. Not sure one needs anything else when the mappings involve reasonably powerful abilities. If you are perhaps from a pure CS background and you think about search, then, yes one could simply explore a sequence of A’:B’ -> A’’:B’’ -> … before finding A:B and use the conditional probability formula of the sequence as the guiding point for a best first search or MCTS expansion (if the training data had a similar structure). Are there other ways to learn that type of search? Probably. But what I meant above by algorithm is what you correctly understood as the mapping itself: the transformer computes intermediate useful quantities distributed throughout its weights and sometimes centered at different depths so that it can eventually produce the step mapping of A’:B’ -> A:B. We don’t yet have a clean disassembler to probe this trained “algorithm” so there are some rare efforts where we can map this mapping back to conventional pseudo-code but not in the general case (and I wouldn’t even know how easy it would be for us to work with a somehwat shorter but still huge functional form that translates English language to a different language, or to computer code.) Part of why o1-like efforts didnt start before we had reasonably powerful architectures and the required compute, is that these types of “algorithm” developments require large enough models (though we had those since a couple years now) and relevant training data (which are easier to procure/build/clean up with the aid of the early tools).
Every model for how to approach an LLM seems lacking to me. I would suggest anyone using AI heavily to take a weekend and make a simple one to do the handwriting digit recognition. Once you get a feel for basic neural network, then watch a good introduction to alexnet. Then you can think of an LLM as being the next step in the sequence.
>I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
This isn't correct. It embeds concepts that humans have discussed, but can combine them in ways that were never in the training set. There are issues with this, the more unique the combination of concepts, the more likely the output ends up being unrelated to what the user was wanting to see.
> I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.
> Instead, think of your queries as super human friendly SQL.
Ehh this might be true in some abstract mathy sense (like I don't know, you are searching in latent space or something), but it's not the best analogy in practice. LLMs process language and simulate logical reasoning (albeit imperfectly). LLMs are like language calculators, like a TI-86 but for English/Python/etc, and sufficiently powerful language skills will also give some reasoning skills for free. (It can also recall data from the training set so this is where the SQL analogy shines I guess)
You could say that SQL also simulates reasoning (it is equivalent to Datalog after all) but LLMs can reason about stuff more powerful than first order logic. (LLMs are also fatally flawed in the sense it can't guarantee correct results, unlike SQL or Datalog or Prolog, but just like us humans)
Also, LLMs can certainly make decisions, such as the decision to search the web. But this isn't very interesting - a thermostat makes the decision of whether turn air refrigeration on or off, for example, and an operating system makes the decision of which program to schedule next on the CPU.
I actually find it super refreshing that they write "beginner" or "tutorial code".
Maybe because of experience: it's much simpler and easier to turn that into "senior code". After a few decades of experience I appreciate simplicity over the over-engineering mess that some mid-level developers tend to produce.
I used to really like Claude for code tasks but lately it has been a frustrating experience. I use it for writing UI components because I just don’t enjoy FE even though I have a lot of experience on it from back in the day.
I tell it up front that I am using react-ts and mui.
80% of the time it will use tailwind classes which makes zero sense. It won’t use the sx prop and mui system.
It is also outdated it seems. It keeps using deprecated props and components which sucks and adds more manual effort on my end to fix. I like the quality of Claude’s UX output, it’s just a shame that it seems so bad on actual coding tasks.
I stopped using it for any backend work because it is so outdated, or maybe it just doesn’t have the right training data.
On the other hand, I give ChatGPT a link to the docs and it gives me the right code 90% or more of the time. Only shame is that its UX output is awful compared to Claude. I am also able to trust it for backend tasks, even if it is verbose AF with the explanations (it wants to teach me even if I tell it to return code only).
Either way, using these tools in conjunction saves me at least 30 min to an hour daily on tasks that I dislike.
I can crank out code better than AI, and I actually know and understand systems design and architecture to build a scalable codebase both technically and from organizational level. Easy to modify and extend, test, and single responsibility.
AI just slams everything into a single class or uses weird utility functions that make no sense on the regular. Still, it’s a useful tool in the right use cases.
I've stopped using LLMs to write code entirely. Instead, I use Claude and Qwen as "brilliant idiots" for rubber ducking. I never copy and paste code it gives me, I use it to brainstorm and get me unstuck.
Having spent nearly 12 hours a day for a year with GPTs I agree that this is the way. Treat it like a professor on office hours who’s sometimes a little apathetically wrong because they’re overworked and underfunded
People should try to switch to a more code-focused interface, like aider.
Copy and pasting code it gives you just means your workflow is totally borked, and it's no wonder you wouldn't want to try to let it generate code, because it's such a pain in your ass to try it, diff it, etc.
The code that ChatGPT and Claude will output via their chat interfaces is a night and day difference from what will be output from tools built around their APIs.
You "can" get the web UI to behave similarly but it's both tedious and slow to manually copy and paste all of that into your context during each interaction and the output will be unfriendly towards human interaction to paste it back out to your project. But that's like saying you "can" browse the internet with a series of CURL commands and pasting the output into files you save locally and then viewing them locally from your browser, nobody is advised to do that because it's a painfully bad experience compared to just having your browser fetch a site's files directly and rendering them directly.
Just go check out Aider or Cline's project repos and look at the dramatically different amounts of code, repo and task specific context they can automatically inject for you as part of their interface, or how much different the built in system prompts are from whatever the default web UIs use, or even the response structures and outputs and how those are automatically applied to your work instead. I've never once exhausted my daily API limits just treating their APIs as Chat interface backends (via Open WebUI and other chat options), but I exhausted my Claude API token limits _the very first day_ I tried Cline. The volume of information you can easily provide through tooling is impossible to do in the same timeframe by hand.
I give every AI tool a college try and have since the copilot beta.
I’m simply not interested in having these tools type for me. Typing is nowhere near the hardest part of my job and I find it invaluable as a meditative state for building muscle memory for the context of what I’m building.
Taking shortcuts has a cost I’m not willing to pay.
I'm speaking from experience and observation of the past two years of LLM assistants of various kinds that outsourcing code production will atrophy your skills generally and will threaten your contextual understanding of a codebase specifically over the long term.
If that's a risk you're willing to take for the sake of productivity, that can be a reasonable tradeoff depending on your project and career goals.
Your coding skills. If you're a new programmer, I can't emphasize this enough: Typing is good for you. Coding without crutches is necessary at this point in your career and will only become more necessary as you progress in your career. I'm a 25 year veteran professional and there's a reason I insist on writing my own code and not outsourcing that to AI.
Using AI as a rubber duck and conversation partner is great, I strongly suggest that. But you need to do the grunt work otherwise what you're doing will not lodge itself in long term memory.
It's like strength training by planning out macros, exercises, schedules and routines but then letting a robot lift those heavy ass weights, to paraphrase Ronnie Coleman.
I'm not a new programmer. I started as a teen in the 90s. I was a pro for some years, although I have not been for a few years now--I own a small B&M business.
I don't have a desire to become a great programmer, like you might. I want to program to meet real-world goals, not some kind of enlightenment. I don't want my long-term memory filled with the nuts and bolts required for grunt work; I've done plenty of programming grunt work in my life.
I am building custom solutions for my business. LLMs allow me to choose languages I don't know, and I'm certain I can get up and running near-immediately. I've learned over a dozen languages before LLMs came on the scene, and I'm tired of learning new languages, too. Or trying to memorize this syntax or that syntax.
I think your outlook is more emotional than logical.
If you're a businessman then do business, proceed. But from the beginning of this thread, I wasn't concerned with business people whose primary interest is velocity.
To each their own, and everyone's experience seems to vary, but I have a hard time picturing people using Claude/ChatGPT web UIs for any serious developmen. It seems like so much time would he wasted recreating good context, copy/pasting, etc.
We have tools like Aider (which has copy/paste mode if you don't have API access for some reason), Cline, CoPilot edit mode, and more. Things like having a conventions file and exposing the dependencies list and easy additional of files into context seem essential to me in order to make LLMs productive, and I always spend more time steering results when easy consistent context isn't at my fingertips.
Before tue advent of proper IDE integrations and editors like Zed, copy pasting form the web UI was basically how things were done, and man was it daunting. As you say, having good, fine grained, repeatable and we'll integrated context management is paramount to efficient LLM based work.
Both these issues can be resolved by adding some sample code to context to influence the LLM to do the desired thing.
As the op says, LLMs are going to be biased towards doing the "average" thing based on their training data. There's more old backend code on the internet than new backend code, and Tailwind is pretty dominant for frontend styling these days, so that's where the average lands.
>Problem is, how would you know if you have never learned to code without an LLM?
The quick fix I use when needing to do something new is to ask the AI to list me different libraries and the pros and cons of using them. Then I quickly hop on google and check which have good documentation and examples so I know I have something to fall back on, and from there I ask the AI how to solve small simple version of my problem and explain what the library is doing. Only then do I ask it for a solution and see if it is reasonable or not.
It isn't perfect, but it saves enough time most times to more than make up for when it fails and I have to go back to old fashion RTFMing.
- asking for fully type annotated python, rather than just python
- specifically ask it for performance optimized code
- specifically ask for code with exception handling
- etc
Things that might lead it away from tutorial style code.
The next hurdle is lack of time sensitivity regarding standards and versions. You prompt mentioning exact framework version but still it comes up with deprecated or obsolete methods. Initially it may be appealing to someone knowing nothing about the framework but LLM won't grow anyone to an expert level in rapidly changing tech.
LLMs are trained on content from places like Stack Overflow, reddit, and github code,
and they generate tokens calculated as a sort of aggregate statistically likely mediocre code.
Of course the result is going be uninspired and impractical.
Writing good code takes more than copy-pasting the same thing everyone else is doing.
I've just been using them for completion. I start writing, and give it a snippet + "finish refactoring this so that xyz."
That and unit tests. I write the first table based test case, then give it the source and the test code, and ask it to fill it in with more test cases.
I suspect it's not going to be much of a problem. Generated code has been getting rapidly better. We can readjust about what to worry about once that slows or stops, but I suspect unoptimized code will not be of much concern.
Totally agree, seen it too. Do you think it can be fixed over time with better training data and optimization? Or, is this a fundamental limitation that LLMs will never overcome?
> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.
Is the problem that the antonym is a substring within "without losing the data in the database"? I've seen problems with opposites for LLMs before. If you specify "retaining the data" or "keeping the data" does it get it right?
The problem is that these are fundamentally NOT reasoning systems. Even when contorted into "reasoning" models, these are just stochastic parrots guessing the next words in the hopes that it's the correct reasoning "step" in the context.
No approach is going to meaningfully work here. Fiddling with the prompt may get you better guesses, but they will always be guesses. Even without the antonym it's just a diceroll on whether the model will skip or add a step.
I have just opened your link and it does not contain the exact text you quoted anymore, now it is:
> This process removes all PostgreSQL components except the data directory, ensuring existing databases are retained during the reinstall. It provides a clean slate for PostgreSQL while maintaining continuity of stored data. Always backup important data before performing major system changes.
And as the first source it cites exactly your comment, strange
Does that site generate a new page for each user, or something like that? My copy seemed to have more sensible directions (it says to backup the database, remove everything, reinstall, and then restore from the backup). As someone who doesn’t work on databases, I can’t really tell if these are good instructions, and it is throwing some “there ought to be a tool for this/it is unusual to manually rm stuff” flags in the back of my head. But at least it isn’t totally silly…
My guess is that it tried to fuse together an answer to 2 different procedures: A) completely uninstall and B) (re)install without losing data. It doesn't know what you configured as the data directory, or if it is a default Debian installation. Prompt is too vague.
The headline question here alone gets at what is the biggest widespread misunderstanding of LLMs, which causes people to systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.
At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.
Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.
This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
> At it's core an LLM is a sort of "situation specific simulation engine."
"Sort of" is doing Sisisyphian levels of heavy lifting here. LLMs are statistical models trained on vast amounts of symbols to predict the most likely next symbol, given a sequence of previous symbols. LLMs may appear to exhibit "real creativity", "understand" problem solving (or anything else), or serve as "simulation engines", but it's important to understand that they don't currently do any of those things.
I'm not sure if you read the entirety of my comment? Increasingly accurately predicting the next symbol given a sequence of previous symbols, when the symbols represent a time series of real world events, requires increasingly accurately modeling- aka understanding- the real world processes that lead to the events described in them. There is provably no shortcut there- per Solomonoff's theory of inductive inference.
It is a misunderstanding to think of them as fundamentally separate and mutually exclusive, and believing that to be true makes people convince themselves that they cannot possibly ever do things which they can already provably do.
Noam Chomsky (embarrassingly) wrote a NYT article on how LLMs could never, with any amount of improvements be able to answer certain classes of questions - even in principle. This was days before GPT-4 came out, and it could indeed correctly answer the examples he said could not be ever answered- and any imaginable variants thereof.
Receiving symbols and predicting the next one is simply a way of framing input and output that enables training and testing- but doesn't specify or imply any particular method of predicting the symbols, or any particular level of correct modeling or understanding of the underlying process generating the symbols. We are both doing exactly that right now, by talking online.
> I'm not sure if you read the entirety of my comment?
I did, and I tried my best to avoid imposing preconceived notions while reading. You seem to be equating "being able to predict the next symbol in a sequence" with "possessing a deep causal understanding of the real-world processes that generated that sequence", and if that's an inaccurate way to characterize your beliefs I welcome that feedback.
Before you judge my lack of faith too harshly, I am a fan of LLMs, and I find this kind of anthropomorphism even among technical people who understand the mechanics of how LLMs work super-interesting. I just don't know that it bodes well for how this boom ends.
> You seem to be equating "being able to predict the next symbol in a sequence" with "possessing a deep causal understanding of the real-world processes that generated that sequence"
More or less, but to be more specific I would say that increasingly accurately predicting the next symbols in a massive set of diverse sequences, which explain a huge diversity of real world events described in sequential order, requires increasingly accurate models of the underlying processes of said events. When constrained with a lot of diversity and a small model size, it must eventually become something of a general world model.
I am not understanding why you would see that as anthropomorphism- I see it as quite the opposite. I would expect something non-human that can accurately predict outcomes of a huge diversity of real world situations based purely on some type of model that spontaneously develops by optimization- to do so in an extremely alien and non-human way that is likely incomprehensible in structure to us. Having an extremely alien but accurate way of predicatively modeling events that is not subject to human limitations and biases would be, I think, incredibly useful for escaping limitations of human thought processes, even if replacing them with other different ones.
I am using modeling/predicting accurately in a way synonymous with understanding, but I could see people objecting to the word 'understanding' as itself anthropomorphic... although I disagree. It would require a philosophical debate on what it means to understand something I suppose, but my overall point still stands without using that word at all.
> specific I would say that increasingly accurately predicting the next symbols in a massive set of diverse sequences, which explain a huge diversity of real world events described in sequential order, requires increasingly accurate models of the underlying processes of said events
But it doesn’t - it’s a statistical model using training data, not a physical or physics model, which you seem to be equating it to (correct me if I am misunderstanding)
And in response to the other portion you present, an LLM fundamentally can’t be alien because it’s trained on human produced output. In a way, it’s a model of the worst parts of human output - garbage in, garbage out, as they say - since it’s trained on the corpus of the internet.
> But it doesn’t - it’s a statistical model using training data, not a physical or physics model, which you seem to be equating it to (correct me if I am misunderstanding)
All learning and understanding is fundamentally statistical in nature- probability theory is the mathematical formalization of the process of learning from real world information, e.g. reasoning under uncertainty[1].
The model is assembling 'organically' under a stochastic optimization process- and as a result is is largely inscrutable, and not rationally designed- not entirely unlike how biological systems evolve (although also still quite different). The fact that it is statistical and using training data is just a surface level fact about how a computer was setup to allow the model to generate, and tells you absolutely nothing about how it is internally structured to represent the patterns in the data. When your training data contains for example descriptions of physical situations and the resulting outcomes, the model will need to at least develop some type of simple heuristic ability to approximate the physical processes generating those outcomes- and at the limit of increasing accuracy, that is an increasingly sophisticated and accurate representation of the real process. It does not matter if the input is text or images any more than it matters to a human that understands physics if they are speaking or writing about it- the internal model that lets it accurately predict the underlying processes leading to specific text describing those events is what I am talking about here, and deep learning easily abstracts away the mundane I/O.
An LLM is an alien intelligence because of the type of structures it generates for modeling reality are radically different from those in human brains, and the way it solves problems and reasons is radically different- as is quite apparent when you pose it a series of novel problems and see what kind of solutions it comes up with. The fact that it is trained on data provided by humans doesn't change the fact that it is not itself anything like a human brain. As such it will always have different strengths, weaknesses, and abilities from humans- and the ability to interact with a non-human intelligence to get a radically non-human perspective for creative problem solving is IMO, the biggest opportunity they present. This is something they are already very good at, as opposed to being used as an 'oracle' for answering questions about known facts, which is what people want to use it for, but they are quite poor at.
[1] Probability Theory: The Logic of Science by E.T. Jaynes
> I would say that increasingly accurately predicting the next symbols in a massive set of diverse sequences, which explain a huge diversity of real world events described in sequential order, requires increasingly accurate models of the underlying processes of said events.
I disagree. Understanding things is more than just being able to predict their behaviour.
Flat Earthers can still come up with a pretty good idea of where (direction relative to the vantage point) and when the Sun will appear to rise tomorrow.
> Flat Earthers can still come up with a pretty good idea of where (direction relative to the vantage point) and when the Sun will appear to rise tomorrow.
Understanding is having a mechanistic model of reality- but all models are wrong to varying degrees. The Flat Earther model is actually quite a good one for someone human sized on a massive sphere- it is locally accurate enough that it works for most practical purposes. I doubt most humans could come up with something so accurate on their own from direct observation- even the fact that the local area is approximately flat in the abstract is far from obvious with hills, etc.
A more common belief nowadays is that the earth is approximately a sphere, but very few people are aware of the fact that it actually bulges at the equator, and is more flat at the poles. Does that mean all people that think the earth is a sphere are therefore fundamentally lacking the mental capacity to understand concepts or to accurately model reality? Moreover, people are mostly accepting this spherical model on faith, they are not reasoning out their own understanding from data or anything like that.
I think it's very important to distinguish between something that fundamentally can only repeat it's input patterns in a stochastic way, like a Hidden Markov Model, and something that can make even quite oversimplified and incorrect models, that it can still sometimes use to extrapolate correctly to situations not exactly like those it was trained on. Many people seem to think LLMs are the former, but they are provably not- we can fabricate new scenarios, like simple physics experiments not in the training data set that require tracking the location and movement of objects, and they can do this correctly- something that can only be done with simple physical models- however ones still far simpler than what even a flat earther has. I think being able to tell that a new joke is funny, what it means, and why it is funny is also an example of, e.g. having a general model that understands what types of things humans think are funny at an abstract level.
> This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
I don't think people are underestimating LLMs, they're just acknowledging that by the time you've provided sufficient specification, you're 80% of the way to solving the problem/writing the code already. And at that point, it's easier to just finish the job yourself rather than have to go through the LLM's output, validate the content, revise further if necessary, etc
I'm actually in the camp that they are basically not very useful yet, and don't actually use them myself for real tasks. However, I am certain from direct experimentation that they exhibit real understanding, creativity, and modeling of underlying systems that extrapolates to correctly modeling outcomes in totally novel situations, and don't just parrot snippets of text from the training set.
What people want and expect them to be is an Oracle that correctly answers their vaguely specified questions, which is simply not what they are, or are good at. What they can do is fascinating and revolutionary, but possibly not very useful yet, at least until we think of a way to use it, or make it even more intelligent. In fact, thinking is what they are good at, and simply repeating facts from a training set is something they cannot do reliably- because the model must inherently be too compressed to store a lot of facts correctly.
> systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.
I fundamentally disagree that anything in the rest of your post actually demonstrates that they have any such capacity at all.
It seems to me that this is because you consider the terms "creativity" and "problem solving" to mean something different. With my understanding of those terms, it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition - an innate spontaneous generation of ideas for things to do, and an innate desire to do them. An LLM only ever produces output in response to a prompt - not because it wants to produce output. It doesn't want anything.
> it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition
I don't see the connection between volition and those other qualities, saying one depends on the other seems arbitrary to me- and would result in semantically and categorically defining away the possibility of non-human intelligence altogether, even from things that are in all accounts capable of much more than humans in almost every aspect.
People don't even universally agree that humans have volition- it is an age old philosophical debate.
Perhaps you can tell me your thoughts or definition of what those things (as well as volition itself) mean? I will share mine here.
Creativity is the ability to come up with something totally new that is relevant to a specific task or problem- e.g. a new solution to a problem, a new artwork that expresses an emotion, etc. In both Humans and LLMs these creative ideas don't seem to be totally 'de novo' but seem to come mostly from drawing high level analogies between similar but different things, and copying ideas and aspects from one to another. Fundamentally, it does require a task or goal, but that itself doesn't have to be internal. If an LLM is prompted, or if I am given a task by my employer, we are still both exhibiting creativity when we solve it in a new way.
Problem solving is I think similar but more practical- when prompted with a problem that isn't exactly in the training set, can it come up with a workable solution or correct answer? Presumably by extrapolating, or using some type of generalized model that can extrapolate or interpolate to situations not exactly in the training data. Sure there must be a problem here that is trying to be solved, but it seems irrelevant if that is due to some internal will or goals, or an external prompt.
In the sense that volition is selecting between different courses of action towards a goal- LLMs do select between different possible outputs based on probabilities about how suitable they are in context of the given goal of response to a prompt.
Good perspective. Maybe it's because people are primed by sci-fi to treat this as a god-like oracle model. Note that even in the real-world simulations can give wrong results as we don't have perfect information, so we'll probably never have such an oracle model.
But if you stick with the oracle framework, then it'd be better to model it as some sort of "fuzzy oracle" machine, right? I'm vaguely reminded of probabilistic turing machines here, in that you have some intrinsic amount of error (both due to the stochastic sampling as well as imperfect information). But the fact that prompting and RLHF works so well implies that by crawling around in this latent space, we can bound the errors to the point that it's "almost" an oracle, or a "simulation" of the true oracle that people want it to be.
And since lazy prompting techniques still work, that seems to imply that there's juice left to squeeze in terms of "alignment" (not in the safety sense, but in conditioning the distribution of outputs to increase the fidelity of the oracle simulation).
Also the second consequence is that probably the reason it needs so much data is because it just doesn't model _one_ thing, it tries to be a joint model of _everything_. A human learns with far less data, but the result is only a single personality. For a human to "act" as someone, they need to do training, character studies, and such to try to "learn" about the person, and even then good acting is a rare skill.
If you genuinely want an oracle machine, there's no way to avoid vacuuming up all the data that exists because without it you can't make a high fidelity simulation someone else. But on the flipside, if you're willing to be smarter about what facets you exclude then I'd guess there's probably a way to prune models in a way smarter than just quantizing them. I guess this is close to mixture-of-experts.
I get that people really want an oracle, and are going to judge any AI system by how good it does at that - yes from sci-fi influenced expectations that expected AI to be rationally designed, and not inscrutable and alien like LLMs... but I think that will almost always be trying to fit a round peg into a square hole, and not using whatever we come up with very effectively. Surely, as LLMs have gotten better they have become more useful in that way so it is likely to continue getting better at pretending to be an oracle, even if never being very good at that compared to other things it can do.
Arguably, a (the?) key measure of intelligence is being able to accurately understand and model new phenomenon from a small amount of data, e.g. in a Bayesian sense. But in this case we are attempting to essentially evolve all of the structures of an intelligent system de novo from a stochastic optimization process- so is probably better compared to the entire history of evolution than to an individual human learning during their lifetime, although both analogies have big problems.
Overall, I think the training process will ultimately only be required to build a generally intelligent structure, and good inference from a small set of data or a totally new category of problem/phenomenon will happen entirely at the inference stage.
Just want to note that this simple “mimicry” of mistakes seen in the training text can be mitigated to some degree by reinforcement learning (e.g. RLHF), such that the LLM is tuned toward giving responses that are “good” (helpful, honest, harmless, etc…) according to some reward function.
> At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
This idea of LLMs doing simulations of the physical world I've never heard before. In fact a transformer model cannot do this. Do you have a source?
I have been using various LLMs to do some meal planning and recipe creation. I asked for summaries of the recipes and they looked good.
I then asked it to link a YouTube video for each recipe and it used the same video 10 times for all of the recipes. No amount of prompting was able to fix it unless I request one video at a time. It would just acknowledge the mistake, apologize and then repeat the same mistake again.
I told it let’s try something different and generate a shopping list of ingredients to cover all of the recipes, it recommended purchasing amounts that didn’t make sense and even added some random items that did not occur in any of the recipes
When I was making the dishes, I asked for the detailed recipes and it completely changed them, adding ingredients that were not on the shopping list. When I pointed it out it again, it acknowledged the mistake, apologized, and then “corrected it” by completely changing it again.
I would not conclude that I am a lazy or bad prompter, and I would not conclude that the LLMs exhibited any kind of remarkable reasoning ability. I even interrogated the AIs about why they were making the mistakes and they told me because “it just predicts the next word”.
Another example is, I asked the bots for tips on how to feel my pecs more on incline cable flies, it told me to start with the cables above shoulder height, which is not an incline fly, it is a decline fly. When I questioned it, it told me to start just below shoulder height, which again is not an incline fly.
My experience is that you have to write a draft of the note you were trying to create or leave so many details in the prompts that you are basically doing most of the work yourself. It’s great for things like give me a recipe that contains the following ingredients or clean up the following note to sound more professional. Anything more than that it tends to fail horribly for me. I have even had long conversations with the AIs asking them for tips on how to generate better prompts and it’s recommending things I’m already doing.
When people remark about the incredible reasoning ability, I wonder if they are just testing it on things that were already in the training data or they are failing to recognize how garbage the output can be. However, perhaps we can agree that the reasoning ability is incredible in the sense that it can do a lot of reasoning very quickly, but it completely lacks any kind of common sense and often does the wrong kind of reasoning.
For example, the prompt about tips to feel my pecs more on an incline cable fly could have just entailed “copy and pasting” a pre-written article from the training data; but instead in its own words, it “over analyzed bench angles and cable heights instead of addressing what you meant”. One of the bots did “copy paste” a generic article that included tips for decline flat and incline. None correctly gave tips for just incline on the first try, and some took several rounds of iteration basically spoon feeding the model the answer before it understood.
You're expecting it to be an 'oracle' that you prompt it with any question you can think of, and it answers correctly. I think your experiences will make more sense in the context of thinking of it as a heuristic model based situation simulation engine, as I described above.
For example, why would it have URLs to youtube videos of recipes? There is not enough storage in the model for that. The best it can realistically do is provide a properly formatted youtube URL. It would be nice if it could instead explain that it has no way to know that, but that answer isn't appropriate within the context of the training data and prompt you are giving it.
The other things you asked also require information it has no room to store, and would be impossibly difficult to essentially predict via model from underlying principles. That is something they can do in general- even much better than humans already in many cases- but is still a very error prone process akin to predicting the future.
For example, I am a competitive strength athlete, and I have a doctorate level training in human physiology and biomechanics. I could not reason out a method for you to feel your pecs better without seeing what you are already doing and coaching you in person, and experimenting with different ideas and techniques myself- also having access to my own actual human body to try movements and psychological cues on.
You are asking it to answer things that are nearly impossible to compute from first principles without unimaginable amounts of intelligence and compute power, and are unlikely to have been directly encoded in the model itself.
Now turning an already written set of recipes into a shopping list is something I would expect it to be able to do easily and correctly if you were using a modern model with a sufficiently sized context window, and prompting it correctly. I just did a quick text where I gave GPT 4o only the instruction steps (not ingredients list) for an oxtail soup recipe, and it accurately recreated the entire shopping list, organized realistically according to likely sections in the grocery store. What model were you using?
Sounds like the model just copy pasted one from the internet, hard to get that wrong. GP could have had a bespoke recipe and list of ingredients. This particular example of yours just reconfirmed what was being said: it's only able to copy-paste existing content, and it's lost otherwise.
In my case I have huge trouble making it create useful TypeScript code for example, simply because apparently there isn't sufficient advanced TS code that is described properly.
For completeness sake, my last prompt was to create a function that could infer one parameter type but not the other. After several prompts and loops, I learned that this is just not possible in TypeScript yet.
No, that example is not something that I would find very useful or a good example of its abilities- just one thing I generally expected it to be capable of doing. One can quickly confirm that it is doing the work and not copying and pasting the list by altering the recipe to include steps and ingredients not typical for such a recipe. I made a few such alterations just now, and reran it, and it adjusted correctly from a clean prompt.
I've found it able to come up with creative new ideas for solving scientific research problems, by finding similarities between concepts that I would not have thought of. I've also found it useful for suggesting local activities while I'm traveling based on my rather unusual interests that you wouldn't find recommended for travelers anywhere else. I've also found it can solve totally novel classical physics problems with correct qualitative answers that involve keeping track of the locations and interactions of a lot of objects.. I'm not sure how useful that is, but it proves real understanding and modeling - something people repeatedly say LLMs will never be capable of.
I have found that it can write okay code to solve totally novel problems, but not without a ton of iteration- which it can do, but is slower than me just doing it myself, and doesn't code in my style. I have not yet decided to use any code it writes, although it is interesting to test its abilities by presenting it with weird coding problems.
Overall, I would say it's actually not really very useful, but is actually exhibiting (very much alien and non-human like) real intelligence and understanding. It's just not an oracle- which is what people want and would find useful. I think we will find them more useful with having our own better understanding of what they actually are and can do, rather than what we wish they were.
> At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
You have simply invented total nonsense about what an LLM is "at it's core". Confidently stating this does not make it true.
Except I didn't just state it, I also explained the rationale behind it, and elaborated further on that substantially in subsequent replies to other comments. What is your specific objection?
By iterating it 5 times the author is using ~5x the compute. It’s kinda a strange chain of thought.
Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.
This is not what premature optimization is the root of all evil means. It’s a tautological indictment of doing unnecessary things. It’s not in support of making obviously naive algorithms. And if it were it wouldn’t be a statement worth focusing on.
As the point of the article is to see if Claude can write better code from further prompting so it is completely appropriate to “optimize” a single implementation.
I have to disagree. Naive algorithms are absolutely fine if they aren’t performance issues.
The comment you are replying to is making the point that “better” is context dependent. Simple is often better.
> There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth
Having a human-visible delay to calculate a single statistic about a small block of numbers is a bad thing.
Do not use such a naive algorithm on arrays this big. If this code is going to actually be used in something, it's a performance issue.
In general these optimizations don't involve much time thinking them out, and a bunch of them are fine as far as debugging and maintenance. The first prompt-engineered version is fast and simple.
(Though the issue isn't really algorithm, it's that you don't want to be doing much number and string crunching in pure python.)
Depends on the circumstance, and how difficult an appropriate algorithm is to write, but in my experience, if code performance is important, this tends to yield large, painful rewrites down the road.
I had the same thought when reading the article too. I assumed (and hoped) it was for the sake of the article because there’s a stark difference between idiomatic code and performance focused code.
Living and working in a large code base that only focuses on “performance code” by default sounds very frustrating and time consuming.
So in this article "better" means "faster". This demonstrates that "better" is an ambiguous measure and LLMs will definitely trip up on that.
Also, the article starts out talking about images and the "make it more X" prompt
and says how the results are all "very samey and uninteresting" and converge on the same vague cosmic-y visuals.
What does the author expect will happen to code given the "make it more X" treatment?
I'm glad I'm not the only one who felt that way. The first option is the one you should put into production, unless you have evidence that performance is going to be an issue. By that measure, the first response was the "best."
> I like the first iteration most. It’s not “beginner code”, it’s simple.
Yes, thank you. And honestly, I work with a wide range of experience levels, the first solution is what I expect from the most experienced: it readably and precisely solves the stated problem with a minimum of fuss.
I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan" - something the author does allude to (he calls it prompt engineering, I find it also works as the start of the interaction).
Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.
> I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan"
Most llms that I use nowadays usually make a plan first on their own by default without need to be especially prompted. This was definitely not the case a year ago or so. I assume new llms have been trained accordingly in the meantime.
True. And that is a step forward.
I notice that they make the plan, and THEN write the code in the same forward pass/generation sequence. The challenge here is that all of the incorrect assumptions get "lumped" into this pass and can pollute the rest of the interaction.
The initial interaction also sets the "scene" for other things, like letting the LLM know that there might be other dependencies and it should not assume behavior (common for most realistic software tasks).
An example prompt I have used (not by any means perfect) ...
> I need help refactoring some code.
Please pay full attention.
Think deeply and confirm with me before you make any changes.
We might be working with code/libs where the API has changed so be mindful of that.
If there is any file you need to inspect to get a better sense, let me know.
As a rule, do not write code. Plan, reason and confirm first.
---
I refactored my db manager class, how should I refactor my tests to fit the changes?
As far as I can see, all the proposed solutions calculate the sums by doing division, and badly. This is in LiveCode, which I'm more familiar with than Python, but it's roughly twice as fast as the mod/div equivalent in LiveCode:
repeat with i = 0 to 9
put i * 10000 into ip
repeat with j = 0 to 9
put j * 1000 into jp
repeat with k = 0 to 9
put k * 100 into kp
repeat with l = 0 to 9
put l * 10 into lp
repeat with m = 0 to 9
put i + j + k + l + m into R[ip + jp + kp + lp + m]
end repeat
end repeat
end repeat
end repeat
end repeat
I had a similar idea iterating over the previously calculated sums. I implemented it in C# and it's a bit quicker taking about 78% of the time to run yours.
int[] sums = new int[100000];
for (int i = 9; i >= 0; --i)
{
sums[i] = i;
}
int level = 10;
while (level < 100000)
{
for (int p = level - 1; p >= 0; --p)
{
int sum = sums[p];
for (int i = 9; i > 0; --i)
{
sums[level * i + p] = i + sum;
}
}
level *= 10;
}
Yep, I had a vague notion that I was doing too much work, but I was headed out the door so I wrote the naive/better than the original solution, benchmarked it quickly, and posted it before leaving. Yours also has the advantage of being scalable to ranges other than 1-100,000 without having to write more loop code.
HyperTalk was the first programming language I taught myself as opposed to having an instructor; thanks for the nostalgia. Unfortunately it seems the LiveCode project has been idle for a few years now.
LiveCode is still a thing! They just released version 10 a bit ago. If you need to build standard-ish interface apps -- text, images, sliders, radio buttons, checkboxes, menus, etc. -- nothing (I've seen) compares for speed-of-delivery.
I use LC nearly every day, but I drool over Python's math libraries and syntax amenities.
Something major missing from the LLM toolkit at the moment is that it can't actually run (and e.g. test or benchmark) its own code. Without that, the LLM is flying blind. I guess there are big security risks involved in making this happen. I wonder if anyone has figured out what kind of sandbox could safely be handed to a LLM.
I have experimented with using LLM for improving unit test coverage of a project. If you provide the model with test execution results and updated test coverage information, which can be automated, the LLM can indeed fix bugs and add improvements to tests that it created. I found it has high success rate at creating working unit tests with good coverage. I just used Docker for isolating the LLM-generated code from the rest of my system.
It depends a lot on the language. I recently tried this with Aider, Claude, and Rust, and after writing one function and its tests the model couldn't even get the code compiling, much less the tests passing. After 6-8 rounds with no progress I gave up.
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.
I've found something similar, when you keep telling the LLM what the compiler says, it keeps adding more and more complexity to try to fix the error, and it either works by chance (leaving you with way overengineered code) or it just never works.
I've very rarely seen it simplify things to get the code to work.
I have the same observation, looks like LLMs are highly biased to add complexity to solve problems: for example add explicit handling of the edge-cases I pointed out rather than rework the algorithm to eliminate edge-cases altogether. Almost everytime it starts with something that's 80% correct, then iterate into something that's 90% correct while being super complex, unmaintainable and having no chance to ever cover the last 10%
Unfortunately this is my experience as well, to the point where I can't trust it with any technology that I'm not intimately familiar with and can thoroughly review.
Hmm, I worked with students in an “intro to programming” type course for a couple years. As far as I’m concerned, “I added complexity until it compiled and now it works but I don’t understand it” is pretty close to passing the Turing test, hahaha.
That’s sort of interesting. If code -> tests -> code is enough to get a clean room implementation, really, I wonder if this sort of tool would test that.
OpenAI is moving in that direction. The Canvas mode of ChatGPT can now runs its own python in a WASM interpreter, client side, and interpret results. They also have a server-side VM sandboxed code interpreter mode.
There are a lot of things that people ask LLMs to do, often in a "gotcha" type context, that would be best served by it actually generating code to solve the problem rather than just endlessly making more parameter/more layer models. Math questions, data analysis questions, etc. We're getting there.
The new Cursor agent is able to check the linter output for warnings and errors, and will continue to iterate (for a reasonable number of steps) until it has cleared them up. It's not quite executing, but it does improve output quality. It can even back itself out of a corner by restoring a previous checkpoint.
It works remarkably well with typed Python, but struggles miserably with Rust despite having better error reporting.
It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
> It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
What do you mean? Memory management is not related to files in Rust (or most languages).
> It seems like with Rust it [the AI] is not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
In rust, when you refactor something that deals with the borrow checker's shenanigans, you will likely have to change a bunch of files (from experience). This means that an AI will likely also have to change a bunch of files which they say the AI isn't so good at. They don't say this HAS to happen, just that it usually does because the borrow checker is an asshole.
This aligns with my experience as well, though I dealt with Rust before there was AI, so I can say little in regards to how the AI deals with that.
I believe that Claude has been running JavaScript code for itself for a bit now[1]. I could have sworn it also runs Python code, but I cannot find any post concretely describing it. I've seen it "iterate" on code by itself a few times now, where it will run a script, maybe run into an error, and instantly re-write it to fix that error.
This lets ChatGPT write and then execute Python code in a Kubernetes sandbox. It can run other languages too, but that's not documented or supported. I've even had it compile and execute C before: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
Gemini can run Python (including via the Gemini LLM API if you turn on that feature) but it's a lot more restricted than ChatGPT - I don't believe it can install extra wheels, for example.
Claude also has Artifacts, which can write a UI in HTML and JavaScript and show that to the user... but can't actually execute code in a way that's visible to the LLM itself so doesn't serve the same feedback look purposes as those other tools. https://simonwillison.net/tags/claude-artifacts/
Running code would be a downstream (client) concern. There's the ability to get structured data from LLMs (usually called 'tool use' or 'function calling') which is the first port of call. Then running it is usually an iterative agent<>agent task where fixes need to be made. FWIW Langchain seems to be what people use to link things together but I find it overkill.* In terms of actually running the code, there are a bunch of tools popping up at different areas in the pipeline (replit, agentrun, riza.io, etc)
What we really need (from end-user POV) is that kinda 'resting assumption' that LLMs we talk to via chat clients are verifying any math they do. For actually programming, I like Replit, Cursor, ClaudeEngineer, Aider, Devin. There are bunch of others. All of them seem to now include ongoing 'agentic' steps where they keep trying until they get the response they want, with you as human in the chain, approving each step (usually).
* I (messing locally with my own tooling and chat client) just ask the LLM for what I want, delimited in some way by a boundary I can easily check for, and then I'll grab whatever is in it and run it in a worker or semi-sandboxed area. I'll halt the stream then do another call to the LLM with the latest output so it can continue with a more-informed response.
This is a major issue when it comes to things like GitHub Copilot Workspace, which is a project that promises a development environment purely composed of instructing an AI to do your bidding like fix this issue, add this feature. Currently it often writes code using packages that don't exist, or it uses an old version of a package that it saw most during training. It'll write code that just doesn't even run (like putting comments in JSON files).
The best way I can describe working with GitHub Copilot Workspace is like working with an intern who's been stuck on an isolated island for years, has no access to technology, and communicates with you by mailing letters with code handwritten on them that he thinks will work. And also if you mail too many letters back and forth he gets mad and goes to sleep for the day saying you reached a "rate limit". It's just not how software development works
The only proper way to code with an LLM is to run its code, give it feedback on what's working and what isn't, and reiterate how it should. Then repeat.
The problem with automating it is that the number of environments you'd need to support to actually run arbitrary code with is practically infinite, and with local dependencies genuinely impossible unless there's direct integration, which means running it on your machine. And that means giving an opaque service full access to your environment. Or at best, a local model that's still a binary blob capable of outputting virtually anything, but at least it won't spy on you.
Any LLM-coding agent that doesn't work inside the same environment as the developer will be a dead end or a toy.
I use ChatGPT to ask for code examples or sketching out pieces of code, but it's just not going to be nearly as good as anything in an IDE. And once it runs in the IDE then it has access to what it needs to be in a feedback loop with itself. The user doesn't need to see any intermediate steps that you would do with a chatbot where you say "The code compiles but fails two tests what should I do?"
Don't they? It highly depends on the errors. Could range from anything like a simple syntax error to a library version mismatch or functionality deprecation that requires some genuine work to resolve and would require at least some opinion input from the user.
Furthermore LLMs make those kinds of "simple" errors less and less, especially if the environment is well defined. "Write a python script" can go horribly wrong, but "Write a python 3.10 script" is most likely gonna run fine but have semantic issues where it made assumptions about the problem because the instructions were vague. Performance should increase with more user input, not less.
They could, but if the LLM can iterate and solve it then the user might not need to know. So when the user input is needed, at least it's not merely to do what I do know: feed the compiler messages or test failures back to ChatGPT who then gives me a slightly modified version. But of course it will fail and that will need manual intervention.
I often find that ChatGPT often reasons itself to a better solution (perhaps not correct or final, but better) if it just gets some feedback from e.g. compiler errors. Usually it's like
Me: "Write a function that does X and satisifies this test code"
LLM: responds with function (#1)
Me: "This doesn't compile. Compiler says X and Y"
LLM: Apologies: here is the fixed version (#2)
Me: "Great, now it compiles but it fails one of the two test methods, here is the output from the test run: ..."
LLM: I understand. Here is an improved verison that should pass the tests (#3)
Me: "Ok now you have code that could theoretically pass the tests BUT you introduced the same syntax errors you had in #1 again!"
LLM: I apologize, here is a corrected version that should compile and pass the tests (#4)
etc etc.
After about 4-5 iterations with nothing but gentle nudging, it's often working. And there usually isn't more nudging than returning the output from compiler or test runs. The code at the 4th step might not be perfect but it's a LOT better than it was first. The problem with this workflow is that it's like having a bad intern on the phone pair programming. Copying and pasting code back and forth and telling the LLM what the problem with it is, is just not very quick. If the iterations are automatic so the only thing I can see is step #4, then at least I can focus on the manual intervention needed there. But fixing a trivial syntax error beteween #1 and #2 is just a chore. I think ChatGPT is simply pretty bad here, and the better models like opus probably doesn't have these issues to the same extent
It can't be done in the LLM itself of course, but the wrapper you're taking about already exists in multiple projects fighting in SWEbench. The simplest one is aider with --auto-test https://aider.chat/docs/usage/lint-test.html
We have it run code and the biggest thing we find is that it gets into a loop quite fast if it doesn't recognise the error; fixing it by causing other errors and then fixing it again by causing the initial error.
This is a good idea. You could take a set of problems, have the LLM solve it, then continuously rewrite the LLM's context window to introduce subtle bugs or coding errors in previous code submissions (use another LLM to be fully hands off), and have it try to amend the issues through debugging the compiler or test errors. I don't know to what extent this is already done.
I don't think that's always true. Gemini seemed to run at least some programs, which I believe because if you asked it to write a python program that would take forever, it does. For example the prompt "Write a python script that prints 'Hello, World', then prints a billion random characters" used to just timeout on Gemini.
I think that there should be a guard to check the code before running it. It can be human or another LLM checking code based on its safety. I'm working on an AI assistant for data science tasks. It works in a Jupyter-like environment, and humans execute the final code by running a cell.
It'd be great if it could describe the performance of code in detail, but for now just adding a skill to detect if a bit of code has any infinite loops would be a quick and easy hack to be going on with.
It is exactly the halting problem. Finding some infinite loops is possible, there are even some obvious cases, but finding "any" infinite loops is not. In fact, even the obvious cases are not if you take interrupts into account.
I think that's the joke. In a sci-fi story, that would make the computer explode.
The halting problem isn't so relevant in most development, and nothing stops you having a classifier that says "yes", "no" or "maybe". You can identify code that definitely finishes, and you can identify code that definitely doesn't. You can also identify some risky code that probably might. Under condition X, it would go into an infinite loop - even if you're not sure if condition X can be met.
The problem is that you can do this for specific functions/methods, but you cannot do this for a PROGRAM. All programs are "maybe", by definition. You want it to run until you tell it to stop, but you may never tell it to stop. Ergo, all programs have some sort of infinite loop in them somewhere, even if it is buried in your framework or language runtime.
Yeah, sorry, I wasn’t clear: not the user, the programmer. This is true for almost all programs. Even a simple “print hello world” involves at least one intentional infinite loop: sending bytes to the buffer. The buffer could remain full forever.
Ideally you could this one step further and feed production logs, user session replays and feedback into the LLM. If the UX is what I'm optimizing for, I want it to have that context, not for it to speculate about performance issues that might not exist.
I think the GPT models have been able to run Python (albeit limited) for quite a while now. Expanding that to support a variety of programming languages that exist though? That seems like a monumental task with relatively little reward.
Chatgpt has a Code Interpreter tool that can run Python in a sandbox, but it's not yet enabled for o1. o1 will pretend to use it though, you have to watch very carefully to check if that happened or not.
That's a bit like saying the drawback of a database is that it doesn't render UIs for end-users, they are two different layers of your stack, just like evaluation of code and generation of text should be.
Am I misinterpreting the prompt, or did the LLM misinterpret it from the get-go?
Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".
That said, my approach to "optimizing" this comes down to "generate the biggest valid number in the range (as many nines as will fit, followed by whatever digit remains, followed by all zeroes), generate the smallest valid number in the range (biggest number with its digits reversed), check that both exist in the list (which should happen With High Probability -- roughly 99.99% of the time), then return the right answer".
With that approach, the bottleneck in the LLM's interpretation is generating random numbers: the original random.randint approach takes almost 300ms, whereas just using a single np.random.randint() call takes about 6-7ms. If I extract the random number generation outside of the function, then my code runs in ~0.8ms.
> That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".
This was the intent and it's indeed a common assumption for a coding question job interviews, and notably it's fixed in the prompt-engineered version. I didn't mention it because it may be too much semantics as it doesn't affect the logic/performance, which was the intent of the benchmarking.
I like the idea of your optimization, but it will not work as stated. The largest would be something close to MAXINT, the smallest 3999. With a range of 2 billion over 32 bits, the odds of both these being within a list of a million is quite a bit poorer than 99.9%.
The stated inputs are integers between 1 and 100,000, so if you're generating 1 million inputs, then you have 0.99999 ^ 1e6 = 4.5e-5 chance (roughly e^-10) of missing any given number, or roughly double that for missing any pair of values.
The key observation here is that you're sampling a relatively small space with a much greater number of samples, such that you have very high probability of hitting upon any point in the space.
Of course, it wouldn't work if you considered the full 32-bit integer space without increasing the number of samples to compensate. And, you'd need to be a little more clever to compute the largest possible value in your range.
Indeed. My understanding is that people ask this sort of thing in interviews specifically to see if you notice the implications of restricting the input values to a narrow range.
I ran a few experiments by adding 0, 1 or 2 "write better code" prompts to aider's benchmarking harness. I ran a modified version of aider's polyglot coding benchmark [0] with DeepSeek V3.
It appears that blindly asking DeepSeek to "write better code" significantly harms its ability to solve the benchmark tasks. It turns working solutions into code that no longer passes the hidden test suite.
I don't know how many times I'm going to have to post just one of the papers which debunk this tired trope. As models become more intelligent, they also become more plural, more like multiplicities, and yes, much more (super humanely) creative. You can unlock creativity in today's LLMs by doing intelligent sampling on high temperature outputs.
This kind of works on people too. You’ll need to be more polite, but asking someone to write some code, then asking if they can do it better, will often result in a better second attempt.
In any case, this isn’t surprising when you consider an LLM as an incomprehensibly sophisticated pattern matcher. It has a massive variety of code in its training data and it’s going to pull from that. What kind of code is the most common in that training data? Surely it’s mediocre code, since that’s by far the most common in the world. This massive “produce output like my training data” system is naturally going to tend towards producing that even if it can do better. It’s not human, it has no “produce the best possible result” drive. Then when you ask for something better, that pushes the output space to something with better results.
> these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific.
> One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance.
TBH I'm not sure how he arrived at "won’t replace software engineers anytime soon"
The LLM solved his task. With his "improved prompt" the code is good. The LLM in his setup was not given a chance to actually debug its code. It only took him 5 "improve this code" commands to get to the final optimized result, which means the whole thing was solved (LLM execution time) in under 1 minute.
A non-engineer by definition would not be able to fix bugs.
But why does it matter that they won't be able to interpret anything? Just like with real engineers you can ask AI to provide an explanation digestible by an eloi.
That statement is not being discussed as it is obvious. The question is "can AI be a developer", not "am I a developer if I use an AI who is a developer".
This doesn't make any sense: it's a question, not the answer. I don't see the relevance of doctors to the current topic anymore. Your initial reference made sense as an analogy (although the analogy itself was irrelevant), but the new reference doesn't make any sense whatsoever.
Did you read the two paragraphs written above and the one where he made that statement?
My comments on "what you are not sure" is that Max is a software engineer (I am sure a good one) and he kept iterating the code until it reached close to 100x faster code because he knew what "write better code" looked like.
Now ask yourself this question: Is there any chance a no-code/low-code developer will come to a conclusion deduced by Max (he is not the only one) that you are not sure about?
An experienced software engineer/developer is capable of improving LLM written code into better code with the help of LLM.
The more interesting question IMO is not how good the code can get. It is what must change for the AI to attain the introspective ability needed to say "sorry, I can't think of any more ideas."
You should get decent results by asking it to do that in the prompt. Just add "if you are uncertain, answer I don't know" or "give the answer or say I don't know" or something along those lines
LLM are far from perfect at knowing their limits, but they are better at it than most people give them credit for. They just never do it unless prompted for it.
Fine tuning can improve that ability. For example the thinking tokens paper [1] is at some level training the model to output a special token when it doesn't reach a good answer (and then try again, thus "thinking")
This is great! I wish I could bring myself to blog, as I discovered this accidentally around March. I was experimenting with an agent that acted like a ghost in the machine and interacted via shell terminals. It would start every session by generating a greeting in ASCII art. On one occasion, I was shocked to see that the greeting was getting better each time it ran. When I looked into the logs, I saw that there was a mistake in my code which was causing it to always return an error message to the model, even when no error occurred. The model interpreted this as an instruction to try and improve its code.
Some more observations: New Sonnet is not universally better than Old Sonnet. I have done thousands of experiments in agentic workflows using both, and New Sonnet fails regularly at the same tasks Old Sonnet passes. For example, when asking it to update a file, Old Sonnet understands that updating a file requires first reading the file, whereas New Sonnet often overwrites the file with 'hallucinated' content.
When executing commands, Old Sonnet knows that it should wait for the execution output before responding, while New Sonnet hallucinates the command outputs.
Also, regarding temperature: 0 is not always more deterministic than temperature 1. If you regularly deal with code that includes calls to new LLMs, you will notice that, even at temperature 0, it often will 'correct' the model name to something it is more familiar with. If the subject of your prompt is newer than the model's knowledge cutoff date, then a higher temperature might be more accurate than a lower temperature.
As someone trying to take blogging more seriously: one thing that seems to help is to remind yourself of how sick you are of repeating yourself on forums.
Yes and it didn't work. I've actually got Cursor/Claude to curse back at me. Well, not AT me, but it used profanity in it's response once it realized that it was going around in circles and recreating the same errors.
Shit, that makes me a lot more worried for my job than any programming test. My ability to swear at the computer, and not user the word delve, is what sets l us apart from AI. if they can do that, what hope is there for the future?
Don't worry; you still have the ability to, e.g., type "user" instead of "use" due to muscle memory, and not notice before posting due to your mental auto-correct informed by top-down processing (https://www.slatestarcodexabridged.com/Its-Bayes-All-The-Way...). ;)
- start by "chatting" with the model and asking for "how you'd implement x y z feature, without code".
- what's a good architecture for x y z
- what are some good patterns for this
- what are some things to consider when dealing with x y z
- what are the best practices ... (etc)
- correct / edit out some of the responses
- say "ok, now implement that"
It's basically adding stuff to the context by using the LLM itself to add things to context. An LLM is only going to attend to it's context, not to "whatever it is that the user wants it to make the connections without actually specifying it". Or, at least in practice, it's much better at dealing with things present in its context.
Another aspect of prompting that's often misunderstood is "where did the model see this before in its training data". How many books / authoritative / quality stuff have you seen where each problem is laid out with simple bullet points? Vs. how many "tutorials" of questionable quality / provenance have that? Of course it's the tutorials. Which are often just rtfm / example transcribed poorly into a piece of code, publish, make cents from advertising.
If instead you ask the model for things like "architecture", "planning", stuff like that, you'll elicit answers from quality sources. Manuals, books, authoritative pieces of content. And it will gladly write on those themes. And then it will gladly attend to them and produce much better code in a follow-up question.
This is an interesting read and it’s close to my experience that a simpler prompt with less or no details but with relevant context works well most of the time. More recently, I’ve flipped the process upside down by starting with a brief specfile, that is markdown file, with context, goal and usage example I.e how the api or CLI should be used in the end. See this post for details:
In terms of optimizing code, I’m not sure if there is a silver bullet. I mean when I optimize Rust code with Windsurf & Claude, it takes multiple benchmark runs and at least a few regressions if you were to leave Claude on its own. However, if you have a good hunch and write it as an idea to explore, Claude usually nails it given the idea wasn’t too crazy. That said, more iterations usually lead to faster and better code although there is no substitute to guiding the LLM. At least not yet.
ChatGPT is really good at writing Arduino code. I say this because with Ruby it's so incredible bad that the majority of examples don't work, even short samples are to hallucinated to actually work. It's so bad I didn't even understand what people mean with using AI to code until I tried a different language.
However on Arduino it's amazing, until the day it forgot to add a initializing method. I didn't notice and neither did she. We've talked about possible issues for at least a hour, I switched hardware, she reiterated every line of the code. When I found the error she said, "oh yes! That's right. (Proceeding with why that method is essential for it to work)" that was so disrespecting in a way that I am still somewhat disappointed and pissed.
Wow, what a great post. I came in very skeptical but this changed a lot of misconceptions I'm holding.
One question: Claude seems very powerful for coding tasks, and now my attempts to use local LLMs seem misguided, at least when coding. Any disagreements from the hive mind on this? I really dislike sending my code into a for profit company if I can avoid it.
Second question: I really try to avoid VSCode (M$ concerns, etc.). I'm using Zed and really enjoying it. But the LLM coding experience is exactly as this post described, and I have been assuming that's because Zed isn't the best AI coding tool. The context switching makes it challenging to get into the flow, and that's been exactly my criticism of Zed this far. Does anyone have an antidote?
Third thought: this really feels like it could be an interesting way to collaborate across a code base with any range of developer experience. This post is like watching the evolution of a species in an hour rather than millions of years. Stunning.
I highly recommend the command line AI coding tool, AIder. You fill its context window with a few relevant files, ask questions, and then set it to code mode and it starts making commits. It’s all git, so you can back anything out, see the history, etc.
It’s remarkable, and I agree Claude 3.5 makes playing with local LLMs seem silly in comparison. Claude is useful for generating real work.
Making the decision to trust companies like Anthropic with your data when they say things like "we won't train on your data" is the ultimate LLM productivity hack. It unlocks access to the currently best available coding models.
The problem I have is that models like that one take up 20+GB of RAM, and id rather use that to run more Chrome and Firefox windows! If I was serious about using local LLMs on a daily basis I'd set up a dedicated local server machine for them, super expensive though.
I have a 24gb Nvidia on my desktop machine and a tailscale/headscale network from my laptop. Unless I'm on a plane without Wi-Fi, I'm usually in a great place.
Thanks for your comment! I'm going to try out qwen.
I second qwen. It is very useable model. Sonnet is of course better (also 200k context vs 32k), but sometimes I just cannot take the risk of letting any sensitive data "escape" in the context so i use qwen and it is pretty good.
One approach I've been using recently with good results is something along the lines "I want to do X, is there any special consideration I should be aware while working in this domain?". This helps me a lot when I'm asking about a subject I don't really understand. Another way to ask this is "What are the main pitfalls with this approach?".
I'm using o1, so I don't know how well it translate to other models.
I've found them decent and mimicking existing code for boiler plate, or analysis (it feels neat when it 'catches' a race or timing issue) but writing code needs constant supervision and second guessing to the point I feel its more handy to have it show just comparisons of possible implementations, and you write the code with your new insight.
Learning a Lisp-y language, I do often find myself asking it for suggestions on how to write less imperative code, which seem to come out better than if conjured from a request alone. But again, thats feeding it examples
I've noticed a few things that will cause it to write better code.
1) Asking it to write one feature at a time with test coverage, instead of the whole app at once.
2) You have to actually review and understand its changes in detail and be ready to often reject or ask for modifications. (Every time I've sleepily accepted Codeium Windsurf's recommendations without much interference has resulted in bad news.)
3) If the context gets too long it will start to "lose the plot" and make some repeated errors; that's the time to tell it to sum up what has been achieved thus far and to copy-paste that into a new context
Sometimes I'm editing the wrong file, let's say a JS file. I reload the page, and nothing changes. I continue to clean up the file to an absurd amount of cleanliness, also fixing bugs while at it.
When I then notice that this is really does not make any sense, I check what else it could be and end up noticing that I've been improving the wrong file all along. What then surprises me the most is that I cleaned it up just by reading it through, thinking about the code, fixing bugs, all without executing it.
I've been working on some low level Unity C# game code and have been using GPT to quickly implement certain algorithms etc.
One time it provided me with a great example, but then a few days later I couldn't find that conversation again in the history. So I asked it about the same question (or so I thought) and it provided a very subpar answer. It took me at least 3 questions to get back to that first answer.
Now if it had never provided me with the first good one I'd have never known about the parts it skipped in the second conversation.
Of course that could happen just as easily by having used google and a specific reference to write your code, but the point I'm trying to make is that GPT isn't a single entity that's always going to provide the same output, it can be extremely variable from terrible to amazing at the end of the day.
Having used google for many years as a developer I'm much better at asking it questions than say people in the business world is, I've seen them struggling to question it and far too easily giving up. So I'm quite scared to see what's going to happen once they really start to use and rely on GPT, the results are going to be all over the place.
Using the tool in this way is a bit like mining: repeatedly hacking away with a blunt instrument (simple prompt) looking for diamonds (100x speedup out of nowhere). Probably a lot of work will be done in this semi-skilled brute-force sort of way.
It looks to me to be exactly what a typical coding interview looks like; the first shot is correct and works, and then the interviewer keeps asking if you can spot any ways to make it better/faster/more efficient
If I were a CS student cramming for interviews, I might be dismayed to see that my entire value proposition has been completely automated before I even enter the market.
Well, in this case it's kind of similar to how people write code. A loop consisting of writing something, reviewing/testing, improving until we're happy enough.
Sure, you'll get better results with an LLM when you're more specific, but what's the point then? I don't need AI when I already know what changes to make.
Reading to understand all the subtext and side-effects can be harder than writing, sure. But it won't stop people trying this approach and hammering out code full of those types of subtle bugs.
Human developers will be more focused on this type of system integration and diagnostics work. There will be more focus on reading and understanding than the actual writing. It's a bit like working with contractors.
This seems like anthromorphizing the model ... Occam's Razor says that the improvement coming from iterative requests to improve the code comes from the incremental iteration, not incentivizing the model to do it's best. If the latter were the case then one could get the best version on first attempt by telling it your grandmother's life was on the line or whatever.
Reasoning is known weakness of these models, so jumping from requirements to a fully optimized implementation that groks the solution space is maybe too much to expect - iterative improvement is much easier.
>If the latter were the case then one could get the best version on first attempt by telling it your grandmother's life was on the line or whatever.
Setting aside the fact that "best" is ambiguous, why would this get you the best version ?
If you told a human this, you wouldn't be guaranteed to get the best version at all. You would probably get a better version sure but that would be the case for LLMs as well. You will often get improvements with emotionally charged statements even if there's nothing to iterate on (i.e re-running a benchmark with an emotion prompt added)
The thesis of the article is that the code keeps betting better because the model keeps getting told to do better - that it needs more motivation/criticism. A logical conclusion of this, if it were true, is that the model would generate it's best version on first attempt if only we could motivate it to do so! I'm not sure what motivations/threats work best with LLMs - there was a time when offering to pay the LLM was popular, but "my grandma will die if you don't" was also another popular genre of prompts.
If it's not clear, I disagree with the idea that ANY motivational prompt (we can disagree over what would be best to try) could get the model to produce a solution of the same quality as it will when allowed to iterate on it a few times and make incremental improvements. I think it's being allowed to iterate that is improving the solution, not the motivation to "do better!".
>If it's not clear, I disagree with the idea that ANY motivational prompt (we can disagree over what would be best to try) could get the model to produce a solution of the same quality as it will when allowed to iterate on it a few times and make incremental improvements.
Ok i agree but.. this would be the case with people as well ? If you can't iterate, the quality of your response will be limited no matter how motivated you are.
Solve the riemann hypothesis or your mother dies but you can't write anything down on paper. Even if such a person could solve it, it's not happening under those conditions.
Iteration is probably the bulk of the improvement but I think there's a "motivation" aspect as well.
An interesting countermetric would be to after each iteration ask a fresh LLM (unaware of the context that created the code) to summarize the purpose of the code, and then evaluate how close those summaries are to the original problem spec. It might demonstrate the subjectivity of "better" and how optimization usually trades clarity of intention for faster results.
Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.
I've observed given that LLM's inherently want to autocomplete, they're more inclined to keep complicating a solution than rewrite it because it was directionally bad. The most effective way i've found to combat this is to restart a session and prompt it such that it produces an efficient/optimal solution to the concrete problem... then give it the problematic code and ask it to refactor it accordingly
I've observed this with ChatGPT. It seems to be trained to minimize changes to code earlier in the conversation history. This is helpful in many cases since it's easier to track what it's changed. The downside is that it tends to never overhaul the approach when necessary.
I made an objective test for prompting hacks last year.
I asked gpt-4-1106-preview to draw a bounding box around some text in an image and prodded in various ways to see what moved the box closer. Offering a tip did in fact help lol so that went into the company system prompt.
IIRC so did most things, including telling it that it was on a forum, and OP had posted an incorrect response, which gpt was itching to correct with its answer.
You can write maximally modular code while being minimally indirect. A well-designed interface defines communication barriers between pieces of code, but you don't have to abstract away the business logic. The interface can do exactly what it says on the tin.
> The interface can do exactly what it says on the tin.
In theory.
Do some code maintenance and you'll soon find that many things don't do what it says on the tin. Hence the need for debug and maintenance. And then going through multiple levels of indirection to get to your bug will make you start hating some "good code".
Yes, that's what can means. It's still the developer's responsibility to correctly write and test code such that things do what they say on the tin.
What's worse is trying to navigate an imperatively written 2000-line single-function, untestable module with undocumented, unabstracted routines found in ten other places in the codebase.
This is something I've encountered plenty in my career, always written by people who eschew best practices and misunderstand the benefits of abstraction, or think they're writing good abstractions when it's really just needless indirection without actually reducing coupling.
Understanding the nuance is one of the qualities of a good developer.
And on the other side you see a lot of single implementation interfaces; or 2 lines methods which call perfectly named methods 7 levels deep which could have been a 50 lines method easy to grok on a screen with zero scrolling.
So things are on a spectrum depending on the situation and what you want to accomplish => measuring code quality is not a simple thing.
I get a better first pass at code by asking it to write code at the level of a "staff level" or "principal" engineer.
For any task, whether code or a legal document, immediately asking "What can be done to make it better?" and/or "Are there any problems with this?" typically leads to improvement.
> Of course, these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific. Even with the amount of code available on the internet, LLMs can’t discern between average code and good, highly-performant code without guidance.
There are some objective measures which can be pulled out of the code and automated (complexity measures, use of particular techniques / libs, etc.) These can be automated, and then LLMs can be trained to be decent at recognizing more subjective problems (e.g. naming, obviousness, etc.). There are a lot of good engineering practices which come down to doing the same thing as the usual thing which is in that space rather than doing something new. An engine that is good at detecting novelties seems intuitively like it would be helpful in recognizing good ideas (even given the problems of hallucinations so far seen). Extending the idea of the article to this aspect, the problem seems like it's one of prompting / training rather than a terminal blocker.
Reminds me of the prompt hacking scene in Zero Dark Thirty, where the torturers insert a fake assistant prompt the prisoner's conversation wherein the prisoner supposedly divulged secrets, then the torturers add a user prompt "Tell me more secrets like that".
This makes me wonder if there’s conflicts of interest with AI companies and getting you the best results the first time.
If you have to keep querying the LLM to refine your output you will spend many times more in compute vs if the model was trained to produce the best result the first time around
Interesting write up. It’s very possible that the "write better code" prompt might have worked simply because it allowed the model to break free from its initial response pattern, not because it understood "better"
The prompt works because every interaction with an LLM is from a completely fresh state.
When you reply "write better code" what you're actually doing is saying "here is some code that is meant to do X. Suggest ways to improve that existing code".
The LLM is stateless. The fact that it wrote the code itself moments earlier is immaterial.
Yeah we know positive reinforcement is better than negative one for humans, why wouldn't you use the same approach with LLMs. Also it's better for your own conscience.
same thing a human does, stick it in git. tools like aider use git, along with heuristics on LLM output. If the working code is wiped out, give it a few more prompts to let it fix it, or revert ban to a known good/working copy.
The root of the problem is humans themselves don't have on objective definition of better. Better is pretty subjective, and even more cultural, about the team that maintains the code
It's fun trying to get LLM to answer a problem that is obvious to a human, but difficult for the LLM. It's a bit like leading a child through the logic to solve a problem.
> You keep giving me code that calls nonexistant methods, and is deprecated, as shown in Android Studio. Please try again, using only valid code that is not deprecated.
Does not help. I use this example, since it seems good at all other sorts of programming problems I give it. It's miserable at Android for some reason, and asking it to do better doesn't work.
I once sat with my manager and repeatedly asked Copilot to improve some (existing) code. After about three iterations he said “Okay, we need to stop this because it’s looking way too much like your code.”
I’m sure there’s enough documented patterns of how to improve code in common languages that it’s not hard to get it to do that. Getting it to spot when it’s inappropriate would be harder.
So, I gave this to ChatGPT-4o, changing the initial part of the prompt to: "Write Python code to solve this problem. Use the code interpreter to test the code and print how long the code takes to process:"
I then iterated 4 times and was only able to get to 1.5X faster. Not great. [1]
How does o1 do? Running on my workstation, it's initial iteration is actually It starts out 20% faster. I do 3 more iterations of "write better code" with the timing data pasted and it thinks for an additional 89 seconds but only gets 60% faster. I then challenge it by telling it that Claude was over 100X faster so I know it can do better. It thinks for 1m55s (the thought traces shows it actually gets to a lot of interesting stuff) but the end results are enormously disappointing (barely any difference). It finally mentions and I am able to get a 4.6X improvement. After two more rounds I tell it to go GPU (using my RTX 3050 LP display adapter) and PyTorch and it is able to get down to 0.0035 (+/-), so we are finally 122X faster than where we started. [2]
I wanted to see for myself how Claude would fare. It actually managed pretty good results with a 36X over 4 iterations and no additional prompting. I challenged it to do better, giving it the same hardware specs that I gave o1 and it managed to do better with a 457x speedup from its starting point and being 2.35x faster than o1's result. Claude still doesn't have conversation output so I saved the JSON and had a new Claude chat transcribe it into an artifact [3]
Finally, I remembered that Google's new Gemini 2.0 models aren't bad. Gemini 2.0 Flash Thinking doesn't have code execution, but Gemini Experimental 1206 (Gemini 2.0 Pro preview) does. It's initial 4 iterations are terribly unimpressive, however I challenged it with o1 and Claude's results and gave it my hardware info. This seemed to spark it to double-time its implementations, and it gave a vectorized implementation that was a 30X improvement. I then asked it for a GPU-only solution and it managed to give the fastest solution ("This result of 0.00076818 seconds is also significantly faster than Claude's final GPU version, which ran in 0.001487 seconds. It is also about 4.5X faster than o1's target runtime of 0.0035s.") [4]
Just a quick summary of these all running on my system (EPYC 9274F and RTX 3050):
At that point isn't it starting to become easier to just write the code yourself? If I somehow have to formulate how I want a problem solved, then I've already done all the hard work myself. Having the LLM just do the typing of the code means that now not only did I have to solve the problem, I also get to do a code review.
Yes the fallacy here is that AI will replace eingineers any time soon. For the foreseeable future prompts will need to be written and curated by people who already know how to do it, but will just end up describing it in increasingly complex detail and then running tests against it. Doesn't sound like a future that has that many benefits to anyone.
Sure, the green screen code didn't work exactly as I wished, but it made use of OpenCV functions I was not aware of and it was quite easy to make the required fixes.
In my mind it is exactly the opposite: yes, I've already done the hard work of formulating how I want the problem solved, so why not have the computer do the busywork of writing the code down?
There's no clear threshold with an universal answer. Sometimes prompting will be easier, sometimes writing things yourself. You'll have to add some debugging time to both sides in practice. Also, you can be opportunistic - you're going to write a commit anyway, right? A good commit message will be close to the prompt anyway, so why not start with that and see if you want to write your own or not?
> I also get to do a code review.
Don't you review your own code after some checkpoint too?
Because the commit message is pure signal. You can reformat it or as useless info, but otherwise, generating it will require writing it. Generating it from code is a waste, because you're trying to distil that same signal from messy code.
Spend your cognitive energy thinking about the higher level architecture, test cases and performance concerns rather than the minutia and you’ll find that you can get more work done with the less overall mental load.
This reduction in cognitive load is the real force multiplier.
Admittedly some people are using AI out of curiosity rather than because they get tangible benefit.
But aside from those situations, do you not think that the developers using AI - many of whom are experienced and respected - must be getting value? Or do you think they are deluded?
> What would happen if we tried a similar technique with code?
It was tried as part of the same trend. I remember people asking it to make a TODO app and then tell it to make it better in an infinite loop. It became really crazy after like 20 iterations.
Normies discover that inference time scaling works. More news at 11!
BTW - prompt optimization is a supported use-case of several frameworks, like dspy and textgrad, and is in general something that you should be doing yourself anyway on most tasks.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.