Maybe I'm a dinosaur but I'm utterly mystified by folks who are interested in making the leaderboard or even solving AoC puzzles with AI assistance. To me it feels like going to a 5k charity run, taking a motorcycle to the finish line, and parading your gold medal around. Like, who are you impressing? And are you even proud of yourself?
I understand the appeal of using the best and fastest means when there are real-world stakes — taking a motorcycle is better than running for many transport scenarios. But the whole point of this is to have fun! Is it fun to hit <tab> over and over?
> To me it feels like going to a 5k charity run, taking a motorcycle to the finish line, and parading your gold medal around. Like, who are you impressing? And are you even proud of yourself?
Weirdly enough, basically this happens all the time at races [0]. Not a motorcycle, but a runner in my local competitive scene used a bicycle to fake her strava data for a HM in 2017 [1].
Geez, the number of support requests I’ve sent to Strava over the years asking them to clean up their data. I don’t know about Android, but iOS can fairly reliably tell the difference between walking, running, bicycling, driving… They should be able to couple this with some better data science to filter things out like deliberate cheating, forgetting to set the right activity type or stop tracking before you drive away, or glitches in eg the GPS radio that result in huge changes in location or altitude, pushing you to the top of leaderboards with impossible results.
> I'm utterly mystified by folks who are interested in making the leaderboard or even solving AoC puzzles with AI assistance
This reminds me of the Project Euler position on people posting their solutions publicly. The assumption is others would seek them out, get credit they don’t deserve and not learn as much.
I understand trying to prevent skewing the results, but personally it doesn’t affect me since I’m not doing it to compete in the website, I’m doing it for self edification.
I believe this actually helps me compete in the thing that really matters, getting a job, bringing high quality products to market and generally solving real problems people have.
Are LLMs starting to solve some of that? Sure. So then I want to compete for the jobs that still can’t be solved by them and involve more complicated and interesting solutions.
If people want to handicap their problem solving abilities or compete for jobs supervising boilerplate generation, all the better for me. I’ll take a hopefully more rewarding career over making the leaderboard on AoC and PE. Though I do feel sorry for their maintainers.
> Is it fun to hit <tab> over and over?
It’s probably a mixture of entertainment (I’ve seen countless apps where it looks like people just tap the screen as fast as possible without even really watching what’s going on) and imagining they’re learning something by watching answers appear (although IMO the lack of engagement leads to lack of retention; sort of like Socrates’ position on reading vs discussion).
Last year, AI was only able to solve the first 3 or 4 problems, so it's really not like using AI takes all human ingenuity out of the equation. What it does is enable you to autocomplete sub-problems that are already solved problems so you can focus on the good stuff. I'd personally prefer if there were 2 leaderboards - one with AI, and one without. Motorcycle races are fun too :)
> mystified by folks who are interested in making the leaderboard or even solving AoC puzzles with AI assistance
I think it defeats the purpose of the AoC to use AI, but I can imagine for some people it's a different game to see how they can use a screw (gaming term) to circumvent the system.
On the other hand, there may be some value to learning how to apply AI tools as a solver. If test cases are available for a given problem space, and AI can be applied to build code to meet those "requirements", then it is probably a skill we should be learning.
While I don't believe people are going to play fair here, it would be nice if AoC submissions had an [AI-assist] checkbox. Then there could be two leaderboards, and the AI folks could still have some good competition with each other.
Ultimately, the AoC leaderboards are unfair as they favor people in timezones that fit the release of each "day". So being on the leaderboard isn't really something that many people can do, even if they are brilliant problem solvers.
People can tell if you use a motorbike though. More like, performance enhancing drugs... or designing an anthropomorphic running robot that completes the race for you.
Designing such a robot would be an achievement in its own right, but not really in the spirit of the competition.
If you designed and built the robot, its one thing (still not right, but interesting).
If you just rented some time on someone else's robot, its just plain cheating.
I would love to see teams compete, in a separate venue/ranking, based on their own model's ability to write code with minimal prompting.
It would be kinda cool to see what can be done in different 'classes' of model. How far can you get with a local model running on an 8gb card, or just using an M2 mac. How far can you get with a custom tuned chatgpt vs a programming specialized model. Etc...
I wonder if, in future, there will be specific days and seasons to commemorate AI achievements. Like, if Wall-E was a real story would we have a holiday to honor him? Maybe even just observed by the machines / operators?
Because AoC has a leaderboard? Whenever you have a leaderboard of any kind, there will be people trying to get up there.
It's also likely that a lot of AI users will be skilled developers too. They already know they can solve the problems, they also think they deserve the leaderboard rankings, so using AI is ok.
I guess it's not really an apt analogy. It's more like if a the challenge was to get from city A to city B in the fastest time by any means necessary. People would run it until the horse was domesticated, and then people would ride them until the motorcycle or car was invented, and so on.
That's not to say that you can't still do a foot race between A and B. It's just now you have to constrain it, it has become something different.
I tried solving last years AOC with chatgpt a few months ago and while it did just straight up produce solutions for the first 2 or 3 questions with minimal work on my part, after that, it required a lot of work to get it to a solution. It would have been faster to just write the code myself.
I use copilot a lot and use chatgpt a lot to help with coding problems and doing things like writing short bash scripts for me, so I feel like I have a good grasp of what it's capable of doing, and the amount of work you'd have to do to break the problem into small enough chunks that it can understand and solve it is similar, if not more, than the amount of work you'd have to do to just solve it yourself. At least in its current state.
I play chess and some players end up thinking that cheating allows them to reach their real level and that it's just bad luck or conspiracy that prevent them to get to that level without resorting to cheating. Talk about cognitive dissonance.
Or is it like a drag race where the first vehicle across the line is celebrated?
There's two views: (1) programming as a craft and an art, a human creation, and (2) programming as a hurdle to overcome to accomplish more important things. People with both views might wish to practice.
I agree with the sentiment, but I believe your motorcycle analogy is a bit flawed.
Bringing a motorcycle to a 5K is obviously cheating. The motorcycle is a factor of 10 faster than its closest competitor. That is not where LLM/AI is currently at. The code generated by AI is not 10 times better than human code. It is just human code cleverly regurgitated.
Comparing AI/LLM to the motorcycle is giving it too much credit. AI is more like a 500CC motorcycle vs a 300CC motorcycle.
Sure they are pretty close, and the human might win once in a while, but the AI has just enough of an advantage that is guaranteed to burn the human out and win 95+% of the time.
It's optionally anonymous. People don't have to publicize their names on AoC. With Github (and I assume the other identity providers) you get an option of: anonymous #some number, your name (from Github), your username (from Github). IIRC, the anonymous option is the default.
> To me it feels like going to a 5k charity run, taking a motorcycle to the finish line, and parading your gold medal around.
Almost. But the crucial difference is — imagine an alternative universe, where humanity didn't have horses or any other animals or devices, even rickshas or wheels, to help moving around the world, and suddenly the first primitive cars were being invented just a couple of years ago.
Wouldn't you feel enthusiastic to use those new devices and test them everywhere?
I think this depends entirely on the competition. Athletes "cheat" by taking ped, but it's still interesting to see how far they're able to take themselves. Similarly, if an AI assisted programmer is able to solve aoc's relatively complex problems at lightning speed that's interesting too. But if the comonm understanding is that people should solve it without AI, obviously I agree ot feels shitty to go against that.
Are you saying we should not have motorcycle races? I agree that motorcyclists and runners should probably not compete in the same race, but there can be a competition where participants try to code the solution and another competition where participants try to build a machine that can code the solution at the same time.
"(If you want to use AI to help you solve puzzles, I can't really stop you, but I feel like it's harder to get better at programming if you ask an AI to do the programming for you.)"
I'd say that whatever "better at programming" is quickly evolving. AI is a tool like high level languages, IDEs, autocomplete or all the different types of code analyzers. How is ignoring a powerful tool like AI going to get you better at programming?
Programmers who think they are better programmers without AI may quickly be replaced with "lesser" programmers who get things done much quicker.
For the same reason giving a 5 year old a calculator to "learn" addition doesn't teach them math. You still need to know why shit works in order to use it in a fully educated way. Otherwise you are just "putting magic words into the text editor" and your code will be held together with bailing wire and chewing gum.
For day-to-day use, as long as you know what the code is doing when you run it, its fine. But to get to the point where you understand what the code is doing, you have to have written code manually.
> 5 year old a calculator to "learn" addition doesn't teach them math
Yes, using machines to give an answer, without providing any explanations of the process is not learning. One cannot generally give a young kid a calculator and expect them to learn how adder in the calculator's ALU works.
However, using machines to get an answer that you know how to get to is beneficial. No learning is involved in either process, and it saves time. So if you give a kid, who already knows well how to perform addition, understands the principles, and knows the properties, a calculator to quickly get through their homework, it merely saves their time. You know, just like most people don't exactly like to write boilerplate code that they can type in their sleep.
And using machines that can provide explanations how they got to the answer then reading it is learning. And learning by example could be powerful in certain situations - e.g. when one already knows some stuff, but needs some hints how to apply it to a different problem.
Surely, just asking the machine to solve a puzzle is unlikely to get one anywhere. Asking the machine to guide towards a solution might be very beneficial (if it manages to get things right, of course).
> For day-to-day use, as long as you know what the code is doing when you run it, its fine.
I think that'd exclude most Python, Ruby, Java, JS, Haskell,... developers. Heck, it'd even exclude a bunch of C developers who don't appreciate the fact they're not working on a PDP-11.
Incorrect. While you might not understand the lower level systems, as long as you understand the logic of it you are good to go. Just plopping in code from ChatGPT or even Stack Overflow without vetting it is a dangerous game. You have to know what the code does before using it, or else the likelihood of introducing security vulnerabilities and bugs goes WAY up.
> Otherwise you are just "putting magic words into the text editor" and your code will be held together with bailing wire and chewing gum.
That's just abstraction ? We're constantly dealing with legacy abstractions built 30-40+ years ago when we had fundamentally different constraints from today. Leaky abstractions are a fact of life and there's only so much you can reliably know about the stack.
My problem with learning with LLMs is how bad they are at the moment - people are betting big that they will eventually get better - but so far I'm only seeing scaling walls and negative cost/benefit footed by big tech $$$.
I’m not a 5 year old, I’m an old guy who has a job to finish. That said, it depends what you want to get out of AoC. for me, the prize is to learn something by solving the puzzles, so I probably wouldn’t plug the problem statement into the llm directly
I have been coding for almost 30 years now, and when I'm going to compete in Advent of Code I'm going to explicitly try and use AI tools because that's the new skills I need to train now.
I think the right context for thinking about the quoted advice is for a person's education. If true beginners didn't ever handwrite a recursion algorithm, a list sorting algorithm, a CRUD application, an assembly interpreter, a neural net forward pass, etc., because they copied the code from an LLM, would they benefit from that? And if it's true for beginners, is there ever a point in one's learning journey where it stops being true and we don't have to learn or practice anything anymore which can be solved by an LLM, or never have to understand new concepts and tools even though the LLM can work with those concepts and tools for us?
I think there's a both-and answer for the contest. Maybe have one competition where the unenforceable spirit of it is, don't use LLMs for help, and another one where the challenges are just made... quite harder... so that even people who use the LLMs still need to marshal great ingenuity to use them better than others (e.g. the #1 spot is someone who used RAG and chain-of-thought better than the #2 spot, and also had better intuition of what to trust vs challenge from the LLM outputs).
I've always liked the premise that you should be competent one level below where you do your daily work. The tools you use should be more efficient for you, not necessarily the only way you can get your work done. If you don't know how to use a drill, you shouldn't use a drill press.
By that token, using AI to solve AoC really seems like one would be doing themselves a disservice. You might tell yourself you fully understand what the AI spits out, and maybe you do, but you won't build the same sort of comprehension that you would from solving it yourself. Nor will you retain it for as long.
"I feel like it's harder to get better at programming if you ask an AI to do the programming for you"
I've been thinking about this for a full year now, and I'm ready to present my opinion: I think AI assistance makes it easier, not harder to get better at programming.
Sure, if you're lazy you won't learn anything because you'll get the AI to do all the work for you. But if you're that lazy I imagine you wouldn't have even tried to learn programming without access to AI assistance, so you're still doing better here!
If you're curious and dedicated to self-education, LLM assistance is a gift that never stops giving. You have to learn HOW to learn with it first, but once you've figured that out you can really fly with it.
I am curious, why is this an illusion? I have experienced the same (or at least similar) feeling of getting better after getting my hands on a search engine like google. I never stopped using a search engine since then to check if I am suffering from this “illusion” you’re talking about. Should we?
I seriously suggest getting a book and typing out a listing totally from scratch. It's interesting to actually type out another person's style of solving a problem because not only is it alien feeling to "solve it that way", it's also interesting to physically solve it that way. I did this recently and was surprised to see myself solving the problem "that way" at work. It was much more elegant and more readable than what the AI was suggesting.
I'm sure you think you're getting better at programming by using a compiler, but make sure you test your skills by writing assembly without the compiler.
There's a distinction between the short-term economic incentives of using AI (work gets done quicker, more code gets written, output is created faster) versus the long-term personal incentives of not using AI early on. I believe mathematics and various natural sciences benefit from having a complete understanding of the tools, their inner workings, and their potential use-cases within your head. I think programming is similar too. If you are already a skilled programmer who is able to construct complex-and-reliable (engineering incentives) systems with great accuracy and reliability, then AI will add speed to your conquests. However, if you are a beginner who is not clear on what fits where, and how a canoe is built before a cargo ship, then AI will hinder the quality of your work, and eventually put you out of a job because you're just a mere puppet to the tool.
This is an insightful comment. I have been thinking about why I prefer CLI-based tools over GUI tools; one important difference is that with CLI the affordances are more in my head than in the tool (though the distinction is fuzzy). I just got a typewriter for my birthday, and have been reflecting on how different writing feels. When I write on my laptop, I process language less in my head and more on the screen--I type scattered fragments and then clean them up into sentences and paragraphs. Writing with the typewriter is slower and much harder to edit, so I need to do more composition in my head. I lose some of the affordances of the external medium (including whatever AI might contribute), but necessarily devote more attention to interrogating and composing the ideas.
Maybe better programmer not so much, but I'd say the spirit of AoC is to crack a puzzle and translate the solution into code. Solving this using LLM will not help you getting better at solving puzzles, solving puzzles, or grasp the main concepts of a programming language. In that sense, ChatGPT is not a pedagogical asset in this particular instance.
> Software engineers who think they are better without AI may quickly be replaced with "lesser" software engineers who get things done much quicker.
Using AI to program is just ctrl-c ctrl-v software engineering, very clearly not the point of programming competitions. High level tools as you mention don't make you a better programmer, they make you a better software engineer.
I think the discussion you're raising is a good one it's just not relevant to a programming competition.
if i'm trying to practice problem solving, why would i feed my problem into something that's going to spit out -- in the best case -- suggestions depending on the completed-ness of my input, and worse case: a proposed solution.
if i'm trying to practice handwriting -- why would i use a type writer, or word processor?
i am not sure how using a machine to do the work you're trying to practice makes you better at that practice, since it's doing the.. mechanics.. for you.
as another commenter pointed out i feel there is a distinction between programmer vs. software engineer just as there is a distinction between writing and say, novelist.
i understand the notion of practicing, and getting better at an evolved workflow for producing results -- but we're practicing foundational skills, not trying to reduce everything simply to achieving results by any means necessary.
i feel that producing results without understanding is a short-sighted investment, and not exactly the point of AoC imo.
This same sentiment we now have against AIs has existed in the past, mostly against what was then called "high-level" languages, that is C, Pascal and maybe C++. Similar opinions were once held against memory-wasting languages (those with garbage collection) and languages that are too easy (the ones with dynamic typing and no pointers).
Programmers that write "slow C" instead of "fast assembly" were once regarded as lazy and incompetent, same for programmers that write "slow Java instead of fast C" or "slow Python instead of fast Java."
Right, the policy seemed fine to me until it got to this part. AI can be a way to get better at programming too. When I was learning Elixir, I watched José Valim work through the AoC puzzles[0] and followed along, typing in what José had and then working to understand it. José was doing the programming but it helped me get better at programming I think.
In that case, you're not actually competing in the competition, and that kind of thing _is_ useful. However, it'd be very much against the spirit of the competition to take those answers and use them as your own, and that's really what the policy is getting at.
Agreed, that was the sentiment I was aiming for, separating the competition aspect from the getting better at programming aspect. The policy with respect to competition seems fine. The follow-on in parens to the policy was the part I was referring to:
> (If you want to use AI to help you solve puzzles, I can't really stop you, but I feel like it's harder to get better at programming if you ask an AI to do the programming for you.)
Using AI is basically asking your college for the solution. It helps if it's used on the right occasions, but will make you very dependent if abused.
It's the same for tools, not gonna lie, but usually you need to understand how to use a tool before using it (and it's better if you know what the tool is doing).
Is the difference between telling someone to do something versus asking how to do it.
I remember older programmers making fun of me for using an IDE and autocomplete when I was learning to code as a kid, claiming I'd never learn if I relied on tools like that to "do the job for me".
What if I'm looking to practice my ability to augment myself using AI?
(This isn't theoretical - I've done Advent of Code a few times, but I don't get to code too much these days. I am curious about its abilities and having a fun way to approach it might be a reason to do Advent of Code.)
That said, I'm not trying to push to the top of the leaderboard.
I know we have seen some recent advancements with the Large Language Models, but I'm pretty sure they still can't, say, generate a working python Flask server for my specific REST API through only some prompts, can they? Because that would be incredible.
I don't think it's necessarily that it makes you worse; it's that it defeats the purpose of the exercise.
Like, I'm sure there are some professional CNC operators who are also very, very good at doing things by hand. But they didn't get good at doing things by hand by using a CNC machine to do everything that can be done on a CNC machine. And, based on what I observed growing up in a down-and-out manufacturing town, the people who decided that CNC is the way of the future and went all in on just being very, very good CNC operators are the ones who got hit hardest by the growth of industrial automation in the 1990s and 2000s.
I have similar concerns about Copilot. In the short term, yes, it lets you get relatively straightforward tasks done more quickly. But I suspect that people who make heavy use of Copilot risk plateauing at a lower skill level and ultimately being the people whose jobs are most vulnerable to the growth of technologies like Copilot.
Seems like they need to adapt the level of difficulty to the new world of developer tooling. An AI may be able to write a starter template and used wisely can move the project forward, but it still requires knowledge and expertize to build a system of systems even with the AI.
This is an opportunity to change our expectations of developer potential. Up the stakes and innovation will soon follow.
Edit: This is also an oportunity for someone to swoop in and make the AI version of this competition with exciting and complex challenges, and make Advent of Code irrelevant by next year.
>Seems like they need to adapt the level of difficulty
"They" is one guy - Eric Wastl. He has a very specific style and hey may not want to change the entire style of the challenge just because some dorks want to cheat. I hope that he doesn't change it -- AoC is very approachable, especially inside the first ~15 or so days. It's a charming set of puzzles that people look forward to. I don't want it to become an elite competitive programming competition just to stay ahead of LLMs. I want to be able to partake with 1-2 hours a night after my kids go to bed.
With that said, I don't think LLMs will be helpful with some of the tougher problems. It's a hard thing to test, because I assume CoPilot would do well with previous year challenges due to the massive amount of public code available on the same problems. I don't think it'd do well on designing the IntCode VM or the modular arithmetic problem on first time seeing them without examples on GitHub. But perhaps Eric has been feeding some LLMs the unseen problems for this year and the LLMs are cracking them? I'm skeptical but perhaps they've crossed that rubicon.
> With that said, I don't think LLMs will be helpful with some of the tougher problems. It's a hard thing to test, because I assume CoPilot would do well with previous year challenges due to the massive amount of public code available on the same problems.
I'm not 100% sure on this, but I think copilot isn't trained on anything past January 2022 (the gpt-3.5 and 4 cutoff dates), so I think it still shouldn't do well with AoC questions from 2022
This entirely misses the point of AoC. We already have system-of-systems challenges - they're called work and they're not as fun as the self-contained whimsical puzzles that are AoC.
Seems like there would be better odds of people complying if you gave them an "AI/LLM" mode to tick at the beginning to group their results in an AI/LLM leaderboard.
As the person who came top of a round using GPT-3 last year:
1. The policy makes sense to me.
2. It would be fun to have an AI leaderboard too. It was a decent amount of engineering work to win last year — I had 25 LLMs going in parallel, then running their solutions against the sample problem in attempt to validate the solutions. So there is still a lot of space for innovation.
I used ChatGPT (3.5 - 4 wasn't out yet) to help with Advent of Code last year and found it extremely useful - but I was careful not to attempt to make the leaderboard because it felt unfair to me: https://simonwillison.net/2022/Dec/5/rust-chatgpt-copilot/
Extremely useful for getting over fighting the borrow checker while learning Rust, or extremely useful for a proficient programmer in the language being used? Did ChatGPT actually solve the fundamental problems beyond ~day 15? Or just help you with the BC and parsing the data and other little boilerplate like that?
It was useful in that it helped me learn Rust syntax faster - using ChatGPT-assistance was a lot more productive and less frustrating than constantly running searches and referring to documentation.
I gave up after day 15 because it got too time consuming, with or without ChatGPT help.
> Can I use AI to get on the global leaderboard? Please don't use AI / LLMs (like GPT) to automatically solve a day's puzzles until that day's global leaderboards are full. By "automatically", I mean using AI to do most or all of the puzzle solving, like handing the puzzle text directly to an LLM. The leaderboards are for human competitors; if you want to compare the speed of your AI solver with others, please do so elsewhere. (If you want to use AI to help you solve puzzles, I can't really stop you, but I feel like it's harder to get better at programming if you ask an AI to do the programming for you.)
I love Advent of Code. Genuinely the most fun I've had coding since school. I like this policy, although its impossible to enforce, which is great since Advent of Code is like, purely a hobby? But surely, this years leaderboards will still be crushed by AI-tool-assisted-speedrunners. But that won't take away how fun it is for me to do it.
I get the sentiment from the pro LLM folks, but this is a fun thing we do once a year that is held by a few dedicated volunteers that want to keep it to be a programming competition. Using an LLM is not in the _same_ spirit as past years and if AoC wants to maintain that original feeling then I'm all for it.
Useful reminder to sort out my AoC++ donation for this year. Done.
Being in the UK I've never bothered attempting to target the global leader-board as 5am is just too early.
According to my research I probably could scrape in here or there (looking at the time between the timestamp on my input file - which I download by running a script manually - and my solution times) but then I assume I'd be one of thousands who could also do the same but are hampered by timezones.
I think the "Please don't" wording of the policy is pretty much all that Eric can do.
People will always try and game the system, but I doubt they can consistently game it for the number of days required to appear on the final leaderboard. I think the problems will just be too big/varied for LLMs to consistently solve.
If LLMs are mostly working by predicting tokens from previously seen problems (am I wrong? An AI noob here!), could an LLM even tackle such weirdly written novel problems? It's hard to know because the solutions for previous AoC problems posted on Github/blogs could be the reason it's capable of solving them nowadays, but could it solve 2023 on day 1? Really keen to find out!
There are certainly emergent capabilities that come out of the training process of being able to predict next tokens, and it's shown that LLMs often can solve logic and math problems, riddles, etc. which are not in it's training data because deep in the internal layers of the model it has something resembling logic and reasoning abilities. This can be seen in the "benchmark" results of papers like the GPT-3 and InstructGPT papers, among many others, where they tried pretty hard to test the models against questions which were not part of the training.
This is also shown in the Codex paper, where they trained an LLM to write code and then watched it solve a number of code problems they handwrote originally to make sure the problems could not have been in the training data.
Try it out yourself, make up some little math word problems and ask chatGPT or something.
Of course, advent of code will be much more challenging problems, but to get help with some subcomponents of the problem a motivated participant would likely try to use the most recent, powerful, and advanced models which outperform the results from papers written a few years ago, and outperform the free chatGPT.
> Try it out yourself, make up some little math word problems and ask chatGPT or something.
I have, and chatGPT simply sucks at logical reasoning in the way that is required for mathematics. It's extremely easy to make it trip up by just formulating questions in a slightly nonstandard way (so that it doesn't exactly match what it has been trained on), e.g.
> User: prove that there are only finitely many primes
> ChatGPT: Actually, it is not possible to prove that there are only finitely many prime numbers because this statement is false. There are infinitely many prime numbers, and this has been proven by the ancient Greek mathematician Euclid over 2,000 years ago. Euclid's proof is based on the principle that if you assume there are only finitely many prime numbers, you can always find a new prime by multiplying all known primes together and adding 1. This new number will either be prime itself or have a prime factor not present in the original list of primes, which contradicts the assumption of finiteness.
> User: prove that there exists a finite list of integers that contains all integers which are only divisible by itself and 1
> ChatGPT: You're referring to prime numbers. The statement is essentially the definition of prime numbers. A prime number is a positive integer greater than 1 that is only divisible by 1 and itself. There is no need to prove this statement because it defines what prime numbers are. While there are infinitely many prime numbers, they can be listed, but this list will be finite in the sense that there are infinitely many prime numbers, but they can all be enumerated within an infinite sequence. Some of the first prime numbers are 2, 3, 5, 7, 11, 13, and so on.
ChatGPT gets the first question right because something like that is undoubtedly in its training corpus, but it gets utterly confused by the second question (which is the same statement), to the point that its answer belongs to the "not even wrong" category.
I don't know where this myth comes from that LLMs are magically good at maths. They're not.
You're right about the basic mistakes they can make - they can also excel at the same tasks if prompted differently. But I was making a slightly different point. The point is they can reason about things in a better-than-chance way (I mean, a much-better-than-chance way) when given problems not in the training set. Have you read the Codex paper? Seriously, go look, an LLM even from years ago (which is like decades ago in ML-time) is often able to write code to solve novel programming problems that were handwritten to not be in the training set! Also, the benchmarks against which GPT3, InstructGPT, and GPT4 were tested are illuminating about its problem solving ability on novel problems, these are described in the three respective papers.
Secondly, quality is highly variant and there are traps the context window can fall into which causes especially bad results. Jeremy Howard has a great video (https://www.youtube.com/watch?v=jkrNMKz9pWU, starting at 18:05 the 'limitations and capabilities' section is only 13 minutes long) talking about how quality depends on: how you frame your prompts, model power (4 does a lot of stuff that 3.5 can't), and whether you're in a kind of "context trap" of repeated mistakes.
Of course, some people like to point out that if it's so "finicky" and variant, it is "dumb." Sure, if you like. I'm not interested in whatever definitions you're using those things, the objective and observable point is that given well-known prompting practices, LLMs can do something functionally equivalent to reasoning about novel problems, and more powerful ones can reason about more powerful and difficult things.
I re-phrased your prompt (instead of "prove a false thing" I made it like "decide whether this thing can exist, and prove your answer"). And added a little well-known boilerplate prompt sugar. It seems to have done a better job.
If your statement is that they can perform significantly better than chance, given certain favourable conditions (e.g. good prompt engineering), then I'll agree, although I'm not sure how helpful it is given that you can never know whether what they reply is garbage or not.
But even given "careful prompt engineering" it's still extremely easy to make ChatGPT trip up (I don't have access to GPT-4, so let's talk about ChatGPT for now):
> User: User
I am a student of machine learning and I have a graduate-level of understanding of certain subfields of machine learning and mathematics. I would like to ask you a tricky question about math. Please think carefully and step-by-step. Because, as a language model, each token you generate is a way to perform more computation, please write out your understanding of the problem and your steps of thinking about it. The challenge is as follows: decide whether or not there exists an algorithm that can decide for any statement in propositional logic whether it's true or not
> ChatGPT: The problem you're describing is known as the decision problem for propositional logic, or the "satisfiability problem" (SAT). It asks whether there exists an algorithm that can determine, for any given statement in propositional logic, whether it's true or not. This problem is fundamental in computer science and mathematical logic.
> I will walk you through the key concepts and steps in thinking about this problem:
> Propositional Logic: In propositional logic, statements are composed of variables and logical connectives (AND, OR, NOT, etc.), and the goal is to determine whether a given combination of truth values for these variables makes the statement true (satisfiable) or false (unsatisfiable).
> Formal Statement of SAT: The SAT problem can be formally stated as follows: Given a propositional formula (a statement in propositional logic), can we design an algorithm that correctly decides whether there exists an assignment of truth values to the variables that makes the formula true?
> Assume There Exists Such an Algorithm: Let's assume, for the sake of argument, that there exists an algorithm SAT that can decide the satisfiability problem for any propositional formula.
> Constructing a Contradiction: Now, we'll use a technique similar to the one used in the proof of the Halting Problem to create a contradiction.
We define a new formula, Q, that represents a situation where SAT contradicts itself. We construct Q as follows:
> Q(P):
> if SAT(P) == True:
> NOT(P) # Negation of the input formula P
> else:
> P
> In other words, if SAT says that P is satisfiable, Q takes the negation of P, making it unsatisfiable, and if SAT says that P is unsatisfiable, Q takes P itself.
> Consider Q(Q): Now, let's consider what happens when we apply Q to itself, i.e., Q(Q).
> If SAT(Q) == True, then Q(Q) should be unsatisfiable because it takes the negation of itself.
> If SAT(Q) == False, then Q(Q) should be satisfiable because it takes itself.
> Contradiction: We've reached a contradiction. If SAT(Q) == True, then Q(Q) should be unsatisfiable, but our definition of Q says it should be satisfiable. If SAT(Q) == False, then Q(Q) should be satisfiable, but our definition of Q says it should be unsatisfiable. This contradiction arises because SAT cannot consistently determine the satisfiability of Q(Q).
> Conclusion: As in the Halting Problem, we've demonstrated that there are cases where the algorithm SAT fails to give a correct answer. Therefore, we can conclude that SAT (or any similar algorithm) cannot exist to solve the satisfiability problem for all cases.
> This proof shows that there is no general algorithm that can decide the satisfiability of any propositional formula. The SAT problem is known to be NP-complete, which means it's one of the most challenging problems in computational complexity theory. While efficient algorithms exist for many instances, there's no guarantee of a general solution for all possible cases. This has significant implications for problems in complexity theory and computer science, including issues related to algorithmic efficiency and the limits of computation.
This is absolute garbage. Even worse than garbage, it adds the additional insult of being plausibly-sounding garbage to people who've had maybe a cursory introduction to theoretical CS. And notice that it's not even that hard to come up with such an example: a question like this could plausibly be on a problem set for an intro theoretical CS course (if it's not too easy for even that).
My main point is that this betrays an absolute incapability of ChatGPT to perform basic deductive reasoning reliably. It clearly knows that the referenced statement is about SAT, and, if you prompt it directly, it also knows that SAT is decidable, but it somehow isn't able to connect these two facts.
BTW my objection was related to mathematics, not coding. It's possible that, in the average case, LLMs perform much better at coding, since the level of rigour required in many cases is less than in mathematics. But when it comes to pure deductive reasoning, I wouldn't trust LLMs. Or at least not ChatGPT.
Yea honestly I think there would be value in them adding a checkbox that says "I used AI to mostly solve this problem" and maybe even list the AI. Could be an interesting way to see how the different AIs perform against such problems.
This policy does make sense, but feels like it won't age well.
Using LLMs to "win" at Advent of Code is - for me - a bit like getting an app to solve a Sudoku for you. You've got to end state pretty quick, and building the app to solve Sudokus is fun (trust me), but when you are trying to have fun and stretch your brain by solving a puzzle, how has that helped?
I think Advent of Code for me is about having fun trying to solve problems in my head and with my favourite IDE and programming language while having my first cup of tea of the day. I wouldn't want to use an LLM. I don't care about the global leaderboards.
I definitely see though how using one could be fun. And building code that can drive an LLM to solve the problem sounds like a ton of fun. So it would be nice to allow those who want to do that to flag it as such and to have their own leaderboards, or for it to show in leaderboards and allow them to be filtered in/out.
And like I say, at some point it will feel as weird as saying "no compilers".
I wonder how long until they drop the leaderboard entirely.
Most people are not realistically shooting for a spot on the leaderboard.
It adds novelty value to know that some geniuses are out there who can solve a problem in 20 minutes which takes me over 4 hours, but the main value is my private leaderboard, and the enjoyment of solving the puzzles.
Conversely, there is this whole other subsection of people who play the AOC to place on the leaderboard (I do not – too late at night). It's fascinating to see their strategies. Personally, I focus on reproducible code, whereas they just want the answer, so often they run it in the console of the AOC website. Also, they have all developed their own libraries for common problems. It's an art, and it's fascinating to watch the videos they release.
Yep, private leaderboards are where it’s at for me. Global pseudonymization dilutes accountability. I much prefer the book club feel of paying with people I know and trust, who all discuss the solutions as they work through them and are happy to provide hints to those who want them.
Leaderboard will be dominated by AI-assisted speed runners this year. I think I’ll just solve them in the morning this year instead of trying to get on the leaderboard (I can usually get in the top 100 on the trickier ones). Will be better for my sleep anyway.
It's pretty unlikely for AOC to penalize people who use LLMs in any way (they have no way of detecting such use). This means there's going to be two groups of competitors, the honest ones and the ruthless ones. The ruthless group will probably be much smaller, but, considering how popular AoC is, probably big enough to completely overwhelm the honest group.
To be clear, the "ruthless" group won't be clueless people using Chat GPT for absolutely everything, but people who could very well solve the tasks on their own and figure out a way to leverage LLMs as efficiently as possible.
A rule permitting LLMs would give everyone an equal footing.
I'd like to see an AI leaderboard consisting of 100% LLM-generated submissions. These submissions could then be evaluated for quality and scored, using a new set of measures for code quality.
We can also learn a thing or two about prompt engineering.
I might try a wholly separate challenge this year: seeing if I can talk an AI into solving each of the puzzles. I wonder how far I could get using only ChatGPT's output without any hand-coding.
Couldn't you tell it exactly what to write without technically hand-coding it? "Make an if statement checking if x is greater than 5, and if so, print y."
I'd use LLMs to help unblock me in the same way I use Google, since lately I've been finding it more difficult to find easy answers to "how do I do X with Y library" with Google. I've been finding the first page of Google showing SEO pages for programming questions rather than Stack Overflow or official documentation.
I wouldn't just copy-paste the problem into ChatGPT and then use prompts to solve it. I want to have fun solving the puzzle, rather than asking ChatGPT what the answer to the puzzle would be.
I can't help but feel part of the meaning drops out of it when there are carve outs for tools. If an AI can do it instantly, it starts to feel like pointless busy work, isn't that why we become programmers in the first place?
If your intent is to make it a pure human-keyboard-code challenge, then I would have thought this would be a great opportunity to make the underlying questions harder, or to find the edge cases (in terms of input or question format) where LLMs are no help.
Computers completely dominate humans in, say, chess. A game between humans is still interesting, though--and isn't nearly as interesting if one person cheats with an engine.
In the same way, a human competition in advent of code is an interesting challenge--and wouldn't be, if people cheat with AI solvers.
The point of AoC was always to practice something new, learn a couple of things, challenge yourself a bit, share solutions and ideas with people etc. I don't see how that's changed by the existence of LLMs or by this policy.
> If an AI can do it instantly, it starts to feel like pointless busy work,
Isn't every coding competition pointless busywork? The organisers already know the solution, and even if they don't I can probably find a dozen solutions on github/reddit/wherever.
The point is you can enjoy it even though it's pointless busywork - much like you can enjoy doing a crossword, or running a marathon, or lifting a heavy weight in the gym.
Last year, when GPT-3 was already around, people managed to get it solve some of the earlier problems itself, but not the later problems. This makes sense, because the earlier problems are made deliberately easier, while more difficult problems are saved for later. To eliminate this "busy work" aspect would require knocking out those easier problems, which would be a loss for those who can't solve the later problems at all.
Even then, it seems like something being beatable by a computer doesn't make it pointless to do as an exercise. We still teach children to add and multiply, even though computers can do these things billions of times faster than a human can and have been able to do so for decades. We believe that being able to do these things in ones own head is valuable even if it isn't the fastest way of doing them. Besides, there are still many people who practice and enjoy chess even after computers beat humans at it, and in fact many people integrate computer chess programs into their practice and analysis of chess.
The advent of code problems are already "busy work", in the sense of not making much sense, and having numerous complexities thrown into them for the sake of it. People have still enjoyed them for the competitive aspect, or for the fun of solving a puzzle. Those aspects don't change when you introduce an AI that can solve them faster.
> If an AI can do it instantly, it starts to feel like pointless busy work
If you can't solve the problem in a few seconds/minutes you will be able to just look up or copy/paste the answer for every AoC day from the associated solutions thread.
So in other words if you view it like that there was already a better "tool" to very easily "solve" any AoC problem: internet search.
If all you care about is getting a number to put into the website to get a checkmark that is.
I'm perfectly happy to let new programmers use AI to learn, that way I'm very cozy in any job I'm in.
Nothing to worry about if new programmers are learning from AI that writes, on average, code much worse than anyone I know.
There was always the opion of not doing your work yourself, and instead paying a few bucks on some site to have someone else do it. Same crowd, it just got cheaper.
IMHO, a separate AI leader-board should be created which encourages users to use AI to it's fullest extent. That would give people who want to do so a way to do it legally and might make them less likely to infiltrate the standard leader-board. Not sure how technically challenging this would be.
The optimal policy is to separate the AI and non-AI leaderboards. Having an AI leaderboard will give examples of what AI solutions look like which can then be used when judging potential cheaters.
I understand the appeal of using the best and fastest means when there are real-world stakes — taking a motorcycle is better than running for many transport scenarios. But the whole point of this is to have fun! Is it fun to hit <tab> over and over?