LLMs CURRENTLY favor the more curious and open individuals (High on openness in big 5 scale). Half of the population is not open, does not want to try new things, unless it leads to very direct and proven benefit.
However, over time, the overwhelming benefits of using LLMs will be well understood, and these ladder climbers will absolutely master LLMs, no matter their intelligence. People can become experts at taking exams despite how boring and soul sucking that can be, let alone using something way funner and useful like LLMs.
My friend had never written anything more than an Excel formula a few months ago and now he's using GPT-4 to write very nontrivial Python applications and automate large parts of his job.
I (having 30 years experience as a professional Software Developer^TM) am begging him to teach me his techniques.
Now that you mention it, I met him and we became friends in large part due to his communications abilities.
5 months ago - a friend wrote me a python script and sent it to me...i couldnt get it to work. Used phind.com to explain what to do...it worked out my windows environment variables needed to be changed, told me how to structure a folder schema to place the src script...mindblowing stuff. And i have been using it - when it turn the same friend told me to write a similar script in python myself...it has been amazing to forego stackoverflow/google to get answers. When the GPT-4 model kicks in - my questions are answered with beautiful clarity...whether or not this technology "hallucinates" or not...without a doubt it's empowered me to learn to program. If it's helped me out of 50% of my ruts - i count that as overwhelmingly positive!
I am not going to visit the website you just posted here unless you explain how the mechanism works, otherwise I will assume it's just crappy advertisement in this post.
Phind is just prompt engineering, as is anything sitting on top of anything OpenAI is building right now - if you are doing Q&A then maaaaybe it does some resource augmented generation as well (say, from Stack Overflow or other technical resources)
Phind is often recommended to me by programmers who find it produces "better" results than a naive GPT-4 session, but I don't know that anyone has done any real world testing.
as explained before me phind is a website built ontop of GPT-4 + a free tier (which is what i use) - which uses what they call the Phind Model...though it changes often.
I understand your skepticism (because the internet is fully broken right now/and forum posts are more or less paid for content at this point)
but...its a bit funny you'd like someone who just explained they're a beginning programmer who got into programming because an LLM helped remove insane amounts of friction from the learning process to "explain a mechanism" to them.
Google has actively become catastrophically bad to the point that it will ignore the only special keyterm I deliberately looking for for more general dumb results.. like it is literally unusable - I used to be able to find that very specific stackoverflow answer I have read years ago, or some barely read blogpost, and now I’m happy if it actually finds the website if I search for the domain name..
GPT would be good as a search engine, but who would want a search engine that stopped indexing a few years back? Also, it is not good at ultra niche topics, which would be the whole point of a search engine.
Huh, I can't seem to get in the groove of using it, maybe I'm old or something, but it annoys me all the subtle ways it's wrong and I feel I have a much better grasp if I think through it myself supported by Google.
I get a lot of mileage out of ChatGPT just treating it like an intern who turns around work instantly. You don't expect interns to write perfect code, but they can save you a ton of time if you set them loose on the right problems.
For any relatively simple task I can say "Write a Python script to do X" and it will almost always spit out working code, even if it has subtle mistakes. Fixing mistakes is fine and part of the process. I don't have to read StackOverflow posts saying "Do you really want to do X?", or sift through documentation that follows the author's approach of how they want to introduce the material but doesn't directly address my question.
Not OP, but I always want to read the code generated by chatgpt before I run it. And I dislike reading other people's code much more than writing it myself.
Especially now that the training cutoff is being moved up to April 2023. Questions requiring more recent results were the main ones I’ve been going back to google for.
Part of this is that ad infested AI generated blog spam is flooding Google! But it’s also my go to. I also really liked GPT to bring me up to speed on a libraries I’ve never used.
Agree in part but also I think Terry is such an outlier (though also generous and humble) that it’s hard to extrapolate from this example to a more general case
The only benefit Terrence Tao has received is in correcting syntax. He has mentioned that he has gotten (in my opinion better) benefits from talking to existing Lean4 specialists.
I don't think you realize what you are saying here. I agree that it is a large boost but 100x is just too ridiculous to take serious. Do you really believe an engineer can now finish in 20 hours what would have taking them a year before?
As someone who uses ChatGPT4 pretty much nonstop all day (I have a monitor dedicated to it) - it's almost certainly a 10-20x in fields I have no knowledge of (writing in languages I've never used, dealing with OS/Kernel/Network constructs that are unfamiliar to me) - I can knock off in an hour what might have taken me a day or two previously - but I don't think I've ever had a task where I could say that I completed in 1 hour what would have taken 100 hours - though I would love to hear of counterclaims from people who have been able to do so - definitely not saying impossible, just that I haven't had that experience yet.
The idea of the 10x programmer can mean 1) someone who produces a ton of code quickly 2) someone who can solve seemingly-intractable problems 3) someone who’s presence on a team improves everyone’s productivity quite a bit 4) someone who chooses technical decisions that save a ton of time down the line.
Disagree, those are all valid interpretations, and depending on your experience you will have a vastly different understanding.
To me, the first one is the most basic and frankly, most silly definition of a 10x programmer - produces more code? really? Code sucks. Nobody needs more code, people need solutions. Solving a problem with no more code, or understanding it so a _tiny_ change solves all problems? Way better than adding code, actually much harder to synthesize and produce new results. Way more efficient.
Now can you remove things and solve the problem? Realize you can build an even more simple and generic system that solves your problem? Even more amazing.
I would argue that it only helps in very beginner questions as it is beyond useless on anything more complex.
Like if the answer is not a leetcode/CRUD code sample you can easily find on the internet either way, it can’t do anything useful. Like, my problems mostly constitute things like reasoning about the correctness of a given change in the context of the whole codebase, its threading model, etc. It will at best regurgitate some bullshit like “race conditions can be this and that” here, not helping at all.
I'd say TLA+ is designed more for software people trying to design systems, write down specs, and reason about how the thing behaves dynamically.
Lean is used mostly for writing down math proofs, and a lot less for software (although by the Curry–Howard correspondence, math proofs and programs have an equivalence, so the line is a little blurry). Lean has "mathlib", which is like a standard library of formally verified math that people can contribute to and use in new proofs.
Leo de Moura wants Lean 4 to be used for software verification too.
A cool thing about Lean 4 is that it's also a programming language, using the same syntax as for proofs, making it easy to consider proving correctness properties of programs you write. Most of Lean 4 and its tactics are written in Lean 4 (though at this point almost none of this code has any associated proofs).
Yeah, I believe they said they intend for it to be used as a general purpose programming language. I used it to complete Advent of Code last year.
There are some really interesting features for general purpose programming in there. For example: you can code updates to arrays in a functional style (change a value, get a new array back), but if the refcount is 1, it updates in place. This works for inductive types and structures, too. So I was able to efficiently use C-style arrays (O(1) update/lookup) while writing functional code. (paper: https://arxiv.org/abs/1908.05647 )
Another interesting feature is that the "do" blocks include mutable variables and for loops (with continue / break / return), that gets compiled down to monad operations. (paper: https://dl.acm.org/doi/10.1145/3547640 )
Yeah, here is where TLA+ fails to gain adoption, because it is only a model, how people actually implement the architecture is completely unrelated to how the model was validated.
> Most of Lean 4 and its tactics are written in Lean 4 (though at this point almost none of this code has any associated proofs).
Do you have to actually prove a tactic work for it to be sound? A tactic will rewrite the program into a simplified program that will verified by a kernel; if the tactic wrongly expands something, the kernel will later reject the program. Only the itself kernel need to be verified for the whole thing to be sound.
You're not missing anything. I was going to mention this but decided not to get into it. (One detail: you can't verify the kernel exactly because of Gödel incompleteness issues.)
I am absolutely not an expert at either of them, but I believe TLA+ is interested mostly about all the possible ordering of “code”, that models your program, verifying the absence of concurrency bugs.
Do correct me if I’m wrong, but Lean is more about total functions/FP code, not sure how well does it handle threading, if at all. It might be more related to the correctness of the actual implementation of a serial algorithm.
A few years back, I was trying to find out how to reduce mistakes in the programs I write.
I got introduced to Lamport's TLA+ for creating formal specifications, thinking of program behaviors in state machines. TLA+ taught me about abstraction in a clear manner.
Then I also discovered the book series "software foundations", which uses the Coq proof assistant to build formally correct software. The exercises in this book are little games and I found them quite enjoyable to work through.
Code correctness is a lost art. I requirement to think in abstractions is what scares a lot of devs to avoid it. The higher abstraction language (formal specs) focus on a dedicated language to describe code, whereas lower abstractions (code contracts) basically replace validation logic with a better model.
C# once had Code Contracts[1]; a simple yet powerful way to make formal specifications. The contracts was checked at compile time using the Z3 SMT solver[2]. It was unfortunately deprecated after a few years[3] and once removed from the .NET Runtime it was declared dead.
The closest thing C# now have is probably Dafny[4] while the C# dev guys still try to figure out how to implement it directly in the language[5].
The issue is that modern software fails because it's part of a complex system with many moving parts, rather than, it is inherently complex at the source-level.
Certain sorts of algorithmically complex development (games, cars, medical hardware, etc.) would benefit from a 'closed-world verification' -- but that's not most software, and they have alternatives.
'Code correctness', including unit testing, ends up being a big misdirection here. What you need is comprehensive end-to-end tests, and instrumentation to identify where failures occur in that end-to-end.
The effort to source-level-check source-level-code is largely a huge waste of time and creates an illusion of reliability which rarely exists.
> The issue is that modern software fails because it's part of a complex system with many moving parts, rather than, it is inherently complex at the source-level.
A reminder of Gall's law:
> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.[8]
> The issue is that modern software fails because it's part of a complex system with many moving parts, rather than, it is inherently complex at the source-level.
The choice to run a system as many different moving parts is a decision taken by the team in order to avoid failure.
> Certain sorts of algorithmically complex development -- but that's not most software
It's all software.
> 'Code correctness', including unit testing, ends up being a big misdirection here. What you need is comprehensive end-to-end tests, and instrumentation to identify where failures occur in that end-to-end.
No and no. I have comprehensive end-to-end tests. They take forever, don't fit into RAM (for some services I need to run them on my home PC because my work laptop only has 16GB), and most importantly: they show that the code is not correct. Now I have to change incorrect code to correct code (while not breaking any public interfaces. I wish my predecessors did not put incorrect code into the live system.
I'm not really sure what you're saying here -- the parent commenter is talking about closed-world source-level code verification.
My claim is that these techniques help in relatively few areas of software development. In the main, software is built with open-world interactions across a range of devices (network, disk, etc.) where those interactions dominate the net failure modes of the system.
In the latter case, source-level verification is not only of no help, but it's a huge resource misallocation. You end up writing tests that just create the illusion of success; and then the whole thing constantly falls over.
Spend all that time instead in designing tests that do fit into ram (eg., sample your datasets); and instrumentation that does reveal errors (eg., sampling of memory, requests, etc.).
One of my first software jobs was writing a reverse proxy -- one of my co-devs wrote unit tests that simply established the in/out of various functions was "as expected". Pretty useless -- the issue is whether the proxy actually worked.
Likewise most source-level correctness efforts are 'testing theatre'.
Devs are looking for cargo-cult solutions they can copy/paste. No, testing software is an actual area of development -- and you need to develop tests, not cargo-cult
Tests only show that it is correct for the sets of values and code paths you exercise. It's quite possible for other aspects to be incorrect, and this is what theorem provers like Coq help with.
This is true but mostly meaningless in practice. I have never found the people who say this every time formal verification comes up to be the same people who actually work in formal verification. The reality is that it's far far harder to mess up a formal spec that you actually have to prove a concrete instance of, than it is to mess up the code. With a buggy spec, in the course of trying to prove it you'll arrive at cases that seem nonsensical, or be unable to prove basic lemmas you'd expect to hold true. Formal spec bugs happen, but they're quite rare. Usually even when there are edge cases that aren't covered, they will be known by the authors of the spec due to the dynamics I mentioned earlier (cases encountered during the proof that are clearly being solved in ways that don't follow the "spirit" of the spec, only the letter). Of course, "informal" specs, or paper proofs, will have subtle bugs all the time, but that's not what we're talking about here.
A lot of specs bugs happen all the time. If you think people can account for all edge cases in massively complex projects you are wrong. There are many behaviors that you can't predict will be nonsensical ahead of time until you actually hit them.
Formal verification is not a silver bullet. Despite all of the extra effort bugs will still happen. It's better to invest in making things safe by default, or by aborting when getting into a bad state.
Can you point me to some of the concrete bugs in formal specs with associated proofs that you are thinking of? Specifically in the part that was formally verified, not in a larger project that incorporates both verified and unverified components. Because my experience is that when asked to actually provide concrete examples of how formally verified code has bugs in practice, people find it quite difficult to do so. Formal verification is about as close to a silver bullet as it gets for eliminating bugs, even if it's not 100% (nothing is). It's certainly far better at eliminating bugs than vague philosophical positions like "making things safe by default" (what does that mean?) or "abort when you're in a bad state" (how do you know you're in a "bad" state? Isn't it in general at least as difficult as figuring out what the correct formal spec is? How easy is it to detect? Is the performance cost acceptable?).
In fact, what usually happens is the opposite of a formal proof being undermined by a bad spec--when an informal spec is formally verified, inconsistencies are often found in the original spec during the course of the proof process, and bugs are subsequently found in older implementations of that spec. Fixes are then incorporated into both the original spec and the existing implementations. Formal verification is a spec-hardening process.
>Specifically in the part that was formally verified
No, I am saying that is that it is easy to not verify something because you don't know the requirements up front.
>"making things safe by default" (what does that mean?)
It mean that complexity is abstracted away such that it is hard to the wrong thing.
>"abort when you're in a bad state" (how do you know you're in a "bad" state?
There are invariants which you can assume that are always true. If they aren't true for whatever reason you can abort and later track down what caused you to enter this state. It could be as simple as some logic bug or obscure as hardware malfunctioning and causing a bit flip (at scale you need to deal with hardware misbehaving).
>inconsistencies are often found in the original spec during the course of the proof process
> No, I am saying that is that it is easy to not verify something because you don't know the requirements up front.
Granted, but "this spec doesn't attempt to cover [X]" is very different from "this spec claims to cover [X] but actually doesn't, in ways the authors are unaware of." The former is quite common in formal specs with machine-verified correctness proofs and I've never heard anyone call that a "spec bug". The latter is not common in them at all. Perhaps you didn't intend this yourself, but the way people generally talk about "there could be a bug in the spec" is usually used to imply that the latter case happens often enough to make formal verification comparably effective to other defense mechanisms, when empirically formal verification is far, far more effective than any other bug prevention mechanism--to the point that people struggle to find even individual examples of these kinds of spec bugs.
> There are invariants which you can assume that are always true. If they aren't true for whatever reason you can abort and later track down what caused you to enter this state. It could be as simple as some logic bug or obscure as hardware malfunctioning and causing a bit flip (at scale you need to deal with hardware misbehaving).
Yes, and stating these invariants precisely is generally exactly what you need to do as part of writing the formal spec; getting them right is equally hard in both cases, so it seems weird to me to say that it's easier to check the invariants at runtime than to get the spec right. In fact, many formal specs have invariants that are exactly of the form "assert that if I ran this line of code, it would not throw an exception." Moreover, you can use much more "expensive" invariants when you're creating a spec, since you don't actually have to check them at runtime, which in practice lets you catch far more and more subtle bad states than you can with software checks. For example, your proof state can include a full history of every action taken by the application in any thread, and prove that invariants on them are maintained during every instruction, even when those instructions proceed concurrently and might witness partial or stale views of the latest values in memory; storing and checking all of this information on every step on an actual machine would be wholly impractical, even for high level languages and non-performance-critical software.
Obviously, things are defined according to some base model of the world, so if you are concerned about hardware bit flips (which in rare contexts you are) your spec should take that into account, but the vast majority of bugs in modern applications are not bugs of this form, and most defensive runtime checks are equally susceptible to hardware issues or other unexpected environment changes.
> Bugs are found in the process of testing too.
Yes, and I never said testing was bad. Formal verification and testing are complementary techniques, and tests are important in a formal verification context because you should be able to prove the tests will pass if you've gotten your spec right. However, them being complementary techniques doesn't mean that they are equally effective at reducing bugs.
No, it really isn't. It's doing better now than it ever has... and I mean all the negative implications of that. It was not an art that was ever "found" in the sense you mean.
Formal verification is often not the best tool for ensuring code correctness from an ROI perspective. Things like unit tests (including property based tests) and ensuring 100% code coverage often achieve adequate results with less effort.
The key point is that most software don't need correctness to the level formal verification provides. There's a subset of software however, for which there's no substitute for a formal verification process.
Have you looked into Idris2 at all. While looking into these theorum provers, it always felt like they had an impedance mismatch with normal programming.
Idris2 portends to a general purpose language that also has a more advanced type system for the theorum proving.
I had the same positive experience with Software Foundations.
There is another book somewhat derived from it (if I understand correctly) using Agda instead of Coq: https://plfa.github.io/
I haven't had the chance to go through it yet, but it's on my list - I think Agda (and as mentioned by another commenter, Idris) is likely to feel more like a programming language than Coq.
I’m really excited about dependent types. I’m expecting we won’t get them for a while though. Dependent Haskell is progressing but apparently it’s hard to retrofit. Idris’ own creator has said he expects it to be a model for other languages, I don’t think it will ever have mainstream adoption. Coq and Agda, F* aren’t really designed to be general purpose.
Although the implementation for the compiler is complex, and the syntax can get complex and verbose, to me my requirement is simple: I want to encode everything about input and output that I know.
Right now in mainstream languages I often know more about my arguments or output than the type system will allow me to specify.
I totally share your excitement about dependent types, but it seems that, unlike the type systems we're used to, theorems about the dependent types are much harder to prove, which makes them not very comfortable to use for the whole program.
If only there was some kind of a gradual, perhaps typescript-like approach to adding arbitrary type-level value-limiting information in random places without having to have everything proven everywhere...
Every non-dependent typing relation is also a dependently typed relation so I think things are already the way you want, unless you have a certain example in mind.
Sure, in the same sense that every "untyped" program is already typed with some kind of a universal type, but what's the point?
What I want is to be able to specify and have (when possible) automatically proven arbitrary assertions anywhere in the code without necessarily making sure that every possible presupposition is proven from the ground up. Just like in Typescript, I can add a type at any point where there was only "any", and have a small portion of the program typed without typing the whole thing
Not sure if it's what you're asking for, but in lean, you can put `sorry` to skip a proof. It's `believe_me` in Idris2, and I think the Agda version is `trustMe`.
You can also use stuff like ! when you don't want to prove your array indices (it might crash or substitute a dummy value if you're wrong).
i don't even need it to go beyond the insides of a function. something like a proof of an invariant that's only relevant inside the function's body would be fine. e.g. in Rust, almost every place where you see something like ".unwrap()" with "// unwrap safety:" comment, this comment could be an assertion easily proven from the few lines above
Dependent types are a concept in computer science and programming languages where the type of a variable can depend on the value of another variable. In simpler terms, imagine you have a list of numbers and you also have information about how long that list is. With dependent types, you could create a type for the list that explicitly includes its length, ensuring at compile time that operations respect this length. This makes it possible to catch certain kinds of errors before the code even runs.
For example, in a language with dependent types, you could specify a function to take a list of length 3 and no other sizes. If you tried to pass a list of length 4, the program wouldn't compile, preventing this kind of mistake early on.
It's a bit like having an extra layer of safety checks, where the types are more expressive and can encode more intricate relationships between variables.
I never care much for chatgpt answers. I don't know why people post them on here.
In the first sentence, "another" is wrong because you don't need two variables, you just need one. Final paragraph's wrong for the same reason.
The example given is poor given that I can write [i8; 3] or int[3] in Rust or C and those very much do not have "dependent types" in the popular sense (Rust's const generics excepted). To be fair, those examples are technically dependent types, but it would be better to give an example that's impossible in those languages, such as "array with length at most 3" or "even integer".
Finally, to nitpick, "a bit like" is unneeded hedging.
I like ChatGPT’s answer better than yours. Rust’s cost generics and C++ aren’t dependent types in any sense. And saying “the type of one variable can depend on the value of another” is vague but a good intuition pump.
Since I guess we’re doing dependent types explanations I’ll give it a go. Dependent types extend a type system with two new type thingies: 2-tuples where the type of the second element depends on the value of the first, and functions where the type of the function’s return value depends on the value that was passed to it.
Ha! Looks like I was wrong and confusing refinement types with dependent types: https://cs.stackexchange.com/a/21733 (unless of course this person is also confidently wrong, heh)
> Dependent types are types which depend on values in any way. A classic example is "the type of vectors of length n", where n is a value. Refinement types, as you say in the question, consist of all values of a given type which satisfy a given predicate. E.g. the type of positive numbers. These concepts aren't particularly related (that I know of). Of course, you can also reasonably have dependent refinement types, like "type of all numbers greater than n".
Because I had the question and I figured this was quicker. I didn’t know what dependent types were.
So now you know why I do it. Also, I believe this is my first time doing it. I might be wrong.
Is it better to ask and wait for an answer instead?
There is nothing in the guidelines on HN about it. I don’t know what’s reasonable and I haven’t seen strong cultural norms from HN yet. I at least labeled that the text was from chatgpt as to not confuse it was my own text.
It's of course fine to ask chatgpt. I also appreciate that you wrote that it was from chatgpt.
However, I wouldn't recommend posting the result here if you don't know if it's correct. Moreover, anyone can ask chatgpt themselves. It's better to wait for someone here to post an answer.
Yes, there's nothing in the guidelines, but they're (deliberately) not all-encompassing. Besides, I would hope it's part of basic Internet etiquette; just because we now have access to tools that generate plausible-sounding but not necessarily correct answers to questions doesn't mean we need to post what they create in the place of genuine answers.
> Besides, I would hope it's part of basic Internet etiquette
Yea that got lost on me. I think I view these things a bit differently than most on HN. I've noticed in general that the more I'm not at uni, the more my thinking has become heuristic and quick-ish. It used to be more thorough and in-depth. The trade-off of it being that such type of thinking is more time consuming but the answer is more comprehensive and/or accurate.
> Is it better to ask and wait for an answer instead?
No, around here it's better to say "So dependent types are pretty much $SOMETHING_COMPLETLY_WRONG ?" and wait for all the "corrections" (aka Cunningham's law someone linked to nearby).
I could use google but chatgpt wins on speed and asking on HN wins on vine. When someone asks me a question I am mostly quite enthusiastic in explaining it. You can’t get that as easily from a search engine.
I don't see anything wrong with GPT's answer other than pedantic quibbling. I just don't like it because I don't want to share this space with non-humans.
Cunningham's Law: Post something wrong online so you can get a correct answer.
Half the ChatGPT answers on here seem to be wrong in obvious ways or, worse, subtle but critical ways. When people post them they get downvoted, and other people chime in with "Why are you trusting a fabulist like ChatGPT instead of going to actual resources with definitions and explanations that aren't garbage? Here's what it actually is..."
That is really interesting. Whish there was some better name for that, as I feel like it isn't that descriptive. However their benefit seem really obvious, saying that something is an areay, or even an array of integers is barely half the picture if in reality it is a array of length 5 of even integers to loan from your example. I guess you would try to implement it in other languages creating objects?
Another classic use case is that you can have expressions in the types of your function, for example this won't compile:
stringOrInt : (x : boolean) -> int -> (if x then String else int)
stringOrInt true x = toString x
stringOrInt false x = x + 1
1 + stringOrInt true 37 # this will error, because it knows you've not returned an int
The other example that you can do in depedently typed languages, but is too involved to write out here, is make a type-safe printf, where the format string produces the types for the other arguments.
It’s simplest explanation is that dependent types are types whose exact type use some kind of parameterized value in the definition. So a dependently typed array would use a parameterized length value as part of it’s definition(this is usually the easiest example to understand). So an array of length 4(of whatever type) could be identified as different from an array of length 5, etc.
That one of the brightest minds of our generation is able to increase his bandwidth with the combination of LLMs and automated proofs makes me super bullish on this tech combo in the future!
It starts with bug-fixing, then supports verification, until it starts propelling new discoveries and push the envelope.
We need a term when a dynamic like Moore's Law "infects" a field that had no such compounding properties before.
Could Terence have done it without Copilot? Sure, but like many of us he might not have initiated it due to the friction of adopting a new tool. I think LLM tech has great potential for this "bicycle for the mind" kind of scenarios.
"Terry Tao finds ChatGPT very helpful to formally verify his new theorems" seems to be a true statement. See some of his other recent mathstodon toots.
The point is that LLMs and similar tools tend to be very good at automating the trifle but not very useful at what would be considered really "interesting" work.
So while your point is somewhat true [0], as he mentions that these tools could become good enough to do the formal verification part, it's precisely not the interesting part. See [1] and [2]; in particular some things that are very easy to do in real maths can be very challenging in an automated theorem prover, quoting from [2]:
>In the analyst's dialect of Mathematical English, this is a one-line proof, namely "by the standard limiting argument". Unpacking this into a formal proof required me to go through a fair bit of the Mathlib documentation [...]
It's impressive to be able to do such mathematics in Lean/Coq..; at all, but it is very tedious mechanical work [3].
>It was more tedious than I expected, with each line of proof taking about an hour to formalize
So I think that rather proves the point of what LLMs are currently good for, and what tools can help for really difficult tasks, rather than invalidate it.
I wonder whether we could combine formal proof checkers (like the Lean proof checker) with language models that generate synthetic conjecture-proof pairs in a formal language like Lean.
The Lean proof checker could be used to automatically verify whether the synthetic proofs written by the language model are correct. This information could be used to provide an RL reward signal applied to the original language model, which would result in it writing better proofs. (Or we train a new model using the correct synthetic proofs of the previous round as training data.) And then the process repeats. So the model would self-train using its synthetic training data, without further human intervention.
We could even make this process more adversarial. First we split the generator language model into two: One which generates conjectures, and one which tries to prove/disprove them in Lean. Then add a predictor model which tries to predict whether a synthetic proof is verified by the Lean proof checker. The lower the predicted probability that the proof will be correct, the more reward gets the proof-generator model if it did indeed provide a correct proof.
Finally, we add another model which tries to predict the reward the proof-generator model will get for a given synthetic conjecture. Then the conjecture-generator model is rewarded for conjectures that are predicted to yield a high reward in the proof-generator model. So conjectures that are neither too hard not too easy for the proof-generator model.
So we would expect that the whole system would progressively create harder and harder synthetic proofs, which in turn allows for better and better self-training of the proof-generator.
It seems this could in principle scale to superhuman ability in generating proofs. The process would be somewhat similar GANs or to self-play in AlphaGo Zero.
I think the hard part is the initial bootstrapping part, to get the whole process off the ground. Because the initial training of the generator models has to be done with human provided training data (Lean proofs). But once the synthetic proofs are good enough, the system would self-train itself automatically.
It would also be interesting to train these models on the entire math literature. One would want to use this to verify the math literature, translating the proofs there to checkable forms. Eventually, we would get a model that knows every theorem and every correct proof in the math literature.
It is superficially about a different topic, namely generating and solving "programming puzzles", basically a type of unit test, that are then verified by the python interpreter. This system seems quite analogous to one where the programming puzzles are replaced by conjectures in a language like Lean, the proposed puzzle solutions are instead proposed proofs, and the python interpreter is replaced by the proof assistant.
If I understand this correctly, they want to use an AI model for "autoformalization", which sounds like "using a language model to translate natural language proofs into Lean proofs". Which is cool but much less ambitious than the self-train system I described above. I guess AI technology isn't yet far enough to make such a proposal work.
Lean4 is brilliant. Worth digging into as a programmer. Coq, Lean4, Agda etc. made my brain explode in a good way. Making me a better software developer.
I'm kind of interested in how useful Lean4 is as a programming language, and if it's easy to prove things about a program written in Lean. I should probably look into that when I have a minute.
Regarding usefulness: Lean is very nice to program in, if you care about pure functional languages; its FFI allows you to incorporate fast C routines very easily if pure Lean is not performant enough or lacks features. However, in some domains, Lean is within a decimal order of magnitude of (not hand-optimized) C; some benchmarks I hand-made recently impressed me.
Regarding proving things about programs, no, it is not easy, and the developers do not seem to consider it a core goal of Lean.
> the developers do not seem to consider it a core goal of Lean
I guess it depends on who you ask. The original devs of Lean wanted to do "everything" (because that's how you start projects, I guess). Since then it has attracted a lot of mathematicians (especially those working on Mathlib, a library that aspires to formalize "all known math") who are happy to have "algorithm objects" and prove things about them without being able to actually run an algorithm on any input.
This goes together with mostly embracing classical logic (which breaks the original and most powerful version of Curry-Howard, which allowed you to extract programs from proofs). However, in practical situations, algorithms extracted in this way tend to be too slow to be useful, so maybe that's not actually a downside for programming purposes.
Finally, Lean4 "compiles" to C-code, so at least it is (or can reasonably easily be made) portable. People have been trying to use it for real applications, like the AWS stuff others have linked in this thread.
The core Lean 4 developers do want proving properties about programs to be easy. In the short term maybe priorities have been elsewhere due to limited resources, but that doesn't mean they do not consider this to be a core goal. My understanding is that there are still some research-level problems here that need to be worked out. (Proving things about do notation is currently a real pain for example.)
Does anyone understand what the proof is about? An improved boundary on the mean of geometric series? And why it's only a problem for n=3, k=2, and not for all k=n-1?
Exactly lol. Not sure why everyone is taking the contribution of lean in finding the error so seriously.
I would be more impressed when some mathematician finds a more severe error in their proofs with the help of theorem provers (meaning a mistake in their own intuition).
Is it possible that small bugs or assumptions in a root paper could cascade through referencing papers leading to wildly inaccurate outcomes 5 or 6 papers down the line?
Yes! I work in quantum information theory and this recently happened in a subfield called resource theories.
The generalised quantum Stein's lemma [1][2] is (or was) a very powerful result that was used for over 10 years to prove things in this subfield. However last year it was noticed that there was an error in the proof of this lemma, and a whole bunch of results based on it weren't valid [3][4]. One of the authors of the paper where they wrote about the error gave a talk at the conference QIP 2023 this year, and there is a video of that talk available here [5]. Bartosz is a good speaker and I recommend watching the talk if you're interested, if you go to about 10 minutes, 30 seconds in the talk he discusses the consequences of this result now being not known to be true.
This is arguably what caused the push for formalism in mathematics:
Multiple papers on calculus claimed results about continuity and derivatives, but we’re using subtly different definitions.
The conflict between those results, and the counter-examples to demonstrate the difference, led to mathematicians building the modern machinery around proofs.
> The Weierstrass function has historically served the role of a pathological function, being the first published example (1872) specifically concocted to challenge the notion that every continuous function is differentiable except on a set of isolated points. Weierstrass's demonstration that continuity did not imply almost-everywhere differentiability upended mathematics, overturning several proofs that relied on geometric intuition and vague definitions of smoothness.
In theory yes. In practice mathematicians tend to have a good instinct for which things are true (although not always - some false theorems stood for decades if not centuries) and will avoid looking into ideas that don't pass the sniff test. Plus if you keep building on the consequences of a false result then you'll likely eventually reach a contradiction, which might inspire you to spot the bug in the original result.
In an informal sense disproving things is easier than proving things. For example theorems often say thing like "all objects of type Y have property X", its very difficult to work out even where to start proving such a statement unless you're an expert in Y and X, but to disprove it all you have to do is find some example Y which doesn't have property X. If you've worked with Y things for a while you probably have a bunch of "pet" Y objects you can think of and if someone proves something you kinda automatically check to see what their proof says about the ones you're familiar with.
> In an informal sense disproving things is easier than proving things.
Note that this is not true in general, and depends on the type of theorem.
The idea is that while it’s easy to show why a particular proof is incorrect, it’s much more difficult to show that every proof is incorrect.
Formally, this idea is captured by CoNP, which is believed to be different from NP and hence a strict superset of P.
> which is believed to be different from NP and hence a strict superset of P
“a strict superset of P” doesn’t really follow from that. P is also believed to be different from NP, and P is certainly not a strict superset of P.
Of course, I assume you just misspoke slightly, and that whatever it is that you actually meant to say, is correct.
E.g. maybe you meant to say “coNP is believed to both be strict superset of P (as is also believed about NP), and distinct from NP.”
I think it is believed that the intersection of NP and coNP is also a strict superset of P, but that this is believed with less confidence than that NP and coNP are distinct?
I imagine I’m not telling you anything you don’t already know, but for some reason I wrote the following parenthetical, and I’d rather leave it in this comment than delete it.
(If P=NP, then, as P=coP, then NP=P=coP=coNP , but this is considered unlikely.
It is also possible (in the sense of “no-one has yet found a way to prove otherwise” that coNP = NP without them being equal to P.
As a separate alternative, it is also possible (in the same sense) that they are distinct, but that their intersection is equal to P.
So, the possibilities:
P=NP=coNP,
P≠cocapNP=NP=coNP,
P=cocapNP≠NP,coNP,
All 4 are different,
(cocapNP is the intersection of NP and coNP)
Where the last of these is I think considered most likely?
)
I work in medical software and a few years ago fixed a bug that was an incorrect parameter value in a model used for computing diagnostic criteria that had proliferated throughout the literature after seemingly being incorrectly transcribed in one influential paper. The difference was relatively small, but it did make results somewhat worse.
The theorem was mostly correct. As stated it was false, but it was true for n >= 8. If you change some not very interesting constants it becomes true for all n. All you need change is the constants for n < 8.
Also check out Morph Labs, which is working with Lean to create an AI proof assistant. Cool startup by ex-OpenAI folks.
Essentially a strong type system of Lean can help with constrained generation. Thus every token would always lead to some valid (if not correct) proof in Lean, iiuc. Maybe people @ Morph can comment.
Is there a way to do lightweight incremental proof checking in a typical say Python or Javascript codebase?
Maybe specifying some conditions/assertions in comments and have it verified using some static analysis tool? Though I recognize it could be quite a challenge in dynamically typed languages.
>Is there a way to do lightweight incremental proof checking in a typical say Python or Javascript codebase
Running a JavaScript codebase through the Typescript compiler is a lightweight way to do incremental proof checking, albeit it can only check proofs about the soundness of the code.
Python's Deal library provides contracts and a small portable formal proofs area of the codebase.
Additionally, Deal integrates with CrossHair which does concolic execution of the tests and functions annotated with contracts. It's integrated with Z3, and most of the Python primitives are covered.
It just works surprisingly well for incrementally building up provable code properties.
Tangent useless comment: while checking this Mastodon instance, I noticed a specific user is spamming the global feed by replying to his own posts every other day. I only saw his posts, mostly, and wondered if this was a personal instance of sorts.
Me neither, but it makes sense to me in the case of an error that is relatively easily corrected. If a proof is fatally flawed, however, I would not use that term.
AWS has started using it internally recently, but the biggest Lean software project remains Lean 4 itself, which is almost entirely written in Lean 4 (though not verified except for some index operations and data structures). Galois has used Lean 4 internally, too, and I know of another smaller verification project by a German company for verifying some internals that were very difficult to understand.
One reason for lack of adoption is that the verified standard library for programming is still rather small. Fortunately, it is expected to grow much more quickly now that there are developers who are getting paid to work on it and I expect that we will likely see a lot more on this front in the coming years.
For people that know neither (like me 5 minutes ago):
>Lean4
> Lean is a functional programming language that makes it easy to write correct and maintainable code. You can also use Lean as an interactive theorem prover.
> [...] is an Australian mathematician. He is a professor of mathematics at the University of California, Los Angeles (UCLA), where he holds the James and Carol Collins chair.
If you're going for a PhD or researching some obscure, furthest reach study of some mathematical principle and you get stuck. One of the ways to move forward is to hope that Terence Tao finds your topic interesting and can give you a few minutes of his time - because he'll probably have some ideas for moving your research forward.
Conversely, there’s a running joke that you’re in big trouble if Terry Tao takes too much of an interest in your topic of study, since he will inevitably solve all but the hardest problems in the field.
You can't expect the entire Internet to always provide enough context for you personally, with your exact level of knowledge, to understand a social media post with no additional effort. How do you expect that to work at scale? Meanwhile, Google is just a tab away.
This may be hard to believe, but a lot of us here in "the HN crowd" have zero interest in celebrity and do not keep a catalog of them all in our heads. Even intellectual ones. There's nothing wrong with that, but it's also just its own interest one or may not have.
People approach things in a lot of different ways and it would be nice if we can just respect that instead of digging on each other or making unfounded assumptions.
I don't have much interest in celebrity either, but I feel that you're short-changing Terence Tao. He's not really famous due to being celebrated, but he's celebrated due to his level of skill and breadth of mathematical knowledge (sometimes known as the Mozart of Mathematics).
I think your sentiment is misplaced as I would expect HN commenters to have heard of Tao as he often features in posts about mathematics. I'm sure I recall seeing some comments from him too.
I started to agree with you, but let's keep in mind that there are new generations of tech geeks out here who are just learning about the things we live and breathe.
I am sure 100% of thornewolf and 99% of people reading the same news as you know who Terence Tao is, but extrapolating from your own perspective is treacherous.
Beyond the incredible quality and quantity of his work starting from early in his life, what makes Terence Tao memorable to me, is his approachability and willingness to write advice for mathematicians and math students in a blog: https://terrytao.wordpress.com/career-advice/
It's rare to see a professional at the top of an academic field remain so encouraging to other people in the field, and work to make their work accessible to colleagues across different levels of mathematical maturity.
Tao got his PhD at the age of 21 and a tenured professor chair at the age of 24. To compare, at the age of 21 ordinary people only get their bachelors and typically it's 23-24 they when they get their masters degree. A PhD takes several more years and of course very few become tenured professors even more years later.
Depends on the country, before Bologna Portuguese degrees would be 5 years, so 23-24 would be the age of finalizing the degree, given that PhD without Msc wasn't possible back then, having everything done meant at least 30.
In the other direction, in the UK it's quite possible to have your bachelors by 21, masters by 22 and PhD by 25. I had my mathematics PhD by 26 and am not a remarkable mathematician.
Dumb question: how do people skip years like this in college. Like suppose you do know the material well already, does that just get recognized by the faculty at some point (you are noticed pre-college and get a faculty member interested in you? In college?) or would he have needed to specifically apply for that in some way?
I ask in jealousy, I felt like my college put as many administrative barriers as possible to ensure they got at least 4 years of tuition out of me. I’ve always wondered how people somehow managed to speedrun to the top like this, rather than just being bored by their classes (esp gen-ed classes ugh) until they reach top rank the normal way.
He had a lot of acclaim as a young mathematician, according to Wikipedia he was taking college level math at 9 years of age. At this level of talent I believe you begin to be courted by top universities similar to D1 athlete's, not only that, but you are likely genius-level smart, so sticking you in a classroom of people your age would be doing you an intellectual disservice, similar to sticking a college freshman in a 2nd grade classroom. At this level, you are probably seen as the next Newton or Gauss, and professors will want to attach themselves to you and work alongside you. At this point you bring a ton of value to the school just for attending and they won't mind if you are testing out of course material since they are just happy for you to be there for any length of time, and it just becomes more of a formality anyway.
It wasn't "like this" but I just (with parental support) enrolled in the local community college after junior high, took a light load of college work, and got a homeschool program to sign off until I could test out at 16. Did 2 years worth of college work over the 4 years that would have been highschool, then transferred to a "proper" university as a junior at 18.
For non-geniuses like myself, you can just ask to test out of some of the lower tier introduction courses. I got a typically 4-year degree in 3 years this way. It's called credit-by-examination.
I'm sure for the prodigy-level students there is an even higher streamlined process for keeping them engaged.
In the 90s I sort of attended two universities at the same time in Hungary (given the time and the place it's not as big a deal as it sounds) and I said to the second one, hey I know the basic stuff, let me attend the more advanced one and they said, sure, no skin off my back, do whatever you want, if you fail, that's a you problem but if you pass the more advanced tests we will credit you the basic stuff.
Youtube has forgotten what "audio" is for me so noting the first talk is also at https://www.microsoft.com/en-us/research/video/the-future-of... - which is using youtube - but manages to make noises. Which I suppose is a pointwise instance arguing that software does not work as well as it might do.
> I have decided to finally get acquainted with the #Lean4 interactive proof system (using AI assistance as necessary to help me use it), as I now have a sample result (in the theory of inequalities of finitely many real variables) which I recently completed (and which will be on the arXiv shortly), which should hopefully be fairly straightforward to formalize. I plan to journal here my learning process, starting as someone who has not written a single line of Lean code before.
Automated reasoning is usually included in AI (Lean4 has tactics so it qualifies as such) but it's also quite different from the ML stuff that's the more usual kind of AI these days.
> where he holds the James and Carol Collins chair
What is the academic impact of holding an endowed chair like the 'James and Carol Collins chair'? It it related to any specific advantages or responsibilities? Those titles seem like they serve as a form of recognition for donors, is there a deeper significance behind them?
The advantage is that the chair usually comes with a bunch of money which you can use to fund grad students and pay for research. The latter is of course much more important in experimental sciences and engineering than it is in math. Still, mathematicians often do hire grad students to do grunt work for them.
Often, no, but some are particularly prestigious due to the lineage of individuals who had held the chair. Others have some special requirements like any endowment. Mostly, it is not really meaningful though.
I'm sure some people do indeed say that, as that has also been said about most famous smart people.
However, from my experience as someone who scored "off the charts" on a UK school Cognitive Abilities Test and 148 on an IQ test: any score over 130 is suspect, as basically nobody is collecting enough data to make the tests statistically valid once you're more than 2σ from the mean.
What do you mean by statistically valid? Are you just saying that the map between the raw scores and a z-score is questionable (and thus comparing scores between different tests is invalid), or are you making a deeper claim?
Both, I think. However I would also add that my understanding is second-hand, reading and watching what others have to say rather than any deep academic research of my own.
Every source I've read or watched all agree the sample sizes used when creating the tests in the first place just aren't large enough to be all that confident beyond 2σ.
There's also the problem that training on tests is effective, which limits the ability of a test to be validated on those taking it fairly soon after it gets published.
A further other issue is that most IQ tests pre-suppose the existence of a G-factor and treat differences between types of intelligence as noise rather than a signal for the existence of any multi-factor variation. I don't know which hypothesis is correct (I lean towards thinking it's mostly a G-factor, but there are clearly some things which can only be explained by multi-factor models, e.g. where a skill is entirely absent such as in the case of dyscalculia).
He's really really really good at math, but is he really really really good at a bunch of other unrelated fields too? It seems silly to imply that math skills are all there is to intelligence.
I think you don't actually know about g-factor. basically what these so called 'intelligence tests' measure is the rate at which one can absorb information and apply it abstractly, how you can come up with solutions to problems by basically doing high-level abstractions, finding patterns and connecting dots, finding things that no one thought of or came to their mind. What this means is that even though he might not know about how to do other things at first once he starts delving into them he can quickly learn them and really excel in them. What these test measure is the G factor and that if you are good at one thing you'll also be good at other things and I'm not making this up there is loads of evidence for this. the most rigorously tested idea in the field of psychology/ psychometrics is intelligence. if you had to choose one single concept out the whole field it would be intelligence. you can read the g-factor and how they measure it, all the theory, stats everything and then tell me what do you think.
But why is it that the people who always score the highest on IQ tests contribute the most to math, and only a little bit to all the other subjects under the sun?
Why aren't they also creating the greatest art, growing the greatest vegetables, and, I don't know, designing the greatest structural engineering designs in the world?
If he does work in the field of animal/plant genetics I am hopeful he will discover great things and move the field forward
>structural engineering
again this is not what he does, and why he does math is.....i don't know what's his reason to do math, idk his motivations behind it but i do surely believe there are high-iq structural engineers out there of course they are how can you say there are none. greatest structural engineering designs made yet are actually made by geniuses or highly intelligent people you wouldn't expect a kid who fails high school to make those things do you?
https://mathstodon.xyz/@tao/111208692505811257
Many of his mastodon posts this month have been about his learning progress.
Certainly an interesting case study of how LLMs can accelerate the work of even of the most extremely successful people