The future of interactive theorem proving?

kevinbuzzard · on Aug 16, 2022

I just confirmed with Azerbayev that the typo in the example from Munkres ("identify" not "identity") was indeed what was fed to the algorithm (and the algorithm got it right anyway). We found some other funny examples when asking Lean to write strange definitions of differentiation: it would sometimes just write the correct definition (in correct Lean) despite being asked to write something else, perhaps because it has seen the correct definition far too often? It's pretty weird to play with!

giraj · on Aug 16, 2022

This sounds like a cool demo, but I find it hard to imagine it being useful. For one, in order to check the output you need to be proficient enough that you might as well write the statement that you want. In addition, if you're working on a large library then there are small but important design decisions that matter for many statements, just like when designing ordinary software. I doubt that Codex is able to make coherent design decisions any time soon.

I would love for computers to be able to understand informal but rigorous mathematical reasoning, but, having worked with current proof assistants, it sure feels like software design will have to be solved in the same sense first.

That said, kudos for an awesome demo!

kmill · on Aug 17, 2022

I think one thing that this could help with is that, even if you are proficient in Lean, finding the way you are supposed to translate an informal statement into Lean/mathlib can be time-consuming if not a challenge -- it is a very large library, and it is mathematically heavy (being a library for theoretical mathematics). At the least, this gives you hints about which parts of mathlib you should be looking at.

giraj · on Aug 17, 2022

Maybe! You're certainly correct that there's a difference between knowing Lean (or a given proof assistant) vs. knowing mathlib (or some specific library). It doesn't seem to me that Codex has much of an idea about the mathlib codebase, since it e.g. invented and used "tangent_space_at_identity" out of the blue. Library-specific training would improve on this, of course.

I would reach for the usual code search tools for getting familiar with a library. For example, in Coq you have the "Search" and "Print Hint" commands which let you search for terms and instances, respectively. I imagine Lean has something similar.

nextaccountic · on Aug 17, 2022

> It doesn't seem to me that Codex has much of an idea about the mathlib codebase, since it e.g. invented and used "tangent_space_at_identity" out of the blue

But then perhaps it's a hint to the user that this API may be an useful abstraction, and perhaps one can use the same tool to define this function

blueblimp · on Aug 16, 2022

It's pretty wild that the author gets a somewhat-usable result merely by few-shot prompting OpenAI Codex.

riku_iki · on Aug 17, 2022

probably codex was trained on lean code too.

fromeroj · on Aug 17, 2022

While this is a very interesting exercise, I see this kind of wild optimism as when the compilers were introduced or "5th generation computer languages" or any other advance in that area, its touted as: "you can just tell the computer what you want in a simple, natural way, and it won't need programmers any longer" while any of those is a significatively improvement, it is very far from the dream.

fooker · on Aug 17, 2022

>it is very far from the dream.

SQL is the great success story of that generation.

Writing database query and management code in an imperative language is a nightmare.

LunaSea · on Aug 17, 2022

I don't completely agree. Yes, SQL is a great success but no, the declarativeness was not enough and we end up generating SQL through ORMs, query builders, templating, etc

fooker · on Aug 18, 2022

I agree that there should have been more innovation on thosd aspects of query languages!

It's a little bit tricky because there aren't much low hanging fruits for academics to write good papers about, so this innovation is up to programmers who want to improve their quality of life when dealing with databases.

JHonaker · on Aug 17, 2022

I want to preface this by saying I'm not trying to hate on the author. Good on them for applying a new technology to try to solve an annoying part of their work. I hope they continue to be creative and fearless in their approach to problem solving. The comments that follow are born more out of my increasing frustration with the ML and ML-adjacent communities that seem to have failed to learn anything about the dangers of hype and over-promising from the 1980s (and really just human history in general).

While it's an undeniable achievement that these kinds of models can generate syntactically correct code or even the correct solution some of the time. I find it hard to get excited about them for anything but boilerplate (and even they I think they pose the greatest risk, lulling one into a false sense of security until you get bitten).

If we take formal methods as the way to ensure the code we've written provably does what we say it does, we had better damn well make sure that what we've said it does is what we want it to do. In other words, I feel like this is offloading the most important part of the problem, and the one that can't be checked automatically onto a very iffy situation.

We should really stop to consider what these kinds of models are. Pattern recognizers, not thinking machines. See the author's quote about this theorem:

    Let G be a Lie group. Then the tangent space at the identity is a Lie algebra.

which is translated into the theorem:

    theorem tangent_space_at_identity_is_lie_algebra (G : Type*) [group G]
      [topological_space G] [topological_group G] [charted_space G] [smooth_manifold G] :
        lie_algebra (tangent_space_at_identity G) :=

Their quote

    For some completely unclear reason, the app decides to consider a Lie algebra over \mathbb{R} instead of my explicit instruction to use a field k, and that’s even before addressing the fact that tangent_space_at_identity is not a mathlib definition. I tried a few different responses to the app’s initial attempt, and couldn’t get it to do anything helpful.

They try to fix this by "prompt engineering":

    You must specify a field for your Lie algebra as in lie_algebra k M

which leads to the change:

    lie_algebra R (tangent_space_at_identity G) :=

This is exactly the kind of boilerplate we'd want to automate with this. The specification of a Lie group seems onerous here, but it's also exactly the kind of mistake that makes using this kind of mistake that makes relying on these systems so scary. Here it's a syntax pattern, and we have to end up proving the correctness, so maybe we wouldn't notice the mistake here. It's almost certainly going to be missed often in (choose your favorite dynamically typed interpreted language that let's you find all this fun stuff out at runtime). There is no understanding of what we do and do not know. What we've proved or have yet to prove be damned.

I guess my point hinges completely on what you consider "knowing" to be. It's kind of funny. Nobody would ever say that a compiler knows how to transform one language to another. The personification of ML really irks me. I try very hard to avoid it, but it's easy to slip.

I don't want to say we should throw the baby out with the bathwater. I do think that there are uses for these things. The recognition of the pattern for the syntax to specify the characteristics of a Lie group would make a great first pass at a snippet template. (This is ignoring the question of why isn't Lie_group a definition that just pulls in the necessary requirements on G rather than needing to specify them each time. If one of those is not really a requirement for Lie groups, forgive me, my exposure to Lie theory is very limited.)

I would be much more interested in a tool that allows me to identify patterns in the code with a summary printout or something like the likelihood of a given pattern in the neighborhood of my cursor. The generation part is just not good enough to trust IMO.

Also, I really hope we see some sort of backlash for MS just deciding to just ignore everyone's' licenses.

deadbeef57 · on Aug 17, 2022

(I'm the Johan Commelin mentioned in the blogpost.) In fact, `lie_group` exists in mathlib, and is defined as follows:

  /-- A Lie group is a group and a smooth manifold at the same time in which
  the multiplication and inverse operations are smooth. -/
  -- See note [Design choices about smooth algebraic structures]
  @[ancestor has_smooth_mul, to_additive]
  class lie_group {𝕜 : Type*} [nontrivially_normed_field 𝕜]
    {H : Type*} [topological_space H]
    {E : Type*} [normed_add_comm_group E] [normed_space 𝕜 E] (I : model_with_corners 𝕜 E H)
    (G : Type*) [group G] [topological_space G] [charted_space H G]
    extends has_smooth_mul I G : Prop :=
  (smooth_inv : smooth I I (λ a:G, a⁻¹))

In particular, the output of Lean Chat didn't talk at all about needing a smooth inverse function.

Anyway, I very much agree with you that there are risks here. At the same time, I think this can be an extremely useful tool for new users trying to get started. Because mathlib is approaching 1M lines of code, and a tool like this might help a lot in discovering parts of the library. As you say, it can already be used as some sort of snippet-engine-on-steroids.

JHonaker · on Aug 17, 2022

I figured there would be. It just provides another point that it's not really understanding, and it's not really "extracting the pattern".

I really wish half the effort on generation would be spent on leveraging them for guided exploration. It wouldn't matter if you were provided slightly incorrect suggestions for "likely relevant functions," but even likely correct source code doesn't cut it.

Or even snippet prototyping. Like don't use the generated text directly, provide an interface that I can use to transform the generated code into a snippet that I can save.

fracting · on Aug 19, 2022

It's a fair concern that ML model generates misleading definitions and statements which hides bugs and is hard to detect, but it also depends on how the user utilizes this kind of tools.

For any serious use case, before attempting to prove any statements, a responsible user should always carefully define any *top level statements* manually or hire a domain expert to write or review the definition.

Once we have high confidence on the definition of the top level statements, the user might start to break down the proving approach into a couple of auxiliary lemmas.

From this moment, they can freely play with machine generated code without too much safety concern, because:

* the ultimate goal is to prove the original top level statement, which is already human inspected;

* if the machine generates a misleading lemma, it won't help proving the top level statement, but it won't lead to a false claim as well, since the proof checker will reject any incomplete or incorrect proof.

The worst case is to be fooled by misleading lemmas and waste a lot of time exploring an unhelpful proof approach, which could be sad and costly but won't be a critical safety issue.

If ML assisted interactive theorem provers inspired by Lean Chat becomes productivity tools, users should be educated to always do their best to ensure the top level statement is exactly what they want to define, this should be an industry common practice. For critical usage, a theorem prover UI can even enforce this practice by disable auto code generation for top level statements.

xvilka · on Aug 17, 2022

Does it work with Lean4?