Hacker News new | past | comments | ask | show | jobs | submit login
Emitting Safer Rust with C2Rust (immunant.com)
159 points by dtolnay on March 15, 2023 | hide | past | favorite | 49 comments



DARPA is funding this. Good.

They haven't reached inter-procedural static analysis yet, which means they can't solve the big problem: how big is an array? Most of the troubles in C come from that. Whoever creates the array knows how big it is. Everybody else is guessing.

A bit of machine learning might help here. If you see

    void dosomethingwitharray(int arr[], size_t n) {}
a good conjecture is that n is the length of arr. So, the question is, if this is translated to

    fn dosomethingwitharray(arr: &[i64]) {}
does it break anything? Both caller and callee have to be analyzed. The C caller has the constraint

    assert_eq!(arr.len(), n);
That's a proof goal. If a simple SMT-type prover can prove that true., then the call can be simplified to just use an ordinary Rust slice. If not, conversion to Rust has to drop to those ugly C pointer forms, preferably with a comment inserted. So you need something that makes good guesses, which is a large language model kind of thing, and something which checks them, which is a formalism kind of thing.

The process can be assisted by putting asserts in the original C, as checks on the C and hints to the conversion process. That's probably the cleanest way to provide human assistance.

I've wanted this for conversion of OpenJPEG code to Rust. That's a tangle of code doing wavelet transforms, with long blocks of touchy subscripting and arithmetic, plus encoders and decoders for an overly complex binary format containing offsets and lengths. Someone recently ran it through c2rust. The unsafe Rust code works. It's compatible with the original C - it segfaults for the same test cases which cause the C code to segfault. This is why a naive transpiler isn't too helpful.

(The date at the bottom of the article is 2022-06-13. Has there been further progress?)


> a good conjecture is that n is the length of arr

As an old osdev currently enjoying using rust instead, I would say I wish.

N might be the length of arr. it might also correspond to the number of elements of (implicit) type t that would fit in the unsigned char array arr. It might be the length of the array minus space for a trailing char (either minus one or minus sizeof(char) bytes). Or it could be the size plus one, because why not.


What you just described is the absolute bane of my existence at work in embedded firmware development. Though typically size_t does mean what it says on the tin — the horrific drivers usually use some other unsigned int type if they’re doing dumb stuff with “n” there.


Which is why it's only a conjecture. You need static analysis, SAT solving, and run-time checks to validate that conjecture.

Using something like GPT-4 on this problem is promising. It's probably going to be right most of the time, and its errors can be caught by the next phase of the analysis. That's about what you'd get if you put junior programmers on language conversion.


> The date at the bottom of the article is 2022-06-13

The date was wrong; sorry, my mistake. The article reflects progress as of early January 2023. We're actively working on the lifting feature and will post a follow-up post once the tooling are sufficiently mature to be tested by the community.


> The date at the bottom of the article is 2022-06-13. Has there been further progress?

The article links to their github repo:

https://github.com/immunant/c2rust

There's commits in the last hour, so at least some signal of life.


One could rewrite the C code to pass a pointer to an array with a length in the function prototype. Then you could also get bounds checking in C with UBSan.

https://godbolt.org/z/dPWPo1rrv


n could be the subset still requiring sorting; or any other subset of the array; a start index; it could be anything. assert(len(arr) == n) is quite the wild conjecture.


> does it break anything?

probably the abi if nothing else?


I used c2rust to start rewriting OpenJpeg into Rust code [0].

It was easy to get the Rust code compiled and working as a drop-in-replacement for the C Library. This has been a big help with refactoring the unsafe Rust code into safe Rust (manual work). OpenJpeg has a great testsuite that has allowed testing that each refactor step doesn't add new bugs (has happened at least 3 times).

The original run of c2rust generated 96,842 lines of Rust code (about 1 year ago), now it is down to 46,873 lines code. A lot of the extra 50k lines of code were from C macros that got expanded and from constant lookup tables (C code had 10-30 values per line, Rust 1 value 1 line).

For anyone looking to use c2rust to port C code to Rust, I recommend the following:

  1. Setup some automated testing if it doesn't exist already.
  2. Do refactoring in small amounts, run the tests and commit the changes before doing more refactoring.
  3. Use "search/replace" tools (`sed`) to help with rewriting common patterns.  Make sure to follow #2 when doing this.
  4. Don't re-organize the code until after most of the unsafe code has been rewritten.  This will allow easier side-by-side comparison with the original C code.
  5. c2rust expands macros and constants from `#define`.  Being able to do side-by-side comparison of the C code will help with adding constants back in and removing expanded code with Rust macros or just normal Rust functions.
[0] https://github.com/Neopallium/openjpeg/tree/master/openjp2-r...


I took the insertion_sort impl from the bottom of the post and asked gpt4 to rewrite it into idiomatic Rust:

    pub fn insertion_sort(n: i32, p: &mut [i32]) {
        for i in 1..n as usize {
            let tmp = p[i];
            let mut j = i;
            while j > 0 && p[j - 1] > tmp {
                p[j] = p[j - 1];
                j -= 1;
            }
            p[j] = tmp;
        }
    }

    fn main() {
        let mut arr1: [i32; 3] = [1, 3, 2];
        insertion_sort(3, &mut arr1);
        // …
    }
I guess if this actually works, we can translate massive amounts of internal C libraries into human readable Rust... good stuff.

(funnily enough, passing in the "original" code without the `unsafe extern "C"` part makes it produce the exact same output as the above)


I don't believe GPT is built for this still. There's a too big risk it will fill in another implementation from its training instead of adapting the input.

Here, who says the idiomatic translation is not .sort()? It should use the stdlib.


I say the idiomatic translation of a function named insertion_sort() should not use the stdlib sort. The algorithm used by Rust’s stdlib is PDQsort.

In general translations should stay as close to the original as possible, while eliminating any possibility of segfaults.


That's a literal translation, not to idiomatic Rust - in my opinion. All depends on what the semantics of the program should be.


I wonder if there could be a synthesis of traditional testin g / verification / compiler technology that would help in filtering for correctness. Like property/fuzz testing that automatically checks for deviations in translated vs original by sampling the input space? Or symbolic execution that do the same. And also ask GPT to find a difference in semantics.. and verify its answer to check for hallucination.


Has anyone put this to serious use? I played around with it at some point when it was fairly new and at that time I was able to transpile the C into Rust just fine, but that didn't help me much. The idea was to be able to use the Rust toolchain to better understand the code, but the resulting Rust code was even less understandable, and also much harder to refactor. In this case I wasn't attempting a rewrite per se, just trying to understand a C codebase plagued with memory safety issues. Quickly gave up on this avenue at that point and just started carefully refactoring the C to make the bugs easier to shake out.

Would love to see a technical write up of someone outside Immunant using this on a real world codebase for whatever purpose.


> In this case I wasn't attempting a rewrite per se, just trying to understand a C codebase plagued with memory safety issues

I think this is your problem; to my understanding it's not really the point of the project. The resulting code is meant to be something you can gradually refactor, not something that's immediately better or more understandable. Even if a given piece of code is harder to refactor, it's still important on a large pre-existing project to be able to immediately switch over to the new toolchain all at once, without having to manually refactor/rewrite all of the code all at once


Well, the hope was that maybe the work to get the transpiled code to the state where I could do some borrow-checking might somehow be less than just having at the C itself directly, but yeah, no cigar back them at least.

And I wanted an excuse to play around with rust some more :D


Conditional to your definition of "serious", I did: https://github.com/64kramsystem/catacomb_ii-64k. I essentially don't do technical writing anymore (and I had the impression that this topic isn't generally considered interesting), however, my considerations are:

1. there are three levels of refactoring: removing the extensive (unbearable, to be honest) boilerplate that C2Rust introduces; converting the design from "C with Rust syntax" to safe Rust; convert the design from unidiomatic Rust to idiomatic

2. as another poster pointed out, for non-trivial projects, writing refactoring tooling is a must (to remove the C2Rust boilerplate), in order to perform step 1

3. design refactoring (step 3) difficulty depends on the source code design; the code I worked with was relatively hard to refactor, as it was old (school), in particular, lots of globals; the difficulty was caused by the typical freedoms that C gives and Rust doesn't (in other words, the very obvious design differences between C and Rust); somebody did a C to Rust port of (I think) Zstd, which is a modern codebase, and I think much easier to work with (also because of less, or possibly no, external dependencies)

4. regarding the code understanding, if one performs the translation in the three-steps mentioned in point 1, at the end of step 2, one has effectively a safe Rust codebase, "just" unidiomatic

5. in terms of quantity of changes (but not time spent), it's possible to perform the bulk of step 3 with rather local thinking (understanding), but of course, most of the time spent is on major design changes

6. beside a few steps, I was able to perform a conversion in self-contained steps, which is very good news for this type of work. Even better, it's possible (but that's a niche case) to port an SDL project by using at the same time the C library and the Rust one!

7. however, I can imagine projects like Wolfenstein 3d to be very hard to port, since it's hard to port memory allocators and similar

99. most important of all: just converting to Rust will quickly (even immediately) find bugs in the source; I've found approximately four bugs in the source code, including one by Carmack!

All in all, I find this tool great, but somebody needs to work on refactoring tools, and C2Rust's output must be improved in order to be found usable by the public.


By serious I just meant any real world codebase at all. A full game, even if an old, smaller one is way more than I expected anyone to have done!

Definitely will thumb through the git history to get an idea of the refactoring efforts.

Thanks a bunch!


Transpiled code… way back in the day we had Fortran machine converted to Ada. While it worked it was unreadable and not maintainable. Adatran we called it. Hopefully they do better now.. but from you experience it is the same.


Though be fair the last time I tried it, it was brand new and just barely successful at transpiling at all. Little to no work of the type detailed in TFA had been done at all.


C2rust is really cool, but if you're familiar with writing rust and implement even a trivial C function in there it produces something absolutely terrifying. I really enjoy rust and pray I don't find myself working in a code base someone just ran c2rust against.


Isn’t the point to generate semantically equivalent Rust code from C, so that you can just get it re-compiling under Rust, and then from there you have a working base from which to start rewriting into safer Rust?


Yes, it’s literally spelled out in TFA:

> this provides a starting point for manual refactoring into idiomatic and safe Rust


This is true, but it still generates a (very) large amount of boilerplate and stylistically suboptimal code. Examples:

- the base unit is the individual C file, which causes structs and symbols to be duplicated across Rust modules

- for loops are translated to while loops with overflowing additions, which is ugly and unnecessary in pretty much every case (this makes sense, semantically, but it could be used only when necessary, not as general strategy)

- variables are declared at the top of the functions (AFAIR)

C2Rust generates code that requires significant refactorings _before_ semantic (C->Rust) translations - as a matter of fact, they had a refactoring tool, but it's been temporarily deprecated.

It's a fantastic tool, but as of now, it requires developers to write their own refactoring tools.


Since this is DARPA, this shows they are interested in rust, which mean we will probably have strong toolchain certifications coming up eventually, making rust even more fall in the category of "the language you want to use for serious stuff".


This seems like an interesting project to bridge the "boil the ocean" approach of rewriting in Rust wholesale.

(For anyone else who found it slightly difficult to read, you can remove the added 0.06em `letter-spacing` using your browser's developer tools.)


I'm very excited at the possibilities for C2Rust! Dynamic analysis to fill in the gaps of static analysis makes a lot of sense. I've wanted something similar for inferring TypeScript types via runtime analysis (would not be surprised if it exists already).

I could see a really compelling use case in cross-compilation where you compile your C code to Rust, then use a Rust toolchain to cross compile. Or avoiding interop as well.


What problem does c2Rust solve exactly? Isn't it just gonna produce "garbage" rust.

Calling c directly is already possible in rust.


From c2rust.com:

The C2Rust project is being developed by Galois and Immunant. This tool is able to translate most C modules into semantically equivalent Rust code. These modules are intended to be compiled in isolation in order to produce compatible object files. We are developing several tools that help transform the initial Rust sources into idiomatic Rust.

The translator focuses on supporting the C99 standard. C source code is parsed and typechecked using clang before being translated by our tool.


This isn't about calling external C code from Rust; it helps people "rewrite" their C code in Rust.

You can debate the merits of doing so, of course, but some people do want to do that, and a tool to generate safe, somewhat idiomatic Rust from C code would seem to be useful.


It moves the project directly into rust land and tooling, which hopefully makes it easier to convert it without needing to set up multi langage tooling and a moving barrier / interface between the two langages.


The post does address this and shows their attempt to produce higher quality Rust. I've also seen it used to move off of a C toolchain and onto a pure Rust toolchain by porting C code to Rust.


The article shows what improvements they are thinking of so that it doesn't produce garbage rust. (If by garbage rust you mean unsafe rust.)


From reading the article, I get that the latest version can transform some C into safe Rust.

This gains us machine-proved memory safety. This is huge.


It helps by lowering the barrier to entry when working on rewriting a codebase in rust.


It makes it easier to get your project on the front page of HN as you can claim it is written in Rust.


Do no know this particular tool but some automated language to language transpilers I saw produce the code one would not be able to comprehend never mind edit if the need comes.


The goal of C2rust is not to provide a usable code base per se, it’s to provide a convenient base for conversion: once the project is in unsafe rust it can be managed entirely via rust tooling and is hopefully a lot easier to finish up than if you keep having to redefine bindings as you move code from C to Rust.

C2rust is a springboard, if you move C2rust-Ed code to production you’re doing it very wrong.


On the other hand, if I have some working C dependency which I never intend to modify (owing to its complexity or stability), plopping the autogenerated Rust code simplifies your build step.

Not that it’s a good idea, but I could see a scenario where it would be worthwhile.


>"never intend to modify"

This is ideal state of affairs but sometimes the reality can interfere with our intents.


There are working projects out there whose source code is (at least partially) the output of tools like f2c [which converts Fortran to C].


Did I say move it to production? My question was that the generated code would me way more difficult to understand / modify than the original C.


> Did I say move it to production?

It was implied in your comment.

The point of C2rust is that the artifacts it generates are extremely transient, they don’t get modified they get replaced and if the rustified version is awkward or hard to grok you just get the C version from Git to validate your understanding, because it’s the exact same thing.


I wish they revive their refactoring tool - it was abandoned during the toolchain upgrade. Without the tool, converting the code becomes much more tedious.


Can confirm that the output of c2rust currently suffer because the refactoring tool needs to be revived. We will get to it eventually but we don't have a firm date :/


I am very curious to see how this transpiler problems will be handled by gpt4 in the upcoming months.


It'll handle the simple cases amazingly, and will handle edge-cases by producing wrong code: hopefully obviously-wrong, but subtly-wrong in at least some cases. A prompt will be written and honed and evolved, and tooling will be built to post-process GPT-4's output, and so the accuracy will rise – but still with no correctness guarantees.

When it goes wrong, the advice will include "write better comments, so the transpiler knows what you're doing". Proponents will liken this to type-linting comments. Critics will liken this to INTERCAL / p-hacking / tax fraud, and will claim that the transpiler can be mislead by confusing comments. Proponents will show you that GPT-4 can identify misleading comments in the critics' examples. Critics will say "real code won't contain comments like that, so this ability is useless". Proponents will say "oh, yeah, that too I guess". Critics will promptly vanish in a puff of logic.

The manually-written tool will get better: more slowly at first, but more steadily, and with only a few (predictable, fixable) correctness bugs. Eventually, it will be able to correctly process more programs than the leading GPT-4 approach can. It will be months before anyone notices this, since the two camps (manual approach, GPT-4 approach) will not really be talking to each other enough.

Eventually, somebody will write a blog post about a semi-obscure but representative benchmark (perhaps the Linux kernel), pointing out that the manual tool works better now. There will be a brief wave of hype about the "new tool" and the "death of AI". Then some people will fine-tune the model on tricksy cases using the manually-written tool's output, some other people will call that utterly cheating, and the hype will give way to bickering.

Realistic? Well… GPT-4 is proprietary, and we've got more efficient LLM architectures now – but I think the sorts of people to make a tool like this will probably stick with OpenAI's APIs. (It's In The Cloud.™)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: