SuperC: Parsing All of C by Taming the Preprocessor [pdf] (2012)

evanjrowley · 2024-03-09T15:50:39 1709999439

This is way over my head, but I was reminded of The C language is purely functional by Conal Elliott: http://conal.net/blog/posts/the-c-language-is-purely-functio...

ksherlock · 2024-03-09T11:47:48 1709984868

The source code: https://github.com/appleseedlab/superc/

xniclb · 2024-03-09T12:32:47 1709987567

Has anyone already integrated it as a vscode extension?

DriftRegion · 2024-03-09T20:28:42 1710016122

Figure 1 spoke to me. It's an expanded syntax tree that branches depending on on the value of a preprocessor definition "CONFIG...X". I've often found myself doing the kind of code archeology that this paper seems to be trying to automate: exploring all the configuration possibilities implied by the codebase / build system. A C program that makes heavy use of the preprocessor is generally harder to grok by both h humans and static analysis because 1. the C preprocessor syntax is different from C, 2. the inputs are not necessarily bounded by what appears in the source files alone ("-DCONFIG...X=foo" passed in from the build system), and 3. the resulting program and its control flow may be quite different depending on preprocessor options. As a simple example embedded systems often define an "ASSERT(X)" macro as either noop, an infinite loop, a print statement or the like.

This is definitely a niche space but I see clear use for large, portable and configurable c codebases (e.g. Linux kernel, FreeRTOS) for providing better visibility into the configuration system.

senkora · 2024-03-09T20:39:43 1710016783

You may be interested in unifdef, which selectively evaluates and removes ifdefs.

https://dotat.at/prog/unifdef/

I used it once at work for a niche usecase. It’s main use case seems to be making it easier to simplify platform-specific code when you remove support from old platforms in legacy codebases.

peter_d_sherman · 2024-03-09T20:44:16 1710017056

It seems that the use of macros/IFDEF (in any language, not just C) -- bifurcates into two distinct use-cases:

1) Platform/Processor/OS configuration/build use-cases.

(and)

2) All other use-cases that are not directly related to #1.

In other words, if you're a future language designer and you design a macro system for your language, you might wish to distinguish between configuration/platform/build related macros -- and other macros not directly related to build and configuration...

Doing that would allow one set and/or the other set to be selectively and easily evaluated back into the non-macro source of the base language -- depending on what is desired by the language user...

Anyway, an excellent link!

mncharity · 2024-03-09T20:43:10 1710016990

Fwiw, ~20 years ago my experience was that preprocessor use in open-source C code was very idiomatic, and iirc, a simple backtracking parser with idioms was sufficient to parse all I tried it against, including the linux kernel.

kazinator · 2024-03-09T19:18:36 1710011916

By the way, GNU Bison implements general LR (GLR) parsing by something that can be called "fork merge LR". The documentation states that Bison's GLR algorithm resolves ambiguities by forking parallel parses, which then merge. It's not the same as forking due to a preprocessor conditional, but worth mentioning.

mdaniel · 2024-03-09T18:02:43 1710007363

I am obviously not able to understand what, specific, problem this is solving based on the title of "parsing all of C" when the preprocessor is apparently left intact by design

    static int mousedev_open(struct inode *inode, struct file *file)
    {
    int i;

    #ifdef CONFIG_INPUT_MOUSEDEV_PSAUX
    if (imajor(inode) == 10)
    i = 31;
    else
    #endif
    i = iminor(inode) - 32;

    return 0;
    }
    (b) The preprocessed source preserving all configurations

and my experience with C is that there are untold number of "unbound" tokens that are designed to be injected in by -D or auto-generated config.h files, so presumably this works closer to the "ready for compilation" phase versus something one could use to make tree-sitter better (as an example)

lacraig2 · 2024-03-09T14:29:14 1709994554

This looks really useful, but it seems like an uphill battle even reproducing given the lack of updates in almost the last decade.

mdaniel · 2024-03-09T18:03:47 1710007427

Do you mean getting it to run on modern JVMs or that the C used in the kernel has drifted such that the technique would no longer apply?

lacraig2 · 2024-03-10T01:19:26 1710033566

I mean that it's often difficult to take older projects like this and properly replicate the environment with all dependencies.

The point about the C used in the kernel drifting could be relevant, but with this sort of project I'm more concerned initially with reproducing results and then moving forward.

kazinator · 2024-03-09T19:26:00 1710012360

> In exploring configuration-preserving parsing, we focus on performance.

Why, because this goose is so thoroughly cooked that all that is left is optimizing for speed?

There is a lot of misplaced focus on performance in CS academia, and also in software.

Suppose we have some accurate tool that does something useful with a C program, but it takes 5 minutes to run instead of 5 seconds. So what? Someone still wants to use it. Suppose the program is used by millions of people, and that 5 minute run only has to be repeated half a dozen times during development.

Get it right, and get it in people's hands should be the priorities, and not necessarily in that order.

dzdt · 2024-03-09T13:14:21 1709990061

This is (2012). I don't see that it has been discussed before here though. I guess it didn't make much of a splash.