Hacker News new | past | comments | ask | show | jobs | submit login
C4: C in Four Functions (2014) (github.com/rswier)
436 points by azhenley on Feb 18, 2020 | hide | past | favorite | 131 comments



I unironically love this code. We used it as a starting point (or rather, the same design; we wrote in Go) for reversing challenges at our last startup, symbolically evaluating the stack machine bytecode to generate AVR.

The trick to reading it:

* next() is the lexer

* expr() is a precedence-climbing expression parser

* stmt() parses statements and generates code

* main() has the virtual machine loop.


C4 really has some good ideas by its own.

* It never bothers to produce a machine code. The use of stack-based VM simplifies the single-pass compilation, and it never makes use of unknown library functions so no machine-specific knowledge is required (like dlsym in Bellard's OTCC [1]).

* It chose its primitives wisely. It never implements structs and returning with values, for example, but the code is carefully structured that the lack of them doesn't make it harder to read.

* And yet it has tons of little tricks. Switching from r-value to l-value (triggered when infix `=` or postfix `++`/`--` are read) is a single opcode fix. Reserved words are initialized from an imaginary source code. The type is represented by a single number 2n+k where n is the number of indirections.

[1] https://bellard.org/otcc/


> Reserved words are initialized from an imaginary source code.

What does it mean? Could you point to a place in the source?


I mean this part:

    p = "char else enum if int return sizeof while "
        "open read close printf malloc free memset memcmp exit void main";
    i = Char; while (i <= While) { next(); id[Tk] = i++; } // add keywords to symbol table
    i = OPEN; while (i <= EXIT) { next(); id[Class] = Sys; id[Type] = INT; id[Val] = i++; } // add library to symbol table
    next(); id[Tk] = Char; // handle void type
    next(); idmain = id; // keep track of main
Throughout the entire source code p is a source code pointer, but at the very beginning of the program it is a string containing all reserved words and library functions, and they are read with the same lexing function `next` to the symbol table before the memory for the actual source code is allocated.


I think the beauty of it is the that's it!? feeling you get once you've looked through all ~500 lines --- based on the amount of functionality one would expect it to be much longer; but it is unexpectedly short, yet complete.

For a just-as-interesting "sequel", look at C4x86: https://github.com/EarlGray/c4


In practice, it's not "~500 lines". You have whole control-flow statements with several semicolon-separated statements in their body, all crammed on a single line. The short variable names allow for the physical line not becoming too long and there is even some symmetry to it, but it's not like each line is a single statement with many nested subexpressions.

The real size of this code, after putting each statement on its own line, would be on the order of 300%.

Imagine this:

    if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }
Turning into this:

    if (tk == Mul) {
        next();
        *++e = PSH;
        expr(Inc);
        *++e = MUL;
        ty = INT;
    }


The line count isn't not "real" just because it isn't how a mindless autoformatter would do it. The formatting conveys actual information. A line expresses one "thought". Laying it out horizontally allows the vertical direction to be used to visually convey the repeating pattern

    else if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }
    else if (tk == Div) { next(); *++e = PSH; expr(Inc); *++e = DIV; ty = INT; }
    else if (tk == Mod) { next(); *++e = PSH; expr(Inc); *++e = MOD; ty = INT; }


You know what else would communicate that? A function or macro.

    else if (tk == Mul) { applyOperator(MUL); }
    else if (tk == Div) { applyOperator(DIV); }
    else if (tk == Mod) { applyOperator(MOD); }
But then you're not conforming to their arbitrary idea of "minimalism = fewer functions".

I definitely have some admiration for their picking a goal and following through on it, and there are a few tricks in there that are downright brilliant, but let's not pretend this is about effective communication.


I can agree that there's a repeating pattern in handling each of the binary operators (they are around a dozen), but I'm failing to see a pattern in fragments like these:

      if (tk == ']') next(); else { printf("%d: close bracket expected\n", line); exit(-1); }
      if (t > PTR) { *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = MUL;  }
      else if (t < PTR) { printf("%d: pointer type expected\n", line); exit(-1); }


This code is implementing pointer indexing: p[i]. The first line is reading the closing `]`. The second line is computing the pointer offset, which is equal to `i * sizeof(int)`. The third line is producing an error if `p` does not have a pointer type.

I think I agree with you that this part could be refactored a bit. I would be tempted to put the "PSH" corresponding to the "i" next to when we parse the "i". I would also write the check that "p" has a pointer type before the code that indexes it.

    else if (tk == Brak) {
      next(); *++e = PSH; expr(Assign); *++e = PSH;
      if (tk == ']') next(); else { printf("%d: close bracket expected\n", line); exit(-1); }
      if (t < PTR) { printf("%d: pointer type expected\n", line); exit(-1); }      
      *++e = IMM; *++e = sizeof(int); *++e = MUL; *++e = ADD;
      *++e = ((ty = t - PTR) == CHAR) ? LC : LI;
    }


This is why I mostly hate (but partly love for the sake of reducing bikeshedding) it when teams add an autoformatter as a mandatory part of a code pipeline -- it destroys relevant spatial information.

If we are going to force autoformatters, we might as well just use annotated ASTs instead of text so we all see our own chosen view of the code.


The number of lines isn't really the point; it's C in 4 Functions, not C In 500 Lines.

What's more impressive is that it's self-hosted and implements just the subset of C required to compile itself, which makes it harder to keep the code short, but it manages anyways.


Splitting on non-line-end semicolons (other than in literals) gives you a functioning program of 942 lines, 62 of which are either blank or comments.


This is fun to look at and brings back the fond memories of compiler construction and grad school. But the project could use a better overview page. It is not clear from the description if this is a plastic explosives building guide, a C compiler, a decomposition of food supplements or something else...


I mean I guess, it's only a small subset of C though (no floating point or structs even):

  $ cc float.c && ./a.out && ./c4 -s float.c && ./c4 float.c && echo 'struct { int a; } s;' >struct.c && cc -c struct.c && ./c4 -s struct.c
  hello, world 1.250000
  1: #include <stdio.h>
  2:
  3: int main()
  4: {
  5:   printf("hello, world %f\n", 1.25);
      ENT  0
      IMM  83656704
      PSH
      IMM  1
      PSH
      IMM  25
      PSH
      PRTF
      ADJ  3
  6:   return 0;
      IMM  0
      LEV
  7: }
      LEV
  hello, world 0.000000
  exit(0) cycle = 13
  1: bad global declaration
  $


So not really C in 4 functions then. Just some mini language that’s a subset of C, in four functions. That’s a much simpler target to reach, I guess.


Yeah, it's quite nice. The only thing that bothers me is that it doesn't do bounds checking.


Ooh explain more about your challenges?


Starfighter was shut down but you can Google it.


I really like this, but I wonder why somehow "four functions" has to imply "needlessly cryptic variable names". Is that just part of the exercise in minimalism, i.e. art?


I found the code surprisingly readable.

Most variables names are what I expected them to mean despite their shortness: pc, sp, bp are registers, a is the accumulator, fd is a file descriptor (of the input file, what else?), tk is for the token, t is temporary, etc... For the less obvious ones, it is usually not that hard to infer their meaning from either the code or comments.

Because yes, they are comments, not many, but they are helpful. For example, the VM has unusual instructions (for me) like LEV and ADJ, and they are commented. The "obvious" ones like MUL and SHR are not.

The variable names are not "needlessly cryptic". I've seen (and written, not proud of it) a lot of needlessly cryptic variable names, and believe me, these are crystal clear by comparison. Here, there is a clear influence from assembly mnemonics that really helps understanding.

Now on the why. This is minimalism, and minimizing the number of comments and variable name length is part of it. It is actually a very interesting exercise. The golden rule in making understandable code is making it as short as possible. There is a limited amount of space on your screen and in your mind, and the shorter your code is, the more you can see/understand at once. Of course, too much is too much, you don't want to do things IOCCC style, and striking a balance is difficult. So once in a while, reading or writing very compact code can help you understand where shaving off characters is fine and where it really hurts understanding.


> The golden rule in making understandable code is making it as short as possible.

Absolutely not true, in any way.

And I fail to see a single reasonable argument for why writing "tk" should be better than writing "token". Just name things using real words. Cryptic abbreviation does not help anyone.


You will mentally read "token" every time you see "tk" anyway.

The same way you "grep", "ls" or "cd" --- it easily becomes as natural as any other language, unless you consciously try to stop yourself from learning.


No, you won't. You'll read "tk", and then do a little mental substitution to "token". It's tiny, but it adds to your mental load.

You could just not do that, and be happier. Why add the extra indirection?


I do agree that `tk` is an unusual abbreviation, but `tok` for `token` is pretty common in language implementations. `ty` for `type` is also usual.

The names are not only read but also being manipulated in your mind. You will see tons of `token`s and `type`s throughout the code and while you can read them fine you will have hard time dealing with them in your mind. For uncommon names that are usually read out of the context, longer names are preferred. For common names or local enough names, short mnemonics that evoke the original name really help. When it is not possible, people usually develop a specialized terminology for them.


I don't have any problem holding properly named variables in my mind, so I don't see what the problem is here?

The only issue is that if you are writing gigantic statements with lots of variables, it gets too long to see properly. And that is a sign that you are writing too large expressions, and need to start cutting them into smaller parts anyway.


You don't always replace X and Y (as in coordinates) to something longer. I know, they can be (and probably should be) replaced with left/right and top/bottom when we are talking about boxes, but not all X and Y can be replaced.


You replace them with abscissa and ordinate actually. And Z will be applicate, it seems, although I already heard azimuth used for this one. I'm not aware of a dedicated terms for higher dimensions, so your beautiful (w,x,y,z) quaternion tuple is a big deal, isn't it.

Anyway, you can always use an array `coordinates` with whatever dimensions you want.


Sure, but those are commonly used names and well established, so they act as actual, proper words in this kind of discussion. There are some widely accepted short names that can still be descriptive.

But arbitrary abbreviations like "tk" don't fall into this category.


You need some basic understanding of 2D graphics to understand what is X and Y. In fact, they are not good names otherwise because it is pretty easy to mix them up---one more reason to prefer left/right/top/bottom whenever possible. I've seen some codes using Column and Row instead of X and Y for that very reason. (Not to say that I like it.)

If you are okay with X and Y, you should probably have to accept that the definition of "commonly used", "well established" and "actual, proper" words is subjective and different areas and projects have different notions of them. I'm okay with `tk` if it is used consistently and doesn't interfere otherwise.


Yes they do. tk, pc, sp, a, etc all fall into the well established category in this context.


The first time you'll read "HN", and then do a little mental substitution to "Hacker News".

Why do humans abbreviate?


Unfortunately, it seems that most of the time it's because they are lazy sheeps who don't even have a clue of what means 90% of the abbreviations they use, but don't want to look like the sole ignorant in the room asking it, and won't take time to look at it by themselves later, even for the 10% they most often encounter.


Bad example. ls and friends are meant to be typed constantly and therefore need to be short.


I read "grep", "ls" and "cd" as "grep", "el ess" and "cee dee". I don't even know what grep and ls stand for.


ls is a shortening of "list", i.e. list directory contents, and grep is apparently Global Regular Expression Print.


grep comes from an amalgamation of commands for a global search applying a regular expression and printing the results I.e., “g/re/p”

ls just simply stands for list


No, we won't; "token" is a function name, not a local variable name. If it was written "token", we'd have to substitute it back to "tok" every time we read it.


Once you know “tk” means token, you don’t need it to be “token”.


Absolutely true. C4 doesn't go far enough.

The source of my favorite C compiler, on the other hand:

http://www.kparc.com/b/


This source would look beautiful as pattern on a t-shirt or socks. :-)


Why not just write out ProgramCounter, StackPointer, Accumulator, InputFileDescriptor etc? It doesn't take significantly longer to type. It's faster to read because you can recognize the shape of those words and don't have to mentally substitute the actual works. Code is for reading, not for writing.


Is it faster to read? Maybe in terms of bytes/s, but not in lexemes/s nor in terms of getting an overall idea of how everything works.

Compare with this, which is probably more in the style you're thinking of:

https://github.com/dotnet/roslyn/blob/master/src/Compilers/C...

There's so much "noise" that it's hard to see the "big picture", and the repetition of VeryLongIdentifiers causes https://en.wikipedia.org/wiki/Semantic_satiation to occur quickly.


If you're talking about a small, quickly-written, one-off piece of code, then I think truncated variable names are OK.

If the code is anything that anyone else (including future you) will have to read, or a part of a larger system, then descriptive variable names are best.

I can't count the number of times I've dropped into some source code with variable names that didn't mean anything and with no comments describing what they mean.


If one understands anything at all about a cpu, then pc, sp, a mean something instantly.


Agree on those, but t and tk can mean anything. We can symbolize everything and, as someone else argued, once you know tk is token you can just read it, but replacing variable names with the shortest possible names is still called obfuscation for a reason.

I personally also hate the Java (and to a lesser extent, C#) custom to write MemoryLocationRepresentation when you can say pointer, but there is certainly a middle ground. Token is 5 characters, not 30.


Consider non-native English speakers. For them, the abbreviation makes it much harder to read.


I am not a native English speaker. Reasonably skilled, but nowhere near native.

Abbreviations don't make it harder because of it. If anything, it is less of a problem. Because using the proper English word doesn't help more than using an abbreviation if you don't know the meaning of the English word in the first place.

On a side note, I have more trouble understanding code written in French (my native language) than in English. Simply because when we learn programming, we learn it with the English terms. For example, we know what a "token" is in the context of a "parser", that's how we call it. The french translation would be "symbole" and "analyseur syntaxique" respectively, but you will be better understood if you use the English words.


>Because using the proper English word doesn't help more than using an abbreviation if you don't know the meaning of the English word in the first place.

If you don't know the meaning of an English word, you can use a dictionary. If you don't know the meaning of some ad hoc abbreviation, unless you can waste even more human time by asking at people who already are in the secret, you are left on your own.

> On a side note, I have more trouble understanding code written in French (my native language) than in English.

USA soft power is strong, that's it. It's people duty to take care of better mastering their own languages if they don't want to see it ineffective in their daily linguistic needs.

People know what a token is in the context of a parser, only after they learned it. When this is not the learner native language, they will learn it most likely without having a clue of how it makes sense in the semantic network of English. If a French is first introduced to this notion using the term "lexie" (which also exists in English by the way, as a borrowing from French to English in linguistic this time), chances are far greater that it will evoke something meaningful to this person, as it's lexically close to the term lexic. Using French morphemes, one could also easily produce terms like métataxeur[1], or even distaxeur and transtaxeur.

>but you will be better understood if you use the English words.

Chance are greater that they will see what you are referring to as they already crossed the term before more often. It doesn't necessarily imply that they will better understand what it means. When a notion is well assimilated, it's recognized in any language mastered, even when it's expressed under a bright new metaphor.

[1] see https://fr.wiktionary.org/wiki/m%C3%A9tataxe and https://fr.wiktionary.org/wiki/-eur


Was there a period in the 1960s or 1970s where French speakers used native terms instead of English for computing terminology?

I'm wondering about this because a Brazilian friend is doing a computer history project and he noticed that 1970s documentation used literal Portuguese translations of English technical terms, and the translations are no longer transparently comprehensible to present-day Brazilians because of the subsequent switch to using the English terminology. For example, the documentation refers to a "montador", and he had to translate that into English for his Brazilian audience ("assembler").


If they don't speak English, it matters even less...

(I've read code written by Chinese --- variables named dzhq, xljn, etc. are not uncommon. If anything, they like to abbreviate even more.)


If they're not fluent in the same abbreviations but have decent English-as-a-second-language skills, they can read Rosalyn style code but not 2-letter abbreviations.

Heck, I can't even read my own 2-letter abbreviations a year later sometimes.

When I write the code, I'm likely coming off reading a paper or datasheet that used certain abbreviations. I might have seen the word "token" so many times in that week so in that moment, I can't imagine what else 'tk' might mean. But it's when I come back a year later off a heat stake project that used K-type thermocouples where seeing 'token' is much clearer.

If those Chinese variables were named DaanZenghQian (sorry, I know my Mandarin sucks) instead of dzhq you might have a chance to translate that into "result of the upper thousands" for whatever that means in your context.

Pretend you're someone who doesn't have exactly the state of mind and background knowledge you have right now. That might be a Chinese person with limited English, it might be your coworker who was working in Delphi instead of assembler in the 90s, it might be yourself with a bit of time elapsed. That's the person who you need to be writing for, not for you in the moment of writing it.


I have read many Korean codes and while there are lots of Latin transliterations abbreviations were rare.


I, as a non-native English speaker, disagree.


Because e.g. pc and sp are exactly the abbreviations used in assembler for some decades?


We should probably mention that "e.g." is an abbreviation for the Latin "exempli gratia", and means "for example." ;-)


You don't have to 'mentally substitute' the actual words. PC, SP, A, etc. are the words themselves. StackPointer is a pointless formalism.


Could you also provide the meaning for the other (pointless) abbreviations? :)


Because it make the lines longer and long lines are bad. If it results in a horizontal scrollbar, it is terrible, but even without it, there is a reason papers are often printed in column format and most coding rules specify a maximum line length (often 80, though 120 is becoming popular these days, with big wide screen and all that).

So long lines need to be split. Which is difficult to do properly and results in more lines, and more lines mean less of the code is visible at once and that makes it harder to see the big picture.

But to each his own I guess. Anyway, you can try it out yourself. Just take the code, do the replacements and see for yourself.


Started here: https://github.com/psychoslave/c4

But help would be welcome to retrieve the intended meaning for many of variable names that were turned to nonsense, be it a comment here, an issue on the repository, a pull request or anything else.


There are often ways of reformatting a line to break it if it's too long that also does not require renaming things. For example, a long list of conditions in an if statement can be broken into one condition per line. Results of comparisons can be put into their own variables. Logic flow can be adjusted and produce the same result. And so on.


Are you reading this on an Apple Watch? I still generally use 80 characters out of habit, but given how monitors have grown, 120 or even 140 should be the new norm.


Adding an extra column for code|docs|other context is so much more useful than allowing longer lines for obese identifiers that rarely serve to make a point more clear.

I'll take my four or five columns of 80 chars over two columns of 120-140 chars any day.


It takes longer to type and read. All the little seconds fiddling with the mouse, popup menus, hand eye coordination wastes your time and prevents muscle memory. Its hard to reach max throughput with long variable names.


Typing speed is absolutely not the limiting factor for programming productivity. If you are actually limited by typing speed, you are doing something very, very wrong.


Besides, this issue has been solved for many years now. Auto-completion in modern IDEs has gotten really good.


Humans don't read words letter by letter, you recognize the whole word pattern. Abbreviations are actually slowing you down on this point, at least the first times you encounter each new one. Having a longest but more usual term will take you least time of reading treatment.

Autocompletion will rarely ask you to type more than four keystrokes for selecting any arbitrary long term.

Meaningful terms in context often happen to be far more easier to grep.

Except for sounding far more impenetrable to the lay man, there is not much left to these H4x0r turns. Of course jargon curse is not a prerogative of CS, this is a common spontaneous social behaviour.


Most of these abbreviations are well established. 'pc', 'sp', and 'a' are the names of those registers in many assembly languages.


To clarify, you don't usually see just "a" for an accumulator, as there are usually more than one accumulator-style registers in a CPU, and in many cases they are split along byte (possibly word) boundaries.

So you end up with accumulators called "A" and "B", but are composed of registers "AX" and "AY", and "BX" and "BY", with each being one byte (or word) wide; X and Y being high and low bytes/words of the register (and dependent on "endian-ess" too).

Sometimes you even get where multiple registers can be referenced by a singular name - "D" is a popular choice, and may be made up of "A" and "B" (being low/high "registers" of the larger word). IIRC, the 6809 was like this (?) - A and B were 16 bit registers, but could be referenced as a 32-bit word "D" (or maybe I am thing of the 68k or some other architecture - it's been a long while).

The only other time I have ever seen singular letters used for registers in assembly was for very old pre-microcomputer systems (beasts like the Univac and System/360 - though I think the PDP-8 had similar style). Also some of the very early "microcontrollers" (which were more like glorified sequencers with some extra memory and rudimentary branching, if any) had similar "registers" (Radio Shack once sold, as a part of their "Science Fair" electronic kits, a "Microcomputer Trainer" that was something like a very small 4-bit microcontroller with 128 bytes of memory or something like that - to teach assembler and a bit of hardware interfacing - it had "small" registers like that referred to in single letters).


The 6502 is still in production, and has single-character register names (A, X, Y, P, S).


The 8080 had A, B, C, D, E, H and L. These mostly carried over to the 8085. Newer chips have ax/al/ah, eax, rax type names the grew out of the original names. The Zilog Z80 and Sharp LR35902 were mostly 8080 compatible.

The MOS 6502 has, as gmfawcett said, single-letter names. These in turn carried over to Western Design Center (WDC)'s 65C816. There are actually separate instructions for loading and storing in A, X, Y and Z at least on the '816. LDX, STX, and so on. This means the Ricoh 2A03, Ricoh 5A22, Hitachi 6309, MOS 8501, MOS 8502, and the later MOS 65xx series and the CSG chips. A fun fact is that the 6502 had especially fast access to its zero page memory and special instructions for some functions on that page, the first 256 bytes of RAM. Language implementers sometimes made up for the dearth of registers by treating certain addresses in the zero page as additional registers.

The Motorola 6800 had two accumulators, A and B. The stack pointer was merely S. X is the index register. It also treats the zero page specially. The 68000 series broke with this, having eight address registers a0-a7 and eight data registers d0-d7.

All of the above used A as an accumulator at least by convention in the materials.

SP is the literal name of the stack pointer on x86 in 16-bit mode. It's also used as an alias for R13 in at least some Arm (AArch32 on v7 and earlier for example). SP and PC are the stack pointer and program counter on the PDP-11. It's aliased to r1 on the Intel 80960 (i960) since that is the stack pointer on that platform.

The PDP-8 used similar zero-page tricks to the MOS 6502, only given that it had one (1 !!!) register, that was necessary.

All of these processors where CPUs for commercially successful systems. They might "only" be microcontrollers today.

The MOS 6502 / 6510 and its variant the WDC 65C816 was in the Commodore 64, Commodore PET, the Vic-20, the Apple II, the Atari 2600, the Atari 400/800/600XL/800XL/1200XL/800XE/65XE/130XE, Nintendo Famicom, SuperFamicom, the NES, the SuperNES, BBC Micro, Ohio Scientific Challenger 4, Atari Lynx, Apple III, Apple IIgs, Acorn Atom, Acorn Electron, Franklin Ace, and loads of clones.

The Z80 was in most Amstrad models, in the original TRS-80, the MSX standard, VTech Laser, Intercompex Hobbit, Mattel Aquarius, the Microbee, the NEC PC-6000 & PC-8800 series, Sinclair ZX line & Timex Sinclair, Coleco Adam, and again a bunch of clones.

The Motorola 6809 was in the Tandy Color Computer, while the smaller CoCo MC-10 used the 6803. A few other companies built around this chip family, too.

The Commodore 128 featured both a 6500 series processor and a Z80.

Several of these processors still have versions produced in 2020, although they're not for your main desktop or your phone. Several of them are targets for emulation or new hobbyist software due to the popularity of their platforms. And yes, some of them are used as microcontrollers. Microcontrollers need code written for them, too.


> The golden rule in making understandable code is making it as short as possible.

If that were the case people wouldn't even bother with assembler mnemonics, comments and white spacing. People would write their Javascript / CSS minified from the outset and code golfing would be a best practice rather than a niche activity that some developers do for fun.

I do actually get the point you're trying to make in your post and you do raise some valid points but that sentence is massively overreaching and thus works against you.


since all seems so clear to you, would you mind give some translation of each non plain English word used, or some glossaries explaining each?

Including: p -> position, pointer ? lp -> location pointer ? bss ??? e -> expression ? emitted code? le -> location of emitted code?

Num number? -> why 128, I guess it's related to ASCII ending at 127 Fun function? Sys ??? Glo ??? Loc location? Id identifier?

[reserved keyword are mostly complete word] Char (charset, sign) Else Enum (enumeration, roll) If Int Return Sizeof (size of, heft) While

Assign (assignation, peg) Cond (condition, ply) Lor (logical? or, ere) Lan (logical? and, also) Or Xor (exclusive or, otherwise) And Eq (equals, dows) Ne (not equals, jars) Lt (lower than, Gt (greater than, Le (lower equal Ge (greater equal Shl (shift left, haw) Shr (shift right, gee) Add Sub (subtract, take) Mul (multiply, time) Div (divide, rive) Mod (modulo, lap) Inc (increment, amp/eke/pip) Dec (decrement, ebb/dip) Brak (break, blow)

between parentheses I provided a guess, and a real English word that could carry the same meaning, generally in less than four letters.

I didn't go further in the code so far.


> Most variables names are what I expected them to mean despite their shortness: pc, sp, bp are registers, a is the accumulator, fd is a file descriptor (of the input file, what else?), tk is for the token, t is temporary, etc... For the less obvious ones, it is usually not that hard to infer their meaning from either the code or comments.

Maybe you have the right background to channel the author's particular form of abbreviation, but I have no idea what bp is supposed to stand for, despite having read and understood how it's being used in the program. If that name was supposed to communicate something, I am really not sure sure what it was. Yes, I understand it's probably a reference to some register in some architecture, but that's only effective communication to an audience with a background in that architecture. Even if you know pc = program counter, what exactly does that communicate to someone who doesn't know assembly language?

Of course, using "pc" allows a person with a background in assembly to quickly grok what that variable is, so there are upsides. A best-of-both-worlds approach might be to name it pc and teach people who aren't familiar with assembly language what that means with a comment, like:

    int *pc; // program counter, points to the current instruction
Ask yourself, even if the person knows a = accumulator, what is the accumulator accumulating? When you realize that that isn't even a really sensible question in the context because "accumulating" isn't even really what that variable does, then I have to wonder why you think that's a good name for that variable.

> Because yes, they are comments, not many, but they are helpful. For example, the VM has unusual instructions (for me) like LEV and ADJ, and they are commented. The "obvious" ones like MUL and SHR are not.

> The variable names are not "needlessly cryptic". I've seen (and written, not proud of it) a lot of needlessly cryptic variable names, and believe me, these are crystal clear by comparison. Here, there is a clear influence from assembly mnemonics that really helps understanding.

That's helpful if you know the specific assembly language the author is referencing. But if you don't know assembly language, or if you know a different assembly language, it's not helpful. That SHR instruction you said is obvious doesn't exist in MIPS[1], which is what a lot of assembly beginners will get introduced to. Oh and by the way, which assembly is being imitated isn't documented, so you can't even look that up easily.

The "unusual instructions (for me)" aside is telling: not everyone is you. If your variable names only communicate to you, they don't communicate (an activity that famously involves just one person[2]).

> The golden rule in making understandable code is making it as short as possible.

That's total nonsense. Code becomes understandable when you see it as communication, which starts with understanding who your audience is, and catering your communication to their vocabulary. If your audience has a strong background in x86 assembly, they probably have the vocabulary they need to understand this program. But that is actually a quite narrow audience.

Just to be clear, I don't think this is a bad program. There's a lot to be said for choosing a goal and following through with it, and some of this code is downright brilliant. But effective communication, it is not.

[1] http://inst.eecs.berkeley.edu/~cs61c/resources/MIPS_Green_Sh...

[2] https://www.xkcd.com/1984/


As far as I'm concerned, it could be a single function if they really wanted.


And no comments :'(


I do not think this code needs any. It is very readable and understandable as it is


Clang on my macOS doesn't like the fact that main() takes and returns long long. But it appears GCC on Debian 8 (random linux I have lying around) doesn't mind.

I get it to compile on macOS if I remove the int define but then it segfaults when you run it. I wonder if there is some magic flag to make "main does not return int" a non-fatal error on Clang?

It's fun to read this code but running it is even more fun, you can see the VM code it generates, with source line annotations and all.


The "#define int long long" is annoying indeed. A quick hack to make it work is:

#include <stdint.h>

...and then replace the two "int" in main() with int32_t.


Doesn't that make it fail to maintain the self hosting property that is most likely behind the introduction of this define?


For mac users: gcc -Wno-all -arch i386 -o c4 c4.c

https://news.ycombinator.com/item?id=8560127

You can compile it on 64 bit OS X with clang's -m32 option and it should work.

https://news.ycombinator.com/item?id=8559044


Neither of those work with a recent toolchain :(


I just undef'ed and re-define-d int before and after the main declaration with clang, and it seems to work fine.


Can someone please explain to me what this code does? I know some C, but I couldn't understand this.


A one-pass compiler for a subset of C, relying on a recursive-descent parser, doing the lexing, parsing and code generation in lockstep. The generated code, consisting of abstract machine instructions, is then executed by an instruction fetch and execute loop.

BTW, the code looks relatively short because many semicolon-separated statements are crammed on a single line, and short variable names make that somewhat manageable and even visually symmetric. If you were to unfold it with each statement on its own line, I guess it would be at least 3 times the size.


> A one-pass compiler for a subset of C, relying on a recursive-descent parser, doing the lexing, parsing and code generation in lockstep. The generated code, consisting of abstract machine instructions, is then executed by an instruction fetch and execute loop.

This should be added to the README.


Thanks a lot for the explanation!


AFAIK after taking a quick glance, it takes c code as input, then translates it to internal format by doing tokenization and parsing, and then executes the code. So, it's an interpreter. You could call it with other names as well. You can also have the program itself as an input, and have many layers of execution.


considering the code is translated into a kind of simple, non-native opcode, I think it is also correct to call it a VM.


Am I the only one annoyed by the fact that the main loop has a series of if..else instead of a giant switch statement?

But yes, many people rely on parsers and VMs without really knowing how it works and assume some black magic, whereas it can be really simple and elegant.


It is self-hosted so having switch in the code means yet another code implementing switch.


Wouldn't the compiler optimize it either way?


Hmmm, gcc may be smart enough indeed.


This is awesome, less than 500 lines for a self-compiling C compiler!


...and virtual-machine.


Here's a simpler self-compiling C compiler:

  int main(int argc, char **argv) {
  }
Sadly, it doesn't support that much…


Seeing this made my day. There's something calming about seeing C used so...hmm....elegantly?

Well maybe that's not the right word. But the minimalism of it is soothing.


It looks like a great base for a refactoring exercise.


I considered doing that once. Trouble is, how do you test it? It comes with one tiny example in the C subset, plus one substantial one (itself). Its behavior is generally undefined when the input departs from the subset. It seemed like more work to address these issues than the fun you could have in messing with the code.

So I sketched my refactorings without bothering to check them or to publish it.



Why all the printf copy/pasta around the opcode enum?

Can't the author use pre-processor string concatenation to write once and then leverage that to map enum to string and back?

At least I'm doing that in a little scripting language parser and it looks portable.

Edit: I'm actually using pre-processor string concatenation for something unrelated. You don't even need that to do the mapping.


c4 is meant to be self-hosted and it does not implement a preprocessor. It was easier to re-use printf creatively.


Anything like that for C++? /s


(2014)?



I guess switch/case support is in order.


An earlier version had this, and basic structs too. It was only a little longer, at least in line count, but harder to figure out. It took me four evenings with a printout and a red pen before I was satisfied with my understanding.

(I have mixed feelings about this code: it has both good ideas and pointless obscurity. I guess the newer version would've taken me 'only' three evenings.)


Faster than CPython? ;)


Surprisingly succinct.


hello dot c my old friend, i've come to compile you again...


Thanks for commenting the code. WTF sarcasm


The repo says "An exercise in minimalism." – Looking at the code, there is nothing minimalistic about it. Perhaps in terms of C it can be called that, but in terms of programming languages in general, this cannot be called "minimal" at all.

The code quality is terrible. Few explaining comments (actually mos of those are only 1 or 2 words, so not explaining anything at all) and almost all variable names consist of 1 or 2 characters, which do not say what the thing actually is. Then to achieve this arbitrary goal of doing it "in 4 functions" (actually procedures) loads of stuff was apparently stuffed into 4 those 4 functions, so much so, that they are longer than a whole page of code. Most of it looks like gigantic switch statements. It's horribly written code. It looks like what I think of as a C nightmare.

I will admit, I could not write such a thing myself. I lack the knowledge for writing such a low level stuff, do not use C, and if I had that knowledge to do it, my inner drive to do a cleaner job than anything remotely looking like that code, would prevent me from ever sharing such a thing with anyone in public. At least group cases into procedures as it makes sense. Even people writing this low level type of code should be aware of how unreadable that code is, right?


> I lack the knowledge

And yet you have a strong opinion about how terrible it is.

It is very minimal and it is very readable. It is actually a joy to read. You have no idea what a C-nightmare really is.


Yep, I can have a "strong" (what is strong?) opinion about it, because I know what readable code in other programming languages looks like and it certainly does not look like that. I did use C a few times in assignments and still my code had more comments and better readable names for basically everything than the code present in the repository.

Wait, you are telling me, that people write even less readable code in C? Perhaps you are right. Perhaps people can really be that much without care.

It still does not make this code "very readable". If it was very readable, I would have a vague idea about what each of the "functions" does from reading its name or its docstring. Oh wait, there is no docstring at the beginning of each of the "functions" and the name consists of one word, sometimes abbreviated word. And the variable names don't give me hints either. I think our definitions of readable code simply differ quite a lot. When I write code myself, I am unwilling to accept code on that readability level, but we probably have different standards.

Perhaps for entertainment purposes only, you could show me a real C nightmare. I do honestly believe you, that there is worse ;)


I don't understand russian, therefore russian is unreadable.

I don't understand mathematics, therefore math is undreadable.

I don't understand music, therefore sheet is unreadable.

That is your problem in a nutshell.

Meanwhile, people here take a glance at the code and immediately get it, because it is written in a language they understand. These people tend to find it quite readable -- not necessarily an example of most pretty code, but nevertheless readable.


Still you do not address the simplest of points, which there are: meaningful variable names, meaningful procedure names, explaining comments. All of which are minimum standards for software development these days.

I may be a C noob, as I already and in an honest way stated in the very first post, but my points still stand. Those are not some subjective things. It is very clear, that those things add to readability, yet the code does not have them.


I addressed those points implicitly. Those names are meaningful, in context, to the people who understand the language; just as symbols in math are, in context, understandable to those who understand mathematics. And thus there is no need for comments. Arguably, there are too many comments, because I saw many that said something obvious without adding anything useful. Short identifiers aid reading for people who understand the language, because it allows them to focus on what the code does (or, rather, how exactly it does it) rather than what the things in it are (which is obvious to people who understand the language).


> Wait, you are telling me, that people write even less readable code in C? Perhaps you are right. Perhaps people can really be that much without care.

I'm a person who has written a lot of c, readable and unreadable. It's not that folks don't care; I find that notion slightly offensive. It's that a lot of c programmers care about different things. And, especially those of an older school have a different mindset.

When you read c source, especially that labelled "exercise in minimalism", you should approach it as hallowed. Enter the file with a sense of reverence; lose the ego; ask not why the file wasn't written in service to you. As you have here, tsk tsk. Expect a challenge, a puzzle. Ignore the variable names; they may be misleading -- learn for yourself what role each variable takes on each line -- some of us will reuse a variable for multiple purposes throughout its lifetime. Humble yourself with the understanding that the code was not written for you. It was written for a machine infinitely more patient and methodical than you. Use that machine to execute and debug, watch and analyze, to discover the true meaning.

https://www.ioccc.org


I agree with you, the code isn’t as readable as it could be. But that’s precisely the point here: it’s an exercise in artistic license, so its conciseness is a feature, not a bug. Think of it like poetry.


I can recognize why it will generates a feeling of elegance, as some say with short mathematical concise formula. However this is more like of esoteric practices: it's meaningful for the selected few that went through the initiation, and has only the charming/frustrating taste of mystery for other.

But, at least to me, it's nothing like poetry. Anyone, without any specific formation, can read poetry and understand something of it. In poetry, artistic license is the exception, not the general rule. The deep meanings, intertwines between form and content, the attention to metre, the use of allegories, and so on, will not be consciously grabbed by every reader. But still anyone that has the base language knowledge required can read it and find some sense to it.


Gigantic switch statements are pretty much the best way to do a parsing style thing in a straightforward basic manner, and it is after all, “an _exercise_ in minimalism”. There are short variable names but again there are not many variables used in the whole program. This was not code written to be approachable by any other random team member without further context, it was written for a different purpose entirely.

Perhaps people would be less reluctant to show their code in public if it were not so often criticised for falling short of some criteria that the author was never intending to meet?


> This was not code written to be approachable by any other random team member without further context, it was written for a different purpose entirely.

Ah, let me quote: "Programs must be written for people to read, and only incidentally for machines to execute." – Hal Abelson

And I believe he is right. If you don't write your programs for other people to read, then they not worth being written in that way. Always consider that next person, who needs to read and understand your code. If you don't, then you are not much of a great team player at all. It's great you can do some brain acrobatics, but not much use, if the only one on team being able to understand your code is yourself. I don't know how many 1 person software developer jobs there are still out there, but I guess the number is vanishingly small, compared to team jobs. I would not want someone in my team, who does not pay attention to keeping things very readable.

> Perhaps people would be less reluctant to show their code in public if it were not so often criticised for falling short of some criteria that the author was never intending to meet?

True, you got a point here. But do we want this display of code? However, personally, I'd probably not miss reading such code and maybe it would even be good for the entire profession of software development, if such code was not displayed as something to be achieved but rather something to be shunned.


It’s not a piece of software worked on by a team though. It’s a personal project, written to by one person to be understood by that person, and incidentally I think some effort has been made to make the code easier to read for random passers-by on the internet. Not every piece of code has to be written to meet the same goals, there is a large spectrum.

I write both terse notes to myself, and longer more well-thought out sets of notes to distribute to undergraduates in classes I teach. I wouldn’t give the undergraduates my personal notes - they are simply too unpolished and terse - but my colleagues find them very helpful on occasion. Is it wrong to have two different ways of writing?

I would say this C implementation leans somewhere between “note to self” and “arty exploration”. Is it worth being written? I don’t think you or I could be the judge of that, but the author obviously thought so. Live and let live.


Your amount of unwillingness to learn and understand is disturbing. Lowest-common-denominator dumbing-down is not how we advance our craft. There are many others who can understand it, perhaps you should ask yourself why you can't.

If you think this is unreadable, then you are not qualified to read it --- yet. I implore you to try; maybe it will actually be enlightening.


Nice, now we are getting personal! Interesting, how much you know about me from just a comment. And still you miss the simplest of points, which there are: meaningful variable names, meaningful procedure names, explaining comments.

Perhaps it is not as much that "I can't" because of incapability or stupidity, but simply, that I wont, because of the lack of care given to creating something meeting minimal standards for modern software development and best practices.


modern software development and best practices

Yes, that's all too familiar to me. The dogmatic cargo-culting buzzword-bingo "religion" whose only claim to fame is in effectively producing gargantuan Enterprise Quality™ software --- I mean solutions --- which everyone inevitably hates because they are ridiculously bloated and overengineered to the point that the simplest things take an absurd amount of time and energy, and also have abysmal UX too...

"best practices are best not practiced."


No reasonable amount of comments will make up for a lack of domain knowledge.


Exactly! You hit the nail right on the head.


The code is obviously not intended for production, but it is straightforward nevertheless for anyone with C experience.


I've only skimmed the code but it seems to me that the repo is an art piece just as much or more than it is a software project. The medium is code but that doesn't mean that it needs to adhere to professional standards anymore than someone's woodworking project needs to adhere to building codes.


It's a C compiler in 500 lines. I guess there are C compilers in 500,000 lines.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: