This is just an anecdote, but I had to write a SQL DDL and DML parser during my last semester. I wrote the DDL parser by hand, and it wasn't as bad as I expected, but it was time consuming. I managed to convince the professor to give us the option of using a parser generator for the next phase (DML) since the point of the class wasn't parsing context free grammars and more focused on executing the SQL.
I used Flex and Bison since the project was in C. Getting up and running and understanding how the tools have to be set up with different options was a bit tricky, but after that, my parser was up and running in about two hours, compared to probably four times that for the hand written DDL. Our DML subset was also much larger and more complex than our DDL, so I was very happy with the development speed increase.
I had this idea that using a parser generator was slow and wasteful since many modern tutorials online write them by hand and speak against parser generators (possibly because there isn't a catch all for all languages). Turns out dev speed is way more important to me up front, because in the case that I notice parsing speed actually being an issue I should be happy that my MVP has gotten enough use.
It's also nice because a lexer and parser can be pretty easily black-boxed and swapped out for your hand written, just keep the AST and API the same and you should be good.
All that said, that's personal preferences and writing the parser by hand is definitely good experience and more extensible, especially for error handling. Nice work!
To add an anecdote, I've gotten into a hobby of writing parsers and parser-related things, and I've developed a pattern and a handful of utility functions (which are small enough to be re-written from memory) that together make hand-writing a pretty breezy task. Just the other week at work I hand-wrote a GraphQL schema parser from scratch in about 2 hours using this technique (GraphQL's schemas have a relatively simple syntax, but I still think it's a decent data point)
It's not, though I've thought about open-sourcing or doing a write-up on my pattern, but I've just assumed there are tons of libraries out there for this that are probably better than my homegrown version. Could still make an interesting blog post though, I suppose
Having written many parsers, I can attest that the time spent writing and debugging the parser is about 0.0000001% of the time you'll invest in the compiler.
Heck, I spent more time trying to figure out how to integrate the tag name space from ImportC into the host D compiler's symbol table mechanism, than I did writing the C parser.
In a university compilers class, I suppose that implementing the parser (or the tokenizer part of it) might be used as a weedout assignment. Or to tell students without much programming experience that they are going to have to put a lot of energy into the class (so don't, say, take compilers, operating systems, organic chemistry, and epic novel-writing all in the same term).
(When I took compilers, the initial weedout/attention-getting assignment was artificial: write a string/symbol table manager module that could handle strings of arbitrary size, up to megabytes or larger for each entry. Though I don't know how well this first assignment worked for setting expectations, or if it could've been improved. After that school term, I heard from someone else in the class that he and I had been the only two people in the class to actually "finish" our compilers. Not that we were smarter or harder-working, but we both had a lot of prior programming experience, so a big head start. Setting expectations for a project-heavy systems CS class seems hard to do, without discouraging people who could do it if they invest the time, and I suppose just noticing a couple nerds who had a big head start could be discouraging to everyone else. One of the secret keys to learning/accomplishing something hard is to feel you're doing well enough to stick with it long enough to do well.)
P.S. Having a parallel tag symbol table means doubling the work in creating and tearing down symbol tables. Adding another member to the Symbol struct means lots of extra memory allocated.
Tag symbols are relatively rare, so I was willing to trade lookup times for reduced memory consumption. Wound up adding a pointer to a hash table in each symbol table, where the tag symbols would be stored. Only when tag symbols are actually used is it ever accessed or even initialized. It works a treat.
The main goal at the moment is to create a C compiler, the function keyword there is just a placeholder for the time being. At the moment anything returned is just mov'd into the EAX register and then stored in a local variable if called that way.
If you're looking for C-like languages to study look up the plan 9 Alef language. It was a concurrent C-like that was designed to replace C until they realized its better off as a C library which is what plan 9's thread(2) is. It's mostly extinct but survives:
Worth pointing out Plan9's C was itself sort of its own language. I think it was a strict subset, but I can't remember all the details. The preprocessor was much simpler, that much I remember.
Not exactly a subset, there were at least the anonymous struct/union members that partially made it into C11 much later.
But then if we look beyond the syntax almost any C implementation is “its own language” in that it defines behaviour the standard doesn’t specify, and frequently also changes behaviour the standard does, if in minor ways. Like how the Windows ABI forces a noncompliant wchar_t.
(Even if we look at the preprocessor, the classic algorithm implemented in most places [that I can’t be bothered to find a link for] in fact expands more than the standard strictly guarantees, e.g. you can sometimes get recursive expansion out of it by creative use of token pasting,—I remember reading the ANSI committee thought the case was too “perverse” and didn’t specify anything [no link here as well, sorry... poke me again if you actually care about this]. The standalone preprocessor mcpp has warnings about this specification hole, but other implementations don’t as far as I know. And you’d think the preprocessor, an almost purely syntactic thing, wouldn’t have implementation differences worth keeping around.)
Yes, Plan 9 c has its own C library which does not follow ANSI/ISO though it is very similar to ISO C. Its syntax is pretty much c89 and the useful bits of c99. To build ANSI/POSIX you use the APE (ANSI/POSIX Environment) compilers and libraries which are built in. It's only reason for existing is to help with porting Unix baggage.
The compiler is kencc which is why it is different. Plan 9 also still uses a.out for binary images.
Looks really interesting, thanks for sharing this.
I'm not sure whether I'll incorporate everything from C or just make it compatible with C and behind the scenes do things differently e.g instead of carrying const char */string pointers everywhere always carry the length of said data structure along with it. Or for instance push an additional pointer on the stack for memory zones, where an allocator is always present and knows it's context, sort of like __thiscall with class methods for this->.
Alef appeared in the first and second editions of Plan 9, but was abandoned during development of the third edition.[1][2] Rob Pike later explained Alef's demise by pointing to its lack of automatic memory management, despite Pike's and other people's urging Winterbottom to add garbage collection to the language;[3] also, in a February 2000 slideshow, Pike noted: "…although Alef was a fruitful language, it proved too difficult to maintain a variant language across multiple architectures, so we took what we learned from it and built the thread library for C."[4]
If you are exploring alternatives to the stdlib, you might commit yourself to no zero terminated strings and see where it takes you. A fundamental data type with a pointer, a length and a capacity does a lot to stop buffer overflows.
To replace C-style strings you only need the length! You also get cheap substrings which is such a common operation that it alone IMO makes it worth taking this route.
Can't edit my original reply anymore, but fair point, I've changed the keyword 'function' in the AST to a function return type instead. There's no error handling done yet to check whether the variable receiving the function return data matches the function return type, but it's on my todo list.
1. Some code seems to be duplicated (although with slightly different formatting) across multiple files, like dd/dw/db in elf.c and x86.c for example. I'd suggest consolidating those.
2. It helps to define types when you have variables that can take a limited set of possible values; typedef works but enums are better. I'm thinking of variables like the `type` parameter to `primitive_data_type_size`, for example. The compiler can help you detect a missing case statement if you use an enum, but not an int.
3. It's a matter of preference, but typedef'ing your structs can make your code more concise and not have it littered with struct keywords.
4. You seem to have a mix of tabs and spaces in some files (e.g. in `elf.c`). I would recommend configuring your editor to use just one of the two or it'll start to be an issue as your code base grows.
5. On a related point, I'd suggest picking a formatting style and sticking to it. You have functions like `accept` in `ast.c` declared without spaces after the opening `(` and others like `read_file` in the same file that has spaces.
Shameless plug: if you are having a lot of fun with the "front-end" part of language design but neither like the
idea of writing your own backend nor integrating with LLVM,
maybe Cwerg is right for you: https://github.com/robertmuth/Cwerg
Great job! Writing a compiler from scratch is a massively valuable learning experience. I wrote one in BASIC for the 6502, that became self-hosting on April 24th, 1982. I'm going to watch your project with interest and relive old memories.
Every time I see a compiler, the first thing a zero in on is the parser.
I don't mean this in a bad way, but that tokenizer looks a bit ... simple. ;-)
I say that because ~26 years ago I had a go at writing a C preprocessor with the goal of creating a ToC and Index for my C projects. I sat down with a BNF of C and lex/yacc and got ground up in the details, then found some software that did exactly what I wanted and moved on. I just remember a few parts of the C syntax that don't fit nicely into BNF and hadn't studied parsers enough to dig myself out of the hole.
Good luck doing it manually, that's gonna be tricky as heck.
Really? That's probably why I failed! Ironically (given my post) I've never looked at Clang's parser. I have looked at GCC's and, well, sheepishly: I don't get it, it is a beast of a project and I wasn't able to crack the architecture. I should try again tho.
C is almost context-free with the exception of typedef. A custom parser helps with error diagnostics, e.g. missed semicolon or a "." swapped with "->".
Incredibile timing as I'm working on my own language as well (just started a week ago) and I'm literally starting from 0. So happy that in just a week I can understand at least half of this thread. Your project looks a bit too wild for me but I'm sure there's plenty I can learn there so thank you so much for sharing!
I didn't start writing C yesterday, and most of the things are just stuff I've learned over the years and seeing how other languages work and then just use what I know to try to program in a logical manner.
Also I've written a implementation of a scripting language (compiler and VM) before in similar fashion.
I used Flex and Bison since the project was in C. Getting up and running and understanding how the tools have to be set up with different options was a bit tricky, but after that, my parser was up and running in about two hours, compared to probably four times that for the hand written DDL. Our DML subset was also much larger and more complex than our DDL, so I was very happy with the development speed increase.
I had this idea that using a parser generator was slow and wasteful since many modern tutorials online write them by hand and speak against parser generators (possibly because there isn't a catch all for all languages). Turns out dev speed is way more important to me up front, because in the case that I notice parsing speed actually being an issue I should be happy that my MVP has gotten enough use.
It's also nice because a lexer and parser can be pretty easily black-boxed and swapped out for your hand written, just keep the AST and API the same and you should be good.
All that said, that's personal preferences and writing the parser by hand is definitely good experience and more extensible, especially for error handling. Nice work!