Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

part 1 isn't parser / lexer???? nice job. "On the next episode: A lexer" oh, it is.


Well, it's the first thing in the pipeline, so why not?


I think there are several good reasons not to start with the lexer and parser:

1. If you write a code generator whose input is something like Forth ("operators are separated by spaces") you have written a compiler. If you write a lexer and parser, you have not written a compiler. So, if we want to develop the compiler incrementally, we should start with a back end, then add a front end.

2. Parsing and lexing are much more theoretically tractable than code generation, which is only now beginning to succumb to systematic formalization after nearly 50 years of efforts. Consequently there are enormous edifices of theory and tooling built up around parsing and lexing, which are fascinating and delightful. The problems is that people tend to teach all this stuff, which is not actually necessary to write a useful parser. Rather than studying Chomsky's hierarchy of languages and automata theory, you're probably better off just learning how context-free languages work, a little about refactoring them to eliminate left recursion, and about chronological backtracking search and applying it to the parsing problem, and then just write a recursive-descent parser. If you want to study the LALR algorithm and the refinements that have made Earley's algorithm practical, that's great, but you shouldn't really have to puzzle over shift-reduce conflicts in order to get up and running.

3. It's very tempting for novices to see syntax as the important aspect of programming, because it's what you can see most easily; but of course it is purely superficial. By teaching lexers and parsers as the first step, and thus implicitly the most important step, in writing a compiler, you're hurting novice programmers by reinforcing that misconception instead of helping them by ripping it away.


>3. It's very tempting for novices to see syntax as the important aspect of programming, because it's what you can see most easily; but of course it is purely superficial. By teaching lexers and parsers as the first step, and thus implicitly the most important step, in writing a compiler, you're hurting novice programmers by reinforcing that misconception instead of helping them by ripping it away.

I'm newbie in those topics and I don't see it this way

for me lexer is way of building abstraction above syntax.

Syntax is important from perspective of language design team, meanwhile compiler is operating on abstraction


I can't understand you at all. Maybe my English parser is buggy.


I meant that lexer is just creating an abstraction from (very often) text, but later you operate on abstractions

like enum Syntax.String, Syntax.IfKeyword

dont ya?


Now go ahead and describe all the next steps you need to do to turn "like enum Syntax.String, Syntax.IfKeyword" into running code


I've been doing it kinda naively (it may be problematic in some cases), even with some very handy methods

but it's been like that:

for

if (expression) {body}

______________

if next elements are Syntax.IfKeyword, Syntax.OpenParenthesis then

{

var expressionElements = TakeUntil(Syntax.ClosedParenthesis)

var expressionResultv = ExpressionBuilder(expressionElements)

if (!expressionResult.Success) error;

...

}

else if ...

else if ...

else error;

this way I'm building AST

and then I have other module that makes AST to LLVM_IR


So, how do you solve things like:

- type resolution

- scopes

- variable capture

- closures

- constants

- module boundaries

- symbol lookup

- ...

?

All of the above and about a magnitude more is required for anything beyond a templating language.

And none of them are resolved by a parser.


On my "first stage" AST I have node e.g "variable declaration"

which has declared type, name, expression

then I have pass which transforms "variable declaration" node to

"bounded/proven/type_checked (whatever you call it) variable declaration"

with the difference that it performs check whether result type of expression equals to declared type

and overrides/swaps that particular node

Before I start messing with e.g classes then I'll have to add some kind of type discovery before being able to prove types, but it'll be just walk thru all nodes and save as much info as I can about e.g new classes


So you're not even anywhere close at having a language. And look how it took you almost a paragraph just to describe a simple type check.

> but it'll be just walk thru all nodes and save as much info

Yup. Literally none of that info comes from the parsing stage. It won't be "just" an AST walk.

   SomeType n = f(x, y, z);
This simple line requires you:

- to see if variables x, y, z are in scope

- that they conform to the types required by the function

- to do that you need to look up the function. Depending on your language it can be any of: available globally, imported from a module, aliased from a module, defined in the file, a variable holding a reference to the function

- side note: even x, y, z can be any of the above, by the way

- then you have to check the return type of the function

- then you have to resolve SomeType. Which can be defined in the same file, in a different module, or globally. Depending on your language, it can be simple type, subtype, union type...

- so now, depending on your language you have to decide whether the type of the result of the function (which can be simple, imported global, subtype, union...) can be matched against the type of the variable

And the same has to be done for every single assignment. And literally none of this has to do anything with parsing.

Edit: all of the above depends on how many features you want a language to have, and how robust you want it to be, of course. But all these things are super important and are significantly more important than the parsing stage at least at the beginning. And definitely more important than the importance accorded to parsing in nearly all materials on compiling.


>- that they conform to the types required by the function

>- then you have to check the return type of the function

Those two things seems to be trivial, they're low hanging fruits, at least for "simple types", don't they?

>- to see if variables x, y, z are in scope

every node that can have "body" has its own scope and that scope has also parent scope which it receives from parent and it repeats

e.g function has body which has scope, and then there's if statement in that body, so "if scope" has body's scope as parent, thus receives all declared variable in function body

>- to do that you need to look up the function. Depending on your language it can be any of: available globally, imported from a module, aliased from a module, defined in the file, a variable holding a reference to the function

Nothing seems strange, I assume that whole source code is avaliable at compilation time.

>- so now, depending on your language you have to decide whether the type of the result of the function (which can be simple, imported global, subtype, union...) can be matched against the type of the variable

Definitely things will get more complicated once I stop doing stuff with just ints and bools and try to create complex type system, but for now stuff seems to be easy


> but for now stuff seems to be easy

I'm not saying it's hard. You've forgotten what the conversation started with.


>>If you write a code generator whose input is something like Forth ("operators are separated by spaces") you have written a compiler.

Without a parser, how do you expect to ingest those operators separated by spaces? Let alone verify that the input is valid?

Either you are aware you are writing a parser, or you don't and end up doing very poorly.

More importantly, the whole point of a compiler is not to output a binary. The whole point of a compiler is to allow developers to write code in a higher-level language that allows them to express programming constructs easily, and veridy and validate input to help pinpoint errors. Once you get to a AST then you can even say your work is already done.


Yes, technically string.split() is a parser, but even in C it's two lines of code you really don't need a lot of parser theory to write:

    for (start = end; *start && isspace(*start); start++);
    for (end = start; *end && !isspace(*end); end++);
I mean you can golf it a lot farther than that, but it's very little code even written perfectly straightforwardly like that without any tricks.

One of the strong points of having your syntax be "operators are separated by spaces" is that it's impossible to have invalid input.

You say, "The whole point of a compiler is to allow developers to write code in a higher-level language that allows them to express programming constructs easily, and verify and validate input to help pinpoint errors." Well, Forth is a higher-level language that allows you to express programming constructs easily, but it is not very good at verifying and validating the program to help pinpoint errors. However, I think most of the reason it's bad at validating is not because of its syntax (or lack thereof), but its semantic model: no types, everything is Turing-complete, referential opacity, etc. Forth's semantics opt for maximum flexibility at the cost of analytical tractability.

I don't know about you, but I enjoy actually running my programs more than verifying and validating them, so I would not agree that your work is already done once you get to an AST. And I would say that a parser that produces an AST is not a compiler.


There's too much emphasis on parsing in all the teaching materials on compiling.

And there's rarely any good stuff on the actual hard problems


> And there's rarely any good stuff on the actual hard problems

So what are those hard problems?


The actual compilation. Actual programming language design.

Types, type inference, function implementations, error handling and recovery, incremental compilation, interfacing with the external world, code generation, fixing problems in code generation tools, CPU architecture concerns...

The toy languages that most learning materials make can be parsed with regexps and is definitely not something to spend a lot of time on at the beginning.

Because actual parsing problems are never addressed in these materials, and can come later (example of concerns: https://matklad.github.io//2018/06/06/modern-parser-generato...).


> The actual compilation. Actual programming language design.

Isn't that intimately tied to parsing?

I mean, how do you expect to implement a parser if you have no idea if a specific programming construct can actually be parsed, or what are the implications in compilation time of doing things one way or another?


> Isn't that intimately tied to parsing?

Not entirely

> how do you expect to implement a parser if you have no idea if a specific programming construct can actually be parsed

Every construct can be parsed.

However, there are significantly more aspects to language design than "parsing programming constructs". For example, C++ header/implementation split with no real module system affects compilation orders of magnitudes more than the parsing rules for C++.


codegen, optimizations, ast transformations


I'm not your parent commentor, but imo, it's because there's is an enormous amount of existing info out there about these two steps. For these kinds of projects, they are necessary whether you're translating to another language (i.e. generating code) or not. And everyone starts with lexing and parsing. And just repeats the same old shit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: