I wrote a tokenizer for a language I’m creating, and all I needed was read chara...

barrkel · on Dec 15, 2018

Guess what? You wrote a state machine (DFA most likely) in code, where the program counter represents the state. There is in all probability (if your language is sane), a 1:1 correspondence between the code you wrote and the regular grammar of your tokens. You implemented a regular language matcher, i.e. a regex matcher, but in code rather than via a regular expression language interpreter or compiler.

Typical pattern:

    start = p;
    while (isspace(*p) && p < eof) // [ ]*
        ++p;
    if (p == eof) return EOF;
    if (is_ident_start(*p)) { // [a-z]
       ++p;
       while (is_ident(*p)) // [a-z0-9]*
           ++p;
       set_token(p, p - start);
       return IDENT;
    } else if (is_number(*p)) { // [0-9]
       ++p;
       while (is_number(*p)) // [0-9]*
           ++p;
       set_token(p, p - start);
       return NUMBER;
    } // etc.

Corresponds to:

    IDENT ::= [a-z][a-z0-9]* ;
    NUMBER ::= [0-9][0-9]* ;
    SPACE ::= [ ]* ;

    TOKEN ::= SPACE (IDENT | NUMBER) ;

Inline those nonterminals, and guess what - regular expression!

jstimpfle · on Dec 15, 2018

That works until you need your own implementation of ++p;. Now wiring up that implementation to a generic library is already more work than just doing it all yourself. Not even considering the integration costs of the library into the sources and the build.

And you cannot really do transformation instead of only matching with RE. Those are needed already in simple cases like string and number literals. Now your code will look more like

    %{
    #include "y.tab.h"
    int num_lines = 1;
    int comment_mode=0;
    int stack =0;
    %}
    digit ([0-9])
    integer ({digit}+)
    float_num ({digit}+\.{digit}+)
    %%
    {integer} {  //deal with integer 
                    printf("#%d: NUM:",num_lines); ECHO;printf("\n");
                    yylval.Integer = atoi(yytext);
                    return INT;
                   }
    {float_num} {// deal with float
                     printf("#%d: NUM:",num_lines);ECHO;printf("\n");
                     yylval.Float = atof(yytext);
                     return FLOAT;
                     }
    \n         { ++num_lines; }
    .          if(strcmp(yytext," "))ECHO;
    %%
    int yywrap() {
    return 1;
    }

(copied from stackoverflow). And that's before preprocessing. Yeah. Thanks, but no thanks.

jcranberry · on Dec 15, 2018

You could say the same about strtok--it would be trivial to create a regular grammar expressing whatever your code is lexing. However, by looking at it from this perspective it removes the distinction not just between a hand rolled lexer and a regex matching library, but also between those and using strtok. There would be no point in having this conversation.