Well, yes, using strtok works if the data happens to be structured in a certain ...

johnisgood · on Dec 15, 2018

I do not believe that using regex is necessary. I have parsed a lot of code in my life and regex was not a necessity.

GordonS · on Dec 15, 2018

Agreed. Regex can make parsing code much more succinct and easiet to grok (although usually at a small performance cost). So not "necessary", but it can be really useful.

jstimpfle · on Dec 15, 2018

You don't really use regex in C. You just write a few simple loops. Look up the lexer of the programming language of your choice.

barrkel · on Dec 15, 2018

Most lexers are state machines, either explicit with tables (like you get from lex) or implicit with program counter (with loops and switches). Those state machines implement matchers for regular languages; they're effectively hand-coded implementations of regular expression matching.

Regular expressions don't show up outside the spec, sure; but if you're writing the code (for implicit state machine), you need to know exactly where you are in the regular language that defines the tokens to write good code. Writing a regex matcher in code like this is like writing code in assembly - mentally, you're mapping to a different set of concepts all the time.

jstimpfle · on Dec 15, 2018

Yes. I don't think anyone is disagreeing here.

If you're implying that we should then use a regex implementation instead: Coding up a lexer (for a mainstream programming language) using simple counter increments and such is not a lot of work. It has the advantage that it results in faster code (unless you're going for real heavy machinery) and that you can easily code up additional transformations. For example, how would you parse string literals (including escape sequences) with a regex?

jcranberry · on Dec 15, 2018

String literals are easy without escaped quotes. With escaped quotes its annoying and non regex is much cleaner.

mcguire · on Dec 15, 2018

"(.|\.)*"

jstimpfle · on Dec 15, 2018

"Evaluation:\tWrong!\x07\x07\x07\r\n"

You want to convert the string literal to an internal buffer, interpreting the escape sequences. In the same way, you want to parse integers. You cannot really do that with a regular expression. RE is for matching, not for transforming.

dahart · on Dec 15, 2018

> Yout don’t really use regex in C.

Speak for yourself. There’s a POSIX standard for regex that is more than 30 years old & a GNU implementation that comes with gcc. C++ has regex in the standard library.

jstimpfle · on Dec 15, 2018

"You" used like "one", or "in common practice it isn't really used". It's just not in C's spirit to use canned libraries. Such use cases have long been transferred to Python and other languages. Sure, there is a regex API in the C standard library. I'd bet not even grep uses it.

The one use-case I'm envisioning is quickly exposing POSIX conformant regexen to the command-line.

dahart · on Dec 15, 2018

> I’d bet not even grep uses it.

I’ll take you up on that bet. I see #include <regex.h> in every source repo of grep I can find right now.

http://git.savannah.gnu.org/cgit/grep.git/tree/src/search.h

https://opensource.apple.com/source/text_cmds/text_cmds-99/g...

https://android.googlesource.com/platform/system/core.git/+/...

https://github.com/c9/node-gnu-tools/blob/master/grep-src/sr...

You owe me a beer. :)

BTW, I do super agree with your comment to just not use strtok, and also the idea that most people are better off parsing text in perl or python...

jstimpfle · on Dec 15, 2018

Not sure why that header is included there, though. I can't find any uses of the regex library. There are multiple custom matchers implemented. So maybe it's that the GNU regex library uses the grep sources. In any case, there don't seem to be any uses of regexec() or regcomp(), for example. Which would have surprised me anyway since that API is rather limiting (you cannot search incrementally).

Let's get that beer sometime, anyway.

kahirsch · on Dec 15, 2018

GNU grep uses the regex library in Gnulib. It can also use the Perl-compatible pcre library.

I am finding it hard to believe that you think it's "just not in C's spirit to use canned libraries."

jstimpfle · on Dec 15, 2018

I was saying "canned", not "boxed" or "containerized". It's a lot easier to write the little things yourself. (And there are good benefits to be had from writing specialized code yourself, instead of relying on big fat generalized tankers.)

kahirsch · on Dec 15, 2018

I don't know what distinction you're trying to make about "canned" vs. other things.

There are advantages and disadvantages of using libraries in any language. There are some aspects of C —e.g. the lack of garbage collection— that make it harder to reuse code compared to some languages. But it is definitely not against "C's spirit".

jstimpfle · on Dec 16, 2018

The distinction is one of size and number. An npm for C would make no sense.

int_19h · on Dec 15, 2018

Lexer generators are pretty popular in C land. I mean, there's Yacc, obviously. And then there's more low-level stuff like re2c.

tptacek · on Dec 15, 2018

Yacc is a parser generator, not a scanner generator. You meant lex/flex.

colejohnson66 · on Dec 15, 2018

I wrote a tokenizer for a language I’m creating, and all I needed was read character and peek character from an iterator

barrkel · on Dec 15, 2018

Guess what? You wrote a state machine (DFA most likely) in code, where the program counter represents the state. There is in all probability (if your language is sane), a 1:1 correspondence between the code you wrote and the regular grammar of your tokens. You implemented a regular language matcher, i.e. a regex matcher, but in code rather than via a regular expression language interpreter or compiler.

Typical pattern:

    start = p;
    while (isspace(*p) && p < eof) // [ ]*
        ++p;
    if (p == eof) return EOF;
    if (is_ident_start(*p)) { // [a-z]
       ++p;
       while (is_ident(*p)) // [a-z0-9]*
           ++p;
       set_token(p, p - start);
       return IDENT;
    } else if (is_number(*p)) { // [0-9]
       ++p;
       while (is_number(*p)) // [0-9]*
           ++p;
       set_token(p, p - start);
       return NUMBER;
    } // etc.

Corresponds to:

    IDENT ::= [a-z][a-z0-9]* ;
    NUMBER ::= [0-9][0-9]* ;
    SPACE ::= [ ]* ;

    TOKEN ::= SPACE (IDENT | NUMBER) ;

Inline those nonterminals, and guess what - regular expression!

jstimpfle · on Dec 15, 2018

That works until you need your own implementation of ++p;. Now wiring up that implementation to a generic library is already more work than just doing it all yourself. Not even considering the integration costs of the library into the sources and the build.

And you cannot really do transformation instead of only matching with RE. Those are needed already in simple cases like string and number literals. Now your code will look more like

    %{
    #include "y.tab.h"
    int num_lines = 1;
    int comment_mode=0;
    int stack =0;
    %}
    digit ([0-9])
    integer ({digit}+)
    float_num ({digit}+\.{digit}+)
    %%
    {integer} {  //deal with integer 
                    printf("#%d: NUM:",num_lines); ECHO;printf("\n");
                    yylval.Integer = atoi(yytext);
                    return INT;
                   }
    {float_num} {// deal with float
                     printf("#%d: NUM:",num_lines);ECHO;printf("\n");
                     yylval.Float = atof(yytext);
                     return FLOAT;
                     }
    \n         { ++num_lines; }
    .          if(strcmp(yytext," "))ECHO;
    %%
    int yywrap() {
    return 1;
    }

(copied from stackoverflow). And that's before preprocessing. Yeah. Thanks, but no thanks.

jcranberry · on Dec 15, 2018

You could say the same about strtok--it would be trivial to create a regular grammar expressing whatever your code is lexing. However, by looking at it from this perspective it removes the distinction not just between a hand rolled lexer and a regex matching library, but also between those and using strtok. There would be no point in having this conversation.