Well, yes, using strtok works if the data happens to be structured in a certain simple way.
Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.
Agreed. Regex can make parsing code much more succinct and easiet to grok (although usually at a small performance cost). So not "necessary", but it can be really useful.
Most lexers are state machines, either explicit with tables (like you get from lex) or implicit with program counter (with loops and switches). Those state machines implement matchers for regular languages; they're effectively hand-coded implementations of regular expression matching.
Regular expressions don't show up outside the spec, sure; but if you're writing the code (for implicit state machine), you need to know exactly where you are in the regular language that defines the tokens to write good code. Writing a regex matcher in code like this is like writing code in assembly - mentally, you're mapping to a different set of concepts all the time.
If you're implying that we should then use a regex implementation instead: Coding up a lexer (for a mainstream programming language) using simple counter increments and such is not a lot of work. It has the advantage that it results in faster code (unless you're going for real heavy machinery) and that you can easily code up additional transformations. For example, how would you parse string literals (including escape sequences) with a regex?
You want to convert the string literal to an internal buffer, interpreting the escape sequences. In the same way, you want to parse integers. You cannot really do that with a regular expression. RE is for matching, not for transforming.
Speak for yourself. There’s a POSIX standard for regex that is more than 30 years old & a GNU implementation that comes with gcc. C++ has regex in the standard library.
"You" used like "one", or "in common practice it isn't really used". It's just not in C's spirit to use canned libraries. Such use cases have long been transferred to Python and other languages. Sure, there is a regex API in the C standard library. I'd bet not even grep uses it.
The one use-case I'm envisioning is quickly exposing POSIX conformant regexen to the command-line.
Not sure why that header is included there, though. I can't find any uses of the regex library. There are multiple custom matchers implemented. So maybe it's that the GNU regex library uses the grep sources. In any case, there don't seem to be any uses of regexec() or regcomp(), for example. Which would have surprised me anyway since that API is rather limiting (you cannot search incrementally).
I was saying "canned", not "boxed" or "containerized". It's a lot easier to write the little things yourself. (And there are good benefits to be had from writing specialized code yourself, instead of relying on big fat generalized tankers.)
I don't know what distinction you're trying to make about "canned" vs. other things.
There are advantages and disadvantages of using libraries in any language. There are some aspects of C —e.g. the lack of garbage collection— that make it harder to reuse code compared to some languages. But it is definitely not against "C's spirit".
Guess what? You wrote a state machine (DFA most likely) in code, where the program counter represents the state. There is in all probability (if your language is sane), a 1:1 correspondence between the code you wrote and the regular grammar of your tokens. You implemented a regular language matcher, i.e. a regex matcher, but in code rather than via a regular expression language interpreter or compiler.
Typical pattern:
start = p;
while (isspace(*p) && p < eof) // [ ]*
++p;
if (p == eof) return EOF;
if (is_ident_start(*p)) { // [a-z]
++p;
while (is_ident(*p)) // [a-z0-9]*
++p;
set_token(p, p - start);
return IDENT;
} else if (is_number(*p)) { // [0-9]
++p;
while (is_number(*p)) // [0-9]*
++p;
set_token(p, p - start);
return NUMBER;
} // etc.
Corresponds to:
IDENT ::= [a-z][a-z0-9]* ;
NUMBER ::= [0-9][0-9]* ;
SPACE ::= [ ]* ;
TOKEN ::= SPACE (IDENT | NUMBER) ;
Inline those nonterminals, and guess what - regular expression!
That works until you need your own implementation of ++p;. Now wiring up that implementation to a generic library is already more work than just doing it all yourself. Not even considering the integration costs of the library into the sources and the build.
And you cannot really do transformation instead of only matching with RE. Those are needed already in simple cases like string and number literals. Now your code will look more like
You could say the same about strtok--it would be trivial to create a regular grammar expressing whatever your code is lexing. However, by looking at it from this perspective it removes the distinction not just between a hand rolled lexer and a regex matching library, but also between those and using strtok. There would be no point in having this conversation.