String tokenization in C

kazinator · on Dec 15, 2018

The actions of strtok can easily be coded using strspn and strcspn.

https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2001]

https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2011 repost]

strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.

The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .

saagarjha · on Dec 15, 2018

And it’s nicer, since you can pass in a const char * and use it in concurrent code.

jstimpfle · on Dec 15, 2018

strtok is one of the silliest parts of the standard library. (And there are many bad ones). It's broken. It's not thread safe (yes there is strtok_r). It's needlessly hard to use. And it writes zeros to the input array. The latter means it's unfit for most use cases, including non-trivial tokenization where you want e.g. to split "a+1" into three tokens.

If you program in C please just write those four obvious lines yourself.

yason · on Dec 15, 2018

If you program in C please just write those four obvious lines yourself.

Those are not necessarily obvious lines, there are several pitfalls to avoid, and for that reason strtok() is much longer than four lines. When it comes to the standard library functions strtok() has well-defined behaviour that is easy to reason with and near-magically approaches the string-splitting convenience close to scripting languages.

In contrast, an example of truly sickening part of stdlib is converting strings to number. The atoi()/atol() family doesn't check for errors at all so you want to use strtol(). But the way error checking works in strtol() is so complex that the man page has a specific example of how to do it correctly. All sane programmers quickly write a clean wrapper around strtol() to encode the complexity once. Now, strtok() is nothing like that.

In its simplicity, strtok() is quite versatile. A few strtok() calls can easily parse lines like:

    keyword=value1, value2, value3

that you might find in configuration files. And I mean truly in just a few lines which you might expect in Python but with C string handling? No.

jstimpfle · on Dec 15, 2018

Here is the musl implementation.

> https://github.com/esmil/musl/blob/master/src/string/strtok....

It's a bit longer than 4 lines because strtok does things you should not want. If you insist on parsing that configuration line with strtok, go ahead and write that brittle code. It breaks as soon as you want empty strings (try "keyword=value1, , value3" with strtok) or escape sequences or other transformations, or as soon as you want to do something as basic as parsing from a stream instead of a string that is completely in memory.

So to clarify, of course you are never done with parsing in 4 lines. But even if it wasn't as braindead to overwrite the input string, the functionality strtok provides would not be worth more than 4 lines.

yason · on Dec 16, 2018

So, here's that implementation:

    static char *p;
    if (!s && !(s = p)) return NULL;
    s += strspn(s, sep);
    if (!*s) return p = 0;
    p = s + strcspn(s, sep);
    if (*p) *p++ = 0;
    else p = 0;
    return s;

Instead of carrying that code, or something similar, with my source code or my own utility library I'd much rather have the already debugged version from the standard library.

Overwriting the input in C is more efficient than maintaining more internal state and returning a pointer and the length of each token which you would need to strncpy() to get the token into a C string. strtok() does not want to do the initial strdup() for you because only you will know whether your input can already be mutated or whether you need to use a copy.

As I pointed in the other reply, strtok() does not break on strings like "keyword=value1,, , value3" unless you skipped RTFM and expect it to do something completely different. And more often than not that's exactly what you want when parsing non-computer readable input which you can expect to take a specific form.

If you want to handle escape sequences, parse from a stream (without having the option to fgets() the next line into memory), or parse CSV tables without collapsin colunms then you will want to use something more specific to that. Luckily, strtok() was not advertised as a Swiss army knife so it's off the hook for specific parsing purposes like those.

jstimpfle · on Dec 16, 2018

As someone else pointed out what you really want is the implementation of strspn/strcspn which is where the loop is. You don't "carry" that code along. You just write

    while (i < len && !is_token_begin(buf[i]))
        i++;
    if (i == len)
        error("End of input\n");
    start_token(tok);
    while (i < len && is_token_char(buf[i])) {
        i++;
        add_to_token(tok, buf[i]);
    }
    end_token(tok);

or something along those lines. Whatever you need. It's not rocket science. Putting highly fluctuating and project-specific code like this in a library would only have disadvantages. Not everything should be in a library. In fact, most things should not be.

nly · on Dec 15, 2018

Unless you're using a pre-specified configuration file format (e.g. TOML), then parsing configuration files requires a general parsing library. This is a non-trivial task requiring a real parser operating over a well-specified grammar. A tokenization pipeline just won't cut it.

I worked on a project a few years ago that read its custom-format config file in line by line, chopped everything off each line following the first '#' character (to support comments), and then trimmed the whitespace. This sounds like a reasonable and elegant approach until you consider that now none of your user controlled fields (via a GUI in our case) can contain the '#' character. This effected customers, but nobody ever fixed it.

With the tools and languages out there now, there's just no excuse for this crap.

yason · on Dec 16, 2018

Unless you're using a pre-specified configuration file format (e.g. TOML), then parsing configuration files requires a general parsing library.

If you have needs that require a general parsing library then why are you criticizing strtok()? It doesn't parse XML either, not C source code, nor any unspecified configuration file formats.

nly · on Dec 17, 2018

My point is that strtok has almost no reasonable uses in real programs.

Someone · on Dec 15, 2018

> A few strtok() calls can easily parse lines like:

     keyword=value1, value2, value3

The challenge with parsing isn’t parsing correct inputs; it’s generating useful error messages and recovering on incorrect inputs such as

     keyword=,,value1, value2, value3,,,,

or even

     =keyword=,,value1, value2, value3,,,,

strtok isn’t the best tool for doing that.

(Yes, those could be valid inputs, but if they are, chances are they should be parsed differently)

yason · on Dec 16, 2018

That's what I meant: strtok() is well-defined in what it does. The man page is really short. I don't understand people who complain it doesn't do something it's not supposed to do. Yet it's very useful in what it does.

Your mileage may vary but it's also a common issue having to filter out empty items out of something like ",foo,bar ,,,,baz,,xyzzy," where you only really want those four words. You will especially encounter this in parsing user input which might have extra whitespace, or badly formed lists of items.

Use strtok() to split C strings into tokens delimited by the given set of delimiters. If you want to catch each comma but skip over any whitespace and split at the first '=' only, then use something else.

Let's say if I'm going to need a configuration file for my program I'd most certainly start with something I can parse with strtok(). I would really need very specific needs to warrant a more complex format that would require a more sophisticated parser in which case using one would be a no-brainer.

simias · on Dec 15, 2018

I use strtok_r from time to time, it does the job if you have a mutable input. Of course having to write zeroes is a bit cumbersome but it's one of the drawbacks of C-style strings.

The plain truth is that string handling in C is a huge pain in the ass no matter how you look at it. Splitting, concatenating, regex-ing... All of that is a huge pain in C. If you need to write a high-performance parser then it might be worth it but if you're just parsing a fancy command line format and performances don't matter it's just incredibly frustrating and error prone.

Rust fares better here because its str type is not NUL-terminated but actually keeps track of the length separately which makes it significantly more flexible and versatile. Of course you could do that in C but you'll be incompatible with any code dealing with native C-strings.

And of course you make one mistake and you have a buffer overflow vulnerability...

So yeah, if you program in C please use strtok_r if applicable, otherwise considering offloading the parsing to an other part of your application written in a language better suited for that and hand over binary data to the C library. If everything else fails then consider handwriting your parser and may god be with you. Oh and if your grammar is complex enough to warrant it, there's always lex/yacc.

Matthias247 · on Dec 15, 2018

> The plain truth is that string handling in C is a huge pain in the ass no matter how you look at it.

It is. And it’s not even only Cs fault. 80% of it is bad API Design. Strings could be accepted as a struct consisting of a pointer and a length, aka string_view. And there could be some manipulation functions around it. That would make those APIs a lot more flexible (one no longer needs to care whether things are null terminated and there would be less pointless copies).

For these reasons my estimate in the meantime is that the average C program which uses stdlib functions is less efficient than an implementation in another language, even though the authors would claim otherwise (its C, it must be fast).

paavoova · on Dec 15, 2018

What happens if you truncate your string? You lose information about buffer size. So now you need to store two sizes for such strings to be useful, string size and buffer size, which is 16 bytes for size_t on 64-bit systems. On top of that, strings are no longer arrays. So either the language would have to incorporate first class support for these strings, or you'll have people extracting the string pointer from the struct to perform indexing operations on themselves.

Matthias247 · on Dec 15, 2018

The approach obviously only works for read access and not for mutation. However those are the most often required operations. For owned strings and mutations different APIs are required. C++ string_view vs string, and Rusts str slices vs owned Strings work like this.

WalterBright · on Dec 15, 2018

One of the big performance wins of the D programming language over C is that arrays are length terminated instead of 0 terminated, so you can "slice" strings to get substrings, rather than allocate/copy/zero (and then get the free in the right place!).

paavoova · on Dec 15, 2018

> but you'll be incompatible with any code dealing with native C-strings

Not entirely, see https://github.com/antirez/sds

Basically, you have a header storing length, etc, but still null terminate, so library functions like strlen are none the wiser.

saagarjha · on Dec 15, 2018

You lose the ability to pass create zero-cost slices, though. C++11 actually implements strings like this, but iterators allow you to pass in parts of the string as necessary.

the_duke · on Dec 15, 2018

Nim (the programming language) implements strings this way, AFAIK, with the same reason: to be easily compatible with the C ecosystem.

nly · on Dec 15, 2018

Much of libc is terrible from an API design perspective, even given the limitations of C as a language.

libc has somehow managed to hit the sweet spot and have APIs that are both inconvenient to use properly, and perform poorly.

agumonkey · on Dec 15, 2018

any attempt has been made to build a nicer foundational C/OS library since ?

jstimpfle · on Dec 15, 2018

There are of course tons of code on the net, but there is no need to standardize another grab-bag of bad API calls. If you need a batteries-included library you use python or similar. If you write in C, you care so much that you largely avoid the standard library and design your code from the ground up.

Matthias247 · on Dec 15, 2018

You should. However there are lots of organizations around that use C because that’s what’s used in their domain (E.g. embedded) AND use the standard library, because they don’t know how to do better.

The results are often worse than if they would have used a higher level language right from the start.

dkrikun · on Dec 15, 2018

There are numerous reasons to write in C, apart from caring so much. Using strtok() is not that bad after all because its a simple function after all.

nly · on Dec 15, 2018

BSD has some minor improvements to libc. There's also glib.

Generally though, C is just a terrible language to do anything other than write extremely low level routines in. You should probably just never use strtok() and go straight to something like Ragel or re2c for building high performance tokenizers... then call those from a higher level language.

yakovdk · on Dec 15, 2018

Hey! I take exception to that. I've been writing shellcode all week. C is uncomfortably high level for writing low level routines!

nwmcsween · on Dec 15, 2018

libc aka ISO, POSIX, GNU wasn't designed it was more or less evolved from what was in use.

tails4e · on Dec 15, 2018

strncat is another good example. It's a buffer overflow waiting to happen, and worse it's the 'safe' version of strcat. The trick is the parameter you pass is not the length of the destination buffer, but the remaining length of the destination buffer. Most people really want the behaviour of strlcat.

rurban · on Dec 16, 2018

Nope, the safe versions of strcat are strcat_s and strncat_s, with bounds check and ensured NULL termination. strlcat does no bounds check.

loeg · on Dec 15, 2018

glibc's continued resistance to adding strlcat and strlcpy is a travesty. :-(

jstimpfle · on Dec 15, 2018

I'd say just use memcpy.

tptacek · on Dec 15, 2018

How does that help? You have the same bookkeeping problem with memcpy as you do with strncat.

jstimpfle · on Dec 15, 2018

str[ln]cat is like memcpy, but it does unnecessary work (search for the end of the string which the caller should long have figured out). It's like memcpy but less general and more confusing to use due to the implicit string length of the target operand. It's a completely unnecessary API, likely a remainder from the times when people used C for scripting.

It's also a complete idiocy to possibly not copy everything. I've never wanted that. It's begging for bugs. If there's not enough room to copy all the source bytes, you either know that in advance or you want to crash hard.

Even memcpy itself is hardly necessary as a library function. It's a simple loop. But at least it's what you would write anyway, and in some cases compilers are able to optimize it (with questionable benefits).

repiret · on Dec 15, 2018

If you use memcpy and your destination buffer is too small you don’t crash hard with any certainty, your program might even appear to work.

Moreover, most of the time I’m either certain the output buffer is large enough (maybe because I just allocated it) or I really do want the output buffer to have a truncated string. Strncpy’s two dumb behaviors that I almost never want is to pad the destination buffer with zeros if the source string is smaller and to leave the destination I terminated if the source is longer.

tails4e · on Dec 15, 2018

Sure, however the problem I'm highlighting is most programmers are not aware this std library function is not like the other str[n] functions. There are plenty of solutions, but it was an objectively poor decision to define strncat the way it was.

hermitdev · on Dec 15, 2018

There are valuable use cases where this matters. For example, parsing FIX messages in finance, this allows you to parse the tag/value pairs with no memory allocations, which matters in low latency HFT applications.

quotemstr · on Dec 15, 2018

If you want speed, just build a lexer with ragel [1]. It's hard to go faster than a DFA.

[1] http://www.colm.net/open-source/ragel/

Gibbon1 · on Dec 15, 2018

> And it writes zeros to the input array.

If you pass strtok_r a const string it can and will bus fault in some systems. This happens when it tries to write a /0 to the input string. Being an old crusty firmware guy I'm not sailing on good cargo cult ship HMS Immutability, but generating side effects in your input data stream is terrible.

There is no way to back up/undo when using strtok_r. When your parsing involves a decision tree that kinda sucks.

avar · on Dec 15, 2018

> it writes zeros to the input array [which] means it's unfit for most use cases[...]

Other issues with strtok() aside, this seems like a silly reason to discount a standard library function. If you don't want your input munged you can strdup() it. It's rare to find a C program that's so specialized that the performance hit of a strdup() would be unacceptable in a case where strtok() could otherwise have been used.

jstimpfle · on Dec 16, 2018

It would be wasteful to duplicate the input string which includes so much garbage. I would rather just go through the input string and append token by token to the output string, terminated with a single NUL.

Hypergraphe · on Dec 16, 2018

Agreed, I basically avoid using strtok because of that. Why would you write zeros in my input...

_cx2w · on Dec 15, 2018

strtok is there for things like and similar to /etc/hosts and /etc/fstab

ygra · on Dec 15, 2018

Or have those file formats been designed around having to parse them in a C program?

_cx2w · on Dec 15, 2018

Could be. I don't know which one came first.

fanf2 · on Dec 15, 2018

Use strcspn() instead

jstimpfle · on Dec 15, 2018

    Token tok;
    start_token(&tok);
    for (;;) {
        int c = look_next_char();
        if (('A' <= c && c <= 'Z') ||
            ('a' <= c && c <= 'z')) {  /* or whatever test */
            consume_char();
            add_to_token(tok, c);
        } else {
            break;
        }
    }
    end_token(tok);

Done. There's no point in going through a weird API.

torstenvl · on Dec 15, 2018

You've not only hard-coded your tokenization rules inside your logic, but you've managed to make it break on anything non-ASCII. I suppose you consider isalpha() to have a weird API?

jstimpfle · on Dec 15, 2018

I look at it the other way: I've hard coded the reading and writing routines inside the tokenization logic.

Being able to do that is exactly the point why it's so much simpler to avoid a silly API such as strcspn (or, god forbid, strtok).

> non-ASCII

yeah i know... Do you prefer strcspn("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ")? Do you think it's faster?

If you're pedantic, you could lex (0x41 <= c && c <= 0x5A). That way at least you consistently read ASCII, even on non-ASCII implementations. But I don't care and it's less readable.

> I suppose you consider isalpha() to have a weird API?

Yes. I do not even understand what it does.

>> isalpha() checks for an alphabetic character; in the standard "C" locale, it is equivalent to (isupper(c) || islower(c)). In some locales, there may be additional characters for which isalpha() is true-letters which are neither upper case nor lower case.

Well in any case I'm sure that's not what I wanted... By the way locale is super hard to use as well. Locale is a process global property. I'm not aware of any way to pass explicit locales to library functions.

avar · on Dec 15, 2018

> If you're pedantic, you could lex (0x41 <= c && c <= 0x5A)

'A' v.s. 0x41 makes no difference for portability. The thing that's unportable about that is that it assumes that the characters A..Z are continuous in your character encoding, which isn't portable C.

Although admittedly having to deal with EBCDIC these days is rare in anything except highly portable programs like C compilers or popular script interpreters.

This is why ctype.h functions exist. Just use them.

loeg · on Dec 15, 2018

> 'A' v.s. 0x41 makes no difference for portability. The thing that's unportable about that is that it assumes that the characters A..Z are continuous in your character encoding, which isn't portable C.

Wait, what? If C does not require A..Z to be contiguous, the distinction between 'A' and 0x41 is extremely significant to portable programs intending to parse ASCII when the native compiler encoding is whatever franken-coding doesn't have contiguous latin characters.

avar · on Dec 15, 2018

Yes. If the problem was trying to parse ASCII consistently that would be the right solution.

My response was to OPs moving the goal post to "portably parsing ASCII" in response to his suggested replacement for a C library function not being portable on non-ASCII systems, which make no sense.

jstimpfle · on Dec 16, 2018

I explicitly wrote "or whatever test" as a comment to the code snippet, and obviously the test was not the point of my comment.

Anyway I think most programming languages nowadays have their source encoding specified as UTF-8 or at least something ASCII-like, so ('A' <= c && c <= 'Z') is in fact what I would likely write, and using isalpha() would technically be a bug just as well.

zozbot123 · on Dec 15, 2018

EBCDIC famously does not have A..Z as contiguous characters, and I wouldn't describe it as a 'franken-coding' just yet - it still finds plenty of use in some places.

fork1 · on Dec 15, 2018

Unless you're dealing with mainframes, it's not like you see it everyday.

loeg · on Dec 15, 2018

EBCDIC is a classic example of a franken-coding.

If your compiler's source character set is EBCDIC and you want to parse ASCII files, you must use 0x41, etc, instead of 'A'.

tptacek · on Dec 15, 2018

strcspn is ANSI C90.

stochastic_monk · on Dec 15, 2018

I recommend ksplit/ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0]. It modifies the string in-place, adding null terminators, and provides a list of offsets into the string. This gives you the flexibility of accessing tokens by index without paying costs of copying or memory allocation.

[0] https://github.com/attractivechaos/klib

lixtra · on Dec 15, 2018

I have an obsession with unsafe example code:

  strcpy(str,"abc,def,ghi");
  token = strtok(str,",");
  printf("%s \n",token);

Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.

enriquto · on Dec 15, 2018

> I have an obsession with unsafe example code:

It is perfectly OK for example code to be unsafe. You do not wear a parachute when you learn to fly using a simulator. You realize that things will become more serious and complicated in the future, but you have to start with something simple and unsafe, no big deal. Otherwise you will never see the consequences of unsafe code in simple cases.

bqe · on Dec 15, 2018

I think you underestimate how many people blindly copy examples without understanding them. Safe example code results in more correct programs.

enriquto · on Dec 15, 2018

> I think you underestimate how many people blindly copy examples without understanding them. Safe example code results in more correct programs.

Even if this is true, the reasoning here is disturbingly short-sighted. Copying code that you do not understand is unacceptable behavior, and I'd say the sooner it blows up in your face, the better. The goal of code examples is to illustrate how things work in a simplified way, and code without error checks is often easier to understand at first. Imagine a hello world with all the possible error checks. That would be incomprehensible.

jfries · on Dec 15, 2018

Well, yes, using strtok works if the data happens to be structured in a certain simple way. Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.

johnisgood · on Dec 15, 2018

I do not believe that using regex is necessary. I have parsed a lot of code in my life and regex was not a necessity.

GordonS · on Dec 15, 2018

Agreed. Regex can make parsing code much more succinct and easiet to grok (although usually at a small performance cost). So not "necessary", but it can be really useful.

jstimpfle · on Dec 15, 2018

You don't really use regex in C. You just write a few simple loops. Look up the lexer of the programming language of your choice.

barrkel · on Dec 15, 2018

Most lexers are state machines, either explicit with tables (like you get from lex) or implicit with program counter (with loops and switches). Those state machines implement matchers for regular languages; they're effectively hand-coded implementations of regular expression matching.

Regular expressions don't show up outside the spec, sure; but if you're writing the code (for implicit state machine), you need to know exactly where you are in the regular language that defines the tokens to write good code. Writing a regex matcher in code like this is like writing code in assembly - mentally, you're mapping to a different set of concepts all the time.

jstimpfle · on Dec 15, 2018

Yes. I don't think anyone is disagreeing here.

If you're implying that we should then use a regex implementation instead: Coding up a lexer (for a mainstream programming language) using simple counter increments and such is not a lot of work. It has the advantage that it results in faster code (unless you're going for real heavy machinery) and that you can easily code up additional transformations. For example, how would you parse string literals (including escape sequences) with a regex?

jcranberry · on Dec 15, 2018

String literals are easy without escaped quotes. With escaped quotes its annoying and non regex is much cleaner.

mcguire · on Dec 15, 2018

"(.|\.)*"

jstimpfle · on Dec 15, 2018

"Evaluation:\tWrong!\x07\x07\x07\r\n"

You want to convert the string literal to an internal buffer, interpreting the escape sequences. In the same way, you want to parse integers. You cannot really do that with a regular expression. RE is for matching, not for transforming.

dahart · on Dec 15, 2018

> Yout don’t really use regex in C.

Speak for yourself. There’s a POSIX standard for regex that is more than 30 years old & a GNU implementation that comes with gcc. C++ has regex in the standard library.

jstimpfle · on Dec 15, 2018

"You" used like "one", or "in common practice it isn't really used". It's just not in C's spirit to use canned libraries. Such use cases have long been transferred to Python and other languages. Sure, there is a regex API in the C standard library. I'd bet not even grep uses it.

The one use-case I'm envisioning is quickly exposing POSIX conformant regexen to the command-line.

dahart · on Dec 15, 2018

> I’d bet not even grep uses it.

I’ll take you up on that bet. I see #include <regex.h> in every source repo of grep I can find right now.

http://git.savannah.gnu.org/cgit/grep.git/tree/src/search.h

https://opensource.apple.com/source/text_cmds/text_cmds-99/g...

https://android.googlesource.com/platform/system/core.git/+/...

https://github.com/c9/node-gnu-tools/blob/master/grep-src/sr...

You owe me a beer. :)

BTW, I do super agree with your comment to just not use strtok, and also the idea that most people are better off parsing text in perl or python...

jstimpfle · on Dec 15, 2018

Not sure why that header is included there, though. I can't find any uses of the regex library. There are multiple custom matchers implemented. So maybe it's that the GNU regex library uses the grep sources. In any case, there don't seem to be any uses of regexec() or regcomp(), for example. Which would have surprised me anyway since that API is rather limiting (you cannot search incrementally).

Let's get that beer sometime, anyway.

kahirsch · on Dec 15, 2018

GNU grep uses the regex library in Gnulib. It can also use the Perl-compatible pcre library.

I am finding it hard to believe that you think it's "just not in C's spirit to use canned libraries."

jstimpfle · on Dec 15, 2018

I was saying "canned", not "boxed" or "containerized". It's a lot easier to write the little things yourself. (And there are good benefits to be had from writing specialized code yourself, instead of relying on big fat generalized tankers.)

kahirsch · on Dec 15, 2018

I don't know what distinction you're trying to make about "canned" vs. other things.

There are advantages and disadvantages of using libraries in any language. There are some aspects of C —e.g. the lack of garbage collection— that make it harder to reuse code compared to some languages. But it is definitely not against "C's spirit".

jstimpfle · on Dec 16, 2018

The distinction is one of size and number. An npm for C would make no sense.

int_19h · on Dec 15, 2018

Lexer generators are pretty popular in C land. I mean, there's Yacc, obviously. And then there's more low-level stuff like re2c.

tptacek · on Dec 15, 2018

Yacc is a parser generator, not a scanner generator. You meant lex/flex.

colejohnson66 · on Dec 15, 2018

I wrote a tokenizer for a language I’m creating, and all I needed was read character and peek character from an iterator

barrkel · on Dec 15, 2018

Guess what? You wrote a state machine (DFA most likely) in code, where the program counter represents the state. There is in all probability (if your language is sane), a 1:1 correspondence between the code you wrote and the regular grammar of your tokens. You implemented a regular language matcher, i.e. a regex matcher, but in code rather than via a regular expression language interpreter or compiler.

Typical pattern:

    start = p;
    while (isspace(*p) && p < eof) // [ ]*
        ++p;
    if (p == eof) return EOF;
    if (is_ident_start(*p)) { // [a-z]
       ++p;
       while (is_ident(*p)) // [a-z0-9]*
           ++p;
       set_token(p, p - start);
       return IDENT;
    } else if (is_number(*p)) { // [0-9]
       ++p;
       while (is_number(*p)) // [0-9]*
           ++p;
       set_token(p, p - start);
       return NUMBER;
    } // etc.

Corresponds to:

    IDENT ::= [a-z][a-z0-9]* ;
    NUMBER ::= [0-9][0-9]* ;
    SPACE ::= [ ]* ;

    TOKEN ::= SPACE (IDENT | NUMBER) ;

Inline those nonterminals, and guess what - regular expression!

jstimpfle · on Dec 15, 2018

That works until you need your own implementation of ++p;. Now wiring up that implementation to a generic library is already more work than just doing it all yourself. Not even considering the integration costs of the library into the sources and the build.

And you cannot really do transformation instead of only matching with RE. Those are needed already in simple cases like string and number literals. Now your code will look more like

    %{
    #include "y.tab.h"
    int num_lines = 1;
    int comment_mode=0;
    int stack =0;
    %}
    digit ([0-9])
    integer ({digit}+)
    float_num ({digit}+\.{digit}+)
    %%
    {integer} {  //deal with integer 
                    printf("#%d: NUM:",num_lines); ECHO;printf("\n");
                    yylval.Integer = atoi(yytext);
                    return INT;
                   }
    {float_num} {// deal with float
                     printf("#%d: NUM:",num_lines);ECHO;printf("\n");
                     yylval.Float = atof(yytext);
                     return FLOAT;
                     }
    \n         { ++num_lines; }
    .          if(strcmp(yytext," "))ECHO;
    %%
    int yywrap() {
    return 1;
    }

(copied from stackoverflow). And that's before preprocessing. Yeah. Thanks, but no thanks.

jcranberry · on Dec 15, 2018

You could say the same about strtok--it would be trivial to create a regular grammar expressing whatever your code is lexing. However, by looking at it from this perspective it removes the distinction not just between a hand rolled lexer and a regex matching library, but also between those and using strtok. There would be no point in having this conversation.

graycat · on Dec 15, 2018

A lot of experience shows that the string tokenization in Open Object Rexx is darned useful. E.g., for many years, IBM's internal computing was from about 3600 mainframe computers around the world running VM/CMS with a lot of service machines written in Rexx. Rexx is no toy but a powerful, polished, scripting language and really good at handling strings.

A little example of some Rexx code with some string parsing is in

https://news.ycombinator.com/item?id=18648999

pasokan · on Dec 15, 2018

It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today

tinus_hn · on Dec 15, 2018

Strtok is not thread safe and can’t be made thread safe without changing the API. You should not use it.

morbusfonticuli · on Dec 15, 2018

> Strtok is not thread safe and can’t be made thread safe without changing the API. You should not use it.

Well, there is already a thread-safe variant [0]: > The strtok() function uses a static buffer while parsing, so it's not thread safe. Use strtok_r() if this matters to you.

[0] https://linux.die.net/man/3/strtok_r

seba_dos1 · on Dec 15, 2018

...with different API :P

enriquto · on Dec 15, 2018

of course! the original API cannot be made re-entrant

heinrichhartman · on Dec 15, 2018

Well, strtok could use thread local variables to store intermediate state, to make it threadsafe while maintaining the same API. Not saying this is a good idea, but technically it would work, no?

nly · on Dec 15, 2018

You could, but that would change the behaviour of existing programs. It might well be that there are well-defined programs out there that use strtok across separate, but properly synchronized, threads.

This is why it's crucial to get APIs right first time.

Sharlin · on Dec 15, 2018

Yes, as long as you can guarantee that there’s only one tokenization going on per thread at a time.

zakk · on Dec 15, 2018

That’s not a good reason not to use it.

A function can be not thread-safe and still safe to use in single-threaded programs.

The point is that strtok is not a good choice even for single-threaded code.

johnisgood · on Dec 15, 2018

> The point is that strtok is not a good choice even for single-threaded code.

Why isn't it a good choice exactly? Could you sum it up?

enriquto · on Dec 15, 2018

Maybe he means that the function is not re-entrant. You cannot run a loop that tokenizes a string, and somewhere inside, you call a function that uses strtok itself. This can happen inadvertently.

caf · on Dec 15, 2018

Note though that strsep() is not as portable, because it is an extension to standard C.

tptacek · on Dec 15, 2018

It's a tiny function, written in ANSI C, so if you're really concerned about this, just include it with your program. It's an extension to the standard C library, not to C itself.

beefhash · on Dec 15, 2018

Except then you have the issue about compilers complaining about double-declarations of the function, meaning you'll either have a lot of warning spam on every #include or now hard require some kind of header defines for HAVE_STRSEP. Once you go that way, there's no going back and it's only gonna become more and more.

tptacek · on Dec 15, 2018

People have been including strsep in packages since the 1990s (people used to include their own snprintfs, ffs). If you're really this freaked out about it, call your local copy "mystrsep" or something like that.

You know what else isn't in POSIX? All the rest of your C code.

beefhash · on Dec 15, 2018

In fact, it's not even in POSIX.

teddyh · on Dec 15, 2018

To quote the GNU C library manual: “This function was introduced in 4.3BSD and therefore is widely available.”¹

1. https://www.gnu.org/software/libc/manual/html_node/Finding-T...

johannes1234321 · on Dec 15, 2018

Widely except Windows and different embedded platforms etc.

satyenr · on Dec 16, 2018

> Next, strtok is not thread-safe. That's because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.

I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:

  int strtok(char *str, char *delim, char **tokens);

Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?

Does anyone here have the historical prospective?

megous · on Dec 15, 2018

Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It's very powerful in combination with goto.

saagarjha · on Dec 15, 2018

  str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));

  strcpy(str,TESTSTRING);

str = strdup(TESTSTRING)?

rurban · on Dec 15, 2018

AFAIK strtok has restrict on both args since C99. And the safe variants strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not ASCII.

https://en.cppreference.com/w/c/string/byte/strtok

bsenftner · on Dec 15, 2018

...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...

syrrim · on Dec 15, 2018

As long as you're fine with ascii delimiters, strtok et al. work fine for utf-8 strings.

bsenftner · on Dec 16, 2018

Would you happen to be aware of good Unicode normalization function/lib in C/C++?

the_clarence · on Dec 15, 2018

Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?

setquk · on Dec 15, 2018

I just use flex. You don’t have to ship flex as a dependency either.

_em6m · on Dec 15, 2018

How about just using a properly suited language por string manipulation?