Wc2: Investigates optimizing 'wc', the Unix word count program

jmholla · 2024-06-20T14:59:00 1718895540

> No, you aren't suppose to be able to see how the word-count works by looking at this code. The complexity happens elsewhere, setting up the state-machine.

This is like most of the reason I opened this article. I wish they'd spend more time talking about this. In fact, most of the README covers details that are important, but, irrelevant to their algorithmic improvements.

Maybe they expect you to open up the code, but even there the comments are largely describing what things do and not why, so I feel you need to have some understanding of the algorithm in question. For example, the parts where the state table gets constructed rather unhelpful comments to the uninformed:

    /**
     * Build an ASCII row. This configures low-order 7-bits, which should
     * be roughly the same for all states
     */
    static void
    build_basic(unsigned char *row, unsigned char default_state, unsigned char ubase)
    
    ...
    
    /**
     * This function compiles a DFA-style state-machine for parsing UTF-8
     * variable-length byte sequences.
     */
    static void
    compile_utf8_statemachine(int is_multibyte)

Even `wc2o.c` doesn't delve into the magic numbers it has chosen. I was hoping this repo would be more educational and explain how the state table works and why it's constructed the way it is.

Does anyone have a good resource for learning more about asynchronous state-machine parsers, that also could hopefully help explain why this is better than just iterating over characters? I'm guessing maybe it's the lack of branching?

James_K · 2024-06-20T16:00:32 1718899232

You can figure it out by just looking at the table and output. It prints the number of lines, then words, then characters. Since the lines count is counts[1], you can conclude that when a newline is encountered, the machine will transition to state 1, hence why the newline is the only 2 entry in the character table and all 2 entries in the state machine point to state 1. From the character table, we can see that characters are split between classes 0, 1, 2 which are word, space, and newline characters.

To get the definitions of all other states, by working back from the fact that state 2 is the word count, it must be entered when a word character (class 0) is hit after some space/newline characters. Hence state 0 represents the machine being in the middle of a sequence of whitespace, state 1 is when the machine is in a sequences of newlines, state 2 is the first letter of a word, and state 3 is any subsequent letter of a word. You can then verify that the machine transitions to the correct state from those by checking what it does in case of a word, space, or newline character. But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].

As for why the approach is better, I think it's because it avoids branching. If you think about the length of a word, it's quite short compared to a CPU pipeline, so on a branching version of this you're going to be stuck spending most of your execution time on branch miss-prediction.

tylerhou · 2024-06-20T17:14:12 1718903652

There are fewer branches but there is now a data dependency between loop iterations which makes each iteration slightly slower (maybe 1-2 cycles additional latency per iteration).

Because newlines are relatively common and unpredictable a state machine is likely better. But on a long file with no new lines and no spaces the branching one should be slightly faster. (All reasoning from first principles; I have not done any benchmarking!)

anamexis · 2024-06-20T20:06:53 1718914013

The README covers the exact case you describe - the `word.txt` benchmark is just a file with 93MB of the char `x`. The state machine is still faster in this case.

tylerhou · 2024-06-21T18:02:39 1718992959

wc2 is faster than wc, but unless I am missing something, I can't find an instance where the author benchmarked wc2 with the core state machine loop replaced with branches.

wc2 and wc might be compiled with different flags / use a different strategy for IO, which makes it hard to compare speeds directly. The theoretical speedup of branches is tiny compared potential speedup of changing your compiler flags!

samatman · 2024-06-20T18:39:43 1718908783

> But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].

I think the fourth transition represents illegal Unicode. From there it stays in the illegal state until it hits legal UTF-8, then goes back to counting.

teo_zero · 2024-06-22T05:15:03 1719033303

For handling ASCII, table needs 4 states × 3 classes of chars. Why is it defined as 4×4?

To handle illegal chars, it would need a 4th class but also a 5th state, so that's not the reason.

Can it be to replace a 'modulo' operation with an 'and' in the access to table?

James_K · 2024-06-20T20:27:41 1718915261

That isn't relevant in the ASCII example, and the UTF-8 one uses a 256×256 table.

samatman · 2024-06-21T17:04:50 1718989490

Half of all bytes are illegal ASCII though.

James_K · 2024-06-21T21:26:38 1719005198

And actually there isn't a state for illegal ASCII so it makes no sense to have a transition to/from it. And it still can never receive a 3 as input and use that transition.

Vvector · 2024-06-20T17:11:39 1718903499

I've read recently (here on HN) that branch predictors are right 99% if the time. Is that inaccurate?

drchickensalad · 2024-06-20T21:02:40 1718917360

If you're like me, you need to stop thinking of it as "99% of call sites" and start thinking of it as "99% of iteration loops". I know it's super obvious on the face of it, but the implications is that most call sites can repeatedly fail but are totally overshadowed by insanely high frequency hot loops that it almost always predicts right (it's skewed, just like avg vs median)

James_K · 2024-06-20T18:14:56 1718907296

This is the 1% of cases. As I understand it, most branches take the same path most of the time. Think of a loop for instance. With purely statistical prediction (go whichever direction the branch takes more often), loops will be predicted with extreme accuracy and the result is a huge number of correct branches. Then think about error handling code which isn't triggered most of the time due to input being valid. Between these two easy cases, I think you cover the majority of branches is programs which is why so many of them are predictable.

teo_zero · 2024-06-21T06:04:29 1718949869

That's an over-generalization.

Only branches that capture exceptional events, e.g. error checks, show 99% hit rate.

Branches that are part of an algorithm are driven by the data you feed to them. For example in a classical binary search, the prediction is right only 50% of the time.

kazinator · 2024-06-21T13:18:08 1718975888

But there's an overall loop which branches backward to the same spot, regardless whether the iteration went left or right. At least that's predictable.

If we have an instruction that will conditionally load one of two pointers from memory, the left or right traversal can be made branch-free.

The only reason we need to switch between two code paths in the binary tree descent is that we have two different pieces of code for loading the left or right pointer to the different offsets used in the load instruction.

If we use a two element array for left and right, the indices [0] and [1] select between them. We make the comparison condition calculate a 0 or 1 result and use it as an index.

ncruces · 2024-06-21T09:45:42 1718963142

And any loop that iterates more than 100 times.

kardos · 2024-06-20T17:51:52 1718905912

It may be true in some cases, but not in general. I think of it as an observation that for large swathes of code the branch is predictable. Eg, we usually don't go down the error checking codepath. Faced with flow control that depends on random data, such as "is this random number even", the branch predictor will have a hard time doing better than 50%.

kllrnohj · 2024-06-21T02:14:40 1718936080

Run a profiler on your code sometime and see how many branches it reports taking. You can easily get millions of branches per second, so 1% events happen hundreds of thousands of times.

See for example this quick investigation and look at the number of branches for what's not a very big program https://bnikolic.co.uk/blog/hpc-perf-branchprediction

casey2 · 2024-06-21T06:27:20 1718951240

Here's how the table is made

For starters you need to know what the states the values represent (you should remember this from k&r)

0 we are out of a word, but not via newline (lets call this character type "a one").

1 we are out of a word, via a newline ("a two" or "a newline").

2 we entered a word ("a zero").

3 we are in a word ("a zero").

So we know we need 4 rows in the table

{} {} {} {}

lets fill the 0th row first, we aren't in a word so the only states we can enter are 0,1 and 2. That's helpful cause we can fill out the 1st element

{,0,} {} {} {} (a one sends us to the first row)

{[12],0,[12]} {} {} {}(we know 3 isn't a possible destination as we are out of a word)

{1,0,2} {} {} {} (actually we have a choice of {1,0,2} or {2,0,1} I'll show 1,0,2 but this swaps the first 2 args of print at the end)

{1,0,2} {} {1,0,2} {} (since newline is also out of a word we have the same options)

{1,0,2} {3,0,2} {1,0,2} {} (now 3 is a possible destination if we get zero. A one will take us to state 0 and a newline to state 2 as before)

{1,0,2} {3,0,2} {1,0,2} {3,0,2} (now 0,2, and 3 are our only possible states)

As you can see, there is nothing magic about it.

1vuio0pswjnm7 · 2024-06-21T07:49:06 1718956146

"As you can see, there is nothing magic about it."

The parent seems to be using the term "magic numbers" incorrectly.

https://en.wikipedia.org/wiki/Magic_number_(programming)

thechao · 2024-06-20T15:16:53 1718896613

I’m with you. It appears to be a fairly bog-standard DFA, but divining that from 4x4 `int`s and a 256 `char` array feels like I’d just be doing all of the heavy lifting, intellectually. However… it does look like it’s trying to detect spaces (that’s the array of 256 `char`, as `char` is self-indexing). I can’t be bothered to remember how `wc` works, but my guess is that the state-machine has transition states for `sp->sp`, `sp->wd`, `wd->wd`, and `wd-sp`. It’d also need error modes. I suppose it doesn’t contain states for escapes?

zadokshi · 2024-06-20T15:33:09 1718897589

> why this is better than just iterating over characters?

The explanation it provides regarding Apache vs Ngnix seemed to me to imply that Apache requires memory allocation to read headers into buffers and Ngnix uses a state machine to avoid memory allocation. Memory allocation is slower than most people realise.

loeg · 2024-06-20T16:02:45 1718899365

Why is this better for wc, though? wc is not doing memory allocation in the hot loop.

jxndnxu · 2024-06-20T19:33:33 1718912013

Because classic wc is not iterating over every byte once, but multiple times.

It's especially obvious in the Unicode case where it first takes 1-4 bytes to get a Unicode character and then checks this character with another function to see if it's whitespace

But even with with naive ASCII approach, if you don't hand roll a state machine you are checking multiple conditions on each byte (is it a space and am I leaving a word etc)

Using a dfa has fixed compute per byte

kllrnohj · 2024-06-20T21:56:16 1718920576

Those 1-4 bytes are sitting in a register the entire time and thus basically free to read as often as you want, though.

An actual sampled profile showing the two approaches would be interesting. Naively it seems like it's just because it has faster UTF8 handling and nothing to do with being a state machine exactly

brainwad · 2024-06-21T11:51:23 1718970683

According to the authors it's also faster on files full of 'x' or ' ', so there must be more than just better unicode support.

yencabulator · 2024-06-21T14:08:26 1718978906

Even something as simple as wc calling a libc function for parsing utf-8, that doesn't get inlined, would destroy its performance relative to anything optimized.

Personally, I'd expect SIMD to win over all of these. wc sounds like kind of challenge that's very easy to partition and process in chunks, though UTF-8 might ruin that.

teo_zero · 2024-06-22T05:22:14 1719033734

> very easy to partition and process in chunks

Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.

yencabulator · 2024-06-22T13:30:15 1719063015

That sort of "find the correct chunk boundary" logic was very common with all the mapreduce processing that was done when people still used the phrase big data.

detourdog · 2024-06-20T18:56:38 1718909798

I think it keeps wc simple if an algorithm is introduced not directly related to count words it becomes more complicated to debug. I see this a an example of the Unix philosophy of do one thing well.

tkhattra · 2024-06-21T01:43:31 1718934211

rob graham wrote an article on this wc program in the humorously titled PoC||GTFO journal - see [1]. he also tweeted about it a few years ago [2].

[1] https://github.com/angea/pocorgtfo/blob/master/contents/arti... [2] https://twitter.com/ErrataRob/status/1494009849427992576

joe_the_user · 2024-06-20T22:28:39 1718922519

Hmm,

A project with all code working with special magic reminds me of another project that some attention a while back (XZ lib - supposed improvements with packages containing clever system back doors).

Edit: The code is likely harmless but the more opaqueness is in a system, the easier it is for malevolent opaqueness to hide.

acheong08 · 2024-06-21T04:57:20 1718945840

That is not what they mean by “magic”. The “magic” in xz is the complexity in the build system. Here, the “magic” are seemingly arbitrary numbers that improve performance. The reason they’re magic is because we don’t know why they were chosen. The code itself is obvious and not magic.

VMG · 2024-06-20T15:07:04 1718896024

I think the documentation might be stale, `build_basic` clearly has code for non-ASCII characters

worewood · 2024-06-20T15:52:40 1718898760

Writing text and writing code are two distinct abilities. It is very rare the professional that possesses both.

oopsallmagic · 2024-06-21T14:54:01 1718981641

That used to not be the case. Has something changed?

sebastianmestre · 2024-06-20T16:27:46 1718900866

Yeah, Sean Barrett has a good writeup about them and optimizing them for branch-prediction, etc

It's framed in the context of programming-language tokenization, but the principles are the same.

https://nothings.org/computer/lexing.html

lesuorac · 2024-06-20T14:41:34 1718894494

State machines are an underrated approached.

Just remember, if you're ever debugging something and some values in a Class/Object/Group of them are set and others are unset the state machine avoids that issue. When you go to transition between states check that the future state will be valid and if-not go to an error state with that information. The amount of time I've seen spent debugging where spurious `null`s came from completely dwarfs the time it would've taken just to write that class/algorithm as a state machine.

Kinda surprised WC wasn't a state machine to beginning with. Isn't it effectively a special characters counter where if you see a charactered followed by a space bump up the word count? I'm judging by the repos comment of "The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces -- which re-implementations of wc skip." that the big improvement is removing unnecessary work of those methods. mbrtowc [1] appears to re-create the provided substring which isn't necessary to count.

[1]: https://en.cppreference.com/w/cpp/string/multibyte/mbrtowc

munchler · 2024-06-20T16:01:39 1718899299

> Just remember, if you're ever debugging something and some values in a Class/Object/Group of them are set and others are unset the state machine avoids that issue.

Better yet, switch to a language that supports discriminated unions to avoid invalid state without having to write a low-level state machine. Each case in the union represents one of the valid states of the type. This is one of the many benefits of functional programming.

tialaramex · 2024-06-20T16:27:49 1718900869

This comes down to "Make Invalid States Unrepresentable"

Notably Go's choice here is instead "Make Representable States Valid". So for example in Go it's not possible for a type not to have a default value - every type must have a value you get by default, you could name this unwanted default value FUCK_OFF or DO_NOT_USE if you like, but in a larger system (where Go is supposed to thrive) you can be certain you'll find FUCK_OFF and DO_NOT_USE values in the wild.

This is one of the few things the C++ type system almost gets right. You really can say for your C++ type no, it doesn't have a default. Unfortunately exception safety means we can easily end up in a place where too bad it's in some state anyway, but if we turn off exceptions or discount them we can live without nonsensical defaults.

ekidd · 2024-06-20T19:20:04 1718911204

> This is one of the few things the C++ type system almost gets right. You really can say for your C++ type no, it doesn't have a default.

Rust does even better here, I think, for fairly subtle reasons:

- Exceptions are replaced by Result, with explicitly-marked return points. They're far less magic. ("panic!" is still magic, but it's allowed to be a call to "abort", so you only use it when the world is burning.)

- "Move semantics" by default makes a lot of edge cases cleaner. If you move out of an object by default, it gets consumed, not set to a default value. Similarly, moving values into a newly-constructed object can't fail, because it's just memmove.

- Object construction is atomic from the programmer's perspective.

- Default initialization only works if you implement "Default". Which is actually just a normal trait with no magic.

- If you discover that you have a state machine, you can switch to discriminated unions ("enum") to make your states mutually exclusive and clearly represented.

This whole area is one of the better parts of Rust's design.

kelnos · 2024-06-21T03:23:49 1718940229

> Default initialization only works if you implement "Default".

And -- and I think this is important -- even then, you have to explicitly assign the default value[0]. You won't just somehow magically get a default value.

[0] E.g.:

    let foo: SomeType = Default::default();

typon · 2024-06-20T17:01:29 1718902889

Maybe something like CppFront can allow metaprogramming such that the default constructor is deleted by default on all types and becomes opt-in.

lmm · 2024-06-21T01:21:29 1718932889

> Just remember, if you're ever debugging something and some values in a Class/Object/Group of them are set and others are unset the state machine avoids that issue.

You don't want a state machine for that, you want sum types. Using state machines give you the worse problem of not being able to tell how you got into to a given state, not being able to take meaningful stack traces, which is generally an even worse problem.

btown · 2024-06-20T17:58:22 1718906302

State machines are great for complex situations, but when it comes to performance, it's not at all clear to me that they're the most scalable approach with modern systems.

The data dependency between a loop iteration for each character might be pipelined really well when executed, and we can assume large enough L1/L2 cache for our lookup tables. But we're still using at least one lookup per character.

Projects like https://github.com/simdjson/simdjson?tab=readme-ov-file#abou... are truly fascinating, because they're based on SIMD instructions that can process 64 or more bytes with a single instruction. Very much worth checking out the papers at that link.

ashvardanian · 2024-06-21T03:56:51 1718942211

In the context of State Machines and Automatas - Intel HyperScan might be a better reference point. But the idea is the same. With a trivial PoC using Python wrappers over SIMD libraries one can get a 3x boost over the native `wc` CLI on a modern CPU, memory-mapping a very average SSD: https://github.com/ashvardanian/StringZilla/tree/main/cli

teo_zero · 2024-06-22T07:11:50 1719040310

Sorry, but your wc implementation does nothing to detect words, it just counts the spaces. Of course you don't need a state machine for that!

jxndnxu · 2024-06-20T19:37:04 1718912224

My understanding of the article's use of scalable was "fixed overhead more or less regardless of the complexity of the state machine and input" not "fastest implementation available"

GrantMoyer · 2024-06-20T16:08:09 1718899689

> The algorithm is known as an "asynchronous state-machine parser". It's a technique for parsing that you don't learn in college.

The mapping from regular and context free languages to state machines and their table-based implementations was covered in depth in the Compilers course I took. A table based state machine is literally the textbook algorithm for parsing these classes of languages, and implementing them has been almost entirely automated by tools like Flex and Bison since the 80s.

mywittyname · 2024-06-20T21:10:19 1718917819

Compilers was a graduate course at my school.

CS as a field has grown quite a lot over the years, and it seems that colleges are responding by pushing the more theoretical classes into grad school to make room for the more applied classes.

mcguire · 2024-06-20T19:02:41 1718910161

For example, see Chapter 3 of the Dragon Book, Compilers: Principles, Techniques, and Tools by Aho, Ullman, et al. Or probably any other compilers text book.

mananaysiempre · 2024-06-20T15:19:01 1718896741

> This is analogous to NFA and DFA regular-expressions. If you use the NFA approach, you need to buffer the entire chunk of data, so that the regex can backtrack. Using the DFA approach, input can be provided as a stream

Sorry, no, this is completely wrong. There are other reasons NFAs or DFAs might be preferable, and there are other performance considerations as far as backtracking is concerned, but NFA implementations do not backtrack. I mean, they could, but they don’t, and in common usage a “NFA regex implementation” means a non-backtracking one (that goes through all input once with a bag of states in hand and advances each one of them on each step), whereas a “backtracking regex implementation” means one that doesn’t use automata at all (and is usually worst-case exponential in the input).

What this is actually talking about is a streaming parser. Just call it like it is, it’s an OK term that sees plenty of usage.

[What you could mention here is PEG parsers, because even linear PEG implementations may need to keep O(input) state and thus are essentially incapable of streaming, unlike LL or LR ones; and GLR et al. can be even worse, for what it’s worth. But this is a much more obscure topic.]

veltas · 2024-06-20T15:44:05 1718898245

How do you implement a non-backtracking NFA other than converting it to a DFA?

mananaysiempre · 2024-06-20T16:24:42 1718900682

Assuming each state transition consumes a character (“epsilon-free NFA”), just follow the definition. Before each step, you have a list of NFA states you might be in, initially containing only the start state. During each step, go through each state on that list and enter any acceptable successors into a new one. Deduplicate and repeat.

This is (potentially super)linear in the number of states, but that doesn’t preclude it from being linear in the input length. In particular, even though it’s essentially a bad lazy implementation of standard NFA-to-DFA conversion, you’re avoiding the exponential state blowup you get in the eager version. (I say it’s a bad lazy implementation because a proper one would memoize; that, however, would bring the worst-case blowup back.) I think that, in addition to the fact that people do actually do this kind of thing in production implementations, qualifies it as a separate approach.

If your NFA does have epsilon-transitions—like, say, the NFAs obtained from the textbook Thompson regexp-to-NFA construction—you can either use a different (Glushkov) regexp-to-NFA construction[1], eliminate those transitions as a separate pass[2], or adjust the algorithm above to eliminate duplicate NFA states on the fly[3].

[1]: https://en.wikipedia.org/wiki/Glushkov%27s_construction_algo...

[2]: https://news.ycombinator.com/item?id=19272990

[3]: https://swtch.com/~rsc/regexp/regexp2.html

zeroonetwothree · 2024-06-21T14:20:34 1718979634

Last time I implemented this myself I just computed the epsilon closure at each step.

burntsushi · 2024-06-20T16:04:27 1718899467

At search time, the principle difference between an NFA and a DFA is that, in an NFA, you can be in more than one state at the same time. There's no problem with implementing your transition function to advance all states you are in simultaneously for each character of input. You'll get `O(m * n)` time, where `m` is proportional to the number of states in the NFA and `n` is proportional to the length of the input.

The GP is remarking on a confusable term, "NFA algorithm," that has come to mean something different from its underlying theory. The PCRE2 documentation itself, for example[1], remarks about how it uses the "NFA algorithm" as its standard approach to matching. But of course, PCRE2 supports way more than the theoretical model of non-deterministic finite automata (NFA) actually supports. PCRE2 is much more expressive. Yet, the implementation approach is still called "NFA algorithm." Presumably the reason for this is that an NFA can be simulated by doing backtracking, but if you don't keep track of where you've visited, the worst case runtime of this is superlinear, possibly catastrophically so[2]. The problem is that the backtracking regex engines have evolved well beyond what an NFA actually supports. But the name stuck.

To make things even more confusing, as I mentioned above, an NFA can be used to execute a search in O(m * n) time. (We typically call that "linear in the size of the input" since the NFA itself can usually, but not always, be regarded as a constant factor. i.e., A literal regex string in your program source code.) So when you say something like, "a finite automata regex engine uses NFAs to guarantee linear time," you might be really confused if your background is in backtracking engines which... also use an "NFA," but of course cannot guarantee linear search time.

So what we have is ambiguous terminology, not unlike "regex" itself, which could perhaps mean a real theoretical "regular expression" (in the case of Hyperscan, RE2, Go's regexp package and Rust's regex crate) or it could mean "a very expressive DSL for pattern matching that supports more than what regular languages can describe" (in the case of PCRE2, ECMAScript's regex engine, Ruby's Oniguruma, Python's `re` module, and more). I've made my peace with "regex" being ambiguous, but the fact that "NFA algorithm" is itself ambiguous despite "non-deterministic finite automata" being a somewhat baroque and specific technical term, is very unfortunate.

To make things even worse---and I shit you not, I am not joking---the backtracking people seem to have taken to calling a linear time NFA search the "DFA algorithm."[3] (At least the PCRE2 docs mention it's not a "traditional" finite state machine...) But... it's not a DFA. The backtracking folks have effectively elevated the terms "NFA" and "DFA" to mean something very different from their theoretical underpinnings.

So when you have people that are only aware of one meaning of "NFA algorithm" talking to other people only aware of the other meaning of "NFA algorithm," chaos ensues.

[1]: https://www.pcre.org/current/doc/html/pcre2matching.html

[2]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#cloud...

[3]: https://www.pcre.org/current/doc/html/pcre2matching.html#TOC...

fanf2 · 2024-06-20T18:51:54 1718909514

Yeah, it’s vexing. This backtracking / NFA / DFA confusion originally came from Jeffrey Friedl’s book “mastering regular expressions” (O’Reilly, 1997) — at least, that’s the first place I saw the mistake in print. Philip Hazel relied a lot on Friedl’s book when writing PCRE, which is why the PCRE docs get it wrong. (I spoke to Philip about this many years ago when we worked together.)

burntsushi · 2024-06-20T20:36:15 1718915775

Yeah I didn't mention Friedl because I've dumped on him in the past, and didn't want to belabor it. But Friedl definitely didn't start this. According to him, he was just using what was common vernacular even back then: http://regex.info/blog/2006-09-15/248 (Note that blog is from 2006, but he's actually quoting himself from ~10 years prior to that in 1997.)

But... he is the one who ultimately installed this ambiguity into a classic book that is still read to this day. So I think he can at least be blamed for popularizing the confusion. And especially since the entire book is framed around "NFA" and "DFA" regex engines. It's not like it was just some one-off mention. The mix-up is baked into the conceptual fabric of the book. The landscape also looked different back then. It predated RE2 for example, and RE2 is, I think, principally responsible for bringing "backtracking semantics" to finite automata oriented engines. So Friedl's book is forever stuck in a false dichotomy that only happened to exist back when he wrote it.

mikeqq2024 · 2024-06-24T05:32:16 1719207136

Thanks for the detailed explanation to grok the whole landscape.

loeg · 2024-06-20T16:05:39 1718899539

As GP said, you can keep a bag of current states.

moody__ · 2024-06-20T15:52:40 1718898760

The "asynchronous state machine" name here is a bit strange, when searching for this term used elsewhere I couldn't find any formal definition what it is. Reading further in the README it looks like the author implies that it really just means a DFA? Not entirely sure.

I'd also like to add the Plan 9 implementation[0], which also uses the properties of utf8 as part of its state machine and anecdotally has been quite performant when I've used it.

[0] http://git.9front.org/plan9front/plan9front/107a7ba9717429ae...

rhelz · 2024-06-20T17:50:43 1718905843

"Asynchronous" isn't part of the name of some really cool state machine :-) Its just an adjective and means the same as when you put it in front of any other noun.

A synchronous state machine is one where the incoming stream of events is always "in sync" with the state transitions, in the following sense:

1. When an event happens, the state machine can transition to the next state and perform any necessary actions before the next event happens

2. After the the state machine has transitioned and performed any necessary actions, the program must wait for the next event to happen. It can't do anything else until it does.

An asynchronous state machine doesn't make the main program wait until the next event happens. It can go on and do other things in the meantime. It doesn't have to wait for then next event to arrive.

When the next event does arrive, the program pauses whatever else it is doing, and hands control back to the state machine, which process the event, and then hands control back over to the main program.

moody__ · 2024-06-20T18:24:59 1718907899

I was not treating "asynchronous state machine" as a noun, even if taken as a generic adjective it doesn't make sense in this context. What "other things" is this wc2.c doing while the state machine is churning? There is no multi threading or multi processing going on here. So I find it hard to believe that this use of "asynchronous" is inside of what I would generally see it used as. As such I thought perhaps it referred to a specific methodology for designing the code, something akin to how the "lock free" adjective implies a certain design sensibility.

rhelz · 2024-06-20T22:16:26 1718921786

// What "other things" is this wc2.c doing...

AFAICT, wc2.c isn't written to be an asynchronous state machine. It doesn't ever seem to transfer the control to any other place.

// So I find it hard to believe that this use of "asynchronous" is inside of what I would generally see it used as

Yeah, you are legitimately confused. The post talks about asynchronous state machines, but w2c.c isn't an example of that. I'm sure this gave you a severe case of WTF?!??

// thought perhaps it referred to a specific methodology for designing the code

It does---that's exactly what it is, a programming methodolog, or perhaps better put, a design pattern. But w2c.c isn't an example of code written using that methodology. Again, you are legitimately confused here, because the post talks about something and w2c.c isn't that.

Do you know python? If you google for "asynchronous programming in python" you'll get all kinds of blog posts and youtube videos which explain the technique.

moody__ · 2024-06-21T05:30:49 1718947849

Why would the author of this repository make "wc2 - asynchronous state machine parsing" his header of his README if indeed wc2 was not by his own definition an "asynchronous state machine"? I ask you to consider what is more likely: that your blanket definition of asynchronous is incorrect as applied here or the author is just fucking with us by adding random words as the description of his project.

partialzero · 2024-06-21T17:56:40 1718992600

Indeed this is very confusing! The program implements a pretty standard state machine (ok), but there is nothing apparently async here. The auth alludes to combining the state machine with async IO in this paper (https://github.com/angea/pocorgtfo/blob/master/contents/arti...), but this implementation is just using fread to (synchronously) read a chunk of bytes.

Furthermore, given disk caching and memory mapping, I'm not convinced async IO would really be that astonishingly different, as individual reads are going to be amortized over pretty much the same bulk reads that the sample program is doing.

As the author says themselves, it seems the main win is hand implementing the incremental utf8 parsing instead of calling a library/os function.

rhelz · 2024-06-21T20:30:49 1719001849

> I ask you to consider what is more likely: that your blanket definition of asynchronous is incorrect as applied here or the author is just [elided] with us by adding random words

LMAO!!! Well, when you put it that way, I can't blame you for not believing me. Your skeptical mindset will no doubt serve you well in this era of deep fakes and AI hallucinations.

Alas, it is also an example of how this skepticism, however necessary, is going to slow down the sharing of information :-( Its the price we're going to pay for so much lying and such a breech of the social contract.

I assure you, however, w2c.c is not asynchronous. It would be nice if the author could step in here and clarify, because it is hella confusing.

I don't believe the author is Effing with us either--documentation and comments are not automatically synced with the code they describe, so its easy for them to drift apart. Perhaps the author is intending to implement asynchronous features in the future, or perhaps he changed his goals between when he wrote the README and when he wrote the code.

rhelz · 2024-06-21T13:32:34 1718976754

You'll have to ask him.

PaulHoule · 2024-06-20T18:09:13 1718906953

As I see it, state machines are particularly good for expressing logic in asynchronous systems. For instance in the late 1980s I wrote assembly language XMODEM implementations for the 6809 and the 80286 and since that kind of code is interrupt drive it is efficient to make a state machine that processes one character at a time. Today when you use async/await the compiler converts your code, loops and all, into a state machine.

rhelz · 2024-06-20T22:03:43 1718921023

// 6809

The good old days :-)

1vuio0pswjnm7 · 2024-06-21T08:01:28 1718956888

"Parallel" seems to be a more popular term in the literature.

Maro · 2024-06-21T09:12:01 1718961121

I wrote a clean C++ version of `wc` for fun, and successfully beat the system version by 10-20% iirc. My goal was to make the code much more readable than the GNU/C version, which is a big spaghetti loop, while maintaining the same semantics (ie. return the same counts).

Blog posts:

https://bytepawn.com/tag/wc.html

Code:

https://github.com/mtrencseni/wcpp/

butterisgood · 2024-06-21T13:46:00 1718977560

A "word" counter in C++ (not really, but kind of)

std::cout << std::distance(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std:::string>()) << std::endl;

And I guess you could get "character" count using istreambuf_iterator.

Don't do this :-) (Though I've used it in a pinch)

butterisgood · 2024-06-21T13:46:51 1718977611

Also, sorry for the typos... the character version is istreambuf_iterator<char> also... whoops!

im3w1l · 2024-06-20T14:44:27 1718894667

I'm surprised there is no mention of simd. Like I'm sure this is "fast enough" but if you want to make a really fast wc for fun wouldn't that be a natural direction?

thequux · 2024-06-20T14:58:55 1718895535

A hand-written SIMD wc is likely to be even faster than a state machine, but at the cost of orders of magnitude more work. The huge advantage of state machines are that they are a relatively generic approach that provides massive speedups nearly every time. A SIMD wc algorithm is a significant effort that can't be generalized, and is only likely to provide a 4x speedup or so: rarely worth it except in cases where the task is a performance hotspot for a lot of applications.

mananaysiempre · 2024-06-20T17:57:01 1718906221

An ASCII SIMD wc would likely be much more than 4× faster. The 320 MB/s is speed TFA quotes is good for a byte-at-a-time solution, but pitiful as far as the capabilities of modern machines are concerned. A decent SSD is an order of magnitude faster than that. Around 1 GB/s (i.e. ~100 cycles/fetch) is table stakes for a task of this complexity.

(TFA almost nerd-sniped me into writing a SIMD implementation already, and you finished the job. End result: on an aging 2.5 GHz Broadwell, I’m seeing 300 MB/s for TFA’s C implementation and 3200 MB/s for my first attempt at an x86-64-v3 one. So more like a 10× speedup, and its highly likely one can do better. Admittedly I did spend around an hour on these 50 lines, and would need to spend more in reality to check for CPU features at runtime and so on.)

However, TFA calls going ASCII-only “cheating”, and I can see its point. I don’t know how difficult a Unicode SIMD wc would be or if it’s even possible to do well. (Bonus points: do you want to define a word boundary as a space/nonspace boundary the way[1] POSIX tells you to, or the more complex way[2] Unicode tells you to?)

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/w...

[2] https://www.unicode.org/reports/tr29/#Word_Boundaries

mananaysiempre · 2024-06-20T18:04:21 1718906661

The (cheating, ASCII-only) SIMD implementation for reference:

  #if 0 /* shell polyglot */
  exec ${CC:-cc} $CPPFLAGS -g -O2 -march=x86-64-v3 $CFLAGS -o "${0%.c}" "$0" || exit $?
  #endif
  /* SPDX-License-Identifier: CC0-1.0 */
  
  #include <immintrin.h>
  #include <stdint.h>
  #include <stdio.h>
  #include <unistd.h>
  
  int main(int argc, char **argv) {
      const char *const progname = argc ? argv[0] : "<unknown>";
      unsigned long long bytes = 0, words = 0, lines = 0;
      uint32_t prev = -1;
      for (;;) {
          static __m256i buf[1024 * 1024];
          ssize_t k, n = 0;
          do {
              k = read(STDIN_FILENO, (char *)buf + n, sizeof buf - n);
          } while (k > 0 && (n += k) < sizeof buf);
          if (k < 0) { perror(progname); return 1; }
          if (!n) break;
          bytes += n;
  
          while (n % sizeof *buf) ((char *)buf)[n++] = ' ';
          n /= sizeof *buf;
  
          for (size_t i = 0; i < n; i++) {
              __m256i x = _mm256_load_si256(&buf[i]);
  
              __m256i nl = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('\n'));
              lines += _mm_popcnt_u32(_mm256_movemask_epi8(nl));
  
              const __m256i table = _mm256_broadcastsi128_si256(_mm_setr_epi8(
                  /* 0 */ ' ', 0,    0,    0,    0,    0,    0, 0,
                  /* 8 */ 0,   '\t', '\n', '\v', '\f', '\r', 0, 0));
              __m256i ws = _mm256_cmpeq_epi8(x, _mm256_shuffle_epi8(table,
                  _mm256_and_si256(x, _mm256_set1_epi8(0x0F))));
              uint64_t mask = _mm256_movemask_epi8(ws);
              words += _mm_popcnt_u32(mask ^ (mask << 32 | prev) >> 31);
              prev = mask;
          }
      }
      words += !(prev >> 31);
      printf("%llu %llu %llu\n", lines, words / 2, bytes);
      return 0;
  }

bear8642 · 2024-06-21T22:25:05 1719008705

Remember exploring the simd fizz-buzz solution[0] and forgot just how fast computers can be

[0] https://codegolf.stackexchange.com/questions/215216/high-thr...

mananaysiempre · 2024-06-22T13:22:34 1719062554

There are degrees to this kind of thing, and this is far far away from that one (exercise: find one instruction in the code that can be deleted without changing anything else). On my (Ryzen 7x40) laptop it does run around two times faster than its SSD can supply data (11 GB/s vs 6 GB/s), and—to my great surprise—gets four times(!) slower if you feed it via (GNU) cat(1) instead of shell redirection (slowdown from piping from pv is below 1.5× though).

Yet it’s nowhere near being bottlenecked on memory bandwidth (theoretical at 80 GB/s, sequential-read actual at 55 GB/s, memset at 45 GB/s, or—probably most realistically given we’re not using zero-copy I/O—memcpy at 25 GB/s). As Daniel Lemire put it, “Data engineering at the speed of your disk”[1].

Unfortunately, to get that speed out of your computer, you end up needing to program any transformations you may need in strange, target-dependent, and most importantly nearly noncomposable ways. Compiler engineers have been working on the problem for more than two decades, but I don’t think we’re putting away our intrinsic references and latency tables any time soon. One good way[2] to explain the problem is that a single core of the CPU mentioned above (for example) has about as much memory it can access essentially instantly as the original IBM PC without addons: about 64 KB (then it was called “RAM”, now we call it “physical registers” and “L1 cache”).

[1] https://www.youtube.com/watch?v=p6X8BGSrR9w

[2] https://retrocomputing.stackexchange.com/a/26457

Joker_vD · 2024-06-20T16:14:30 1718900070

SIMD-based reimplementations of wc already exist, e.g. this [0] one, and it's rather straightforward.

[0] https://github.com/expr-fi/fastlwc

Galanwe · 2024-06-20T15:01:35 1718895695

I think this is but a detail in the context of the article.

Yes, you can implement your state machine on SIMD sized states, but there are also millions of other optimizations you can do like that, aligning your reads, prefetching the file, parallelizing, etc.

Doesn't change the core concept

nitwit005 · 2024-06-20T18:46:04 1718909164

The primary reason their code is faster as they've made incorrect code that assumes you always have a UTF-8 locale. The normal isspace does not assume this:

> The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces

This will produce bad results in the real world. People have previously posted about bugs related to this on Hacker News: https://news.ycombinator.com/item?id=36216389

scottlamb · 2024-06-20T20:13:22 1718914402

Are we looking at the same thread? Folks there seem to be complaining the old interfaces are anything from outdated to confusing to simply wrong, and I agree.

I think it's totally reasonable for a program designed in 2024 to say it only supports ASCII and UTF-8 encodings. Whether/how it should support the full spectrum of Unicode definitions of characters/graphemes/... is more interesting. For a lot of backend code (processing text-based network protocols, for example), focusing on ASCII is arguably best (e.g. functions like isspace and isdigit only returning true for ASCII characters). For more user-focused stuff, most of the world would say they'd like their native language supported well. Programs written with both use cases in mind should probably have a switch. (Or parallel APIs: e.g. Rust has {u8,char}::is_ascii_digit vs. char::is_numeric, which makes much more sense there than having one that switches behaviors based on an environment variable, as the correct behavior really depends on the call site.)

Of course "say it only supports ASCII and UTF-8 encodings" is mutually exclusive with claiming to be a drop-in replacement for a utility that is specified as having POSIX locale support. This project does not make that claim, or even that it's intended to be actually used rather than illustrative of a performance technique.

zzo38computer · 2024-06-21T04:10:32 1718943032

> I think it's totally reasonable for a program designed in 2024 to say it only supports ASCII and UTF-8 encodings.

I think that it should depends on the program; that might be reasonable for some programs but in a lot of cases I think that it won't be reasonable. (Your explanation includes some of the examples, although not all of them.)

Sometimes, it is most helpful to support only ASCII (although non-ASCII bytes might still be supported, even without needing special processing to handle them; in some cases this may effectively allow other ASCII-compatible encodings as well such as EUC-JP).

Sometimes, a program should not need to deal with character encoding at all.

Sometimes, it makes sense to deal with whatever character encodings are used in the file formats the program is designed to handle.

Sometimes, it makes sense to support multiple character encodings, with or without conversion (depending on what is being done with them).

Even if a program does only support ASCII and UTF-8 encodings, then depending on what it does with them, mentioning ASCII might be unnecessary since UTF-8 is a superset of ASCII anyways.

But unfortunately many programs use UTF-8 (or other encodings, but mostly UTF-8) where it is inappropriate to do so, which can result in many problems including inefficiency.

leni536 · 2024-06-20T20:41:38 1718916098

It would be just as fast if it detected the locale correctly, and used the UTF8 state machine for the UTF8 locale only. Other locales could have their own state machines or just fall back to a generic implementation.

stkdump · 2024-06-21T03:38:55 1718941135

I'm wondering if you could easily build the state machine quickly at startup time. Unless it is a multi byte encoding it should be trivial.

YesThatTom2 · 2024-06-20T14:50:26 1718895026

I love state machines and every time I use one my workers think I invented it because they’ve never seen them before.

The data for state machine in this article might be best prepared by generating it from a program. That generator program doesn’t need to care (too much) about performance since it is run during the build process. I like the idea of doing a lot of work now to save time in the future.

SoftTalker · 2024-06-20T16:04:56 1718899496

I agree, and I raised an eyebrow at the "The algorithm is known as an "asynchronous state-machine parser". It's a technique for parsing that you don't learn in college."

I certainly learned about state machines in college. Not sure we ever studied this particular algorithm but the concept was definitely covered.

And I also agree that when you are dealing with a known, finite set of values, pre-computing lookup or index tables for techniques like this can be a big win.

FroshKiller · 2024-06-20T22:10:58 1718921458

I'm a little surprised to hear this. Writing a state machine to simulate an alarm clock used to be an introductory project for first-year computer science undergraduates.

pletnes · 2024-06-20T15:58:42 1718899122

Do you have any accessible references on state machines? I’ve just read about them a bit but never coded one in a project. Seems like a useful concept for testable, reliable code.

GrantMoyer · 2024-06-20T16:38:37 1718901517

The video Finite State Machines explained by Abelardo Pardo[1] seems like a good introduction (I'm not familiar with the author; I just searched Finite State machine on youtube and found the first result which wasn't a jumbled mess of abstract or misused jargon).

It may seem simple, but that's truely all there is to finite state machines, a set of finite states and a set of events which cause transitions between states. Their specific applications to various domains is a much larger topic, but with a basic understanding, any application should be clear whenever it comes up. There's no one way to implement a state machine, and you've probably used them all the time, even if you haven't thought about them formally. For example, a simple but common 2-state machine takes the form:

   running = true
   while running:
       …
       if <some condition>:
           running = false

[1]: https://www.youtube.com/watch?v=hJIST1cEf6A

d0mine · 2024-06-20T20:12:58 1718914378

It could be written as:

    while True:
        ...
        if <exit condition>:
            break

trealira · 2024-06-20T19:30:54 1718911854

A finite state machine has a set of input symbols, a set of states, and a transition function, which maps the current state and current input to the next state (and, when programming it, it doesn't have to be a literal function; for example, it could be a 2D array). It also has a start state, and a set of accept states.

It's also common for state machines to have some sort of output; otherwise, all an FSM can do is recognize whether an input string is "good." So, there's also an output function, so that the FSM actually does something while processing the input string. There are two types of FSMs: a Mealy Machine, and a Moore machine. In a Mealy machine, the output function maps the current state and the current input to an output; so, one state could be associated with multiple outputs. In a Moore machine, the output is purely a function of the current state, no matter the input. Usually, a Moore machine has more states than the equivalent Mealy machine. In practice, Moore machines are a lot rarer than Mealy machines.

You can model FSMs in a variety of ways. You could use the state pattern [0]. You could have a table-based state machine; examples in C here [1] [2]. It's also common to model them using a loop and a switch statement with integers to represent the current state - or just use "goto" to go to the next state, if it's available in your language.

Also, this is tangential, but something nice about coroutines is that they save can you from writing an explicit state machine, as the state is implicit in your code. One example of this is how it's easier to write iterators as coroutines that yield results as they run, rather than as structures that update themselves based on their current state, transition to the next state, and yield a result.

So far, what I've covered is basically a deterministic finite automaton (DFA). There are also non-deterministic finite automata (NDFAs), which are used to implement regular expressions (at least when regular expressions aren't extended to require more powerful recognizers) [3], as well as pushdown automata (PDAs), which are basically finite automata with a stack, and can parse e.g. programming languages. Finally, there are Turing machines.

For theory on that, I'd recommend reading Sipser's Introduction to the Theory of Computation. If you look up the book on Google, it's not difficult to find a PDF of it.

[0]: https://refactoring.guru/design-patterns/state

[1]: https://nullprogram.com/blog/2020/12/31

[2]: https://ideone.com/94U6aJ (this one removes C-style comments from source code - I didn't make it)

[3]: https://swtch.com/~rsc/regexp/regexp1.html

bruce343434 · 2024-06-20T15:20:44 1718896844

What about this is "asynchronous"?

roelschroeven · 2024-06-20T15:58:16 1718899096

I don't get that either.

It feels like they compare the asynchronous-ness of their algorithm to Nginx being asynchronous ("asynchronous web-servers like Nginx use a state-machine parser. They parse the bytes as they arrive, and discard them."), but I don't see how that relates. The way web-servers handle requests (multiple threads vs multiple processes vs asynchronous event-driven in one thread) is completely orthogonal to how they parse headers.

The impression I have is that their algorithm works in a streaming way, without having to allocate memory buffers, and that they call that asynchronous (wrongly, as far as I can see).

I also don't really see what they mean by their algorithm being more scalable.

laurent_du · 2024-06-20T16:25:15 1718900715

My understanding was that scalability came from the removal of memory buffers.

stkdump · 2024-06-21T03:47:20 1718941640

I guess since state is fully encoded inside a single state variable and of course the output counts, it would be trivially simple to switch between multiple input streams and incrementally count words in each of them as new data arrives.

JonChesterfield · 2024-06-20T16:00:41 1718899241

Recognising that wc is a lexer problem and implementing it using re2c would be my first choice here. That'll be building a similar state machine to what is done by hand here, especially since the hand rolled one is still byte at a time (the main drawback to re2c's generated output).

JackSlateur · 2024-06-20T17:04:25 1718903065

Reading the code to help me understand how things are

I got the wc_lines from coreutils: https://github.com/coreutils/coreutils/blob/master/src/wc.c#...

And I thought "damn, I understand nothing about the state-machine stuff, how did they made this faster ?"

Truth is: they did not Of course, this is just the "count line" part. Other parts are indeed faster.

Coreutils 9.4:

  0.67 [jack:/tmp] /usr/bin/time -v wc -l debian-12.2.0-amd64-netinst.iso 
  2524855 debian-12.2.0-amd64-netinst.iso
   Command being timed: "wc -l debian-12.2.0-amd64-netinst.iso"
 User time (seconds): 0.00
 System time (seconds): 0.07
 Percent of CPU this job got: 98%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
 Average shared text size (kbytes): 0
 Average unshared data size (kbytes): 0
 Average stack size (kbytes): 0
 Average total size (kbytes): 0
 Maximum resident set size (kbytes): 2176
 Average resident set size (kbytes): 0
 Major (requiring I/O) page faults: 0
 Minor (reclaiming a frame) page faults: 99
 Voluntary context switches: 1
 Involuntary context switches: 0
 Swaps: 0
 File system inputs: 0
 File system outputs: 0
 Socket messages sent: 0
 Socket messages received: 0
 Signals delivered: 0
 Page size (bytes): 4096
 Exit status: 0

And wc2 (from gcc -O3 -o wc2 wc2.c):

  0.47 [jack:/tmp] /usr/bin/time -v ./wc2 -l debian-12.2.0-amd64-netinst.iso 
  2524855 debian-12.2.0-amd64-netinst.iso
 Command being timed: "./wc2 -l debian-12.2.0-amd64- netinst.iso"
 User time (seconds): 0.97
 System time (seconds): 0.05
 Percent of CPU this job got: 99%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.03
 Average shared text size (kbytes): 0
 Average unshared data size (kbytes): 0
 Average stack size (kbytes): 0
 Average total size (kbytes): 0
 Maximum resident set size (kbytes): 2176
 Average resident set size (kbytes): 0
 Major (requiring I/O) page faults: 0
 Minor (reclaiming a frame) page faults: 146
 Voluntary context switches: 1
 Involuntary context switches: 1
 Swaps: 0
 File system inputs: 0
 File system outputs: 0
 Socket messages sent: 0
 Socket messages received: 0
 Signals delivered: 0
 Page size (bytes): 4096
 Exit status: 0

Gotta keep reading

roetlich · 2024-06-20T22:20:15 1718922015

From looking at the code, the clever bit is how this is doing word count. (and word & line count at the same time). So yeah, it makes -l isn't better.

orf · 2024-06-20T18:00:18 1718906418

Any notable difference if you pipe the file in, rather than read from disk?

IO caches are a thing as well, don’t use the first measurement.

JackSlateur · 2024-06-20T21:24:57 1718918697

I've accounted for IO cache, the test is running from an in-memory dataset

After checking a bit more, wc2 is indeed faster than wc in a specific use case: counting characters. That is, transforming a sequence of one or multiple bytes into meaningful character (wc --chars somefile).

In wc, the related code is here: https://github.com/coreutils/coreutils/blob/master/src/wc.c#...

So in the end, wc uses the function while wc2 does not: https://en.cppreference.com/w/c/string/multibyte/mbrtoc32

Now, what I cannot tell : is wc2's code really comparable versus mbrtoc32 ? From a couple of tests, it works against ascii and utf8 dataset, but is that all ? Or are there edgecases cleverly handled by mbrtoc32 (with a performance cost) ?

throwaway55533 · 2024-06-20T16:51:04 1718902264

From reading the article, how is this more efficient? Doesn't any word counting algorithm have to iterate through all the characters and count spaces?

What makes this better than the standard algorithm of

    wc = 0
    prev = null
    for ch in charStream:
        if !isSpace(ch) and isSpace(prev):
           wc += 1
        prev = ch

GrantMoyer · 2024-06-20T17:03:35 1718903015

Basically, it's just a lookup table for isSpace, rather than whatever logic is in the original function (which probably has conditional branches).

There's a little bit of a complication in the fact that a state machine can implicitly share parts of the computation for utf-8 encoded codepoints with shared prefixes, so instead of a 2^32-element lookup table, you only need 4 2^8-element lookup tables (and instead of 1 state, parsing codepoint, you have 4, parsing first byte, parsing second byte, etc.).

throwaway55533 · 2024-06-20T17:33:46 1718904826

Ah. Is 8 bits really optimal? I don't know how many UTF space characters there are, I thought there were only a few. Why not a 16-bit lookup?

GrantMoyer · 2024-06-20T21:02:46 1718917366

UTF-8 encoded codepoints can have an odd number of bytes, so processing a file 2 bytes at a time would be a little more complicated. Processing the file one codepoint at a time works because you can decode the UTF-8 stream into a stream of 32-bit codepoints before passing the codepoints to the lookup table. I suppose you could also transform from UTF-8 to UTF-16 before passing values to the lookup table.

Processing byte by byte isn't necessarily faster than processing codepoint by codepoint, or any other size. You'd need to measure performance empirically, and it probably depends on caches sizes and other factors. In theory, you could also process bit by bit — then you'd only need 32 2-element lookup tables — but that's unlikely to be efficient, since you'd need to do a lot of bit manipulation.

Edit: Upon inspection, the method I described doesn't appear to be the method used by the featured program. It still basically uses a lookup table for detecting space characters, byte by byte, but the states are not how I described. Instead of the states representing which byte of a UTF-8 encoded codepoint is being processed, and the word count being incremented upon certain transitions — a Mealy machine — the state represent the class of the codepoint last fully processed, and the count is always increased based on the current state (often by zero) — a Moore machine.

trealira · 2024-06-21T21:08:14 1719004094

It's funny that you call that the standard algorithm, because, although it intuitively makes sense, this is the first time I've seen that word-counting algorithm.

The first implementation of wc I ever saw was the one from The C Programming Language, which has this written at the beginning of the book (except instead of "(isspace(c))" they wrote "(c == ' ' || c == '\n' || c == '\t')"):

  state = OUT;
  nl = nw = nc = 0;
  while ((c = getchar()) != EOF) {
      ++nc;
      if (c == '\n')
          ++nl;
      if (isspace(c))
          state = OUT;
      else if (state == OUT) {
          state = IN;
          ++nw;
      }
  }

jxndnxu · 2024-06-20T19:41:38 1718912498

You have answered your question yourself: your algorithm looks at each byte twice, not once

It's even more obvious in the UTF case where the classic implementation first looks at 1-4 byte to parse a character and only then checks if it's a space

rep_movsd · 2024-06-20T15:47:48 1718898468

Remember "The Zen of Code Optimization" by Michael Abrash in the 90s?

This word count challenge was the first one he described. The table driven approach was discussed, but was not necessarily the fastest due to the large cache overflowing (on 90s processors) table needed.

I used the same example in a chapter I wrote for a C++ book

smaudet · 2024-06-20T15:30:18 1718897418

I was a bit surprised by the macos/linux speed differential here:

> The difference in macOS and Linux speed is actually the difference in clang and gcc speed. The LLVM clang compiler is doing better optimizations for x86 processors here.

I know GCC has been getting much, much more aggressive as of late, but there have been some complaints that it is now much, much less safe (checks intended for security getting optimized away, e.g.).

I wonder if you were to go back to 5 years ago, if you'd see the linux code was generally safer than the speedy osx llvm code...

kstrauser · 2024-06-20T15:44:20 1718898260

I'd assume the difference between Linux's GNU and macOS's BSD implementations would be much more significant.

pjdesno · 2024-06-20T19:15:45 1718910945

Exactly. MacOS (really BSD) grep sucks, 5-10x slower than GNU Grep for simple searches on the same hardware, but MacOS/BSD wc is fast.

1vuio0pswjnm7 · 2024-06-20T22:58:05 1718924285

"The wc program included with macOS and Linux are completely different. Therefore, the following table shows them benchmarked against each other on the same hardware."

Appears BSD wc ("the macOS program") is much faster than the GNU coreutils wc ("the Linux program").

It certainly illustrates a point about people rewriting wc in newer languages to demonstrate alleged speed against C.

Which wc are they using for comparison.

Pehaps they should use wc2 or wc2o.

mytec · 2024-06-21T00:04:19 1718928259

That's a really good point. I'm using macOS and I tested -w option against a CSV file that is about 430MB in size and has about 18,000,000 words in it. The time was 0.989. When I ran wc2 on the same file, it clocked in at 0.943.

I wondered what I was doing wrong and came back to the comments section and found your comment. Ah.

Edit: interestingly, the newline count time was 0.458 with the macOS version and 0.943 with wc2.

kazinator · 2024-06-21T13:00:57 1718974857

The target audience, Unix graybeards programming shell scripts using commands like wc, is extremely hard to convince to use new tools. If you don't integrate these optimizations into, say, the GNU Core Utils wc program, the impact will be nil.

apitman · 2024-06-20T18:15:44 1718907344

I didn't know you were allowed to use a language other than Rust for something like this.

rakoo · 2024-06-20T15:06:20 1718895980

> This state machine approach always results in the same speed, regardless of input.

This is strange to me, shouldn't it be O(N) with the size of the input ? Or maybe in this benchmark the inputs are small enough that the difference isn't measurable ?

sebastianmestre · 2024-06-20T15:17:13 1718896633

They meant that different characters don't change the performance, unlike the Linux and MacOS versions, where different characters go through different code paths.

But yes, a bigger file does take longer to process.

VMG · 2024-06-20T15:07:42 1718896062

he is talking in context of same-sized files (counted in bytes)

roywiggins · 2024-06-20T15:53:43 1718898823

Speed as in bytes per second, regardless of which bytes you give it.

fragmede · 2024-06-21T00:56:45 1718931405

In the bringing a tank to a knife fight kind of way, could this be optimized to run on a GPU? Load the contents then do an "and" across the whole contents in parallel, and then sum the whitespaces?

Dylan16807 · 2024-06-21T01:48:30 1718934510

These benchmarks are on 92 million byte files so we're into the range where bringing a tank is fair (and worth the startup cost).

stkdump · 2024-06-21T03:52:24 1718941944

I doubt that you can make it faster on the GPU than on CPU when utilizing SIMD, reason being that you are actually doing something close to trivial upon looking at each byte in sequence. So you transfer it from CPU memory to GPU memory in order to do almost nothing with it.

fragmede · 2024-06-21T05:43:23 1718948603

I've got it working on a T4 via Google Colab. The PDF takes 178 milliseconds to the 206 listed in the readme for the C version, so 15%?

https://github.com/fragmede/wc-gpu/blob/main/wc_gpu.ipynb

Dylan16807 · 2024-06-21T04:31:04 1718944264

It's only at a limit like that if you don't parallelize. And sure you could use more cores, but you can go a lot faster on 20% of a GPU than on 20% of your CPU cores.

fragmede · 2024-06-21T06:48:31 1718952511

I got nerd sniped into doing it. https://github.com/fragmede/wc-gpu

aftbit · 2024-06-20T15:32:41 1718897561

It would have been nice for the author to label the states and explain how and why they chose the states that they did. Otherwise, it's hard to understand how to apply this sort of thing to larger problems.

jxndnxu · 2024-06-20T19:44:14 1718912654

How would if have helped? You either design a DFA by hand or use a compiled from a regular language

mjamil · 2024-06-20T14:47:06 1718894826

"Many programmers think pointer-arithmetic is faster." Don't modern compilers make this statement false (i.e., both approaches are implemented via the same machine instructions)?

IshKebab · 2024-06-20T15:21:04 1718896864

They definitely do in simple cases (e.g. a simple for loop). In more complex cases, especially when the index/pointer escapes a function then the code will be different.

jimbobthrowawy · 2024-06-20T18:07:19 1718906839

I checked using godbolt, and when compiling with -O2 (like his makefile) only the parse_chunk_pp version is in the emitted assembly and is called regardless of the -P option. I think compilers have done this where applicable for a long time.

Compiling without any arguments leads to two lines differing in the assembly for the functions.

trealira · 2024-06-20T20:02:52 1718913772

Something funny is that GCC may compile your code that uses pointer arithmetic into one that uses an index register, and compile your code that uses an index into an array as pointer arithmetic.

https://godbolt.org/z/9KPnnM7Po

markstos · 2024-06-20T14:47:07 1718894827

Was this ever submitted an update to the `wc` shipped on systems?

canucker2016 · 2024-06-20T16:18:27 1718900307

I doubt it - from the project's readme page:

"...Now the real wc works with a lot more character-sets, and we don't do that. But by implementing UTF-8, we've shown that it's possible, and that the speed for any character-set is the same."

Looking at the command line options for wc2.c and comparing to the FSF wc man page, it looks like there are some missing/unsupported options in wc2.c as well.

anon1040 · 2024-06-20T16:26:29 1718900789

> ...the real wc works with a lot more character-sets...

This should read: the real wc works with a lot more character _encodings_

1vuio0pswjnm7 · 2024-06-23T03:58:34 1719115114

https://github.com/katef/libfsm/

nikau · 2024-06-22T09:11:24 1719047484

Good to see effort to speed up wc, its been taking way too many cycles away from my 1.5GiB electron chat program

kristianpaul · 2024-06-21T00:52:07 1718931127

Isn't fedora working on improving some gnu tools?

NunoSempere · 2024-06-20T17:28:16 1718904496

From a previous project: https://nunosempere.com/blog/2023/09/15/wc/ that was looking not at speed, but at simplicity, here is a comparison of different historical versions of wc:

---

The [version of wc.c](https://git.nunosempere.com/personal/wc/src/branch/master/sr...) in this repository sits at 44 lines. It decides to read from stdin if the number of arguments fed to it is otherwise zero, and uses the standard C function getc to read character by character. It doesn't have flags, instead, there are further utilities in the src/extra/ folder for counting characters and lines, sitting at 32 and 35 lines of code, respectively. This version also has little error checking.

[Here](https://github.com/dspinellis/unix-history-repo/blob/Researc...) is a version of wc from UNIX V7, at 86 lines. It allows for counting characters, words and lines. I couldn't find a version in UNIX V6, so I'm guessing this is one of the earliest versions of this program. It decides to read from stdin if the number of arguments fed to it is zero, and reads character by character using the standard C getc function.

The busybox version ([git.busybox.net](https://git.busybox.net/busybox/tree/coreutils/wc.c)) of wc sits at 257 lines (162 with comments stripped), while striving to be [POSIX-compliant](https://pubs.opengroup.org/onlinepubs/9699919799/), meaning it has a fair number of flags and a bit of complexity. It reads character by character by using the standard getc function, and decides to read from stdin or not using its own fopen_or_warn_stdin function. It uses two GOTOs to get around, and has some incomplete Unicode support.

The [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version implements some sort of table method in 331 lines. It uses plan9 rather than Unix libraries and methods, and seems to read from stdin if the number of args is 0.

The plan9port version of wc ([github](https://github.com/9fans/plan9port/blob/master/src/cmd/wc.c)) also implements some sort of table method, in 352 lines. It reads from stdin if the number of args is 0, and uses the Linux read function to read character by character.

The [OpenBSD](https://github.com/openbsd/src/blob/master/usr.bin/wc/wc.c) version is just *nice*. It reads from stdin by default, and uses a bit of buffering using read to speed things up. It defaults to using fstat when counting characters. It is generally pleasantly understandable, nice to read. I'm actually surprised at how pleasant it is to read.

The [FreeBSD version](https://cgit.freebsd.org/src/tree/usr.bin/wc/wc.c) sits at 367 lines. It has enough new things that I can't parse all that it's doing: in lines 137-143, what is capabilities mode? what is casper?, but otherwise it decides whether to read from stdin by the number of arguments, in line 157. It uses a combination of fstat and read, depending on the type of file.

Finally, the GNU utils version ([github](https://github.com/coreutils/coreutils/tree/master/src/wc.c), [savannah](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...)) is a bit over 1K lines of C. It does many things and checks many possible failure modes. I think it detects whether it should be reading from stdin using some very wrapped fstat, and it reads character by character using its own custom function.

So this utility started out reasonably small, then started getting more and more complex. [The POSIX committee](https://pubs.opengroup.org/onlinepubs/9699919799/) ended up codifying that complexity, and now we are stuck with it because even implementations like busybox which strive to be quite small try to keep to POSIX.

trealira · 2024-06-20T20:42:33 1718916153

There's another version from Unix V10: http://tuhs.org/cgi-bin/utree.pl?file=V10/cmd/wc.c

Also, from V8: https://www.tuhs.org/cgi-bin/utree.pl?file=V8/usr/src/cmd/wc...

fsckboy · 2024-06-20T15:09:03 1718896143

while you are in there,

  wc --unicode
  wc --bom

cryptos · 2024-06-20T15:42:59 1718898179

Now we need another round of the language battle, because now this approach needs to be implemented in Rust, Go, Zig, Nim ... you name it!

aargh_aargh · 2024-06-20T15:14:21 1718896461

I wanted to take a peek the state diagram (ASCII version). But I was feeling lazy, so apologies for having ChatGPT do the work for me. I wasn't going for correctness and didn't check it. If you're like me, here you go:

https://dreampuf.github.io/GraphvizOnline/#digraph%20StateMa...

digraph StateMachine { rankdir=LR; size="8,5"; node [shape = circle];

    LookingForWord -> InsideWord [label="Non-space"];
    LookingForWord -> LookingForWord [label="Space"];
    LookingForWord -> Newline [label="Newline"];
    LookingForWord -> LookingForWord [label="Other"];
    
    Newline -> InsideWord [label="Non-space"];
    Newline -> LookingForWord [label="Space"];
    Newline -> Newline [label="Newline"];
    Newline -> LookingForWord [label="Other"];
    
    InsideWord -> ContinueWord [label="Non-space"];
    InsideWord -> LookingForWord [label="Space"];
    InsideWord -> Newline [label="Newline"];
    InsideWord -> LookingForWord [label="Other"];
    
    ContinueWord -> ContinueWord [label="Non-space"];
    ContinueWord -> LookingForWord [label="Space"];
    ContinueWord -> Newline [label="Newline"];
    ContinueWord -> LookingForWord [label="Other"];

}

EDIT: Under peer pressure, I checked it and it correctly reflects the code apart from being designed for one specific line ending sequence (as it should, being an example optimized for brevity).

As for the replies, as opposed to my approach, I'm sure when you're browsing research papers, you're doing a full reproducibility study for each one. I'm sorry I commented, I should have waited for you to post your results.

roywiggins · 2024-06-20T15:29:15 1718897355

Surely correctness is the most important thing? Anyone can produce an incorrect state diagram.

bruce343434 · 2024-06-20T15:15:34 1718896534

This is a useless contribution to the discussion, maybe even negative... Why didn't you check it at least before sharing?

aftbit · 2024-06-20T15:31:28 1718897488

Why? Because it's wrong? If it were correct, wouldn't it be useful?

roywiggins · 2024-06-20T15:50:05 1718898605

This is a bit like presenting someone with a raw egg on a plate and saying "ah, but if I had cooked it, might it not be a tasty omelet?" Correctness is the hard part!

arp242 · 2024-06-20T16:14:29 1718900069

More like a covered plate you're expected to eat without looking. "This may be an amazing omelet and you will have a great meal, or you may eat a raw egg, get salmonella, and be sick for a week. It's probably an omelet, but I'm not sure, so you'll probably be fine. Enjoy!"

aftbit · 2024-06-21T03:22:49 1718940169

Right, but are its labels of the state actually incorrect? I don't know well enough how to check.