Somewhat related tip: prepend LANG=C to many console commands such as grep to sp...

seanhunter · on June 7, 2023

If you care about speed you would probably be using ripgrep rather than grep anyway, but doesn’t `LANG=en_US.UTF-8` give a similar speed on modern systems without any compromise on consistency of sort ordering etc and support for extended characters?

burntsushi · on June 7, 2023

For GNU grep in particular, no, using a UTF-8 locale can significantly slow it down:

    $ time LC_ALL=C grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    3
    
    real    0.808
    user    0.744
    sys     0.063
    maxmem  10 MB
    faults  0
    
    $ time LC_ALL=en_US.UTF-8 grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    4
    
    real    20.064
    user    19.982
    sys     0.077
    maxmem  10 MB
    faults  0

Where as ripgrep is just Unicode aware by default, and still about as fast as the ASCII only variant of GNU grep above:

    $ time rg '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 
    4
    
    real    1.163
    user    1.132
    sys     0.030
    maxmem  916 MB
    faults  0

kps · on June 7, 2023

For grep, how much of the difference is due to '\w' having a different meaning between the two cases?

burntsushi · on June 8, 2023

That's exactly the point. ripgrep uses the Unicode definition by default and so corresponds to what GNU grep is doing in the en_US.UTF-8 locale.

emmelaich · on June 6, 2023

and set it for consistency of ordering (collation) between sort, join, tsort, look, etc.