Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Somewhat related tip: prepend LANG=C to many console commands such as grep to speed up many tools processing large files, as they will assume ASCII input (which is probably what you have in most cases)


If you care about speed you would probably be using ripgrep rather than grep anyway, but doesn’t `LANG=en_US.UTF-8` give a similar speed on modern systems without any compromise on consistency of sort ordering etc and support for extended characters?


For GNU grep in particular, no, using a UTF-8 locale can significantly slow it down:

    $ time LC_ALL=C grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    3
    
    real    0.808
    user    0.744
    sys     0.063
    maxmem  10 MB
    faults  0
    
    $ time LC_ALL=en_US.UTF-8 grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    4
    
    real    20.064
    user    19.982
    sys     0.077
    maxmem  10 MB
    faults  0
Where as ripgrep is just Unicode aware by default, and still about as fast as the ASCII only variant of GNU grep above:

    $ time rg '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 
    4
    
    real    1.163
    user    1.132
    sys     0.030
    maxmem  916 MB
    faults  0


For grep, how much of the difference is due to '\w' having a different meaning between the two cases?


That's exactly the point. ripgrep uses the Unicode definition by default and so corresponds to what GNU grep is doing in the en_US.UTF-8 locale.


and set it for consistency of ordering (collation) between sort, join, tsort, look, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: