AFAIK about Julia, Rust and Julia handle strings similarly. i.e., Strings are re...

StefanKarpinski · on May 30, 2019

Julia strings are not required to be valid Unicode since 1.0, they can hold arbitrary data. Moreover you can round trip arbitrary data from a file through strings, through chars, then back to disk and you will get an identically file regardless of its content. The principle is this: a program should never error because of broken data, only because of programmer error.

burntsushi · on May 30, 2019

Oh interesting! TIL. I think that probably means that this is UB then: https://github.com/JuliaLang/julia/blob/d8ff21c69c118e8801e8... --- You can't enable NO_UTF_CHECK in PCRE if you're going to pass data that isn't valid UTF-8.

N.B. As of PCRE 10.33, you can able the PCRE2_JIT_INVALID_UTF check for JIT matching instead.

It looks like this is also coming to the standard interpreter as well: https://lists.exim.org/lurker/message/20190524.173112.0d226a...

StefanKarpinski · on May 30, 2019

Yeah the PCRE situation is a bit unfortunate. Avoiding crashes on invalid data would be the minimum and hopefully PCRE does that officially soon. To really make things work well, we would have to patch PCRE to handle what Julia considers invalid characters to be, which is doable, but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.

burntsushi · on May 30, 2019

> Yeah the PCRE situation is a bit unfortunate.

Indeed. In all likelihood, it's a CVE waiting to happen.

> we would have to patch PCRE to handle what Julia considers invalid characters to be

Sorry, did you see my links in the previous comment? This is already available in the JIT engine for PCRE 10.33, and appears to be making its way into the standard interpreter as well. So long as both Julia and PCRE implement UTF-8 correctly, both should be on the same page with respect to invalid UTF-8 byte sequences.

> but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.

Yup, this is what I did for Rust, which can work on both completely valid UTF-8 and arbitrary byte sequences. But it is a ton of work. I'd get as much mileage out of PCRE2 as I could.