AFAIK about Julia, Rust and Julia handle strings similarly. i.e., Strings are represented by UTF-8 internally, and they are required to be valid UTF-8.
I've also built strings in Rust that are only conventionally UTF-8, similar to Go. It's still an experiment though: https://docs.rs/bstr --- It turns out that conventionally UTF-8 strings can be quite useful in a lot of cases, since the real world often provides data without any guaranteed encoding (e.g., the contents of files).
Julia strings are not required to be valid Unicode since 1.0, they can hold arbitrary data. Moreover you can round trip arbitrary data from a file through strings, through chars, then back to disk and you will get an identically file regardless of its content. The principle is this: a program should never error because of broken data, only because of programmer error.
Yeah the PCRE situation is a bit unfortunate. Avoiding crashes on invalid data would be the minimum and hopefully PCRE does that officially soon. To really make things work well, we would have to patch PCRE to handle what Julia considers invalid characters to be, which is doable, but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.
Indeed. In all likelihood, it's a CVE waiting to happen.
> we would have to patch PCRE to handle what Julia considers invalid characters to be
Sorry, did you see my links in the previous comment? This is already available in the JIT engine for PCRE 10.33, and appears to be making its way into the standard interpreter as well. So long as both Julia and PCRE implement UTF-8 correctly, both should be on the same page with respect to invalid UTF-8 byte sequences.
> but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.
Yup, this is what I did for Rust, which can work on both completely valid UTF-8 and arbitrary byte sequences. But it is a ton of work. I'd get as much mileage out of PCRE2 as I could.
I've also built strings in Rust that are only conventionally UTF-8, similar to Go. It's still an experiment though: https://docs.rs/bstr --- It turns out that conventionally UTF-8 strings can be quite useful in a lot of cases, since the real world often provides data without any guaranteed encoding (e.g., the contents of files).