Thanks for the article, I’m curious of what the author think of string indexing ...

burntsushi · on May 30, 2019

AFAIK about Julia, Rust and Julia handle strings similarly. i.e., Strings are represented by UTF-8 internally, and they are required to be valid UTF-8.

I've also built strings in Rust that are only conventionally UTF-8, similar to Go. It's still an experiment though: https://docs.rs/bstr --- It turns out that conventionally UTF-8 strings can be quite useful in a lot of cases, since the real world often provides data without any guaranteed encoding (e.g., the contents of files).

StefanKarpinski · on May 30, 2019

Julia strings are not required to be valid Unicode since 1.0, they can hold arbitrary data. Moreover you can round trip arbitrary data from a file through strings, through chars, then back to disk and you will get an identically file regardless of its content. The principle is this: a program should never error because of broken data, only because of programmer error.

burntsushi · on May 30, 2019

Oh interesting! TIL. I think that probably means that this is UB then: https://github.com/JuliaLang/julia/blob/d8ff21c69c118e8801e8... --- You can't enable NO_UTF_CHECK in PCRE if you're going to pass data that isn't valid UTF-8.

N.B. As of PCRE 10.33, you can able the PCRE2_JIT_INVALID_UTF check for JIT matching instead.

It looks like this is also coming to the standard interpreter as well: https://lists.exim.org/lurker/message/20190524.173112.0d226a...

StefanKarpinski · on May 30, 2019

Yeah the PCRE situation is a bit unfortunate. Avoiding crashes on invalid data would be the minimum and hopefully PCRE does that officially soon. To really make things work well, we would have to patch PCRE to handle what Julia considers invalid characters to be, which is doable, but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.

burntsushi · on May 30, 2019

> Yeah the PCRE situation is a bit unfortunate.

Indeed. In all likelihood, it's a CVE waiting to happen.

> we would have to patch PCRE to handle what Julia considers invalid characters to be

Sorry, did you see my links in the previous comment? This is already available in the JIT engine for PCRE 10.33, and appears to be making its way into the standard interpreter as well. So long as both Julia and PCRE implement UTF-8 correctly, both should be on the same page with respect to invalid UTF-8 byte sequences.

> but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.

Yup, this is what I did for Rust, which can work on both completely valid UTF-8 and arbitrary byte sequences. But it is a ton of work. I'd get as much mileage out of PCRE2 as I could.

the_mitsuhiko · on May 30, 2019

From my personal experience I think Rust's string system is hard to beat at the moment. It's pretty darn good from a usability point of view and it also found a nice solution to work with UCS2 windows APIs by providing a OsStr type.

chrismorgan · on May 30, 2019

I’m glad that Rust strings aren’t indexable by integer, but I think that making them indexable by range (of UTF-8 code unit offsets) was an error. `foo[0..10]` should have been `foo.slice(0..10)` or similar instead.

the_mitsuhiko · on May 30, 2019

It’s a bit of a footgun indeed but it’s quite handy in combination with the char index iterator.

chrismorgan · on May 30, 2019

Sure, you do want to be able to index by code unit range, but it shouldn’t have been with the Index trait.

zucker42 · on May 30, 2019

There's no reason this couldn't be added now, though, right?

richardwhiuk · on May 30, 2019

The only annoyance is the occasional unwrap, when something is provably impossible, but the type system can't detect it.

1f60c · on May 30, 2019

That’s what `unwrap` is for, though.