Hacker News new | past | comments | ask | show | jobs | submit login

While researching this comment I read some of the D library documentation and found what I think is probably a docbug at this URL:

https://dlang.org/phobos/std_utf.html#.byUTF

"Throws: UTFException if invalid UTF sequence and useReplacementDchar is set to UseReplacementDchar.yes"

My guess is that this is a mistake and should instead say UseReplacementDchar.no since it makes sense to throw an exception if you can't use U+FFFD here, rather than do both.

Anyway, in my view this is bad the same way the Billion Dollar Mistake is bad, and Rust made the right choice here. Arrays of stuff are great, but they aren't strings. Having to sprinkle "or maybe not" cases all over these libraries because of course these might not really be strings, results in exception fatigue from your developers, which in turn results in lower quality software and more effort for the conscientious developers who stick it out.

D's strings are less stupid than C's (and thus some of the C++ strings) but they're still just arrays which are maybe but maybe not actually text.




Thanks for the bug report. I filed it for you: https://issues.dlang.org/show_bug.cgi?id=23405

Having string be a magic builtin type does not eliminate the problem of dealing with invalid UTF sequences.

Invalid UTF sequences are inherent to the Unicode design, and programmers are left on their own to deal with it. The options are:

1. ignore them

2. use the replacement char

3. throw an exception (or other error indication)

D enables the programmer to pick which they need, on a case by case basis.


> Thanks for the bug report. I filed it for you: https://issues.dlang.org/show_bug.cgi?id=23405

#23405 was resolved as fixed a week ago. It isn't fixed. I guess at least I didn't waste my time filing the bug.


The problem does need solving, but it only needs solving once. D's approach means the programmers needs to make this decisions over, and over, and over again everywhere they have an alleged "string". Or they must track somehow (by convention perhaps?) whether string A is or is not "really" a string.

If you have type safety, you can make the choice just once.

Rust's String::from_{utf8,utf16}_lossy turn valid UTF-8/16 sequences into strings, and "fix" invalid ones with U+FFFD

Meanwhile String::from_{utf8,utf16} attempt the same but with an Err instead of replacement on failure if that's what the programmer wants.

Imagine if all D's numeric functions took the same attitude as its string functions, insisting on being passed arrays of bytes so that each function can parse those bytes, decide if this is actually a 16-bit unsigned integer (for example) and if so do what's expected otherwise perhaps return an error. We'd spot right away that this was not a practical design.

D's choices here are conventional, but I've come to expect a lot more and so I'm disappointed when I can't have it.


I don't see the difference here. D offers the same options when processing a string.


That's surely the whole point, every D std.string function is also a string decoder with varying features. But a suitably decoded "string" is still just the same type, whereas Rust has a distinct type for actual UTF8 strings


I think the point is that you run the unicode validation once on your [u8] array, which gives you a &str (or String for the lossy variants). From then on, you know you have valid unicode and don't need to keep checking.


On the other hand, the sad reality is that even when you have a plethora of string types to accommodate with reality like Rust, people will just not care out of convenience. See how Rust build scripts communicate paths to cargo via stdout, and how most of them just use Path::display (or something similar or worse) to do that, which is lossy. Rustc itself doesn't handle paths correctly either. IIRC, all in all, it's basically impossible to compile Rust code from a non-UTF-8 path.


D's string is not text by itself because it is an array of UTF-8 code units. However, we have this infamous feature called auto-decoding in the standard library that presents strings as unicode code points.

On the other hand, D's dstrings are more like text because they are not only UTF-32 but also random-accessible code points. (D does not address multiple representations of graphemes at language level. For example, at language level, ğ is different from "g and combining breve" but there are std.uni and std.utf modules that help.)


> D's string is not text by itself because it is an array of UTF-8 code units.

Bytes. It's an array of bytes. D's char type isn't actually restricted to UTF-8 code units, char x = '\xFF'; works just fine even though that's not UTF-8.


I see what you mean but array of bytes is something else in D: byte[].




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: