How many replacement characters?

kbenson · on May 31, 2017

A well laid out case. Given the churn the new change in preference it might cause in the ecosystem, it's probably worth reverting, especially since apparently no real benefits beyond "it feels right" (according to this accounting) have been put forth.

It may cause the committee to lose a little face, but less than digging in your heels over a decision that has no ramifications to you but does to others, and you have no real justification for. Hopefully they see that and acquiesce, or at least come back with a well thought out rebuttal that isn't dismissive.

rossy · on June 1, 2017

I completely agree with the author on this. The old Unicode 9.0 best practice made sense for a UTF-8 decoder that consumed input using a state machine, and like the article says, if you accept the correct byte-ranges in each state as in Table 3-7, your state-machine-based decoder will implicitly reject every kind of invalid sequence, including overlong encodings, encoded UTF-16 surrogates and encoded out of range characters. A state machine also makes other things trivial, like validating UTF-8 input without outputting characters and writing a streaming UTF-8 to UTF-16 converter. Another property is that it rejects invalid input as soon as possible, eg. as soon as it consumes input that can't possibly be part of an invalid sequence.

This is an example of a particularly elegant UTF-8 decoder which uses a state machine: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

The proposed Unicode 11 best practice suits a decoder that checks for overlong sequences, encoded surrogates and out of range values as a sort of post-validation step, after consuming all the bytes of a potentially valid UTF-8 sequence. Not only is this different to the behaviour of every existing decoder except for ICU, it also seems less elegant, more complicated and more error prone to me. If I understand correctly, even bytes like 0xc0 and 0xf5, which never form part of a valid UTF-8 sequence, won't be rejected immediately in this kind of decoder.

The article makes a pretty solid argument for why this difference in behaviour matters, even though it's just a best practice and not an official requirement. The two key points in this for me are that most existing UTF-8 decoders produce identical results matching the Unicode 9.0 best practice, and that there was an actual bug in Chrome when two internal UTF-8 decoders produced differing results. I think I'd make a stronger conclusion though: Not only should they keep the current recommended best practice, they should also elevate it to a requirement in order to prevent bugs like the one in Chrome from happening in future. Most existing UTF-8 decoders are already compliant, and it would be a nice property of UTF-8 if all byte sequences, including invalid ones, decoded to the same sequence of codepoints in every decoder.

kazinator · on May 31, 2017

How I have implemented things is that when the UTF-8 decoder encounters an invalid sequence, it retreats to the beginning of that sequence. It then converts the first byte of that sequence to a replacement characters, and consumes it. Then it resets to its initial state and begins decoding starting at the following byte.

Basically we are saying "no match occurs for a valid UTF-8 pattern at this input position; let's do error recovery by dropping a byte, pooping it out as a replacement character and trying again."

I have used the surrogate pair range U+DC00 - U+DCFF for replacement characters. On output, I convert these back to individual bytes. Thus the end-to-end decode+encode is binary transparent: any byte string can be decoded to a sequence of code points, some of which may be replacement characters, and that sequence will encode back to the original byte string. (This requirement cannot be achieved if multiple bogus bytes are collapsed into one replacement character.)

Well, that's not the full story: to have this transparency property, we also need to ensure that when some U+DCXX occurs by means of a valid UTF-8 pattern, we nevertheless treat it as invalid. I.e. there is a rule that if the UTF-8 decode works, but a U+DCXX code-point emerges, then we retreat to the start and drop a byte as a replacement character, as if a bad code had been seen.

lilyball · on May 31, 2017

Your UTF-8 decoder is identical to just "emit a REPLACEMENT CHARACTER for every bogus byte".

Also, the use of surrogate pairs mean you actually don't have a UTF-8 decoder at all, you just have something that's similar but produces a sequence of potentially-invalid unicode scalars.

> we also need to ensure that when some U+DCXX occurs by means of a valid UTF-8 pattern

It can't. Trying to encode surrogate pair codepoints in UTF-8 is strictly invalid.

Dylan16807 · on June 1, 2017

> It can't.

Why did you cut off "we nevertheless treat it as invalid."? You're misreading that line, and actually in agreement on what to do. It's the bit pattern that is valid, which is how you can parse the code point and see that the code point is invalid for UTF-8.

lilyball · on June 1, 2017

I cut you off because you basically said "and if we see this thing that is defined as invalid, we treat it as invalid". Which is something that all UTF-8 parsers do, so there's no point in calling it out like this is special behavior.

Dylan16807 · on June 1, 2017

That's not me.

The point is that this variant parser accepts many things that are normally errors and turns them into faux-surrogates, so it's worth restating that it rejects surrogates in the source file.

lilyball · on June 1, 2017

My apologies.

In any case, here's what OP said:

> Well, that's not the full story: to have this transparency property, we also need to ensure that when some U+DCXX occurs by means of a valid UTF-8 pattern, we nevertheless treat it as invalid. I.e. there is a rule that if the UTF-8 decode works, but a U+DCXX code-point emerges, then we retreat to the start and drop a byte as a replacement character, as if a bad code had been seen.

But this whole paragraph is wrongheaded. A U+DCXX codepoint cannot occur as a result of a UTF-8 decode, because it is defined as invalid. Even just that last bit there, "as if a bad code had been seen"... a bad code was seen! This paragraph makes me question whether OP actually understands UTF-8 decoding at all.

Dylan16807 · on June 1, 2017

I think the meaning is clear. They're talking about part of the UTF-8 decoding process, the part that actually decodes leading/trailing bytes into arbitrary 21 bit numbers. What term would you use for that?

lilyball · on June 1, 2017

Well, I wouldn't even bring it up, because the UTF-8 decoding process by definition cannot produce code points in the surrogate pair range, as those are illegal to encode in UTF-8. But if I must, I might talk about the "bit pattern" and the integral value that results from interpreting it. I certainly wouldn't talk about code points resulting from a UTF-8 decode.

lilyball · on June 1, 2017

Somebody is apparently upset about what I said. Why?

nigeltao · on May 31, 2017

The author found it hard to "find the right API entry point in Go documentation".

For the record, Go produces one U+FFFD per byte, not per maximal contiguous run, when iterating over bad UTF-8. This is part of the language specification, not just a library, although the standard libraries follow this behavior. For example, in the standard UTF-8 library, https://golang.org/pkg/unicode/utf8/#DecodeRune says that the size returned is 1 (i.e. 1 byte) for invalid UTF-8.

The relevant language spec section is https://golang.org/ref/spec#For_statements and look for "If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string."

Example code: https://play.golang.org/p/OLIWcjLIvF

I'll note that both Go and UTF-8 were invented by Ken Thompson and Rob Pike. I'm sure that the Go authors were aware of UTF-8's details. (Go also involved Robert Griesemer, but that's tangential).

hsivonen · on June 1, 2017

Thank you. I was looking for something that takes a potentially invalid buffer of UTF-8 and returns a guaranteed-valid buffer and failed to find a function like that.

(And, indeed, Go is an interesting case due to its creators being the inventors of UTF-8, too.)

nigeltao · on June 1, 2017

Yeah, there's not really a guaranteed-valid buffer concept in Go. Even if you have valid UTF-8, you still have to iterate over it to e.g. rasterize glyphs, and iterating over possibly-bad UTF-8 is no harder than iterating over known-good UTF-8.

If you want to compare to other UTF-8, validity alone isn't always sufficient. You often have to e.g. normalize anyway, and normalization should fix up bad UTF-8. Again, a guaranteed-valid buffer type wouldn't win you much.

nigeltao · on June 1, 2017

Oh, in case you were wondering, the ubiquitous term "rune" in the Go documentation is simply shorthand for "Unicode codepoint".

derefr · on May 31, 2017

Thought: what if Unicode decoding was "lossless" in the face of errors, such that the replacement characters represented the bitstring of non-decodable bytes? (E.g. 256 reserved-codepoints for each possible octet value, that render as e.g. "[FF]" in a box; and then another 255 for the set of 7-bit, 6-bit, 5-bit, etc. overhangs.)

kazinator · on May 31, 2017

been there, done that; it is very useful.

  1> [(file-get-string "/bin/ls") 0..15]
  "\x7F;ELF\x02\x01\x01\xDC00\xDC00\xDC00\xDC00\xDC00\xDC00\xDC00\xDC00"
  2> (file-put-string "foo" (file-get-string "/bin/ls"))
  t
  3> (sh "cmp /bin/ls foo")
  0
  4> (sh "sha256sum /bin/ls foo")
  a90ba058c747458330ba26b5e2a744f4fc57f92f9d0c9112b1cb2f76c66c4ba0  /bin/ls
  a90ba058c747458330ba26b5e2a744f4fc57f92f9d0c9112b1cb2f76c66c4ba0  foo
  0

There is no need to handle any fractional bytes if the original input is a sequence of bytes; so many whole bytes have to be recovered, not so many whole bytes plus three bits or whatever.

i336_ · on June 1, 2017

I was extremely curious what REPL this is, and it appears to be TXR: http://www.nongnu.org/txr/

(My conclusion was based on finding file-get-string and file-put-string in https://fossies.org/linux/misc/txr-176.tar.gz/txr-176/share/...)

sph · on June 1, 2017

I might be sleep deprived... but what exactly is this script doing?

You're making a copy of /bin/ls into foo, and sha256sum the copy and the original.

    $ head -c 15 /bin/ls
    $ cat > foo < /bin/ls
    $ cmp /bin/ls foo
    $ sha256sum /bin/ls foo

I don't get it.

kazinator · on June 1, 2017

> You're making a copy of /bin/ls into foo

By getting its contents as a character string formed by passing the binary through a UTF-8 decoder, and writing out that string via the UTF-8 encoder.

vorg · on June 1, 2017

> The proposal is ambiguous about whether to do the same thing for five and six-byte sequences whose bit pattern is not defined as existing in Unicode but was defined in now-obsolete RFCs for UTF-8 [...] If five and six-byte sequences are treated according to the logic of the newly-accepted proposal, the newly-accepted proposal matches the behavior of ICU.

Regarding 5- and 6-byte sequences, perhaps the Unicode Consortium in their ambiguity and the ICU in its implementation are allowing for their possible return to Unicode. One day in the far-off future when UTF-16 finally dies, it will be feasible to increase the codepoint repetoire back up from 1 million to 2 billion, which is easy to implement in both UTF-8 and UTF-16.

jfk13 · on May 31, 2017

This raises what appear to be important points. Has it been formally submitted to Unicode in some way?

captaincrowbar · on May 31, 2017

Yeah, there's a long discussion about this on the official Unicode mailing list recently. You can read it at http://unicode.org/pipermail/unicode/

beerbajay · on May 31, 2017

Link to the thread: http://unicode.org/pipermail/unicode/2017-May/thread.html#53...

jfk13 · on June 1, 2017

Discussion on the mailing list does not constitute official feedback to the UTC; ultimately, it's nothing more than a discussion forum, and does not mean the issue will necessarily come back to the committee.

The contact form for more formal feedback is at http://www.unicode.org/reporting.html

skybrian · on May 31, 2017

It seems like only harm is that "implementations have to explain themselves" and one Chromium bug.

lilyball · on May 31, 2017

The Chromium bug is a demonstration of the fact that different behaviors in different implementations can lead to real bugs. There's no reason to think that one Chromium bug is the only time this will ever matter.

Dylan16807 · on June 1, 2017

It says in the spec that the number of replacement characters can vary, so it's hard to blame Unicode for a bug caused by two parsers making different numbers.

(Unless the argument is that there should be an official required number, which is a different discussion entirely.)

lilyball · on June 1, 2017

No, you're right in that it's perfectly legitimate for multiple parsers to behave differently here. But if there's one behavior that nearly all parsers have standardized on, that's very valuable because it makes it a lot easier to use two different parsers without a problem.