Hacker News new | past | comments | ask | show | jobs | submit login

> In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:

> Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.

I just came to these realization over dinner. Apparently, this is the odd behavior is defined by JSON. So, it's not Python's json module at fault because it is, actually, implemented correctly.

https://en.wikipedia.org/wiki/JSON#Data_portability_issues

This whole time I thought the json module had a bug, but now I am wondering if it's another PyYAML bug.

Maybe not. I'll have to read this section a few more times. http://yaml.org/spec/1.2/spec.html#id2770814

Edit: Going back a bit here.

> Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences,

http://yaml.org/spec/1.2/spec.html#id2771184 makes the statement, "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.". And, there's numerous mentions for JSON compatibility. So, I suppose, this is an issue PyYAML (and Ruby's YAML, I checked).

I had never seen this issue before. Ruby's JSON module doesn't breakup utf-8 characters into surrogate pairs, nor does any online json parser that I could find via Google.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: