Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Javascript strings are not UTF-16, you'd only ever see codepoints if that were the case. Javascript "strings" are UCS2, it's trivial to demonstrate: "\ud83c" is a valid Javascript string, it's not valid UTF-16.

Here's the relevant section of the Unicode FAQ on the subject:

> UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.

A correct UTF-16 implementation would interpret surrogate code point, validate that they're paired and prevent access to either surrogate via string operations.



ES6 did get some new functions to correctly deal with surrogate pairs in strings. In then end, JS strings are just an sequence of 16 bit values, with the unfortunate case that many string functions interpret those as UCS-2 and only some new functions as UTF-16.

When you come across an invalid sequence while decoding particular input (like "\ud83c") then you generally have three choices: throw an exception, skip the invalid part, or replace it with a replacement character. The default JavaScript behaviors is to be lenient. But if you need more control over the decoding behavior then you can use StringView or TextDecoder which is part of this spec: https://encoding.spec.whatwg.org/


> ES6 did get some new functions to correctly deal with surrogate pairs in strings. In then end, JS strings are just an sequence of 16 bit values

Which is exactly why they are not and can not be UTF-16.

> The default JavaScript behaviors is to be lenient.

The javascript behaviour is to have UCS2 "strings".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: