Wow! I was working on this issue in our DBMS product today!
Fun suggestion, try making a JSON string with a NULL character somewhere in the middle. It will be encoded as \u0000 and is a a valid UTF-8 code, but most C based systems will truncate the string by estimating its length via strlen.
Java community and some other software vendors designed the Modified UTF-8, which replaces the zero with a 2-byte code point. Sleek. Aside from the fact, that you are modifying the data that customer wants to stay consistent.
Postres explicitly bans such cases in the VARCHAR, not sure if it can fit in their JSON columns. Who tried?
They are valid if escaped, as explicitly noted in the Section 7 of RFC 7159 [1]. (Annoyingly enough it doesn't explicitly say JSON strings are Unicode strings, it just says that a certain subset of JSON strings is interoperable with Unicode.) GP means that the escaped null byte can still cause issues for C interoperability.
> All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
Any character may be escaped in this way
> Any character may be escaped.
I personally find the JSON spec very explicit while succinct
Oh, actually I made a mistake in the GP. The following sentence:
> Annoyingly enough it doesn't explicitly say JSON strings are Unicode strings, [...]
...is false, I completely missed the very first section (I obviously searched for "Unicode", but failed to thoroughly check results). I have other valid criticisms of the JSON specification but that is not, so please ignore that part of rants since it was based on a wrong assumption. Thank you for (implicitly) pointing it out.
Fun suggestion, try making a JSON string with a NULL character somewhere in the middle. It will be encoded as \u0000 and is a a valid UTF-8 code, but most C based systems will truncate the string by estimating its length via strlen.
Java community and some other software vendors designed the Modified UTF-8, which replaces the zero with a 2-byte code point. Sleek. Aside from the fact, that you are modifying the data that customer wants to stay consistent.
Postres explicitly bans such cases in the VARCHAR, not sure if it can fit in their JSON columns. Who tried?