Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's not really a Thai character, right? It's way too many bytes! It must be an intentional repetition of stacking diacritics. Some of the ones in that Google result page are 21 bytes.


   $ echo ก็็็็็็็็็็็็็็็็็็็ | hexdump
   0000000 e0 b8 81 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0
   0000010 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9
   0000020 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87
   0000030 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 0a
   0000040
   $ echo ก็็็็็็็็็็็็็็็็็็็ | wc
       1       1      64


Something I wondered once: If one were to have a go at sanitizing Unicode input (e.g., for a forum), what would be a sensible limit on the number of diacritics to allow, without interfering with languages that need them?



Thanks. I'd asked on SO about Unicode sanitization before and got a very "brush off" answer. Seems I was asking the wrong question.


Yep. With the diacritics spaced out:

     ก ็ ็ ็ ็ ็


looks like this is the unicode in html

ก ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็


Yes, it is "THAI CHARACTER KO KAI" (U+0E01), plus 20 occurrences of the non-spacing mark "THAI CHARACTER THAI CHARACTER MAITAIKHU" (U+0E47).


It's odd, but I think the HN title has only 5 of those marks. Does HN code limit the stacking?


Indeed, pasting from the google input and investigating it's a single U+0E01 ก (THAI CHARACTER KO KAI) with 20 U+0E47 ็ (THAI CHARACTER MAITAIKHU), a diacritic.


It's definitely an intentional repetition of stacking diacritics. ก็ has a meaning in Thai, but not ก ็ ็ ็


It's 2 characters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: