Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In fact it's trivial to generate a text file of all valid Unicode code points and use that as input to unit tests.


I would have to do research on whether the list of valid code points depends on the Unicode version. For example, can regional indicator code points (https://en.wikipedia.org/wiki/Regional_indicator_symbol) appear in isolation? If not, is that different in Unicode < 6, where those code points weren’t assigned yet?

Similarly, what about tags (https://en.wikipedia.org/wiki/Tags_(Unicode_block) )? Do these require an U+E007F CANCEL TAG?

The 66 noncharacters certainly need consideration. http://www.unicode.org/faq/private_use.html says:

“Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts”

Edit: also, testing all code points likely is overkill and using code points in isolation likely isn’t enough. Most tests are better of with something like the big list of naughty strings (https://github.com/minimaxir/big-list-of-naughty-strings)


It may be faster to generate them on the fly. Iterating over ranges of integers is a lot faster than reading files from disk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: