In CS, most algorithms assume an ASCII character set. I wonder if there's any st...

a1369209993 · on Oct 21, 2021

Asymtotic complexity can't change based on the character set, since you can just reuse the same algorithm with larger opaque datums. (Exception being algorithms with O(n^8) or O(n^256) complexity, but noone uses those anyway.)

A variable width encoding can cause issues in principle, but useful algorithms already have to deal with strings that have variable-length physical represention anyway (eg "yes" vs "no"), so it tends not to be a problem in practice.

deathanatos · on Oct 21, 2021

> In CS, most algorithms assume an ASCII character set.

They most certainly do not. E.g., a Turing machine assumes an alphabet Γ which is a set of some characters and is defined no further, as any exact definition is meaningless to the theory. (I.e., the algorithm is generic over any alphabet.) The alphabet need not even be text; e.g., for a Turing machine, the set of all octets suffices.

Even for something like Levenshtein distance, the only real requirement of the algorithm is that the abstract "characters" implement equality testing. For Unicode text, I'd start with graphemes, and then look for counter examples.

trinovantes · on Oct 21, 2021

I guess they won't break correctness but I do remember many algorithms (e.g. tries) assume you have constant random access to characters which AFAIK is not possible in UTF-8