Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Notably Rust did the correct thing

In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:

  String.len() == number of bytes
  String.bytes().count() == number of bytes
  String.chars().count() == number of unicode scalar values
  String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
  String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.




Similar to Java:

   String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length

  String.graphemes().count()
That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)

ugrapheme and ucwidth are one way to get the graphene count from a string in Python.

It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?


Any correctly designed grapheme cluster handles emoji characters. It’s part of the spec (says the guy who wrote a Unicode segmentation library for rust).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: