Hacker News new | past | comments | ask | show | jobs | submit login
Slimmer and faster JavaScript strings in Firefox (blog.mozilla.org)
191 points by evilpie on July 21, 2014 | hide | past | favorite | 66 comments



Lazily converting UTF-8 (or latin1) to UTF-16 as needed is indeed an old trick employed by many string classes.

It's even a bit surprising that a codebase as popular and performance-critical as SpiderMonkey hadn't picked up such as a simple and high-yield optimization several years ago.

By the way, other implementations are even lazier: the string is kept in its original encoding (utf-8 or 7-bit ascii) until someone calls a method requiring indexed access to characters, such as the subscript operator. At this point, you convert to UTF-16 to for O(1) random access.

Indexing characters in a localized string is rarely useful to applications and often denotes a bug (did they want the N-th grapheme, glyph or code-point?). It's best to use higher-level primitives for collating, concatenating and splitting localized text.

Granted, a JavaScript interpreter must remain bug-by-bug compatible with existing code, thus precluding some of the most aggressive optimizations.


What do the lazily converting string classes do for characters that don't fit in UTF-16? Would they convert to UTF-32, or just fall back to an O(n) index?

Example: ☃


There are, by definition, no Unicode characters that don't fit in UTF-16.

UTF-16 has surrogate pairs; it's an extension of UCS-2, which doesn't.

Incidentally, this is why UTF-16 is a poor choice for a character encoding: you take twice the memory but you don't actually get O(1) indexing, you only think you do, and then your program breaks badly when someone inputs a surrogate pair.

See also elsewhere in the thread: https://news.ycombinator.com/item?id=8066284


String classes rarely use UTF-16 because it doesn't have fixed length code point representation. UCS-2 is often used instead, which uses two bytes to represent all the unicode points in the Basic Multilingual Plane (BMP), which is enough for 99.99% of the use cases.

One example of this is Python, which used UCS-2 until version 3.3. There was a compile time option to use UCS-4, but UCS-2 was enough for most cases because the BMP contains all the characters of all the languages currently in use.


Which encoding does Python use now?


PEP 393 introduced flexible string representation which can use 1, 2 or 4 bytes depending on the type of the string: http://legacy.python.org/dev/peps/pep-0393/


Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.

Perhaps I'm missing something (quite likely, as I am certainly no expert when it comes to unicode), but I was under the impression that this would already have to be the case since UTF16 is also variable length.


Technically, for characters whose codepoint exceeds 0xFFFF, javascript treats them as two characters. To see that, consider the Sushi character "🍣" (U+1f363):

    "🍣".length // 2
    "🍣".charCodeAt(0) // 55356
    "🍣".charCodeAt(1) // 57187


That's a bad interface that allows you to split strings at useless codepoints and get illegal UTF-16 strings as the result.


It's the historical interface which websites now rely on, changing it would be like writing a libc with strcmp operating on Pascal strings.

In any case, a Javascript String is not actually designed to be UTF-16, it is essentially just an `uint16_t[]`. Even textual strings just store UTF-16 code units, not full UTF-16 data. Relevant snippets from the standard:

The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values ("elements").

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. [...] All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

See also:

- Section 8.4 http://www.ecma-international.org/publications/files/ECMA-ST...

- http://mathiasbynens.be/notes/javascript-encoding


> Although the standard does state that Strings with textual data are supposed to be UTF-16.

No, it doesn't. It states that they're UTF-16 code units, a term defined in Unicode (see D77; essentially an unsigned 16-bit integer), which is not the same as UTF-16. A sequence of 16-bit code units can therefore include lone surrogates, which something encoded in UTF-16 could not.


Oh, yes; I just skimmed 'code unit' bit without actually reading. (I've now removed the misinformation from my previous comment.)


I think JS may be from the time when UCS-2 was all there was and there were only 65535 Unicode characters.


It's needed for compatibility with the Web, unfortunately.


Definitely. Thankfully, ES6 will introduce

    "🍣".codePointAt(0)
and iterators will iterate code points, not code units.

    for (var c of "🍣"){}


Funny how this Sushi character appears as nigiri or as maki depending on the font.

I can't say I'm really satisfied with the current state of emojis in Unicode.


The name of the character is simply ‘SUSHI’ ( http://www.unicode.org/charts/PDF/U1F300.pdf ). Any pictographic representation of sushi would fulfil that.


Oh I'm not saying it's wrong, just too imprecise (actually, since in France "sushi" is often synonymous with nigiri, when I posted the character earlier in a chatroom, someone made the remark that they were "maki, not sushi").

Also, what about "🏤" which is "U+1F3E4 EUROPEAN POST OFFICE"? I see it here as a box with some kind of horn, Deutsche Post's logo as far as I know. Is this supposed to be localized in the future so that I can see the French Post's bird instead?

What is not satisfying is that the emojis feel both too incomplete (great, there's an eggplant and a tomato, now where's the bell pepper?) and too imprecise (okay, I have this nice maki emoji to show what I'm eating... oh wait, am I sure my friend will actually see maki?).

And sometimes they're just plain weird, what about "😤 U+1F624 FACE WITH LOOK OF TRIUMPH"? In all fonts I can find it looks like someone who's mightily pissed, maybe fuming because he spent so much time looking for the perfect emoji, only for their friend to see something completely different. That doesn't look like triumph to me.


A stylized bugle is a fairly universal symbol for the postal services in Europe, at least historically. I can't find a complete overview, but it looks like France is one of the very few exceptions.


UTF-16 is variable-length in that a Unicode code-point can take up one or two UTF-16 code-units. However, for backward-compatibility reasons "charAt()" is defined to return a UTF-16 code-unit (regardless of whether or not it's a useless half-a-code-point) so effectively it's O(1) indexing.


Why wouldn't UTF-8 offer the same O(1) indexing? I still think they should fix regex/charCodeAt etc to support the newer characters above 0xFFFF as demonstrated, it would see dramatic memory improvements and remove the need for hacks to detect surrogate pairs.


UTF-8, UTF-16 is variable length, UTF-32 is not. JS spec says you can use UCS-2 or UTF-16. I believe the author meant to say: If you have UTF-16, on average, your operations are faster, but use more memory. With UTF-8 you use less memory, but operations are slower in the web environment.


Note that you don't necessarily use less memory using UTF-8. It only saves memory for languages that can be represented in latin1. Non-western languages usually end up using more memory in UTF-8 than in UTF-16.


Of course that's why I said on average, which is mathematically correct, gut gives the sense.


anybody care to explain the downvotes?


UTF-8 apparently usually ends up faster as you’re shifting a smaller amount of data from the RAM to the CPU and back out. This makes it more likely that your strings will fit in the CPU’s cache rather than having to hit the main RAM every time, leading to overall speed improvements.

(this from an interest in the subject, rather than any actual implementation work I’ve done)


My personal ideal representation of strings.

Modified rope-like tree, where each node stores a flat array of characters, with the caveat that "characters" within a node must have the same encoding, with the node storing the encoding. Yes, this means that a string can have multiple encodings within it. The internal "default" representation is a flat array of a (modified) UTF-8 encoding, at a fixed number of bytes per character (stored in the encoding) using overlong encodings if necessary. (So if you change a character in a node composed of two-byte characters into a single-byte character, you don't need to regenerate the node if it doesn't make sense.)


Honestly, I really couldn't care how a string is stored internally. It matters much more to me that it exposes, at a bare minimum, the ability to iterate over Unicode code points — something JavaScript and so many other languages do not do. From there I can perhaps build up to something actually useful (like a concept of a character).


Agreed.

Also, a related operation that would be useful is being able to do operations on "logical characters" - I believe the technical term is "grapheme clusters".


Most languages designed today use UTF-8 so in those you should be able to iterate over the code points.


What do you mean by this? UTF-8 != iterating over codepoints (in fact, UTF-8 is the most complicated UTF encoding to iterate over codepoints).


For what purposes would you use this?


Any purpose. Just about anything that "iterates over a string", in my opinion, is going to be more correct when considered at the code point level than at the code unit level. Some concrete examples:

- Truncating a string, perhaps to append "…" to it. I definitely don't want to truncate in the middle of a surrogate pair or two UTF-8 code units into a three code unit sequence. In fact, code points isn't good enough here either, as I don't want to clip an accent if someone has the string ("e" + ACUTE_ACCENT).

- Determine the width of a string, for something like columns, or pixels. Code units are meaningless. Especially in regards to terminal columns, this problem comes up often. Are you rendering the output of a SQL table in a pretty ASCII table? Then you need to append spaces to a short string until you reach the "|", and far too many times I've seen this:

    | 1 | Some text in here. | another column |
    | 2 | résumé           | another value  |
MySQL does this. Here, the rendering is broken because the function thought an accute accent took a column. I've seen both the need for something higher than code points (this example) because some code points really take no space (typically, because they combine), and the example that multi-byte UTF-8 strings just got byte counted. I've had a linter (pylint) tell me a line was too long — over 80 columns — when it was in fact closer to ~70 columns, and well under 80. The problem? `if 80 < len(line)`, where line is a bytestring.

My point is that I struggle to come up with an example that is more conceptually valid at the code unit level than at the code point level. Because of this, this — code points ­- is the _bare minimum_ level of abstraction language designers should deliver to programmers. As many of my examples show, even that might be woefully insufficient, which makes it all the worse to have to work with code units.


Well yes, that code points aren't enough as your first example demonstrates, is part of my point.

Where I'm coming from is this: about a week ago I thought the same as you do. Since then I've researched on how to move a big MBCS codebase (Visual Studio speak for 'std::string = array of char's, mostly) to Unicode (i.e. Microsoft-speak for 'use UTF-16 encoding for strings internally, and call all the UTF-16 API's, rather than the char-based ones which expect 8-bit strings encoded in the current code page'). (I'm just explaining because I don't know if you have experience with how Visual C++ and Windows handle these things).

My conclusion is that the people at utf8everywhere.org propose the least wrong approach, which is to use std::string for everything in C++, and assume that the encoding is utf-8.

What is the relationship to this discussion - well, when you do the above, std::string::length() doesn't return the number of 'characters' any more, just the size in bytes. So does iteration - it iterates over bytes.

Why do I think this is not as big a problem as I though it was a week ago: the circumstances where you need to iterate over 'characters' (note that 'character' doesn't really mean anything; what you need in your examples isn't iterating over code points, it's iterating over grapheme clusters) are few and far between.

Neither of your examples will be fixed when the string would let you iterate over code points. What you need instead, is a way to ask your rendering engine (albeit in this case, the console): 'when you render this sequence of bytes, how wide will the result be'. (in the units of your output device, be it a fixed-width output device like a console, or a variable-width one when rendering in a GUI.)

On the other hand, when working with strings, what you do often need is the size in bytes: for allocating buffers, for letting others know the 'size' of the string (which they need to know, at the least, in bytes), etc.

So: - you always need to know the 'size' in bytes, and that has a fixed meaning. - you very seldom need to know the 'size' in code points, or iterate over it. For the cases where you do, you can use an external library. (of course it would be convenient if that functionality was baked in, but where do you draw the line? code-point level access is very rare) - you sometimes need to know the 'size' in grapheme clusters, but there is no fixed definition of 'size' in that context; it's something that depends on your output device. No string class or type can account for that. At best you can say 'I want to know the amount of units if this string were rendered in a fixed-width font', which is sensible, but not only very complex but also asking (imo) too much of a string representation that is to be used in today's programming languages and library ecosystems.

So while I feel your pain, I think what you're asking is not realistic nor even very useful today. At some point in time, 10 or 20 years down the road where full 'unicode' support of all software everywhere is the default, maybe yes.

(as an aside, when reading up about this the last week, I looked at e.g. the Swift string API - https://developer.apple.com/library/prerelease/ios/documenta... - and felt a bit jealous. So this is probably a first start towards the bright future I mentioned above, but it still has some weirdnesses nobody used to 'char = 1 byte, only 7-bit ASCII allowed' strings would expect.)


I agree with you, but I think it proves my point.

I completely agree that it would be best if I could just ask "how wide is this string?", and not worry about code points or grapheme clusters at all. That'd be amazing. But I can't. So, I need to iterate over grapheme clusters, but I can't. So, to do that, I need to iterate over code points, but I can't. So, to get around that, I have to decode code units to code points manually, then build all the way up the aforementioned problems. Each time you encounter these problems. It's a PITA, because Unicode support is so piss poor in so many languages.

Or I'm a user, and the experience is just poor because the coder couldn't be bothered to do it right, most likely because it's so difficult.

To some extent, I'm sure there are libraries (is there a library for terminal output width?), but often, it's coder ignorance that results in them not getting used. There's be more awareness if the API forced you to choose the appropriate type of sequence: you'd be forced to think (and maybe seek a library to avoid thinking). Instead, the default is often wrong.

> On the other hand, when working with strings, what you do often need is the size in bytes

And for this, I'm thankful. Most of the time, it doesn't matter. But when it does, you're in for a world of hurt.


Ropes are great when constructing a large string, but can be much slower and more memory-heavy for comparison, iteration, indexing, and matching. In a typical program, where many of the strings are small literals being matched against or tokenized from a large input, this is often a poor trade-off.

My personal ideal representation of strings, if I were ever doing a new language:

Length-prefixed (in characters) UTF-8, with a redundant null terminator so that the buffer can be passed directly to C libraries without being copied or re-encoded. Assume that heap objects have their size in bytes automatically prefixed (GC requires this), so that it's also possible to get the byte length of the string. Small strings (<7 bytes of UTF-8 on 64-bit, or <3 bytes on 32-bit) are stored immediately; the assumption is the language runtime would use tagged values much like Lisp, so a small string gets a tag value of 0b10 and uses 4 bits or so in that last byte to indicate the length. No null terminator on immediate strings; when they're that small, the cost of copying them to C is negligible (although hmm, the memory management kinda sucks. It'll suck anyway though, as some C libraries expect to take ownership of the string and some don't...and for ones that don't, you could just zero out the tag byte and pass the word directly, then reset the tag when the call completes).

Ropes are available as a standard library data structure with an API identical to strings, but different performance characteristics. All of these are immutable; concatenation and modification result in a copy (in a rope's case, with much shared structure).

Give up on O(1) indexing and slicing, but indexing should always return the complete codepoint at that position in the string and never a corrupted character. Most of the time, indexing or slicing usually involves looking a small number of characters from the beginning or the end, so you can use linear search of UTF-8 codepoints from the nearest endpoint and N ~= 3-10. The exception is finding the midpoint, which should be provided as an API function (bisect?) that takes the midpoint in bytes and then uses UTF-8's self-synchronizing property to find the nearest legal codepoint. indexOf uses Boyer-Moore on bytes, startsWith/endsWith/equals are also byte-wise (well, word-wise, you can significantly speed up string equality tests by comparing by word and then using duff's device on the remainder) with some special casing to handle immediate values above. Split() is basically repeated indexOf with a copy, join() allocates the full string length and then copies in the data. Regexps use a DFA on bytes.

I'm wondering if it's worth introducing a separate data structure (same API) that is analogous to a Slice in Go - a pointer and index into an existing buffer - but my experience is that these often result in subtle memory leaks in GC'd languages. You hold onto a reference to 3 characters in a 300K buffer, and suddenly your program needs to keep all 300K live. It definitely makes split and slice very fast, though.

Also, there should be a separate standard-library type for []bytes, and all I/O should operate on that, with explicit encoding/decoding operations. Encoding happens at program boundaries, all strings inside the language should be UTF-8. Would be cool to allow vectored I/O directly from ropes, too, it'd make for great templating engines & web frameworks.


You can just fall back to a "rope" consisting of a single node for small strings.

One of the things I really like about what I call "pseudo-immutable" ropes (ropes that are not truly immutable, but to all appearances are - when you take a slice it modifies the original rope to maximize structure-sharing.) is that you get immunity to those subtle memory leaks on slicing "for free".

I agree fully on length-prefixed strings. They are a lot better than null-terminated strings, in a bunch of ways. Regexes using DFAs though can be... problematic. You really want a NFA that's converted to a DFA on-the-fly, with caching, or else you can end up with exponential blowup when constructing the DFA (a[<some large character class>]{1000}a, for example.). A DFA-based regex also doesn't handle a number of things.

As for slicing, I had a thought on that matter. Make slices reference the main array weakly, and have (weak) backreferences to any slices made of an array. When an array comes up for GC, if it has any slices made of it, copy out the slices to new arrays. There are a bunch of heuristics that can be tuned here, of course (don't do weakref / backref at all on smaller arrays, don't copy out slices if it won't reduce memory usage, etc.), but it prevents the worst case at least.

The big advantage of slices is that it means that programmers don't need to sacrifice readability for efficiency. It's much easier to be able to pass in arr[7:10000] to a function than to either have to modify the function to take index parameters (which you can't always do) or have to do a temporary copy. It also prevents the O(n^2) behavior associated with doing simple tokenizing (the change to Java's substring behaviour, for example)

I disagree strongly with "Encoding happens at program boundaries". The (major) problem with that is that there becomes no good way to pass strings through the language without going through an encoding pass. Try to implement cat in Python 3, for example. Encoding should happen when you ask for it.


For every JS string we allocate a small, fixed-size structure (JSString) on the gc-heap. Short strings can store their characters inline (see the Inline strings section below), longer strings contain a pointer to characters stored on the malloc-heap.

I wonder what the reason is for this roundabout way of doing it - couldn't the whole string be stored as a variable-length block (header with length, and then the content bytes), all on one heap? Incidentally this is also one of the things I think is broken about the malloc() interface; there is no portable way to get the size of an allocated block with only a pointer to it, despite that information being available somewhere - free() has to know, after all. Thus to do it the "correct, portable" way you have to end up essentially duplicating that length somewhere else. The fact that people are getting told that it's not something they need to know (e.g. http://stackoverflow.com/questions/5451104/how-to-get-memory... ) doesn't help either.

I've written a "nonportable" (in reality, all that would be needed is to change the function that gets the length from the block header) string implementation that uses this technique, and it definitely works well.

Some operations like eval currently inflate Latin1 to a temporary TwoByte buffer, because the parser still works on TwoByte strings and making it work on Latin1 would be a pretty big change.

I haven't looked at their code but if the parser expects the whole string to be available and accesses it randomly it would certainly be a big rewrite; otherwise, if it's more like a getchar(), it wouldn't be so hard to have a function expand each character from the source string as the parser consumes it.

The main goal was saving memory, but Latin1 strings also improved performance on several benchmarks.

With modern processors having multilevel cache hierarchies and increasing memory latencies, smaller almost always is faster - it's well worth spending a few extra cycles in the core to avoid the few hundred cycles (or more) of a cache miss.


SpiderMonkey uses one of the rather elaborate, carefully-tuned modern allocators (tcmalloc or jemalloc, I forget which) that's designed around clustering allocations into 'size categories', and carving allocations out of those categories. As a result, a given allocation will end up with overhead based on the size bucket it's put into, and certain buckets may perform better.

In this environment it's very useful for each type to be fixed-size (within reason) and be consistently sized. You'll end up with all your JSStrings allocated out of the same size bucket and other things get simpler and faster.

For example, with fixed-size JSStrings allocating out of a fixed-size allocation bucket, perhaps thread-local, you can allocate a string by atomically incrementing a pointer. Speedy!


In general Firefox uses jemalloc, so shipping builds of SpiderMonkey do as well. However, individual JavaScript objects are allocated using a custom SpiderMonkey-specific allocator that uses fixed-size bins, knows about garbage collection information, and (as of this year) supports multiple generations. No production-quality JavaScript engine uses the C malloc interface for individual JS objects; you'd get killed in the benchmarks if you tried.


Remember that malloc is a least-common-denominator interface. It needs to work everywhere. In a lot of places, it could be hard to justify the complexity.

Anyway, on glibc systems, you can use malloc_usable_size. On Windows systems (Windows gets the heap very right), you can use the HeapSize function.


The size of the allocated block is often not the same as the size passed to malloc, so the "duplicated" length is usually actually necessary.


Is this similar to the Flexible String Representation[0] in Python 3.3?

[0] http://legacy.python.org/dev/peps/pep-0393/


Nearly the same, save for the fact that python also gives an option for UTF-32.


I wouldn't say it's an option: the string's internal representation might be UTF-32, but whether or not it is is transparent to you the coder. (Just as the JS change is transparent to the JS coder.)

However, the Python change wasn't entirely transparent: len() on a string now returns the length of the string in code points, whereas previously it returned the length in code units. Further, previously Python could be built with one of two internal string representations, so len(s) for a constant s could return different answers depending on your build. Now it doesn't, and len returning code points is much more useful.


I know this will probably get downvoted into oblivious but is string space really an issue in the browser?

For example, this page at the time of this post has 68k of html so 68k of text. Let's assume it's all ASCII but we store it at 32bits per character so it's 4x in memory or 270k. Checking Chrome's task manager this page is using 42meg of ram. Do I really care about 68k vs 270k when something else is using 42meg of ram for this page? Firefox is also at a similar level.

Why optimize this? It seems like wrong thing to optimize? Especially for the complexity added to the code.


Browsers and JS VMs are so widely deployed now that anything that can be done to make them faster or use less memory has a tremendous effect in the aggregate. What's really cool is that you can update a browser and suddenly the web starts working better for everyone who uses the browser.

Browser memory usage does tend to creep up unless there is active work to keep it under control. So I think it is important to try and trim the fat where possible. Just look at Are we slim yet? to see Firefox's memory usage slowly growing over time.

https://areweslimyet.com/

By the way, JS engine optimisation does seem like it is getting more and more complicated. I suppose most of the easy optimisations have already been done. Look at Safari adding a fourth layer to its JIT!

https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-...

It's kind of crazy, but totally worth it! (IMO)

The string slimming work in IonMonkey sounds like it got some memory improvements without too much more complexity. Remember the string logic is already complicated—around 6 different string types. The kinds of pages where I think you'd see a real memory improvements would be in JS-heavy pages which store lots of data in memory. These types of pages are becoming more and more common.

I imagine it would also be a big improvement for any software running on Firefox OS. It would be especially noticeable because Firefox OS is targeted at low cost mobile devices.


I'd guess that it's actually quite an issue for JS heavy pages. This would probably benefit anyone doing signification in-browser apps in JS.


Hmmm, checking for example Gmail which is arguably a heavy page it's got 4meg of requests for various js + html files. So 16meg if expanded to 32bits per code point. But it's using 160meg of ram. Strings are not where all the space is going it would seem.


If you actually read the linked article, it has measurements for how much RAM strings use in Gmail. For the particular case of the article's author, it was about 11MB of strings before the changes he made; it was about 6-7MB of strings afterward. Your mileage will vary depending on what actual mails you have, of course.

Note also that comparing this to the Chrome numbers for overall Gmail memory usage is comparing apples and oranges: Firefox tends to use less memory than Chrome. You'd want to look at about:memory in Firefox to see how much memory that gmail page is likely using.


The raw source code is not the only strings in an application. Gmail especially will be heavily manipulating the DOM and a variety of other things (JS properties, JSON requests) which use Strings internally.


Strings using less memory will also speed up string operations as you can see from their 36% win in the regexp-dna benchmark.


Why not compared with other browser like chrome etc?


In Firefox you can easily answer questions like "how much memory are strings taking up on this page", thanks to the fine-grained measurements available in about:memory.

I don't know of a way to get these measurements in other browsers. Chrome's about:memory page contains much coarser measurements, for example.


Chrome has the memory profiler for that.


Does that even work for memory usage by native code?


Latin1? I hoped it would die some day.


This is an internal representation. JS strings do and continue to behave as sequences of 16-bit integers.

This change takes advantage of the fact that most JS strings fit into an 8-bit charspace, so for those that do, it uses a more compact representation internally.

This optimization is simply: if we have a string and we know that all of the uint16_ts in the string are <= 255, then just store it as a sequence of uint8_ts.


ES6.1 wishlist: UTF8 strings, full stop.


I want Unicode strings that support

1. Opaque cursors pointing somewhere in the conceptual sequence of code points, with constant-time dereferencing,

2. Ranges, defined by starting and ending cursors, and

3. The ability to move cursors forward or backward by either code points or composed grapheme clusters.

This would be a saner interface than any other I've seen, and it puts very few constraints on the underlying encoding.


1, 2. Grapheme clusters are not normative in Unicode, they can be tailored for specific languages. There's a default cluster finding algorithm but it's not suitable in all cases. There's no "one size fits all" approach.

3. Forward and backward are likewise language and tailoring dependent because they depend on graphemes. There may also be application-specific tailoring such as the handling of combining marks, in some scripts "forward" and "backward" are not clearly defined.


That's great stuff...that should be done after standardizing on UTF8.


Be careful what you wish for. Unicode strings are fucking complex. UTF8 double so.

For example which of the four Unicode character normalization interests you most? Or you need grapheme clusters? Or you need code points? Or byte values?


You already have UTF16 which is both complex and inefficient (because of two byte representations of Latin characters)


Those are general Unicode issues, not UTF-8 issues.


I don't want UTF32, UCS-2, UTF16, or endian issues--that much I know for sure. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: