Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unicode strings are really, really hard.

I'm working on a programming language sort of like K or APL. Among other things it has "good" unicode support (I'm fully convinced such a thing is even possible). It introduces a truly frustrating amount of complexity, especially in comparison to how simple ASCII strings are in K or APL (just an array of characters, so you get all the powerful array operations for free as shown in this post).

The approach I've more or less settled on is that strings are not arrays at all but it's possible to get arrays of codepoints (composed or decomposed) or grapheme clusters from them. Really this just pushes complexity to the caller, but it means you more control over what exactly you intend to work with. Pretty much all the solutions here could be straightforwardly translated to my programming language, except you would need to choose what definition of "character" you're going with.

I basically like this solution but the deeper I get the more I become convinced that manipulating an arbitrary user-provided or even user-facing string in any way is a recipe for disaster; there's often simply no sensible way to handle things like direction formatting control codes (or bidirectional text in general).

In other words, I almost think there should be separate types for "computer-facing" strings (like filenames) that one commonly has to manipulate, and "human-facing" strings that one just wants to either display or store somewhere and ideally never touch because you will screw up somewhere.



> In other words, I almost think there should be separate types for "computer-facing" strings (like filenames) that one commonly has to manipulate, and "human-facing" strings that one just wants to either display or store somewhere and ideally never touch because you will screw up somewhere.

That's a fascinating idea, thanks. I imagine it's impractical due to numerous places where machine and human strings overlap, but I'm going to have to ruminate on it for a while.


The GFile API in GTK+'s underlying glib family of libraries kind of has this, at least for your example of filenames. It separates the "actual" name of a file from it's display name, and has APIs to support that division.

From the documentation [1]:

All GFiles have a basename (get with g_file_get_basename()). These names are byte strings that are used to identify the file on the filesystem (relative to its parent directory) and there is no guarantees that they have any particular charset encoding or even make any sense at all. If you want to use filenames in a user interface you should use the display name that you can get by requesting the G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with g_file_query_info(). This is guaranteed to be in UTF-8 and can be used in a user interface. But always store the real basename or the GFile to use to actually access the file, because there is no way to go from a display name to the actual name.

This makes implementing things that deal with filenames a lot (like, cough, a file manager) quite interesting.

[1] https://developer.gnome.org/gio/stable/GFile.html


There's a similar concept in Rust, where filename strings are OsString https://doc.rust-lang.org/std/ffi/struct.OsString.html , which can be converted to the Unicode-like String.


With perhaps Rust's greatest contribution to the public discourse, the WTF-8 format.


It's definitely pretty leaky (just prompting the user for a filename already breaks it; you either leak the rules for machine strings to the user or you pray to god they don't enter something too weird).

However the problem with interaction is really only one-way, since machine strings can be safely promoted to human strings. Unfortunately even concatenating unicode strings is not necessarily straightforward; for example A+B where A has a right-to-left embedding in it but you want B to display left-to-right you need to surround it with left-to-right embedding and pop direction (this actually solves any issues I can think of off the top of my head but no one does this).

The really bad problems involve parsing user-provided non-programmer-oriented text (e.g. markdown). I really don't know if there's a robust way to do that.


Hmm. Surely in order to "display" you need to "manipulate" (someone has to write the text rendering library, although ideally only once). And the simple action of "enter filename" crosses both those domains.

Then there's the whole classic domain of parsers and lexers as the front end to programming languages. I appreciate this gets upsettingly difficult if it's also a security boundary, where things like invisible spaces are a threat, but it remains important.

Maybe what we need is to go the other way from asking interns to reverse strings, and ask library writers to provide some slightly higher-level functions that don't rely on regex. Perhaps LINQ for strings? Most languages give you "split", which is the very beginning of a tokeniser, but we need something a bit more powerful.

A good test case might be writing the notoriously difficult "do these two URLs refer to the same resource?" program.


It sounds a lot like what Go and perhaps Swift do. Perhaps other languages.


Erlang has two different string(ish) value types: binaries and lists of integers. I suppose they could fit into that conceptual model.


Isn't this just what the lisp world calls symbols?


No; symbols are opaque; to manipulate them at all you have to convert them to strings. (For example you can't concatenate or reverse :foo and :bar or iterate over their characters.)


Ah, I see what you mean now. I was taking "computer facing" for strings you don't have to manipulate often.

Though, I am confused, how often are you manipulating a filename? Even strings, in general. Parse? Sure. Manipulate? Seems uncommon.

Formatting, I can grant. But that is different from manipulating a string. More building up from others. And, outside of madlibs, not much you can hope for. Surprising amount of distance from madlibs, I suppose.


(I consider "formatting" under "manipulating". Parsing too, actually. Really anything where you have to consider the parts of the string individually and not just the whole.)

By and large I think you're right; as the root post said you pretty much never actually have to reverse a string.

Specifically though I think manipulating filenames is not really unusual; for instance adding or reading a file extension, filename<number>.<ext>, splitting into directory paths, etc. Filenames are actually an interesting case because they're a very good candidate for having their own type (since they have some internal structure to them and operations on that structure).

Which leads me to another hypothesis, which is that the number of types you have to do something to truly unstructured data other than compare it is really low, and we actually don't need a generic "string" type at all. Unfortunately this is not really feasible in a world where the Unix principle of "everything is a stream of bytes" is ubiquitous.


I was reading a distinction between recognising parts of a string, and changing parts. Such that I can't remember the last time I modified a string. I can think of plenty of times I took one apart and used a piece of it. Usually well structured to avoid the corner cases. Or, again, forced into a madlibs style structure to present to the user.

But yeah, looking for palindromes is a thing I can't recall ever having done. A prefix tree for searches? Done, but that didn't need well formed text/strings for it to work.


>A prefix tree for searches? Done, but that didn't need well formed text/strings for it to work.

I think this does need "well formed" strings and the complexities of Unicode. Is a a prefix of ä? Is 앉 a prefix of 앉다?


In my case, I was fine with binary prefixes.

And searches don't even need Unicode to get difficult. Consider: to, two, too, 2, and II. Should those all find each other? Highly dependent on context. And likely you will be reimplementing NLP before you realize it.


Even worse: is ff a prefix of ffi?


Yes, of course.

Computers should be seen as machines for manipulating symbols, not as machines for manipulating a small number of fundamental data types into which symbols are squeezed, usually with associated wreckage.

A file path, a shell command, a time/date, a URL, on-screen text in a browser, on-screen text in a text editor, and code in an editing window are all completely different data types. They can be implemented as char arrays - usually poorly - but that doesn't mean they're smoothly interchangeable with clear interfaces.

So in reality they're neither abstracted nor standardised nor designed properly, and the result is a lot of pain and confusion, because developers default to "So this is a string..." instead of thinking of them as separate types implementing distinct abstractions with hugely different requirements.


Mind sharing a bit more about the programming language you're working on? Sounds interesting!


It's very much a "this is productive for me and fits how I think, maybe to the exclusion of making sense to other people" project.

It's closest to J or Dyalog APL but with a much different syntax that aims to make it more natural to write entirely pointfree code. (Personally, I feel like long trains can become kind of hard to read and refactor in J. The fork and hook syntax is really nice for short trains but IMO does not scale up very well.) I've been fiddling with the syntax on and off for about 5 years and use it as a sort of general notation for algorithms.

There's actually a fair bit of Erlang in there as well which is honestly kind of an odd combination but actors+arrays hits a sort of local optima for me. It might be the first APLlike with really good I/O.


Thanks for sharing, that sounds quite nice. Always interesting to see what kind of new programming ideas are being worked on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: