Hacker News new | past | comments | ask | show | jobs | submit login
Joe Armstrong: "In my opinion Erlang is brilliant at handling text" (groups.google.com)
62 points by drewr on Nov 27, 2009 | hide | past | favorite | 56 comments



Erlang now has a full-featured standard regex module called re that handles utf8 strings and binaries. It now has a standard module for handling unicode.

Yes, regexes aren't part of the language syntax, but then that's not true for, say, python, either. Having to use a library to do regex is not going to be the difference maker in productivity for your app.

Besides, while I love regex, I still use them as a last resort. You're going to want to use a real parser for reliable structured text processing.

Generally, erlang programmers keep strings in binaries, which are compact. Most modules for handling string type tasks allow you to do this, e.g., the re module understands a unicode_binary type.

In 6 months of programming erlang professionally, in a domain dominated by scripting languages like python and ruby, I've certainly never been tempted to bolt over string handling issues. Erlang's flexible distribution, concurrency, and reliability model is just too compelling.

To get competitive performance in a reasonable amount of programmer time for concurrent applications in other languages you're limited to the subset of tasks that, say, twisted or tornado makes easy. The program I'm writing now couldn't have been done with either of them.

Frankly, if it's the choice between built-in support for regex and built-in primitives for distribution, concurrency, and fault tolerance, there's no question in my mind which is more important.


He compares Erlang's string handling with C... And C sucks at string handling. He should compare Erlang to all the modern languages where strings are a first class citizen. Comparing Erlang's string handling to Java, Python or Ruby might change his mind.


I think the main (and almost the only) reason to have distinct string type is performance, not for the ease of programming. Many string operations are useful as a general list operations as well, including regular expression matcher. (Note: Some people emphasize importance of O(1) access of string access by index, but using integer index is also a performance hack. If search operations can return some way to point to the substring you don't need integer indexes.)

Another minor reason is to display; people prefer reading sequence of characters in a string syntax. If you have a statically typed language it is easy to display a list of chars in string syntax instead of list syntax. For a dynamically typed language with heterogeneous lists, it can be a performance penalty to check whether a list entirely consists of characters or not at runtime. So, in a sense, it is also about a performance. (Note: Having a syntax for strings has nothing to do with having distinct type for strings. The string syntax can be just a syntax sugar.)

But performance is important, of course. One thing very common in string (a list of characters) but not very common in general lists is concatenation. To be precise, lazy language programmers use list concatenation without a guilt, but eager language programmers tend to avoid it since it may cause unnecessary copying of lists. So for the eager evaluation languages, it makes sense to have a string type that has very cheap concatenation operation (e.g. using tree representation) internally.


"Many string operations are useful as a general list operations as well, including regular expression matcher."

Except this falls apart when dealing with Unicode -- if your regex wants to match "ä", for example, treating strings as lists means you'll miss at least one possible way of representing that in Unicode (it may be a single code point, or it may be two -- an "a" with a combining diaresis).


That is a sort of tangent problem you have to deal with whichever you have distinct string type or list-of-character strings. You may canonicalize, or you delegate stuff to intermediate library.


It's not a tangent problem. It's at the very core of what a string is. I think you're confusing a representation of a string with what a string actually is, because it is convenient for your mental model of programming to believe - as an act of faith - that strings are lists of characters. Strings are just blobs of data that encode human-interpretable text. There is no underlying symmetry which proves that they fit into a monadic concept of a list, and indeed, they don't fit particularly well there.

Built-in string types can do things like warn that literal strings aren't convertible into target string type, at compile time; that an assignment from one string type to another may lose information, etc. Different string types may use different encodings, etc.


Hm. I'm assuming internal string representation is canonicalized and independent from specific external encodings (or implementation hides it). I'm not sure what you mean by "target string type".


In part it is motivated by my experience in Delphi of moving from a so-called AnsiString (encoded in current Windows code page) to UnicodeString (encoded in UTF-16) between Delphi 2007 and Delphi 2009. I implemented the initial RTL routines and helped with some of the compiler support.

Delphi has multiple string types to handle all the backward compatibility issues. Ancient Pascal strings are limited to 255 characters; AnsiString (current code page), WideString (a COM BSTR), UnicodeString, and things like Utf8String (magic UTF-8 code page), etc. Assignments between the different strings perform conversions, and cause warnings for possible data-loss.

The more you learn about how strings work, including the international aspects, legacy aspects, OS-specific aspects, conversions at source code -> executable -> runtime -> I/O boundaries, etc., the more you appreciate the situation really isn't trivially reducible to simple lists of characters.


The question is, having that many string types are inherent for string handling? Or are they more of residue of historical artifacts?

Some data, represented in lists, may have extra constraints. If elements have meaning in its order relative to each other, reversing such a list doesn't make sense. If every elements need to be power of 2, putting 3 into such list doesn't make sense. Yet, having constraints doesn't mean they need to be a distinct type from lists. You can implement them in a general lists and handle constraints with libraries. Why do strings need to differ? You don't want to reverse string character-by-character, or arbitrary indexing into a string may not make sense, but yet a general list operations like fold, filter, take-while, etc. may be useful.

Now, to avoid further confusion, let's separate CES and CCS. Utf-8, utf-16, ucs-4 are all CES variations of Unicode CCS. They can be converted freely without loss of information, and there are no reason that implementation can handle it implicitly, except performance issues.

If you need extra constraints, such as ascii-only or length-limited, you can create a specialized type wrapping the basic string and implement constraints there. That can be done in user-level and doesn't require such special types to be built-in. E.g. if a legacy library only accepts ascii-only string, it takes an argument of type AsciiOnly [Char]. Check can be enforced at type coercion.

Composed characters are headache, but the library to deal with them can be built on top of list of unicode codepoints (if we adopt unicode codepoints as Char). What's important here is that you can build some algorithm, say normalization, on top of list-of-char view. If you, as a language user, want to try out a new normalization scheme, you can do that, and you can have all the list manipulating tools in your hand. And you can wrap the resulting normalized string in a specialized type if you want to keep integrity.

Incompatible CCSes are more of a problem; e.g. there are several different conversions between Unicode and other Japanese character sets, and that have caused loss of information. I wish I can write CCS-neutral code, but eventually I need to deal with differences. This could be handled by having distinct UnicodeChar and JISChar, maybe.

Actually, there is a more fundamental question: What is a character? Depending on an application you may want different unit as a character. I think that's a good argument against "list-of-character" view. So another way is to have an opaque string type, which can show different level of abstraction depending on what an application asks (e.g. it may return graphemes, fully composed characters, unicode codepoints, etc.). That is plausible. Although I argue that it can be implemented as a library on top of list of basic characters (e.g. Unicode codepoints).


"Many string operations are useful as a general list operations as well, including regular expression matcher."

The way to get around that is to have some concept of interfaces in the language, and then have a generic Sequence type that strings, arrays, lists, and a bunch of other concrete types all implement. This makes your sequence operations even more polymorphic, eg. you can have them work on ropes, iterators, tree traversals, etc.

The real problem with the list representation for strings is memory consumption. English UTF-8 text in a byte array takes one byte per code point. English unicode text in a list where each code point is an immediate 32-bit int takes up 8 bytes per character. That's a factor of 8 difference in memory consumption.

And more memory means you can't fit as much into cache. There's a huge difference between being able to keep all your strings in L2 cache vs. having to go to main memory, particularly for string manipulations, which may need to touch a lot of different memory locations without a lot of locality.

"Some people emphasize importance of O(1) access of string access by index, but using integer index is also a performance hack. If search operations can return some way to point to the substring you don't need integer indexes."

Also, UTF-8 usually means you have to give up O(1) indexing. Your string may have multi-byte characters, which you can't discover unless you iterate through it.

In practice, this doesn't matter. The vast majority of string slices chop a few characters off from either the beginning or the end. You can do Boyer-Moore on byte sequences and return the byte offset instead of the character offset. Most other indexing operations involve some sort of iteration over the string anyway. The only real pathological case is when you need to find the midpoint of a string repeatedly.


Strings aren't lists of characters. In that misconception, many common coding errors lie.

For example, you might think that reversing a string is equivalent to reversing its characters. It's not.

Similarly, you might think that indexing a string is the same as indexing a list. It's not.

Etc etc.


That very much depends on your definition of a String.

I'm not sure what you mean by reversing a string not being equal to reversing its characters. It's certainly not always equal to reversing its bytes (depending on the encoding). Same goes for Strings that are implemented with UTF-16 characters (e.g. Java, win32 wchar_t etc.) all suffer from bad abstractions driven by implementation/efficiency concerns.


His point is that the reverse of the two character unicode string "LATIN-SMALL-LETTER-O COMBINIG-DIAERESIS" is not "COMBINING-DIAERESIS LATIN-SMALL-LETTER-O", but the original "LATIN-SMALL-LETTER-O COMBINIG-DIAERESIS". Blame it on the unicode consortium, but that is the way it is.


I wouldn't necessarily blame it on the unicode consortium but rather on the assumption that a character in a programming language should conform to the notion of a character in UCS-4 or some other unicode encoding.

From a programming language perspective, I think it would be ideal if strings could always be viewed as lists of encoding independent characters, such that reversing this list is equivalent to reversing the string.

If you want to be able to maintain a one-to-one mapping to unicode, you will need to use unique characters for an "ä" and an "a with combining-diaeresis" but ideally that should be hidden from the user of the language.

Thus, in my ideal world, both "LATIN-SMALL-LETTER-O-WITH-DIAERESIS" and "LATIN-SMALL-LETTER-O COMBINING-DIAERESIS" would each be one single element of a list of characters which is a string.


This problem exists even if we ignore unicode. What is the reverse of the following string?

    foo\r\n\bar
I imagine the most useful answer if you are writing a notepad-level application would be:

    rab\r\n\oof


At the level of abstraction I would expect from a String object, the reverse of \r\n should be \n\r. Basically, you example is another encoding issue. I think strings should be free from any encoding concerns. Those should be handled when parsing/serializing.


Even in UCS-4, strings are not reversible by reversing the code points. Are you suggesting that Unicode combining characters are an efficiency concern, rather than a correctness one?

A string is more like a bytecode program that can be executed to evaluate to a piece of human-readable text.


Performance and ease of use are the main reasons why strings should be first class citizens. Strings are one of the most used data structures and most of today's popular languages have very good support for them, encodings of them and manipulation of them. C does not have that good support for them. Ignoring strings and labeling them as "a list of integers" is a step backward, since a lot of the data we have today is textual and will continue to be textual in the future.


Having dynamic typing makes the discussion complicated, so let's assume we have a cheap way to know the type of a given object.

In one world, you get an object of type [Char] ---means a list of characters--- and you can apply all sorts of list operations on it, and all sorts of operations specialized to [Char]. You can add a type alias String to the type [Char]. In the source file you can write "string" and it is read as a list of six characters. On the output the same list is printed as "string".

In another world, you get an object of type String, which is distinct type from a list of characters. Type String has all sorts of useful operations. But if you want to apply a generic list algorithm, you have to either duplicate the code, or coerce the string into a list. Conversely, if you have a list of characters and want to pass it to a string library function, say, regexp matcher, you have to coerce it to a string.

Which is easier?

As nostrademons commented, one way is to implement a generic interface so that you can write a generic algorithm on top of both list of characters and Strings, but that's actually the same thing I'm saying. I say "list" as some data structure on which you can peek the head, the tail, and you can add an element in front of existing one. I don't care how it is represented---if the runtime or the compiler can find out the list only contains ASCII characters, it can freely store the entire list in an octet array. In a sense, I say "list" as "data structures that implement the list interface".

Now, suppose if you have such a smart runtime/compiler. Suppose you can have specialized functions on [Char], apart from generic list. Do you still think having distinct string type is for ease of use?

In reality we don't have such sufficient smart runtime/compiler, so we compromise. That's the distinct string type.


I would prefer if strings were treated as a list of characters and NOT as a list of integers (as they are in Erlang)...! I.e. your reasoning does not really apply to Erlang.


Oh, I thought you were arguing with my proposition: Distinct string type is for performance and not for ease of programming.

I agree that conflating [Char] and [Integer] is not good. That's a different story.


I disagree with your proposition. Languages should have a string type and not only for performance, but also for ease of programming, since strings are an important and special data type. I don't really care how this string type is implemented - if strings are an array of characters (like in Java) or a list of characters (like in Haskell). What I think is important is that a language treats strings as first class citizens - and not just for performance, but also for ease of use. And currently, strings aren't a first class citizen in Erlang (due to Erlang's limited support for custom datatypes).


Let's forget Erlang. We've agreed that conflating [Integer] and string is bad.

I can't find the reasoning to back up your claim in your posts; if I miss it, could you point it? I think I explained a few points that a language does not need distinct string type, except from performance reasons.

Note that I've never said that strings shouldn't be a first class citizen. A list of characters is a first class citizen. You can have rich string library on top of lists of characters, plus generic list operators works on them. So, why do you want a string type disjoint from lists?


It seems the key is having a good "Char" type, so that Unicode bytes turn into something matching the intuition of what a character should be. So the bytes for "a umlaut" or whatever become a single Char in the list. So the Char type is responsible for storing a representing characters at the right level of granularity, and then string operations can all be implemented as list operations.

Is there a case where this still breaks down?


I didn't specify that a string should be a distinct datatype, but merely a datatype and a first class citizen of a language. If a language's type system is strong enough, then obviously a distinct data type isn't needed.


computers exist to serve humans, and strings are the best way for them to communicate to their masters..


You miss the point. Yes, C isn't perfect when it comes to handling strings, but any halfway decent editor isn't based around having a single giant string in-memory anyway. C and Erlang can do just fine if you have, say, a tree of (length, data) tuples, with e.g. a couple of "shadow trees" for undo. C will be somewhat faster and lighter, and Erlang has automatic memory management, but there's no particular reason why either wouldn't work.


I think you miss my point: Comparing Erlang's string handling to one of the first mainstream languages is pretty ignorant of the evolution that has happened the last 30 years. He should compare Erlang to recent languages, before concluding that Erlang's string support is great.


Actually, Armstrong was talking about text handling, which is not the same thing - certainly not in the context of a thread about writing an editor (in Erlang).

I'm by no means an Erlang expert, nor have I ever written an editor; but I don't think it self-evident that, say, language support for regular expressions is a significant aid. (Automatic memory management may be helpful, but Erlang has that and it's not what we're talking about anyway.)


True, I imagine many of the commenters discussing which language to write a text editor in do not know how to write a text editor any more advanced than notepad. Certainly I have not written one.


I don't know very much about erlang but I can understand why it has the stigma that it doesn't handle text well as almost every introduction I have seen has said this is the case due to it being created by telecommunication companie[s].

However my familiarity is limited and whether this is true or not I cannot comment on, but comparing erlang to C in performing text based operations is probably silly if the other person has an option such as say perl available to them.


String processing is just one of those things you can't offload to a second process. Having Perl in the same box doesn't mean you have the luxury to open some IPC channel and pass your texts to Perl. String processing is one of those types of problems that just benefit from a little thought and 10 seconds of planning.

C itself is bogged down by its own horrible string representation, the nul termination, where most operations need to traverse the string up to the terminator before they know where it ends. This causes all sorts of horrible buffering tricks, conditional tests on every character, and other unpleasant things. Pascal had length-prefixed strings from day one, and DJB created Netstrings to make network programming easier on people. See http://c2.com/cgi/wiki?LeasedString


Even though he has a good point: strings are lists, and erlang is good at lists — erlang doesn't have native regexps like perl, js, or ruby, nor is the support for them that good.


Frankly, I think "native regexps" has proved a mistake, not a virtue. If you're using so many regexps in your code that you actually care whether the syntax is optimized for it, the odds of you Doing It Wrong (TM) are very, very high. (And it has to be direct use, too... if you want something like a regexp-based dispatch map like Django uses, Perl doesn't even have a significant character advantage over Python; r"" vs qr//.) Making it easy to do the wrong thing and harder to do the right thing (than the wrong thing) has certainly wrecked up a lot of Perl code I've had to deal with.

The majority of my professional programming is in Perl with many other programmers, and the number of times I see people do something like $settings =~ /read/ over a string containing settings, often complete with more than one setting that has the substring "read" even though they're only looking for one particular one... oi. Makes me sick.


i think this is a cultural thing. ruby is at least as convenient to deal with regexes as perl, but we don't have the same kinds of problems. i think this is because in ruby, the culture is to use the hippest tool for the job. frankly, regular expressions aren't very hip. YAML or JSON is much more hip and therefore likely to fill that niche. Perl-users made regular expressions something that we all expect each other to know. Did they go too far? Seems that way (just like java-users obsession with introducing abstraction layers helped the programming community learn DP.)

Your point implies that exposing a powerful mechanism that is easily abused should be avoided. I disagree. I think this is a matter of culture and not "law".


Picking what approach to use based on whether it is "hip" is just a terrible way to write software.


Aside from the fact that regex are implemented as a library, can you elaborate how support for them is not that good? I think that's wrong:

http://www.erlang.org/doc/man/re.html


Lisp doesn't have native regexps, but of the two main Lisp regexp libraries, one, cl-irregsexp, is 3 times faster than Perl, 6 times faster than Python, 7 times than C, and 8 times faster than Ruby.

Scroll to the bottom of the page http://common-lisp.net/project/cl-irregsexp/

If you have been following some recent papers on regexp performance, there is consensus that things could be a lot faster with better algorithms. I expect the game to change dramatically soon.


The cited benchmark is completely bogus.

The benchmark compares the different implementations based a single trivial regular expression: /indecipherable|undecipherable/. You simply cannot claim a regex engine is faster than another with such a poor experiment. It is evident that Boyer–Moore string search algorithm will outshine any engine on that regular expression.


Ouch! Ineed, it's a lousy benchmark; I only recommended it from memory because it blew me away the first time I saw it.

FWIW, if anybody can recommend a good benchmark, I would be happy to do a write up since I am proud of the regex performance of the other CL library (cl-ppcre.)


This is true - looked at making a very concurrent webcrawler in Erlang, and the regex bit was painful.


From 12Bx that changed, regexps are enjoyable but still not as rich as Perl.


Maybe someone should write a parse transform for dealing with them?


Cool, I'll look again. Thanks.


Hmm... I also wrote a concurrent webcrawler in Erlang, and at no point was I tempted to use regular expressions.


Then we were crawling for different purposes - imagine that! :)


instead of being snarky, can you please provide the reason why regular expressions were required?


I needed to do pattern matching to pull data out of many different formats, then clean the data before processing it. We were parsing radio station song feeds, and the data is varied, chaotic and often unavailable. This was also a port of a perl POE app, and so it was regex intensive in its original implementation and I don't see how you could effectively deal with such noisy text without regexes.

As to snark - his reply was snark, I just replied in kind. Regexes are incredibly useful for all kinds of things, most especially in parsing data from web services. The fact that he didn't need them isn't 'funny,' it means we were doing different things.

In any case - using Erlang for something like this is so much win. The POE, and threaded implementations got real ugly real fast as we scaled it up. I knew Erlang could do it - across boxen, without a problem. It sounds like the regex libs have improved, and I look forward to using Erlang again in the future.

Happy? :D


How does Erlang fare with Unicode in practice?


If I understand it correctly, Erlang fares well here because it stores each character as a 32-bit integer in a list. The implication of that approach is more memory overhead for each character, but that allows you to treat strings as a list of characters and gives you all the benefits of the standard list-manipulation libraries.


But that tells us nothing about the quality of Erlang's Unicode support. Is it easy in Erlang to encode a string to its UTF-8 or Latin-1 representation and vice-versa? Are there utilities in Erlang to check whether a character is a punctuation symbol, a letter or a digit? Are there utilities to normalize Unicode strings? This one is important if you need to compare two strings.

Sorry if I sound a bit pedantic, but there is a lot more to good Unicode support than using wide characters.


Being pedantic: Erlang doesn't have a 32-bit intger type, it only has integers. The implementation uses the fixnum trick known from lisp, i.e. if the value is larger than a fixnum it is implemented as a bignum.


> Erlang fares well here because it stores each character as a 32-bit integer in a list

Which is silly because if you make a list of 32 bit integers that have ASCII equivalents then Erlang assumes it is a string.


The main difference is the way it's displayed by default in the shell. Strings are handled with list operations, and Erlang is great for list operations.


Yeah, but still: it lets you operate on unicode "characters" pretty easily. With a ton of associated downsides (unclear string/list of integers dichotomy, wasted memory for ascii, etc)


Erlang doesn't make any assumptions about what it is.

To Erlang, it is a list of 32 bit integers. There is no difference between a list of integers and a list of characters.

The shell will transform this list of 32 bit integers with ASCII equivalents into text for your convenience, but it doesn't do any sort of conversion or typing.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: