Kill hashtables, get shorter code.

nostrademons · on July 19, 2010

This is actually a pretty common technique in Haskell. If you look at the Alex lexical analyzer generator's source code, it represents CharSets as functions from Char -> Bool: they return true if the character is in the charset or false if it is not. Internally, they work just like the article describes: the union of two CharSets doesn't actually enumerate every possible character in the set, it just calls the two CharSets and performs an OR.

There're other languages that do this two, eg. JavaScript written by functional programmers often ends up looking like this. In Arc, the distinction is completely meaningless, because you "call" a hashtable to index an object anyway.

The problem with it, I've found, is the lack of debugability. You can enumerate a hashtable, and you can inspect it in the debugger. You generally can't enumerate a function. I've found that with complex systems, the biggest correlation to programming productivity is the amount of debug information available, not the size of the code. Trading off debugability to get short code is usually a poor trade-off, once you get past a certain code size.

cageface · on July 19, 2010

I've found that with complex systems, the biggest correlation to programming productivity is the amount of debug information available, not the size of the code.

I don't think this point is made often enough. Brevity is a virtue certainly, but ultimately what counts is not LOC, bug counts, iterations/sec etc but features delivered to users. Worse is better reigns supreme here which is why you so often see engineering and design disasters like php, craigslist, myspace, and windows brushing aside far more elegant but less engaging alternatives. For most of us the pride we take in our craft makes this a very hard pill to swallow but swallow it we must.

herdrick · on July 19, 2010

Regarding Haskell, yeah I thought as much. When writing this code I thought, "This would all be better, or maybe nearly automatic, in a purely functional language".

Regarding Arc, there's still a distinction, since as you point out you can't enumerate the domain of a function like you can the keys of a hashtable. (And you also "call" a hashtable in Clojure.)

ajg1977 · on July 19, 2010

"But it’s hard to argue with brevity."

No, it's really not. Brevity != clarity. Clarity is hard to argue with, brevity is not. By his own admission this new code is harder to write and harder to debug.

It's interesting to me that in the article the author dismisses any performance impact of his changes because "we can profile and deal with it latter", but fails to realize that by attempting to reduce the length of his code he is at risk of creating exactly the same type of needlessly hard to read, hard to debug, code that premature optimization does.

herdrick · on July 19, 2010

That's a pretty old debate. I think brevity does a lot for clarity, and the more you work at making your code brief the better you'll be at reading brief code.

Of course if other people are reading your code then you should consider what current programming practice is. But the current practice in Clojure is, for cultural reasons, to make things really succinct. So you can assume those reading your code are ok with that.

gnaritas · on July 19, 2010

> That's a pretty old debate. I think brevity does a lot for clarity

Up to a point, that's true, but if you start shortening words by removing letters, you've gone too far.

gruseom · on July 20, 2010

I disagree in two senses. First, it's a poor measurement of brevity that depends on identifier length. Even LOC (surely the gold standard of bad metrics) doesn't depend that much on that. Second and more interestingly, as code gets more concise, long explicit identifiers begin to detract from readability. Well-written concise code discloses valuable information just by the shape it takes. With experience reading such code, you start to grok it at a whole-expression level. Lengthy identifiers interfere with this gestalt: they provide an initial boost when reading the code, but at the long-term cost of overwhelming the code with lexical noise.

Concise programs often appear laughably unreadable to people who are used to a definition of "clarity" that derives from more verbose languages. At one time they appeared that way to me too. But with time spent working in less verbose languages, to my surprise I found myself wanting to use shorter and shorter names, not to interfere with clarity but to enhance it.

gnaritas · on July 20, 2010

Then length of an identifier name should be proportional to it's scope. But specifically I'm talking about non descriptive function names shortened for brevity: map, filter, join, split are better names than mp,flt,jn,sp no matter how often you use them.

It takes less effort to read a word than to trip over an unfamiliar abbreviation, so while the code may be more concise, it is less clear. If your code relies upon the reader to know a bunch of specific idioms, then it might be concise, but it isn't clear; it's jargon.

gwern · on July 20, 2010

All words are arbitrary collections of letters. map, filter, join, and split - for at least 2 of those, the most common meaning is so far removed from the programming meaning that they might as well be something shorter like 'mp'.

gnaritas · on July 20, 2010

That might be true if your users were a blank slate, but they aren't, they have existing language skills and it's easier to remember words from ordinary language than it is some library writers abbreviation.

Nor all all words arbitrary collections of letters, most words have some form of alternating consonant vowel structure that makes them easy to speak and remember. Write a program that just arbitrarily combines letters and I doubt you'll get much out that looks like words; there are lots of patterns in how we create words, it isn't arbitrary. If you can't read your code aloud and sound reasonable then you've probably inappropriately abbreviated your identifiers.

Map is by far a better name than mp; there is no shortage of vowels; map is already a short identifier and a real word.

gruseom · on July 20, 2010

I dunno. Exceptional though they may be, I can imagine situations in which it would be a net gain to write mp rather than map. I don't think "readability" should only mean "readable by someone who is unfamiliar with the program's conventions and is encountering it for the very first time". It does mean that, but it should also encompass "readable by someone who is working with the code repeatedly, as if it were clay". Devices (like long names) that help you read the code de novo can get in the way after a while. Map itself doesn't mean what it says; it's a convention, one I bet was introduced for brevity. I think it's reasonable for languages and programs to adhere to conventions and ask the reader for patience while first working with them, especially if the notation has been crafted to maximize the intelligibility of the code over the long haul.

One doesn't just sit and speed-read math in a single pass. Nor poetry. Good programs have that kind of density and deserve the same consideration; they are not newspaper columns. The trouble with the shallow notion of readability that says "I should be able to random-access any line of code and understand it right away" is that it results in less intelligible whole programs, which is a net loss of readability and of other things as well.

I wish more good programmers understood this. The bad ones we can write off, but too many good ones have been schooled to focus overly on the line-of-code. If you have a million lines of code, it doesn't much matter how readable an individual line is; nobody's going to be able to read the whole thing. What we should be striving for is to produce an equally functional program with orders of magnitude less code (which would automatically be a more functional program). This requires a quite different notion of readability.

gnaritas · on July 21, 2010

Reducing the size of a program by an order of magnitude requires better abstractions, not shorter selector names. I'm not saying names have to be long, I'm saying they should be whole words, not abbreviations of words. The English language is full of whole words that are still very short.

Math notation is horrendous by the way, too many implicit assumptions about the readers background knowledge. See what Gerry Sussman has to say about it. The great thing about code is it forces one to be explicit, which forces one to actually understand and learn.

gruseom · on July 21, 2010

It's probably just that I haven't had any coffee yet, but I'd like to point out in mild irritation that this mangles everything I said:

Reducing the size of a program by an order of magnitude requires better abstractions, not shorter selector names.

gnaritas · on July 21, 2010

Not my intention.

gruseom · on July 23, 2010

I'm sure that's true. It was probably just the drug talking :)

herdrick · on July 21, 2010

map is already a short identifier and a real word

This is heresy, but I actually think it might be better if it weren't a real word. I find I benefit from inventing names for functions or data abtractions. It helps to clear the mind of what those things might be.

Plus making map into mp or whatever makes it much more googleable. And don't get me started on Clojure having map and a Map.

gruseom · on July 21, 2010

Concur on both. It sounds like we're in the same tiny statistical bucket.

By the way, there is a variant of "googleable" that I find to be a valuable property for names in a codebase, and that is "greppable". I try to make sure that the name of any important concept is a unique string in the program, so a simple grep will yield all the places it occurs. I'll even rename things that aren't as important, if their name includes this string, to preserve said uniqueness. (And no, the 'find usage' feature of IDEs doesn't come close to satisfying this need, since a concept can appear in many ways that don't yield to code analysis, such as comments.)

gnaritas · on July 21, 2010

I don't care if you invent words, if you call your function zork that's fine, it's still a pronounceable word. When I say real word, I mean it has consonants and vowels and is speakable. It's abbreviations that annoy me.

herdrick · on July 22, 2010

Really? OK, that's interesting. I guess I prefer speakable ones too.

gruseom · on July 23, 2010

Agreed all around. Whether or not a name sounds right when you say it is one of the criteria we use on our project for naming design ideas. If you're interested in the rationale behind this, Eric Evans' book Domain Driven Design articulates a concept of design as the "ubiquitous language" of a project that I have found very useful.

jfager · on July 19, 2010

One of my favorite features of Coffeescript (and other languages) is destructuring assignment, which lets you use pattern matching over an object to pull out values from nested structures. It's way more useful than this proposal: brief when you want brevity, but you still get the full object returned from the method when you need to debug, and apis don't have to be adapted or rewritten with the usage pattern in mind.

swannodette · on July 19, 2010

Coffeescript is certainly cool, but Clojure has sophisticated destructuring assignment as well.

jfager · on July 19, 2010

The point was not "Coffeescript is better than Clojure", it was "destructuring assignment (like in Coffeescript, for instance) is a better solution than the one proposed in the post". If Clojure has sophisticated destructuring assignment, awesome. Use that instead.

swannodette · on July 20, 2010

I suppose I failed to see how destructuring assignment on hash-maps accomplishes anything interesting at all related to the post. The post eliminates hash-maps altogether.

jfager · on July 20, 2010

It eliminates them from the api, but how does memoize work? There are still hashtables, you just never get a hook on them.

I'm just saying that I prefer having that hook, and that destructuring assignment lets me do so with brevity and clarity.

billmcneale · on July 20, 2010

Destructuring is a pretty dangerous practice because it completely destroys encapsulation.

idoh · on July 19, 2010

Good stuff. My only comment is to make a macro like defmemo. That way instead of writing something like: (def set-m (memoize set)) you could write (defmemo set-m set).

herdrick · on July 19, 2010

Yep, that'd be better. I didn't want to have to explain a macro, though I guess anyone bothering to read the code can handle a simple macro. Thanks, maybe I'll change that.

amalcon · on July 19, 2010

When he talks about performance, he mentions that memoization allows performance to not suffer very much, but what he doesn't mention that almost the entire performance gain comes from memoizing frequencies... which stores the hashtables anyway; it just does it behind the scenes. This is fine for CPU usage, but it does have a significant storage penalty depending on the characteristics of the input.

There's some low-hanging fruit to cut down on the required storage, namely wrapping to-words and frequencies together, and memoizing that. This doesn't reduce the number of hashtables, but at least there's no need to store the entire word lists. As it stands, this will simply fail if your corpus is too large to fit in your RAM all at once.

herdrick · on July 19, 2010

Without profiling it I'd guess it's memoizing the function that does the disk i/o that makes the biggest difference.

amalcon · on July 19, 2010

Right, I realized after writing that that I was forgetting about that one. Still, memoizing frequencies takes it from O(n^2) to O(n) right now, so it's easily the second largest increase.

Groxx · on July 19, 2010

In summary: instead of getting a hash back from a function and pulling out the (single!) desired key, pass the key to the function and get the value back directly.

A limited-use but handy technique, and they're not arguing for dogmatic adherence to the rule. Something to keep in mind, but don't go re-wiring old code unless you're bored. You could also code the function so if you pass no key-args, you get the whole hash back.

I'm personally a fan (sometimes) of code like this when I really want to one-line something but need multiple values:

  # ruby
  value1, value2 = [(x=aFunction(arg))[:key1],x[:key2]]

:)

cageface · on July 19, 2010

I hate it when I see code like that. It's so much harder to maintain. You've saved two lines of code but made the intent at least 10x harder to infer. I've been wading through the Rails codebase lately and this kind of thing gets to be annoying fast, particularly when you're trying to debug a function that's littered with these kinds of expressions, often nested.

Honestly it's this kind of thing that makes me wonder if the Pythonistas don't have a point about clarity vs cleverness.

chc · on July 19, 2010

That's just excessive "cleverness." I don't think you should equate that with concise coding. I could write an equally "clever" snippet that takes 12 lines to do the same thing. Really, you're only saving one line of code over a perfectly readable version:

  x = aFunction(arg)
  value1, value2 = x[:key1], x[:key2]

And you can still write a one-liner much better than the original:

  value1, value2 = aFunction(args).values_at(:key1, :key2)

In essence: That style of coding is not really clever, and I would argue it's not idiomatic. It's doing more work to compensate for a missing tool (in this case, not knowing about the values_at method).

wisty · on July 19, 2010

Aside from the use of a temporary variable (which can be used for self-documentation anyway), people aren't saving any operations with the one-liner. They are just cramming more operations into a single line.

If LOC was so important for readability, we could just write in C without using any newlines.

chc · on July 19, 2010

The point isn't to save operations or lines of code — it's to reduce semantic load and make code more readable. In most languages, calling a function actually results in more operations than just doing copy-and-paste with the same code, but I don't think you'd dismiss well-factored code on those grounds.

A rough gauge for better code might be how much a skilled coder can grok by glancing at N lines of code. This means that reducing LOC can be helpful, but it's not a 1:1 correspondence — your example of C with no newlines would make it harder to read. On the other hand, putting what is essentially a hash destructuring on one line actually does tighten up the semantics.

Groxx · on July 20, 2010

Ooh, yeah, had forgotten about that one. Thanks!

Groxx · on July 19, 2010

Oh, I know it. And I would never put it in a project I'd expect other people to read, much less in a large project. I've got stuff like that in some of the easier Project Euler solutions, because I don't want to solve them in the most rational fashion.

edit: my most-epic-single-line is a range definition, and I actually use this one as-is with a description of how it works:

  [((left=prefix*10*scale-1)==-1 ? (scale==0 ? prefix-1 : prefix) : left)..((right=prefix*10*scale+(10*scale-1))==-1 ? prefix-1 : (scale > 0 ? right-1 : right))]

If passed a prefix, this iterates over all array entries where the index matches /^prefix/ (one-based, I have a zero-based too), entirely skipping non-matching ones. Just need to increase scale appropriately, and loop until it finds nothing. It was fun to come up with.

mkramlich · on July 19, 2010

pythonistas: we do

thaumaturgy · on July 19, 2010

Aside from the maintainability of this technique, it sounds as though it's taking one of the least efficient routes towards doing its job. (I'm not a LISPer, and although I should stop and think about the code there until I grok it before commenting, I'm looking for a lazier answer.)

So, let's say we have two documents, and want to calculate their distance. Each document has 2,000 words. We can either scan each word in each document once, counting them in memory and processing a total of 4,000 words, or we can start with a word in one document, scan both documents for all occurrences of that word, move on to the next word, scan both documents for all occurrences of that word, and so on, for a total of ~N! scans. Given a choice between a really efficient function running in factorial time, or a longer, uglier function running in constant time ... I'd kind of prefer the latter.

Am I way off the mark here? Is there something I misunderstood?

herdrick · on July 19, 2010

Sorry, I don't understand what you're asking. The way it works now is: each document is read and summarized into a bunch of word frequencies, once. It looks like each is being read many times, but since I'm memoizing, it only happens once. Is that helpful or are you suggesting something else?

jimbokun · on July 19, 2010

"That sort of sucks. Further, it makes some functions, like euclidean, hopelessly non-general, since euclidean now has to know what function to call to get a frequency [3]."

No. You can just pass in the appropriate, memoized frequency function to call. In the new code, that just means that "freq" would be an additional argument to euclidean, instead of a globally defined function.

herdrick · on July 19, 2010

Right. That's what I said in the footnote, [3].

jrockway · on July 20, 2010

Yup, that's what abstraction does for you. The less policy the caller has to implicitly understand, the more flexibility you have in the underlying representation, and the less code you have to write.

yason · on July 20, 2010

I can't remember which philosopher it was but he said something like: "If you have something that you can actually express, you can then also express it concisely."