Skip lists in Python

sonoffett · on June 19, 2011

The relevant paper, "Skip Lists: A Probabilistic Alternative to Balanced Trees": http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117...

joe_the_user · on June 19, 2011

Serious question: Is it worth implementing this kind of data structure in Python. Wouldn't Python's overhead overwhelm any speediness you'd get?

mjb · on June 19, 2011

Of course it's worth implementing better data structures in Python. There are two ways to make a Python program faster: make the Python program faster, or implement computationally intensive parts in another language (including using a library written in another language). Which of these approaches is 'best' depends heavily on what your goals are. There are many advantages to keeping code in pure Python rather than moving to C - not least issues like portability and maintainability.

There are also some Python operations that have a large constant time - operations which are much slower, but still O(1), in Python than they are in lower level languages. In these cases, choosing algorithms which reduce the number of operations of that type can be a bigger win than a simple big-O analysis would suggest and allow pure Python code to compete better with code written in lower-level languages.

Better algorithms are the right answer the fast programs more often than faster code. There is room for both, but the "if the code isn't fast you may as well not bother with a good algorithm" claim is generally false.

joe_the_user · on June 21, 2011

There are two ways to make a Python program faster: make the Python program faster, or implement computationally intensive parts in another language (including using a library written in another language).

Sure but why not do both??

Take the most computationally intensive parts, use a different language and use an efficient data structure in that other language. That might seem like a lot of work but fortunately there are handy-dandy libraries easily available to let you do this with even less effort than a simple language extension; just toss your data into SQLite or Kyoto Cabinet and let them deal with it. As is standard practice in, uh, many, many applications?

Your point about speeding up the program by improving the algorithm is valid - but it's only valid when the initial algorithm is doing something more or less unique. If the program is doing which a data base does, you should use a database like SQLite or a data store like Redis, Kyoto Cabinet, etc.

ColinWright · on June 19, 2011

The speediness you get is algorithmic. SkipLists can give you log time operations on a list - Python can largely only give you linear time.

(The above is a serious simplification, but contains enough truth to be worth stating. No doubt others more knowledgeable than I can expand and enhance.)

jerf · on June 19, 2011

"SkipLists can give you log time operations on a list"

So can balanced binary trees, and a number of other structures. I'd demand benchmarks, because Python is not C (and I don't mean just trivially, I mean, it is very not just C); just as one for instance, are you hitting a pessimal case with the GC with this algorithm? You never know.

I'm not actually demanding them, since it didn't seem the point of this exercise. My point is that you can't just assume.

ColinWright · on June 19, 2011

Interesting and valid points. My point is that the SkipList algorithms are, using the assumed primitives, logarithmic. Your point is that in Python the assumptions about the primitives may not be valid and need to be tested.

It's especially interesting to me that the standard large body of work on algorithmic complexity is assumed by many to be getting less and less relevant as other issues come into play. Issues such as memory access time, where it now matters if something is in cache or not, and if so, which cache. Issues such as whether your machine will stop completely at some inconvenient moment to perform a GC when you least expect or desire it. Issues such as whether your language is in fact caching stuff in hash tables, and so what appears in your code to be linear, isn't.

And so on.

More and more it seems necessary to resort to experiment and tests to see what happens, rather than being able to determine things for sure and definite via analysis.

Exciting times, but slightly disappointing.

gchpaco · on June 19, 2011

You're right to be skeptical about the value of skip lists in comparison to other O(log n) type data structures like balanced trees and the like. Skip lists trade ease of implementation for probabilistic bounds, and have some minor cache advantages over balanced binary trees owing to the data structure generally not changing shape all that much, and it's hard to say whether these differences would materialize in Python.

However if I had no O(log n) structures at all and needed to implement some, I'd have to be a maniac to want to do balanced binary trees, which have notoriously delicate balancing algorithms, over skip lists which are very sweet and simple.

liuliu · on June 19, 2011

Besides, if I remember correctly, the most convincing point of skip list is that it provides better concurrency. In the environment where python still have GIL, I would be skeptical about that.

pjscott · on June 19, 2011

Exactly. Skip lists are great for two reasons (and as far as I know, only these two reasons):

1. They're really cool and clever. This is a valid reason.

2. They offer easier concurrency than most alternatives. It's straightforward (though not quite easy, unless you have transactional memory) to make a lock-free concurrent skip list. Try that with a heap or a red-black tree, and you'll quickly run into all sorts of memory conflicts and crazy-complicated locking. The fact that a skip list only provides probabilistic logarithmic time bounds really makes coordination between threads easier.

(Note that it's possible to do some similar stuff with modified versions of other data structures. For example, you can make a good concurrent dictionary by taking a red-black tree, relaxing the invariants, and adding a periodic rebalancing thread to run in the background. But that's a topic too long to fit into this post.)

VBprogrammer · on June 19, 2011

I think the biggest advantage of skip lists is that if a man held a gun to my head and forced me to implement a logarithmic time algorithm I could implement skip lists in an hour or so from memory. Any of the others...maybe given a week and a couple of good books.

ColinWright · on June 20, 2011

  ... if a man held a gun to my head and forced me
  to implement a logarithmic time algorithm I could
  implement skip lists in an hour or so ...

Does this happen to you often?

tshaddox · on June 19, 2011

> The speediness you get is algorithmic.

Don't you mean asymptotic?

ColinWright · on June 19, 2011

No, I mean that the speed increase you get is from a change in the algorithm.

kragen · on June 19, 2011

Three answers.

First, sure, CPython's only about two orders of magnitude slower than C, so on big enough data sets, the asymptotic performance rules. N log N is 100 times better than N^2 as soon as N is over about 1000. (The obvious-to-me way to implement an ordered map in Python would be to maintain a dict and a sorted list of keys; inserting N items into a sorted list is N^2.)

Second, this kind of data structure might be fast in PyPy. Like, C-like fast, or worse in the constant factors by only a factor of 2 or 5. Certainly you could imagine JIT optimizing dynamic language implementations where that was the case. But I haven't tested.

Third, usually not.

godDLL · on June 19, 2011

I regularly see someone make a Python extension in Python and then rewrite the critical parts in C. It might have been made with this kind of process in mind.

goldmab · on June 20, 2011

A skip list written in Python is itself an optimization. You could simply use a sorted list with the bisect module.