Iterators and Streams in Rust and Haskell

runeks · on July 11, 2017

I find it really interesting that the idiomatic Haskell implementations (basically math) are the best-performing, while the Rust-like Haskell implementation (using an IORef) is orders of magnitude slower. This is exactly what I want: describe the logic of the operation, and leave the compiler to optimize for the hardware (in this case a CPU which mutates registers). The Rust implementation makes assumptions about the underlying hardware (has registers we can mutate), and is only about 15% faster than the Haskell implementation which makes no such assumptions.

In essence, this is why I love Haskell, and choose it over Rust: it allows me to write my application logic directly, without having to think about it in terms of mutation, and have the generated code be pretty fast. If GHC becomes well-optimized enough it can render Rust obsolete, since "no runtime overhead" becomes pretty meaningless if it's actually slower than Haskell (e.g. using LinearTypes, which removes need for GC). Rust can't render Haskell obsolete, however, since Haskell's goal is basically allowing you to write logic directly, using types as propositions and values as proofs. So Haskell's goal is a qualitative one (execute logic) while Rust's is a quantitative one (performance/no runtime overhead), which results in Haskell being able to take the place of Rust if GHC gains sufficiently in performance.

adwn · on July 11, 2017

> If GHC becomes well-optimized enough it can render Rust obsolete [...]

Ah yes, the mythical Sufficiently Smart Compiler (TM), scheduled to arrive any day now. Will it also be able to reliably transform non-trivial functions that are accidentally O(n) in memory consumption due Haskell's laziness, to O(1) versions?

The keyword here is reliably. There are many clever but brittle compiler optimizations that aren't used much in the field, because when performance is crucial, you can't rely on the compiler to detect this transformation opportunity in some cases and not in some other cases, so you code it explicitly.

chongli · on July 11, 2017

reliably transform non-trivial functions that are accidentally O(n) in memory consumption due Haskell's laziness, to O(1) versions?

What you're describing is called strictness analysis. [0] I believe that attempting to solve it in general runs up against the halting problem. That being said, GHC does perform some strictness analysis when optimization flags are turned on. [1] The language itself also includes strictness annotations on the fields of data constructors. [2] Experienced Haskell programmers know how to use these to avoid allocating unnecessary thunks.

[0] https://en.wikipedia.org/wiki/Strictness_analysis

[1] https://wiki.haskell.org/Performance/Strictness#Strictness_a...

[2] https://wiki.haskell.org/Performance/Data_types#Strict_field...

lmm · on July 11, 2017

Will Haskell's compiler ever be 100% reliable? Maybe. Will Haskell's compiler ever be more reliable than the average developer? IME it's already there, and has been for a while: for a given functional requirement and a given level of performance, the cheapest way to achieve it is usually Haskell or similar.

rtpg · on July 11, 2017

I understand this frustration, but my understanding is that patterns exist to _ensure_ transformations go through exist in Haskell land. Im thinking about stream fusion for example. Lots of "pipe-y" libraries exist to make sure your program consumes input in a streamed manner.

So while your O(n) code might sometimes unreliably be turned into O(1), if performance is your game you can use these kind of libraries to endure O(1).

jeffdavis · on July 11, 2017

Are you saying there's a way to use haskell without the GC or other runtime overhead?

I like rust because it doesn't have much in the way of runtime requirements (almost zero, like C).

runeks · on July 11, 2017

> Are you saying there's a way to use haskell without the GC or other runtime overhead?

There isn't yet, but there will be when, first, the LinearTypes extension arrives (scheduled for GHC 8.4) and, secondly, GHC's garbage collector is modified to take of advantage of the fact that linear types are always consumed exactly once. This will take 1-2 years or so.

After this it should be possible to write no-runtime-overhead Haskell functions by using the "lollipop" arrow after an argument, and then consuming that argument exactly once in your function. So you will have to split your functions into small enough functions, that consume their arguments exactly once, and then compose them into something that will not need a garbage collector at runtime. In theory you could write an entire application this way, although in practice you probably wouldn't bother avoiding the GC for the outermost constructions of your program, since this would offer little in performance gains and be a lot more difficult (remember, a no-GC function must consume its arguments exactly once).

But time will tell how this will compare to something like Rust's borrow checker.

evincarofautumn · on July 11, 2017

There’s been some work lately by Tweag I/O on a linear types extension to GHC.[1][2]

It would allow us to safely allocate things from malloc memory or a “linear heap” separate from the main GC heap, and that should reduce pause times by reducing GC pressure, but I don’t think we’ll ever be able to avoid the GC entirely.

However, we can still expect some nice performance advantages from the fact that linear types can enforce guarantees about inlining & fusion (which are currently somewhat unpredictable) as well as resource lifetimes.

[1]: https://github.com/tweag/linear-types

[2]: https://github.com/tweag/ghc/tree/linear-types

infinity0 · on July 11, 2017

I can certainly see it's plausible Haskell can beat Rust in the mid-term future on high-level programming away from the bare metal. Less certain that it can beat Rust for system-level programming or low-level library programming, where people using the library don't want to link to a heavy runtime that includes a GC.

pjmlp · on July 11, 2017

Not having issues with using a GC language is also a huge win.

I can grasp Haskell without major issues, follow C++14 and C++17 meta-programming tricks, yet I still don't understand quite well how to deal with the borrow checker on Rust.

Still haven't given up on it, but it shows the learning curve is a bit high, or then it is me that cannot wrap my head around it.

runeks · on July 11, 2017

I'm not that familiar with Rust but, as far as I can see, it makes sense to say that Rust's borrow checker is an interface to a O(1) garbage collector (as fast as no GC at all), that the programmer has to implement at compile-time for all values. All the borrow checker-code is compiled into a no-runtime-overhead garbage collector, that is embedded into the resulting generated code.

In the world of Haskell, the -> arrow embeds a GC (with runtime overhead) into the genrated code. A similar interface to Rust's borrow checker in Haskell (no runtime overhead) would be the ⊸ arrow from LinearTypes (in a future where GHC's GC is able to take advantage of this).

It looks, to me, like the arrow (⊸) notation is a lot simpler inteface to a no-runtime-overhead garbage collector -- or the absense of a traditional GC, in other words -- than Rust's borrow checker. But I'm looking forward to seeing how it plays out (it will probably take years before it becomes reality).

littlestymaar · on July 11, 2017

> I'm not that familiar with Rust but, as far as I can see, it makes sense to say that Rust's borrow checker is an interface to a O(1) garbage collector (as fast as no GC at all), that the programmer has to implement at compile-time for all values. All the borrow checker-code is compiled into a no-runtime-overhead garbage collector, that is embedded into the resulting generated code.

Not really : the borrow checker is just a compile-time analysis tool which checks that you respect some conventions with pointers :

- each value has only one owner

- each pointer to a value must leave less time than the target value

- you must only have one mutable pointer to a value

- you can have as many immutable pointer to a value as you want as long as there is no mutable pointer to the same value (think of a statically-checked version of a RW-lock)

The borrow-checker isn't responsible for freeing memory at all. Freeing memory is done by RAII like in C++, but the conventions enforced by the borrow checker ensure that the deallocation of memory is safe.

pjmlp · on July 11, 2017

The problem I was hinting is that it looks simple on paper and when following the book exercises, but then one picks up a random GUI library, needs to map callbacks and event handlers, it starts to get a bit harder to understand if one needs a lifetime annotation (which one), (A)Rc, RefCell, and starts to jungle them around until it compiles.

Again, maybe it is just me and need to spend more time learning it.

kirillkh · on July 11, 2017

You're right, that's the difficult part about Rust. Every novice has to go through these hoops to acquire the intuition. I'm midway through and feel comfortable working with Rusty libraries, but interfacing with C libraries that widely employ shared state is a major pain. Fortunately, more and more libraries get rewritten in Rust or acquire sane Rust wrappers.

littlestymaar · on July 11, 2017

> Again, maybe it is just me and need to spend more time learning it.

I had the exact same experience : after my first reading of the Rust book, I though I understood what's going on but once I tried to actually write something I was totally lost. I stopped using Rust for several months and when I came back, I read the book again and this time I had no troubles writing stuff and the concepts of ownership and lifetime became really clear.

I think the concepts behind the borrow-checker are really straightforward[] once you've internalized them, yet they are really unintuitive to the beginner. It's quite paradoxical.

[] except when the borrowck is not yet smart enough : http://smallcultfollowing.com/babysteps/blog/2016/04/27/non-...

valarauca1 · on July 11, 2017

    I'm not that familiar with Rust

It is often best to not comment on things you don't understand.

    but, as far as I can see,

Ah you'll power though anyways!

    it makes sense to say that Rust's borrow checker is an
    interface to a O(1) garbage collector [...]

You aren't _that_ far off in terms of CS fundamentals there is a lot of theory you are missing. Rust's _GC_ is called RAII [1] it is used by C++ as well. Effectively the compiler traces the lifetime of an object and inserts a drop/free/destructor call when it leaves scope.

[1] https://en.wikipedia.org/wiki/Resource_acquisition_is_initia...

runeks · on July 11, 2017

It would have been more clear for me to have said: "for the purpose of discussion, let's view Rust's borrow checker as an interface to a O(1) garbage collector [..]".

jmcomets · on July 11, 2017

Don't hesitate to check out #rust / #rust-beginners on irc.mozilla.org, I was struggling with BCK for quite a while and asking dumb questions to experienced rustaceans has been very helpful. I've now been using rust for 2 years and am very happy with it.

pjmlp · on July 11, 2017

The guys on the forums were very nice, my kudos and appreciation to them.

Rust will never be more than an occasional coding exercise for me personally, because doesn't fit the type of work I do (Java/.NET/UWP/Web/Android) where project tooling is dictated by customer IT.

However coding exercises in whatever language X helps to open our mind to new ways of solving problems, that we can eventually use in our daily tools.

When I give these statements my goal isn't to bash the language, rather give another point of view that might help understanding which areas the design team should focus, or not.

After all it is their final call and they need to pay attention to those that are actually using it in production and their real problems, not random dudes like me.

pklausler · on July 10, 2017

> But look again: C is taking 87 nanoseconds, while Rust and Haskell both take about 175 microseconds. It turns out that GCC it able to optimize this into a downward-counting loop, which drastically improves the performance. We can do similar things in Rust and Haskell to get down to nanosecond-level performance, but that's not our goal today. I do have to say: well done GCC.

Downward-counting or not, it is simply impossible for GCC to generate code that executes all 1,000,000 iterations of the loop in 87ns. That would be 87 femtoseconds per iteration, on average.

More likely, GCC figured out how to collapse the entire loop into a closed-form expression that is a function of the loop length.

gizmo686 · on July 10, 2017

I just tested this on my machine (gcc 5.4.0). At -O2, gcc produced normal looking assembly code. At -O3, gcc produced a monstrosity [0] that I don't feel like fully deciphering.

However, from a brief glance, it does not appear to have created a closed form solution. Instead, it contains a single loop:

    .L4:
        addl    $1, %edx
        paddd   %xmm1, %xmm0
        paddd   %xmm2, %xmm1
        cmpl    %edx, %eax
        ja  .L4

which seems to be using a SIMD instruction (paddd[1]) that adds does 4 32-bit integer additions in parallel.

After this loop, it does some "housekeeping" (read, something I don't understand) before proceeding to an unwound version of the last iterations of the loop:

    leal    4(%rdx), %ecx
    addl    %edx, %eax
    cmpl    %ecx, %edi
    jl  .L2 
    addl    %ecx, %eax
    leal    8(%rdx), %ecx
    cmpl    %ecx, %edi
    ...
    jl  .L2 
    addl    %ecx, %eax
    leal    28(%rdx), %ecx
    cmpl    %ecx, %edi
    jl  .L2
    addl    %ecx, %eax
    addl    $32, %edx
    leal    (%rax,%rdx), %ecx
    cmpl    %edx, %edi
    cmovge  %ecx, %eax
    ret

Where .L2 is just:

    .L2:
        rep ret

I assume that this is just some form of return, but the documentation I could find [2] seems to suggest that rep is a prefix for string operations, which doesn't make sense.

[0]https://pastebin.com/raw/Y55gQG7p

[1] http://x86.renejeschke.de/html/file_module_x86_id_226.html

[2] https://c9x.me/x86/html/file_module_x86_id_279.html

gpderetta · on July 11, 2017

the rep in rep ret is ignored, is just used for alignment; the 'housekeeping' code is to handle non-multiple of 8 loop counts.

Still, unless I'm missing something, the code should be executing 8 adds per clock[2]; at 4ghz, that still above 1us for 500k adds.

GCC doesn't seem to be able to fold the loop given a constant expression, unless the function is explicitly declared constexpr; in which case it will complain about the accumulator overflowing, but gcc doesn't seem to be taking advantage of it.

Clang does not vectorize the loop but will replace it with a constant given a constant parameter.

Bottom line, I'm not sure what's going on with the article's measurements.

[2] potentially 12 for skylake or even 24 with avx.

implr · on July 11, 2017

If you pass -march=skylake it will get even more monstrous, with AVX: https://godbolt.org/g/v7iKF3

kccqzy · on July 10, 2017

I don't think so. https://godbolt.org/g/GBZgER

nemetroid · on July 10, 2017

Perhaps the entire call of the function was replaced by a constant, then.

mastax · on July 11, 2017

Yep: https://godbolt.org/g/LCUah9

    calls_cheating():
        movl    $60, %eax
        ret

gizmo686 · on July 11, 2017

No. The "constant" is only provided at runtime from Haskell through its FFI. GCC has no opportunity to optimize that.

pcwalton · on July 11, 2017

Another possibility is that GCC discovered that the loop had no side effects in the benchmark harness and eliminated it. This is a very common thing that happens in microbenchmarks.

gizmo686 · on July 11, 2017

The c code defines a function that returns the sum computed in the loop. This function is than run using Haskell's FFI. GCC has no way of knowing there is no side effects, because other code might link to it and use the result of the function.

GHC, in concept, does know that there are no side effects, but that would apply to all of the functions being tested. Further, since it is using a benchmarking library, I assume the authors of said library were aware of the issue, and wrote it in such a way as to force evalutation of the function.

https://gist.github.com/snoyberg/9b1c77b595c4adf90880213fc49...

gpderetta · on July 10, 2017

I.e. benchmarking is hard.

e12e · on July 10, 2017

Being able to tell that a loop is gone shouldn't be too hard with a diff of the assembler output with optimizations on and off - at least for fairly trivial code.

adrianratnapala · on July 11, 2017

I assume the 87 nanoseconds means the average per loop iteration, not all 1,000,000 iterations.

red75prime · on July 11, 2017

It's unlikely. The entire loop should have taken 0.87 seconds then. That's a bit too much for several million operations, which don't seem to require memory access.

pklausler · on July 11, 2017

0.087, but that still seems high.

iainmerrick · on July 10, 2017

I feel like the author has felt obliged to include the full results, which is noble, but it's mostly obscuring the interesting results.

What does it matter if the "cheating" versions are faster, since they're doing something completely different? (OK, in principle it could be the same with an unrealistically magical optimizer.)

Seems to me the key point is that a bunch of high-level constructs in both Rust and Haskell are very nearly as fast as a tight loop in C. That's great!

The versions that are much slower don't seem very surprising, as they involve boxing the ints. (Edit to add: OK, reading more closely, I guess 'Haskell iterator 5' is interesting to dig into.)

ozataman · on July 11, 2017

No idea why people reacting here so far got fixated on the "cheating" versions - it's clear to me they were included mainly to set a maximal speed baseline/benchmark and are not the main point of the article.

runeks · on July 11, 2017

I find the "cheating" versions peculiar because I don't see the purpose of it. What's the point, and in what way is it cheating? It's just a different algorithm, and doesn't add any useful information to the subject at hand.

chriswarbo · on July 11, 2017

Numerical operations in a loop are often subject to aggressive optimisation by C compilers, which makes them tricky to use in benchmarks: are we measuring the intended loop, or has the work been optimised away? Often comparisons are made of "Blub vs C", where the C result is an order of magnitude smaller, and it's not clear if that's because C is fast or whether it's been optimised away.

Including an "optimised away" version lets us know when this has happened: the "non-cheating" benchmarks take much longer than the "cheating" ones, so we can assume they've not been optimised away.

I assume the author only went into detail about them because they're independently interesting, regardless of the main topic of the post.

runeks · on July 11, 2017

The solution to this is to deliver arguments at runtime, rather than baking them into the program as constants. Describe some computation by a data structure that is delivered at runtime, and see which implementation does best. That way there can be no cheating.

chriswarbo · on July 11, 2017

> The solution to this is to deliver arguments at runtime, rather than baking them into the program as constants.

Functions already take their arguments at runtime. Except when they don't, due to optimisation.

For a benchmark to be automated, reproducible, etc. those constants have to be baked in somewhere, even if it's in a Haskell program using FFI (as in the article), or a shell script, etc. Whilst optimisers don't (yet) cross the language/process boundary, it still makes sense to include such sanity checks, rather than assuming we know what the optimiser will/won't do.

After all, the whole point of a benchmark is to gather evidence to question/ground the assumptions of our mental model. The less we assume, the better. The more evidence we gather, the better.

iainmerrick · on July 11, 2017

Well, the article starts with them and the chart starts with them! But the whole article would be better off without them.