I completely agree. As someone who works on a parallel functional language, it's very hard to sell a parallel language that isn't as fast as parallel fortran or hand-tuned C code that uses pthreads and the fastest parallel implementation of BLAS and other libraries.
The people who really care about performance are using those. The ones who don't are honestly mostly still writing code that has large constant factors of _sequential_ performance available as low-hanging fruit. Sure, they'd take free performance, but the rewrite/porting/debug costs (even in automatic parallel compilers for the same language) are at least as high as just firing up a profiler.
I'm increasingly of the opinion that if you can't win a top spot in the supercomputer performance fight, you have to have a unique application domain. Erlang's seems to be reliability. I suspect that a parallel variant of javascript that runs in every browser will end up being the next compelling parallel language, as opposed to all of us who are either inventing or attempting to resurrect languages that target x86 or GPUs.
We believe so! Our project leader is focused on how we can parallelize general-purpose applications easily. With more and more people writing in static functional languages that have relatively poor parallel scalability without massive program transformations (Haskell, OCaml, F#, etc.), we think there is an opportunity there.
That said, we're at a "go big or go home point." We either need to ramp up from the 1.5 grad students + 2 undergrads / year significantly or wrap things up. Getting to a point where we can be used in general-purpose projects requires a lot of work, none of which results in papers.
If I had to guess, the most probable impact is what you would expect from PL research - integration of lessons learned in other systems down the road:
- CilkPlus has a nice first pass at a work stealing algorithm, but we showed how to do it without static tuning by the programmer with lower overheads, to boot.
- Vectorization to take advantage of wider vector hardware requires transformation of data structures (e.g., array of structs to struct of arrays). We showed how to do that automatically and reason about changes in program performance.
- We have boatloads of papers - at both workshops and conferences - on what has to be done to the compiler and runtime to run efficiently on NUMA multicore systems. Right now, most fp systems do not run into these problems because they cannot scale past their own parallel overheads. Once past that bottleneck, the next one will be memory traffic, at least in our experience.
I don't say any of that to fling mud at other systems; we started our project after them all, at the start of the multicore era (2006), with the goal of investigating these specific issues without carrying along the baggage of a pre-existing sequential implementation.
There's also still a lot more to learn. I personally don't buy that that deterministic and total chaos are the two only points in the design space of program reasoning. There have to be some interesting midpoints (e.g., histories that are linearizable) that are worth investigating.
The people who really care about performance are using those. The ones who don't are honestly mostly still writing code that has large constant factors of _sequential_ performance available as low-hanging fruit. Sure, they'd take free performance, but the rewrite/porting/debug costs (even in automatic parallel compilers for the same language) are at least as high as just firing up a profiler.
I'm increasingly of the opinion that if you can't win a top spot in the supercomputer performance fight, you have to have a unique application domain. Erlang's seems to be reliability. I suspect that a parallel variant of javascript that runs in every browser will end up being the next compelling parallel language, as opposed to all of us who are either inventing or attempting to resurrect languages that target x86 or GPUs.