I think Bryan has been doing a followup to Copperhead, probably easy to just ask him :)
I don't know what you mean by predictable performance. Flattening is a direct transformation and seems simple to reason about on SIMD architectures, though the recent dynamic schedule (work stealing) approach for multicore/distributed has the usual caveats. (I tend to avoid it for HPC.) Given the 10+ year history of the researchers involved, it seems like a slow-but-steady project..
Also, do you know if the DPH folks ever managed to iron out a version of higher order flattening which gives a predictable performance gain?