Partially because of the class __call__ instead of function calls among other things.
In my tests it's 2x slower, but it might not be the main reason. I didn't profile it at all.
Another thing is that it seems like it doesn't use Atlas to scale to all the cores even though my Python is linked against it.
Partially because of the class __call__ instead of function calls among other things.