Its magic function wrapping comes at a cost, trading ease of use for runtime performance. When you have a single C++ function to call that will run for a "long" time, pybind all the way. But pysimdjson tends to call a single function very quickly, and the overhead of a single function call is orders of magnitude slower than with cython when being explit with types and signatures. Wrap a class in pybind11 and cython and compare the stack trace between the two, and the difference is startling.
Ah yeah that makes sense. I would rather call a single C++ function from Python that calls other C++ functions (or itself). In case of pysimdjson however, Cython makes much more sense.
Overall this is way better than writing everything in Rust.