Same here, I guess it's a complex function of hardware and what you're trying to do. Personally I ended up writing my own autodiffing tensor library in C++, because all existing solutions had abysmal performance on my problem (lots of local updates in large tensors). The speedup is >50x compared to TF, pytorch, julia, jax.
> Personally I ended up writing my own autodiffing tensor library in C++
But this will not be a general library right ? You must have only included certain subset of functions of TF or PyTorch or whatever. Autodiffing is also included in certain proprietary libraries like ones from NAG. But I doubt its possible to achieve 50x speedup without compromising on functionality.
Of course it's a herculean task to write a library with that many features. But I don't think that's the issue, it's more that the devs of TF can't possibly optimize for every use case. For me, I knew what kind of ops I needed, so I could focus on getting those as fast as possible.
It mean you have efficient "crud" operators? This interest me because I'm building a relational language and toy with the idea of tensors as the "table" structure, but get rid of that because updates...