It looks like this is still missing many matrix operations like QR, SVD, einsum, etc. Is there a clear route to using these on the GPU in Python on Apple Silicon? Last I checked the PyTorch backend was still missing at least QR...
factorization methods are somewhat uncommonly used in deep learning (the likely target of this framework) and have compute properties (such as approximate outputs, non-deterministic number of iterations) that make them unlike the BLAS++ standard APIs.
einsum seems like a reasonable thing to request, but it's hard to be performant across the entire surface exposed by the operation.
Exactly right that this targets a narrower surface to enable many deep learning models. I wonder how uncommon it is to hit some operation that is not included, though? It seems pretty common from a PyTorch MPS tracking issue:
NVIDIA's moat is not just in providing BLAS++ operations, but extending this to a wider range of cuSPARSE, cuSOLVE, cuTENSOR, etc. Without these, it feels like Apple is just trying to play catch up with whatever is popular and unsupported...