I think there's a subtlety here about what makes (e.g. English) tokens different...

I think there's a subtlety here about what makes (e.g. English) tokens different to points in latent space. Everything is still differentiable (at least in the ML sense) until you do random sampling. Even then you can exclude the sampling when calculating the gradient (or is this equivalent to the "manifold"?).

I don't see a priori why it would be better or worse to reason with the "superposition" of arguments in the pre-sampling phase rather than concrete realizations of those arguments found only after choosing the token. It may well be a contingent rather than necessary fact.