It’s really hard to compare optimizers. Common architectures and default hyperparameters were discovered alongside Adam so you’d have to redo a bunch of sweeps if you wanted a “fair” comparison. In practice this doesn’t really matter and everyone just uses Adam. If you had infinite compute, you’d try every combo and select the one with the best results.