It's a great resource and all but I find insane that hyperparameter tuning isn't more automatized at this point. This artisanal approach doesn't seem scalable.
Random question from someone who's never touched ML professionally:
Can we optimize hyper-params the same way we optimize weights and biases (e.g gradient descent)? Or would that be too expensive (since you have to optimize weights and biases for each hyper-param configuration)?
Of course that might lead to hyper-hyper-params...
In the general case no because the final metric we care about (eg, accuracy or AUC) is not differentiable (ie, we cannot compute its gradient) with respect to the hyperparameters, especially the discrete ones.
However a recent work at this year's NeurIPS [1] did indeed use an outer gradient descent to tune an inner gradient descent so in some cases yours is indeed a good idea that works.
There’s also the (older) devil in the details papers focused on computer vision. I’d love to read something like this on modern methods like transformers. https://arxiv.org/abs/1405.3531
Does anyone know an updated version in the age of Transformers?