Google simply has the computational resources to cover thousands of different hyperparameter combinations. If you don't have that, you won't ever be able to do systematic exploration, so you might as well rely on intuition and luck.
This is not accurate. Chess alone is so complex, brood force would still take an eternity, and they certainly don't have a huge incentive to waste any money just to show off (because that would reflect negatively on them).
But how does it work? It's enough to outpace other implementations, alright. But the model even works on a consumer machine, if I remember correctly.
I have only read a few abstract descriptions and I have no idea about deep learning specifically. So the following is more musing than summary:
They use the Monte Carlo method to generate a sparse search space. The data structure is likely highly optimized to begin with. And it's no just a single network (if you will, any abstract syntax tree is a network, but that's not the point), but a whole architecture of networks --modules from different lines of research pieced together, each probably with different settings. I would be surprised if that works completely unsupervised; after all it took months from beating go to chess. They can run it without training the weights, but likely because the parameters and layouts are optimized already, and to the point of the OP, because some optimization is automatic. I guess what I'm trying to say is, if they extracted features from their own thought process (ie. domain knowledge) and mirrored that in code, than we are back at expert systems.
PS: Instead of letting processors run small networks, take advantage of the huge neural network experts have in their head and guide the artificial neural network into the right direction. Mostly, information processing follows insight from other fields, and doesn't deliver explanations. The explanations have to be there already. It would be particularly interesting to hear how the chess play of the developers involved has evolved since and how much they actually do understand the model.
I'm curious why you believe to be able to tell that my comment is not accurate when you yourself admit that you have no idea about deep learning?
Note that I'm not saying that Google is doing something stupid or leaving potential gains on the table. What I'm saying is that their methods make sense when you are able to perform enough experiments to actually make data-driven decisions. There is just no way to emulate that when you don't even have the budget to try more than one value for some hyperparameters.
And since you mentioned chess: The paper https://arxiv.org/pdf/1712.01815.pdf doesn't go into detail about hyperparameter tuning, but does say that they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.
Your comment reminded me of my self, so maybe I read a bit to much into it. Even given googles resources, I wouldn't be able to "solve" chess any time soon. And it's just a fair guess that this applies to most people, maybe slightly fewer percent here, though, so I took the opportunity to provoke informed answers correcting my assumptions. I did then search papers, so your link is appreciated, but it's all lost on me.
> they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.
I guess the trick is to cull the search tree by making the right moves forcing the opponents hand?
I think you are confused about the thing being optimized.
Hyperparameters are things like the number of layers in a model, which activation functions to use, the learning rate, the strength of momentum and so on. They control the structure of the model and the training process.
This is in contrast to "ordinary" parameters which describe e.g. how strongly neuron #23 in layer #2 is activated in response to the activation of neuron #57 in layer #1. The important difference between those parameters and hyperparameters is that the influence of the latter on the final model quality is hard to determine, since you need to run the complete training process before you know it.
To specifically address your chess example, there are actually three different optimization problems involved. The first is the choice of move to make in a given chess game to win in the end. That's what the neural network is supposed to solve.
But then you have a second problem, which is to choose the right parameters for the neural network to be good at its task. To find these parameters, most neural network models are trained with some variation of gradient descent.
And then you have the third problem of choosing the correct hyperparameters for gradient descent to work well. Some choices will just make the training process take a little longer, and others will cause it to fail completely, e.g. by getting "stuck" with bad parameters. The best ways we know to choose hyperparameters are still a combination of rules of thumb and systematic exploration of possibilities.
Google did do this. It took ungodly amounts of computing power and only did slightly better than random search. They didn't even compare to old fashioned hill climbing or Bayesian optimization.
I've considered gradient descent for optimizing parameters on toy problems at university a few times. Never actually did it though, it's a lot of hassle for the advantage of less interaction at the cost of no longer building some intuition.