Nice work on this! I was behind the implementation at oracle which you reference...

jonath_laurent · on June 22, 2020

What do you mean by WLD output head?

So far, the main idea I have pulled from the Lc0 crowd is to have a prior temperature indeed. The next thing I am planning to add is the possibility to batch inference requests across game simulations instead of relying on asynchronous MCTS. In your blog series, you anticipate the problem of the virtual loss introducing some exploration bias in the search but ultimately concludes that it does not change much:

[Citation from your blog series]: "Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use."

Interestingly, it seems that the LC0 team had a different experience here. I myself ran some tests and going from 32 to 4 workers (for 600 MCTS simulations per turn) on my connect-four agent results in a significant increase in performances. This may be due to the fact that I use a much smaller neural network than yours, which is ultimately not as strong.

Related to this, there is a question I have wanted to ask you since I found your blog article series: did you make experiments with smaller networks and what were the results? What is the smallest architecture you tried and how did it perform?

vishvananda · on June 22, 2020

The lc0 group has switched the result prediction to predict win, loss, and draw probabilities instead of just win/loss. Some information can be found in https://lczero.org/blog/2020/04/wdl-head/

vishvananda · on June 22, 2020

we did a lot of our early experimentation with small networks. I don't think we went any smaller than 5 layers of 64 filters as we mentioned here: https://medium.com/oracledevs/lessons-from-alpha-zero-part-5...

jonath_laurent · on June 22, 2020

And what were the results of these experiments? What error rate can you reach with the smallest network architecture you tried for example?

vishvananda · on June 22, 2020

Unfortunately I don't remember the exact numbers, but I think it was a couple percentage points worse than we were able to get with the large models.

jonath_laurent · on June 22, 2020

This is interesting, thanks! Is there anything else you can tell me about the results of your experiments with small networks? I am really interested in this.

For example: did you notice than increasing or decreasing network size required significant changes in other hyperparameters? Are small networks learning faster at the beginning of training before they start to plateau?