I completely agree with you. Let me just add two remarks. First, although picking 9x9 boards makes connect-four intractable for bruteforce search indeed, I would be suprised if it made it much more difficult for AlphaZero, which relies on the generalization capabilities of the network anyway. Second, using a solved game for the tutorial is a feature, not a bug. This allows precise benchmarking of the resulting agent as a ground truth is known.
That's really cool and I didn't think of that. I just wanted clarification: that means you train the agent without the deterministic solution and your "validation/test" (I'm not sure what those phases are called in unsupervised learning) sets are done without the deterministic solution.
I did not see an evaluation of how close to perfection the agent becomes. Did you compute any sort of error rate (by finding moves that turn a won position into a non-won one or a drawn position into a lost one) ? And how this error rate drops over time as learning advances? That would indeed be very interesting to see.
My team did an implementation of alpha zero connect four a couple of years ago. Our findings are in a series of blog posts starting at https://medium.com/oracledevs/lessons-from-implementing-alph.... We didn't manage to get to perfection either on policy, but got pretty close. You can play against some versions of the network here: https://azfour.com
Your series of blog articles has been an important source of inspiration in writing AlphaZero.jl and I cite it frequently in the documentation. Thanks to you and your team!
Admittedly, the connect four agent is still far from perfect but there is a lot of margin for improvement as I have done very little hyperparameters tuning so far.