Here's a slightly longer answer. When training machine learning models it's the done thing to test them on held-out data, and use the error on the held-out test data to estimate the accuracy of a model on truly unseen data that we really don't have- such as observations that are still in the future, like the weather tomorrow (as in 12/12/24) [1].
The problem is that held-out test data is not really unseen and when a model doesn't perform very well on it, it is common to tweak the model, tweak hyperparameters, tweak initialisation etc etc, until the model performs well on the held-out test data [2]; which ends up optimising the model on the held-out test data and therefore destroying any ability to estimate the accuracy of the model when predicting truly unseen data [3].
You can check Deepmind's paper and see if you can find where in the description of their methodology they explain what they did to mitigate this effect. You won't find it.
This is enough of a problem when the model is, say, an image classifier, but when it's a model that's supposed to predict the weather 15 days from now, the best you can say when you look at the results on held-out test data is: the model is doing fine when predicting the weather we had in the past.
____________
[1] Yes, that's why we test models on held-out data. Not so we can brag about their "accuracy", or so we can put a little leaderboard in our papers (a little table with all the datasets on one side, all the systems on the other side, and all our entries in bold or else we don't submit the paper) and brag about that. We're trying to estimate generlisation error.
[2] "The first iteration of my model cost hundreds of man-hours and thousands of dollars to code and train but if it doesn't perform well on the first try on my held-out test data I'm going to scrap it and start all over again from scratch".
Yeah. Right.
[3] Even worse: everyone splits their experimental dataset to 80/20% partitions, 80% for training and 20% for testing, and that already screws over the accuracy of any error estimates. Not only we're predicting 20% of our data from 80% of it, but we're predicting a tiny amount of data in absolute terms, compared to the true distribution.