> your network has 100000 iterations, while the parent's has 1000, but they both...

> your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions

Correct, but keep in mind that their method appears to use batch descent while mine does not. Batch descent is often converges more quickly. There are other differences between my net and the GP's I can spot as well (e.g., the activation function, the learning rate, and regularization).

Also keep in mind that I threw this together over breakfast, and did not spend much time tweaking parameters :)