Okay, so it works by minimizing (equiv. maximizing) some function. But that does...

zodiac · on April 13, 2016

The function it minimizes is called the "loss function", and its value for the training and test sets are shown in the upper right area. AFAICT the site doesn't say how it's computed, but I think it's average squared error. The gradient is not learned; if you think of the loss function as a real-valued function of the weights, the gradient is just the partial derivatives with respect to the weights.

karterk · on April 13, 2016

It depends on what you are trying to achieve. There are many, for e.g. see: http://cs231n.github.io/neural-networks-2/#losses