That's an interesting (and relevant) question. In fact, if you're using somethin...

That's an interesting (and relevant) question. In fact, if you're using something like stochastic gradient descent to optimize the weights of your network, it might be very hard for the network to escape the general local minimum (or basin of attraction) in which it ended after having trained over a large dataset, even if you present it with the same examples but with the labels flipped (which would be the easiest way to "unlearn").

In theory stochastic gradient descent allows you to escape local minima: the noise in the stochastic error surface will likely be enough for the network to escape whatever minimum it is in. Practically, because weights will tend to have a large magnitude and because most of the times you'll be using a saturating non-linearity (such as a sigmoid), the number of steps required to escape that local minimum might be too big.

Presumably, you could use second-order optimization methods to perhaps escape from minima -- because it allows you to make "bigger" steps -- but that comes with its own set of problems (negative curvature being one of them).

I encourage you to actually test these hypotheses: train a simple network on something stupid like MNIST, and make it achieve a reasonable error with many passes through the data. Then change the labels of 10-20-50% of your inputs and continue training (with the same learning rate... or not!) to see how long it takes for the network to get to another minimum.