- You can think of error, loss, and cost functions as the same. In fact, two textbooks in front of me say that the loss function is a measure of error. If "loss" is a confusing word, think of it as the "information loss" of the model -- if your model is not perfect, you lose some of the information inherent in the data.
- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.
- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)
- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs.
"the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different"
Not exactly because this would be optimizing the number of correctly classified elements.
Instead you minimize the sum of abs(WX) for each misclassified examples.
In the case of these slides, the loss function is max(0, -xy) and the error function is the sum of these. So, the error function is the number of incorrectly classified examples (if x and y are different, it adds 1 to the error), which is exactly what we hope to minimize.
The transfer function is applied only at evaluation.
In the formulas of the slides (and in the code), for training I compute the loss of an example X and it's expected target as: L(XW, target)
What you define is minimizing L(transfer(XW), target) which is not easily optimizable.
In the case of perceptrons, point taken -- I agree. However, my original statement still holds. The loss and error functions presented on the slides are still valid. Whether or not they are easily optimizable, they are still examples of loss and error functions.
- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.
- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)
- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs.