> by some miracle you do, you just use a subgradient
This is the most succinct comment I have encountered on how people think about non-differentiability in deep learning.
This helped me reconcile my experiences with the deep learning paradigm. Thank you.
You see, in the numerical optimization of general mathematical models (e.g. where the model is a general nonlinear -- often nonconvex -- system of equations and constraints), you often do hit non-differentiable points by chance. This is why in mathematical modeling one is taught various techniques to promote model convergence. For instance, a formulation like x/y = k is reformulated as x = k * y to avoid division by zeros in y during iteration (even if the final value of y is nonzero) and to avoid any nonsmoothness (max(), min(), abs() functions for instance are replaced with "smooth" approximations). In a general nonlinear/noconvex model, when you encounter non-differentiability, you are liable to lose your descent direction and often end up losing your way (sometimes ending up with an infeasible solution).
However it seems to me that the deep learning problem is an unconstrained optimization problem with chained basis functions (ReLU), so the chances of this happening is slighter and subgradients provide a recovery method so the algorithm can gracefully continue.
This is often not the experience for general nonlinear models, but I guess deep learning problems have a special form that lets you get away with it. This is very interesting.
I don't know why you think subgradient is that important. It's just a shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and have many many minimizers. It's not so important to find the best one (an impossible task for sgd) as to find one that is good enough.
> I don't know why you think subgradient is that important.
I underquoted. It's more the approach to handling of nondifferentiability in deep learning problems that is of interest to me, whether it involves subgradients or some other recovery approach.
These approaches typically do not work well in general nonlinear systems, but they seem to be ok in deep learning problems. I haven't read any attempts to explain this until I read parent comment.
> It's just a shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and have many many minimizers.
This is not true for general nonlinear systems, hence my interest.
This is the most succinct comment I have encountered on how people think about non-differentiability in deep learning.
This helped me reconcile my experiences with the deep learning paradigm. Thank you.
You see, in the numerical optimization of general mathematical models (e.g. where the model is a general nonlinear -- often nonconvex -- system of equations and constraints), you often do hit non-differentiable points by chance. This is why in mathematical modeling one is taught various techniques to promote model convergence. For instance, a formulation like x/y = k is reformulated as x = k * y to avoid division by zeros in y during iteration (even if the final value of y is nonzero) and to avoid any nonsmoothness (max(), min(), abs() functions for instance are replaced with "smooth" approximations). In a general nonlinear/noconvex model, when you encounter non-differentiability, you are liable to lose your descent direction and often end up losing your way (sometimes ending up with an infeasible solution).
However it seems to me that the deep learning problem is an unconstrained optimization problem with chained basis functions (ReLU), so the chances of this happening is slighter and subgradients provide a recovery method so the algorithm can gracefully continue.
This is often not the experience for general nonlinear models, but I guess deep learning problems have a special form that lets you get away with it. This is very interesting.