'Why limit the weight size can prevent overfitting in machine learning
The most popular way to prevent over-fitting is weight decay(L2, L1) in machine learning(Like logistic regression, Neural network, linear regression etc). The purpose of weight decay is preventing the weight get big.
My question is why small weight can prevent over-fitting.
what if I do weight normalization after each iteration.
Solution 1:[1]
Imagine the parabola ax^2 + bx + c. The larger the coefficient, a, is, the skinnier the parabola and the more closely it fits the data points. Overfitting happens when the curve fit to the data, fits to the data points too closely (using large coefficients). Therefore, making the coefficients smaller and generally sparse can prevent overfitting.
Solution 2:[2]
A large subset machine learning techniques have mathematical models that require large coefficients/weights to correctly represent sudden changes, incoherence, or other high-dimensionality phenomena shown in individual data points in the training data. By limiting the coefficients, one essentially limit the expressiveness of a model to "smooth" or low dimensional results, which (depending on the specific problem you are trying to solve) might fit real world data better under most metrics. In this sense it can be considered as a smoothness prior which we heuristically established by observing real world data and subsequently incorporated into the training process of the mathematical model as a regularization term.
Solution 3:[3]
A small example using logistic regression that explains the concept:
in logistic regression, the probability of y knowing x is
Pr(y/x) = 1/(1+exp(-y*w.x))
for y =1, we will have Pr(1/x) = 1 if w = +infinity, and Pr(1/x) = 0 if w -infinity
You overfit your data if your probabities regarding your training set is 1 or 0 (or very close)
Therefore, adding a regularization prevents your weights for going to infinity.
Solution 4:[4]
Yes…Consider two models.
yhat=x1+x2
yhat=10x1+10x2
The result of yhat when passed through an activation function gives us the probability. We shall use a sigmoid activation function for our case.
We shall make predictions for two points (1,1) and (-1,-1) just for the sake of simplicity.
In the first models’ case
For (1,1)
yhat=(1+1)=2
sigmoid(2)=0.880797
For (-1,-1)
yhat=(?1–1)=?2
sigmoid(?2)=0.1192
Now for our second model
For (1,1):
yhat=(10(1)+10(1))=20
sigmoid(20)=0.99999
For (-1,-1):
yhat=(10(?1)+10(?1))=?20
sigmoid(?20)=0.000000002
You might be tempted to believe that since the second model is giving us better probabilities it is the better model. But that is not the case. Here is a sample code along along with the plotted models

From the above two models . You can see that Model2 no doubt gives us better probabilities. But faces problem when we perform Gradient Descent, the derivative is Huge in the middle and is almost zero at the edges. Whereas in the second case we get good derivatives throughout. Model2 is too certain about everything and doesnt generalize well.
Hence the weights should be optimal, Not too large and not too small
reference: https://www.quora.com/Why-will-a-large-number-of-weights-cause-overfit-in-a-multi-layered-perceptron
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | rahulm |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | Professor |

