The Problem of Overfitting
Regularization is designed to address the problem of overfitting.
High bias or underfitting is when the form of our hypothesis maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.
At the other extreme, overfitting or high variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
This terminology is applied to both linear and logistic regression.
There are two main options to address the issue of overfitting:
- Reduce the number of features.
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
- Regularization
- Keep all the features, but reduce the parameters θj.
Regularization works well when we have a lot of slightly useful features.
Cost Function
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.