Week 6 Machine Learning

If your machine learning algorithm is not working well, you can try following things.

A test you can run to gain insight what is/isn’t working with a learning algorithm, and gain guidance as to how best to improve its performance.

Overfitting
- Use part of training examples as test set.
- Typically random 30% of training examples are used as test set.
- If hypothesis overfits - then error in training set will be low, but high in test set.
Degree of polynomial
- Linear, quadratic, cubic.. etc.
- Keep 20% of training examples for cross validation set.
- 60% training set, 20% cross validation set, 20% test set.

Bias problem - training set and cross validation set errors both will be high.
Variance problem - training set error is low, but cross validation errors are high.

Bias or variance problem

Plot average squared error for training set size.
Plot both average squared error for cross validation set (J_{CV</sup>) and training set J_train.}
As the training set gets larger, the error for a quadratic function increases.
The error value will plateau out after a certain m, or training set size.
High bias - J_{CV</sup> and J_train both will be high, will converge for sufficiently large value of training set size. More data will not help.}
High variance - J_train « J_{CV</sup>, will not converge. More data will likely help.}

Smaller networks prone to under fitting.
Larger networks prone to over fitting, but can be solved using regularization.
Number of hidden layers is similar to degree of polynomial and same techniques can be used to optimize that.

How to spend your time to reduce error of your algorithm?
- Collect lots of data.
- Develop sophisticated features.
Error analysis
- Start with a simple algorithm.
- Plot learning curves to decide whether you need more features or data.
- Eye ball data where algorithm is going wrong.
Use a single numerical evaluation metric for your algorithm - error rate or something like that.
- This evaluation will helps with deciding whether your changes are useful or not.

Data may be skewed to contain far more number of examples for certain class.
For example - far more samples (99%) of patients who dont have cancer, compared to small sample of patient who do.
Throwing out data to reducing skew is incorrect, as skew is representative of real world.
Using error metric will not be very useful to determine performance of such algorithm.
In classification, you can change threshold from 0.5 to higher or lower values to force your algorithm to predict positive class only when very confident.
Precision(P) - true positives / (true positives + false positives)
- What percent of predicted positive results were correct.
- Good algorithm will have closer to 1.
Recall(R) - true positives / (true positives + false negative)
- What percent of positive results were predicted correctly
- Good algorithm will have closer to 1.
Trade off between Precision and Recall is a hard and depends on context.
Trying out different algorithms and comparing P and R for them is a good to determine which algorithm is better.
F score - 2 * (P * R)/(P + R) - Higher the value the better.
- This score is used to compare performance of various algorithm.

Very large data set makes even an “inferior algorithm” work better, almost at par with better algorithms.
Although having large data set is not sufficient.
Is the data useful and contains necessary information to write a good enough algorithm?
Can a human expert predict output confidently based on example and set of features?
For example - you can’t predict housing price based on data which contains only size of the house as information, irrespective of how many examples you have.