Week 11 Machine Learning

Train an algorithm which can detect when an image is a the middle of two adjoining characters.
If it detects image as middle of the two characters, mark it as a boundary.

Photo OCR

Generate data using off the shelf fonts and other image processing techniques to mimic real world data.
Use small real world training set, but use image processing techniques, generate more samples.
Introduce gaussian noise or distortions to generate more samples.
You need a low bias classifier, so that adding more samples helps.
If you can get more real world data, that might be better use of resources than spending time on artificial data synthesis.

Mimic a stage is working at 100% accuracy by using training set, but just using label output at the output of the algorithm.
So if the label is y=1, then algorithm outputs 1, else outputs 0.
This similar to mocking in classical programming unit testing.
Step by step mock out each stage of the pipeline and observer the overall pipeline performance in each step.
If the mock stage which is working at 100% efficiency does not change the overall performance by a lot, then that stage is not worth working on immediate priority.

Ceiling Analysis