Character Separation

Character separation is a simple process in scanning an image in NewOCR. Since each piece of a character (The dot of an I, the top part of an equals sign, the two separate circles of a percent, etc.) are defined as completely separate characters, no character merging is required yet.

Line Separation

Before characters are separated a process, called line separation occurs. This detects any horizontal breaks in each row, to find each line of characters from the training image. Sometimes characters such as the dot of an i or a _ will be above or below all other characters, and will result in a separate line. This is overcome by the OCROptions#setMaxPercentDiffToMerge(double) method which sets a percentage of the main line's height in pixels another line must be away for it to merge, since it is most likely part of the same line. This can only be done during training, as there is a known consistent format of characters/lines.

Character Separation

The first thing the OCR does it go through all black pixels of the input image (after image binarization), and for every black pixel it gets, it gets all touching pixels recursively.

These are then compiled into a single character object.