Character Mergence

After characters are identified, there are still a lot of separated pieces, such as two parts from an = or a :, or the dots from characters such as ! or ?. This is solved with merge rules, which are rules that specify what characters to merge together after everything has been detected.

Defining Merge Rules

Merge rules are loaded in from a list in the default configuration file, which are just a conical path to a class extending MergeRule.class This class provides the basic methods that the system uses to merge characters. These rules are added in the DefaultMergenceManager class. These individual rules rely on data collected during training, which are primarily distances character parts are from each other. The distance is divided by the width to make it scalable and averaged during training, so a projected distance can be generated in the rule after being fetched from the database.

Each merge rule can either be horizontal or vertical, horizontal rules being given a list of characters all in a horizontal line from one another, and vertical rules being given a list of characters overlapping one another vertically.

After the rules are processed, individual parts of characters (Things like the top dot of an `i`, a single part of an `"`, a single part of an `=`, etc.) are then defined as their next closest match from recognition. The system then checks to ensure it's not a lone part again, and if not, it continues.

The following sections will be short descriptions to show what each default merge rule does.

ApostropheMergeRule

The ApostropeMergeRule is meant to merge two vertical lines together to become a ".

The rule is horizontal, and firstly makes sure both inputs are unmerged vertical lines, then making sure their heights are within 25% of each other, so they won't be an apostrophe and a pipe, for instance. After this, it goes through other characters in the current horizontal line and finds an alphanumeric character and makes sure the size of the apostrophe is no more than 50% of its height, so ensure both vertical lines of the quote don't fill the line, such as two pipes. Finally, if the distance between the two characters are within the predicted length (From a ratio from training) the characters are merged together and labeled as a quote.

EqualVerticalMergeRule

The EqualVerticalMergeRule is meant to merge two identical pieces vertically, at the moment being only : and =.

This is a vertical rule, which first ensures both inputs are unmerged. Then, the estimated vertical distance is set to either the trained distance ratio of an equals sign if both inputs are horizontal lines, or a colon's distance ratio if both are dots. If the characters are both within their distance ratio, they are merged together.

OverDotMergeRule

The OverDotMergeRule is to merge dots above a character with their lower bases. This includes i, j, and ;.

This rule is vertical, and starts off with making sure the base is value and the upper character is a dot, and then sets the correct vertical distance ratio from training according to what the base it. If the difference between the heights and the projected distance from the training ratio is within 50% of the projected value, the characters are merged together and labeled as the correct character.

PercentMergeRule

The PercentMergeRule is simply meant to merge the circles with the forward slash of a percent.

This is a horizontal rule, starting off with taking 3 inputs. This rule just ensures 2 of them are empty circles, and the other is a forward slash character. Then, if all are overlapping, it merges them.

UnderDotMergeRule

The UnderDotMergeRule is to merge dots below a base character. This includes ? and ! currently.

This rule is vertical, and begins with ensuring the base is valid, the character below is a dot, and neither have been merged in the past. Proceeding this, it then gets the correct vertical distance ratio from the base, and merges/labels the characters if the projected distance from the ratio and the actual vertical distance is at least 75% of the projected distance apart.