OCR (Optical character recognition)

The OCR model is an LSTM engine with alphabet Lambda. Each character will have as many candidates as elements in the Lambda alphabet. The candidates can be filtered using fine-tunning.

Patterns

Let’s say the field is a date of birth with expected pattern to be dd/mm/yyyy. Pattern matching means we’ll try to make sure the candidates will match the format. For example, the 3rd position is expected to be slash (‘/’) character, we may have ‘7’ or ‘I’ as candidates and the disambiguator will lean toward ‘/’ if pattern matching is enabled.

Dictionaries

We know that the word “The” is more frequent than “7he” in English. The dictionary is used to disambiguate word-level candidates. Patterns are character-level while dictionaries are world-level.

Please note that we don’t offer a configuration entry to enable/disable this feature. For now it’s always disabled.

Alphabet

The alphabet is used to restrict the output characters.

White-list

Let’s say the field is a date of birth with expected pattern to be dd/mm/yyyy. The white-list will be 0123456789/. For example, the 1st position is expected to be slash digit, we may have ‘1’ or ‘I’ as candidates and the disambiguator will lean toward 1 if white-listing is enabled.

It’s very important not to use fake document to test the SDK. As an example, we know that US and British passports use UPPERCASE characters for the Surname and GivenNames, trying with a fake document using lower case names will fail miserably on purpose.

The OCR result will never contain a character not present in the white-list.

Black-list

Let’s say the field is an address. The address could be in English, French, Arabic, Hebrew… It’s not possible to use a white-list because it’s not possible to list all possible character. However, it’s possible to add a black-list. For example, we know that an address will not contain a dollar sign ($), a percentage sign (%), an exclamation mark(!)…. This is where a black-list is needed.

The OCR result will never contain a character present in the black-list.

Characters segmentation

We use LSTM engine for OCR which doesn’t require segmenting the characters. The text segmenter is used in order to detect the baselines in order to “de-rotate” and “de-skew” the text before OCR. This is very important and significantly improves the accuracy. Text deskew paper: Efficient character skew rectification in scene text images We support 2 types of segmentation: