Today, we will briefly analyze the principle of OCR technology, and will not involve specific algorithm interpretation and derivation. After all, each algorithm can occupy a long space, and each algorithm can be rewritten.
Generally speaking, OCR is generally divided into two steps: image processing and text recognition.
Before recognizing characters, we need to preprocess the original image for subsequent feature extraction and learning. This process usually includes: graying, binarization, noise reduction, tilt correction, text segmentation and other sub-steps. Each step involves a different algorithm. Let's take the original picture as an example to illustrate each step.
Gray processing, in the RGB model, if R=G=B, color represents a kind of gray, where the value of R=G=B is called gray value, so each pixel of gray image only needs one byte to store gray value (also called intensity value and brightness value), and the gray range is 0-255. To put it bluntly, it is to turn a color picture into a black and white picture.
Generally speaking, there are four methods for graying color images: component method, maximum method, average method and weighted average method.
The image includes the target object, background and noise. The most common method to extract the target object directly from a multi-valued digital image is to set a threshold value T, and divide the image data into two parts: the pixel group larger than T and the pixel group smaller than T. This is the most special method to study the gray scale transformation, which is called image binarization.
Binary black and white pictures do not contain gray, only pure white and pure black.
The most important thing in binarization is the selection of threshold, which is generally divided into fixed threshold and adaptive threshold. Commonly used binarization methods are: bimodal method, P-parameter method, iterative method and OTSU method.
In the process of digitization and transmission, digital images in reality are often disturbed by noise from imaging equipment and external environment, which is called noise image or noise image. The process of reducing noise in digital images is called image denoising.
There are many sources of noise in images, such as image acquisition, transmission, compression and so on. The types of noise are also different, such as salt and pepper noise and Gaussian noise. Different noise has different processing algorithms.
In the image obtained in the last step, we can see a lot of sporadic small black dots, which are the noise in the image and will greatly interfere with the cutting and recognition of the image by our program, so we need to reduce the noise. Noise reduction is very important at this stage, and the quality of noise reduction algorithm has great influence on feature extraction.
Image denoising methods generally include mean filtering, adaptive Wiener filtering, median filtering, morphological noise filtering, wavelet denoising and so on.
For users, it is impossible to be absolutely level when taking pictures. So we need to rotate the image through the program to find a position that is considered to be the most horizontal, so that the cut image can be the best effect.
The most commonly used tilt correction method is Hough transform, whose principle is to expand the picture and connect intermittent characters into a straight line, which is convenient for straight line detection. After calculating the angle of the straight line, we can use the rotation algorithm to correct the tilted picture to the horizontal position.
For a multi-text text, text segmentation includes two steps: line segmentation and character segmentation, and tilt correction is the premise of text segmentation. We project the text after tilt correction on the Y axis and accumulate all the values, so that we can get the histogram on the Y axis.
The bottom of the histogram is the background, and the peak is the area where the foreground (text) is located. So we determined the position of each line of text.
Character segmentation is similar to line segmentation, except that this time we want to project each line of text on the X axis.
However, it should be noted that two characters in the same line are often very close together, and sometimes they overlap vertically. When projected, it will be regarded as a character, resulting in cutting errors (mostly English characters); Sometimes there is a small gap between the projection of the left and right structures of the same character on the X axis, and a character will be wrongly divided into two characters (mostly Chinese characters) when cutting. Therefore, character segmentation is more difficult than line segmentation.
In this case, we can set an expected value of a character width in advance. If the projection of the cut character exceeds the expected value too much, it is considered as two characters. If it is far less than the expected value, the gap is ignored and the "characters" on the left and right sides of the gap are combined into one character for recognition.
After preprocessing, it is the character recognition stage. This stage will involve some knowledge of artificial intelligence, which is abstract and cannot be expressed by pictures. I'll try to make it easy to understand.
Features are the key information used to identify characters, and each different character can be distinguished from other characters by features. For numbers and English letters, this feature extraction is relatively easy. The total * * * is 10+26 x 2 = 52 characters, and they are all small character sets. For Chinese characters, feature extraction is more difficult, because first of all, Chinese characters are large character sets; Secondly, there are 3755 first-class Chinese characters in the national standard. Finally, the structure of Chinese characters is complex and there are many similar words, so the feature dimension is relatively large.
After determining the features to be used, it may be necessary to reduce the size of the features. In this case, if the dimension of the feature is too high, the efficiency of the classifier will be greatly affected. In order to improve the recognition rate, dimensionality reduction is often needed. This process is also very important, not only to reduce the feature dimension, but also to keep enough information (distinguish different characters) in the reduced feature vector.
For a text image, extract the feature and throw it to the classifier, which will classify it and tell you which text this feature should be recognized as. The design of classifier is our task. The design methods of classifiers generally include: template matching method, discriminant function method, neural network classification method, rule-based reasoning method and so on, which are not described here. Before actual recognition, it is often necessary to train classifiers, which is a supervised learning process. There are many mature classifiers, such as SVM and CNN.
In fact, it is to optimize the classification results of classifiers, which generally involves the category of natural language understanding.
The first is the treatment of similar words: for example, the words "fen" and "xi" are similar in shape, but if you meet the word "fraction", don't recognize it as "xi number" because "fraction" is a normal word. This needs to be corrected by the language model.
Secondly, the processing of text typesetting: for example, some books are divided into left and right columns, and the left and right columns of the same line do not belong to the same sentence, so there is no grammatical connection. If we cut by wire, it will connect the end of the left line with the beginning of the right line, which we don't want to see. This situation needs special treatment.
This is the general principle of OCR. Generally speaking, there are many steps in OCR, and the algorithms involved are complicated. For each step, each algorithm has many separate research papers, which cannot be discussed in depth in this paper. If you do OCR from scratch, it will be a huge project. I'm just a little ignorant, in the primary stage of pattern recognition and machine learning. Please correct me if there are any mistakes.