Therefore, the CRNN algorithm mainly adopts the three-layer network structure of CNN+RNN+CTC, which is as follows from bottom to top:
(1) convolution layer, which uses CNN to extract feature sequences from input images;
(loop layer, using RNN to predict the label (truth value) distribution of feature sequence obtained from convolution layer;
(3) Transcription layer, which uses CTC to transform the tag distribution obtained by circulation layer into the final recognition result through operations such as de-integration.
The convolution layer * * * contains seven layers of convolutional neural networks, and the basic structure is VGC structure, in which the input is to scale the gray image to the size of W*32, that is, the fixed height. For the third and fourth pool layers, in order to pursue the true aspect ratio, the core size is 1*2 (not 2*2). In order to accelerate the convergence, BN layer is introduced. The feature maps extracted by CNN are classified, and the 5 12-dimensional features of each column are input into the bidirectional LSTM with 256 units on two floors for classification. In the training process, under the guidance of CTC loss function, the approximate soft alignment of character position and category standard is realized.
As shown in the figure:
Now it is necessary to extract the feature vector sequence from the feature map generated by CNN model. Each feature vector (red box) is generated from left to right on the feature map, and each column contains 5 12-dimensional features, which means that the ith feature vector is the connection of all pixels in the ith column of the feature map, and these feature vectors form a sequence.
Because the convolution layer, the maximum pool layer and the activation function are executed on a local area, they are translation invariant. Therefore, each column of the feature map (i.e., the feature vector) corresponds to a rectangular area of the original image (which becomes the receptive field), and these rectangular areas have the same order as the corresponding columns from left to right on the feature map. Each vector in the feature sequence is associated with the receptive field.
The vectors in the extracted feature sequence are generated sequentially from left to right on the feature map, and each feature vector represents a feature in a certain width of the image. The width used in this paper is 1, which is a single pixel.
If the size of a picture containing 10 characters is 100×32, and the feature scale obtained through the above CNN network is 25× 1 (the number of channels is ignored here), a sequence is obtained, and each column of features corresponds to a rectangular area of the original image (as shown in the following figure) as the input of RNN, which is convenient for the next calculation.
As can be seen from the above figure, the adjustment of VGG is as follows:
1. In order to input the features extracted by CNN into the RNN network, the kernel size of the third and fourth maxpooling is changed from 2×2 to 1×2.
2. In order to speed up the training of the network, BN layer is added after the fifth and sixth convolution layers.
Why did you change the kernel size of the third and fourth maxpooling from 2×2 to 1×2? It is convenient to use the features extracted by CNN as the input of RNN. First of all, it should be noted that the input of this network is W×32, which means that the network has no special requirements for the width of the input picture, but the height must be adjusted to 32.
Suppose there is an image input now. To enter features into the reproduction layer, perform the following processes:
For a detailed explanation of the principle of CNN, please refer to/writer #/notebooks/46006121/notes/71kloc-0/56459.
Because RNN has the problem of gradient disappearance, it can't get more context information, so LSTM is used in CRNN, and its special design enables it to capture long-distance dependence.
RNN network has an output yt for the characteristic sequence X = X 1, xt output by CNN. In order to prevent the gradient from disappearing during training, LSTM neural unit is used as RNN unit. This paper thinks that both the forward information and the backward information of the sequence are helpful to the prediction of the sequence, so this paper adopts the bidirectional RNN network. The structure of LSTM neurons and the structure of bidirectional RNN are shown in the following figure.
Example:
Through the above steps, we get 40 feature vectors, each with a length of 5 12. In LSTM, feature vectors are introduced for classification in one time step, including 40 time steps.
We know that a feature vector is equivalent to a small rectangular area in the original image, and the goal of RNN is to predict which character this rectangular area is, that is, according to the input feature vector, the softmax probability distribution of all characters is predicted. This probability distribution takes the length of the number of character categories as the input vector of CTC layer.
Because each time step has an input feature vector and outputs the probability distribution of all characters, a posterior probability matrix consisting of 40 vectors with the length of the number of character categories is output, and then this posterior probability matrix is transmitted to the transcription layer.
During the exam, there are two kinds of translators, one with a dictionary and the other without a dictionary.
Having a dictionary means that when testing, the test set has a dictionary, and the probability of all dictionaries is calculated from the output results of the test, and the maximum is the final prediction string.
There is no dictionary, which means that the test set does not give which strings the test set contains, and the one with the greatest output probability is selected as the final predicted string.
The difficulty of end-to-end OCR recognition lies in how to deal with the alignment of indefinite sequences! (Because it is an indefinite sequence, it is difficult for us to calculate the loss according to the previous method. If it is a fixed length, it will easily lead to the loss of information, and the limitations are too great! )
Transcription is the process of transforming RNN's prediction of each feature vector into a tag sequence. Mathematically, transcription is to find the tag sequence with the highest probability combination according to the prediction of each frame.
See/writer #/notebooks/46006121/notes/71156474 for details.