Current location - Education and Training Encyclopedia - Graduation thesis - [ECCV2020] Paper Translation: Attention to Character Areas in Text Positioning
[ECCV2020] Paper Translation: Attention to Character Areas in Text Positioning
The scene text detector consists of text detection and recognition modules. Many researches have been done to unify these modules into an end-to-end trainable model to achieve better performance. The typical structure is to put the detection and recognition module in a separate branch, and RoI pooling is usually used to make this branch enjoy visual characteristics. However, there is still an opportunity to establish a more complementary connection between modules when a recognizer is adopted, which uses an attention-based decoder and detector to represent the spatial information of the character region. This is possible because two modules * * * share the same subtask, which will find the position of the character area. Based on these understandings, we construct a closely coupled single-pipe model. This structure is formed by using the detection output as the input of the recognizer and propagating the recognition loss in the detection stage. The use of the character score graph helps the recognizer to pay more attention to the central point of the character, and the propagation of the recognition loss to the detector module will enhance the location of the character region. In addition, the enhanced * * * enjoyment stage allows feature correction and boundary location of arbitrarily shaped text areas. A large number of experiments have proved the latest performance of the publicly provided benchmark data sets of straight lines and curves.

Scene text location, including text detection and recognition, has attracted extensive attention in recent years because of its wide application in instant translation, image retrieval and scene analysis. Although the existing text detectors and recognizers are effective for horizontal text, it is still a challenge to find curved text examples in scene images.

In order to find the curved text in the image, a classic method is to cascade the existing detection and recognition models to manage the text instances on each side. The detector [32,31,2] tries to capture the geometric properties of the curved text by applying complex post-processing technology, while the recognizer uses multi-directional coding [6] or correction module [37,46, 1 1] to enhance the accuracy of the recognizer to the curved text.

With the development of deep learning, it has been studied to combine detectors and recognizers into an end-to-end network, which can be trained together [14,29]. Having a unified model can not only improve the dimensional efficiency and speed of the model, but also help the model learn the enjoyment function, thus improving the overall performance. In order to benefit from this property, we also try to use the end-to-end model [32, 34, 10, 44] to deal with curved text instances. However, most existing works only use the region of interest pool to enjoy the underlying features between detection and identification of branches. In the training stage, the * * * enjoy feature layer is trained by detecting and identifying losses, instead of training the whole network.

As shown in figure 1, we propose a novel end-to-end text positioning model for attention in character areas, which is called CRAFTS. Instead of isolating the detection and identification modules in two independent branches, we establish a single pipeline by establishing complementary connections between modules. It is observed that the recognizer [1] using the attention-based decoder and the detector [2] encapsulating the character space information share a common subtask, which is used to locate the character region. By closely integrating these two modules, the output of the detection stage can help the recognizer to better identify the central point of the character, and the loss from the recognizer to the detector stage will enhance the position of the character region. In addition, the network can maximize the quality of feature representation used in common subtasks. As far as we know, this is the first end-to-end work to establish tight coupling loss.

Our contributions are summarized as follows:

(1) We propose an end-to-end network, which can detect and recognize any shape of text.

(2) Using the spatial feature information of detectors on the calibration and identification module, the complementary relationship between modules is constructed.

(3) Establish a single pipeline by dispersing the identification loss among all features of the whole network.

(4) We have achieved the most advanced performance on IC 13, IC 15, IC 19-MLT and TotalText [20, 19, 33, 7] data sets, which contain a large number of horizontal, curved and multilingual texts.

Text detection and recognition method

The detection network uses regression-based [16,24,25,48] or segmentation-based [9,3143,45] methods to generate text bounding boxes. Some recent methods, such as [17,26,47], take Mask-RCNN [13] as the basic network, and gain advantages from regression and segmentation methods by adopting multi-task learning. As far as the unit of text detection is concerned, all methods can also rely on word-level or character-level [16,2] prediction for sub-classification.

Text recognizer usually adopts CNN-based feature extractor and RNN-based sequence generator, and classifies according to their sequence generators. Connectionist temporal classification (CTC)[35] and attention-based sequential decoder [2 1, 36]. The detection model provides the information of the text area, but it is still a challenge for the recognizer to extract useful information from the text with arbitrary shape. In order to help identify the irregular text processed by the network, some studies [36, 28, 37] use the Spatial Transform Network (STN) [18]. In addition, the paper [1 1, 46] further expands the use of STN by iteratively performing the correction method. These studies show that running STN recursively helps the recognizer to extract useful features from extremely curved texts. In [27], a cyclic RoIWarp layer is proposed to cut a single character before it is recognized. This work proves that the task of finding the character region is closely related to the attention mechanism used in the attention-based decoder.

One way to build a text location model is to place detection and recognition networks in turn. The well-known two-stage structure couples a TextBox ++ detector and a CRNN [35] recognizer. In short, this method has achieved good results.

End-to-end use of recognizer based on RNN

EAA [14] and FOTS [29] are end-to-end models based on EAST detector [49]. The difference between the two networks lies in the recognizer. The FOTS model uses CTC decoder [35], while the EAA model uses attention decoder [36]. The affine transformation layer has realized the function of merging * * * * in both works. The proposed affine transformation works well on horizontal text, but it shows limitations when dealing with arbitrary shape text. TextNet [42] proposed a spatially-aware text recognizer with perspective RoI transformation at the feature pool level. The network retains RNN layer to identify text sequences in 2D feature maps, but due to the lack of expressive quadrangles, the network still shows limitations in detecting curved texts.

Qin et al [34] proposed an end-to-end network based on Mask-RCNN [13]. Given the box, it is suggested to enjoy the feature of layer merging from * * * and filter the background clutter by using the ROI mask layer. The proposed method improves its performance by ensuring that the attention is only in the text area. Busta and others proposed a deep text observer network and extended their work in E2E-MLT. The network consists of FPN-based detector and CTC-based recognizer. The model predicts multiple languages in an end-to-end way.

End-to-end use of recognizer based on CNN

When dealing with arbitrary shape text, most CNN-based models have advantages in identifying character-level text. MaskTextSpotter [32] is a model that uses segmentation method to recognize text. Although it has advantages in detecting and recognizing single characters, it is difficult to train the network because it usually does not provide character-level labels in public data sets. CharNet [44] is another method based on segmentation, which can make character-level prediction. Through weak supervision training model, the problem of lack of character-level labeling is overcome. During training, the method performs iterative character detection to create false ground truth.

Although the recognizer based on segmentation has achieved great success, this method will be affected when the number of target characters increases. With the increase of the number of character sets, the model based on segmentation needs more output channels, which increases the requirements for memory. The periodical version of MaskTextSpotter [23] expanded the character set to handle multiple languages, but the authors added a decoder based on RNN instead of using their original recognizer based on CNN. Another limitation of segmentation-based recognizer is the lack of context information in the recognition branch. Due to the lack of sequential modeling like RNN, the accuracy of the model will be reduced in noisy images.

TextDragon [10] is another segmentation-based method for locating and identifying text instances. However, there is no guarantee that the predicted character segment will cover a single character area. In order to solve this problem, the model introduces CTC to delete overlapping characters. The network shows good detection performance, but due to the lack of sequential modeling, it shows limitations in the recognizer.

CRAFT detector [2] is chosen as the basic network because it has the ability to express the semantic information of character areas. The output of the CRAFT network represents the central probability of the character region and its connection. Because the goal of these two modules is to locate the center position of the character, we assume that the center information of the character can be used to support the attention module in the recognizer. In this work, we have made three changes to the original process model; Trunk replacement, connection representation and direction estimation.

Trunk replacement

Recent research shows that using ResNet50, clear feature representations defined by detectors and recognizers can be captured [30, 1]. Therefore, we changed the backbone network from VGG- 16 [40] to ResNet50 [15].

Connection representation

Vertical text is not common in Latin texts, but it often appears in East Asian languages such as Chinese, Japanese and Korean. In this work, the binary centerline is used to connect consecutive character regions. The reason for this change is that using the original affinity diagram on vertical text often produces ill-posed perspective transformation, thus generating invalid frame coordinates. In order to generate the ground truth connection graph, a line segment with thickness t is drawn between adjacent characters. Where t = max ((d 1+d 2)/2 * α, 1), where d 1 and d 2 are the diagonal lengths of adjacent character frames, and α is the scaling factor. Using this equation, the width of the center line can be proportional to the size of the characters. In the implementation, we set α to 0. 1.

Direction estimation

It is important to get the correct orientation of the text box, because the frame coordinates need to be clearly defined in the recognition stage to correctly recognize the text. To this end, we have increased the output of two channels in the detection stage. Channels are used to predict the angles of characters along the X and Y axes. To generate the ground truth value of the directed graph.

* * * Enjoy stage includes two modules: text correction module and character area attention: CRA) module. Thin-plate Spline (TPS) [37] transformation is used to correct text areas with arbitrary shapes. Inspired by [46], our correction module combines iterative TPS to better represent the text area. By attractively updating the control points, the bending geometry of the text in the image can be improved. Through empirical research, we find that three TPS iterations are enough to correct.

The typical TPS module takes word images as input, but we provide character area graph and connection graph because they encapsulate the geometric information of text area. We use 20 control points to closely cover the curved text area. In order to use these control points as detection results, they are converted into original input image coordinates. We can choose to perform 2D polynomial fitting to smooth the boundary polygons. An example of iterative TPS and final smooth polygon output is shown in Figure 4.

The modules in the identification phase are formed according to the results reported in [1]. The recognition stage includes three parts: feature extraction, sequence modeling and prediction. Because the feature extraction module uses advanced semantic features as input, it is lighter than a single recognizer.

The detailed architecture of the feature extraction module is shown in table 1. After extracting features, the bidirectional LSTM is applied to sequence modeling, and then the final text prediction is made by the attention-based decoder.

At each time step, the attention-based recognizer will decode the text information by masking the attention output of the feature. Although the attention module works well in most cases, it cannot predict the characters [5, 14] when the attention points are misplaced or disappeared. Figure 5 shows the effect of using CRA module. Proper attention points can make reliable text prediction.

The final loss l of training consists of detection loss and recognition loss, and L = Ldet+Lreg. The whole process of determining the loss is shown in Figure 6. The loss flows through the weight in the recognition stage and propagates to the detection stage through the character area attention module.

On the other hand, the detection loss is used as an intermediate loss, so the weights before the detection stage are updated using the detection and identification loss.

English data set IC 13 [20] The data set consists of high-resolution images, 229 images for training and 233 images for testing. Rectangular boxes are used to annotate word-level text instances. IC 15 [20] contains 1000 training images and 500 test images. Quadrilateral boxes are used to annotate word-level text instances. Total text [7]

It has 1255 training images and 300 test images. Unlike IC 13 and IC 15 data sets, it contains curved text examples and is labeled with polygon points.

Multilingual data set IC 19 [33] The data set contains10,000 training images and10,000 test images. The data set contains texts in seven different languages and is annotated with quadrilateral points.

We train the detectors and recognizers in the CRAFTS model together. In order to train the detection stage, we follow the weak supervision training method described in [2]. The recognition loss is calculated by randomly sampling the cut word features in each image in batches. The maximum number of words per image is set to 16 to prevent the error of insufficient memory. Techniques such as cropping, rotation and color change are applied to data enhancement in the detector. For the recognizer, the angle of the real frame on the ground is disturbed in the range of 0% to 10% of the short length of the frame.

First, the model iteratively trains 50k on the SynthText data set [12], and then we further train the network on the target data set. Adam optimizer is used and online hard negative mining (OHEM) [39] is applied to force the use of positive and negative pixels with the ratio of 1: 3 to detect the loss. When fine-tuning the model, the SynthText dataset is mixed at the ratio of 1: 5. We use 94 characters to cover letters, numbers and special characters, and 4267 characters are used for multilingual data sets.

Horizontal data set (IC 13, IC 15)

In order to reach the benchmark of IC 13, we adopt the model trained on the SynthText data set and fine-tune it on the IC 13 and IC 19 data sets. In; In the process of reasoning, we adjust the input long side to 1280.

The results show that compared with the latest technology, the performance has been significantly improved.

Then the model trained on IC 13 data set is fine-tuned on IC 15 data set. During the evaluation, the input size of the model is set to 2560x 1440. Please note that we perform a common assessment without a common vocabulary set. The quantitative results of IC 13 and IC 15 data sets are listed in table 2.

Heat map is used to explain the character area map and connection map, and the weighted pixel angle value is visualized in HSV color space.

As shown in the figure, the network successfully located the polygon area and recognized the characters in the curved text area. The two figures in the upper left corner show examples of text that successfully identify full rotation and high bending.

Character area attention auxiliary attention

In this section, we will study how the attention to character area (CRA) affects the performance of the recognizer by training an independent network without CRA.

Table 5 shows the effect of using CRA on the benchmark data set. Without CRA, we observed performance degradation on all data sets. Especially in perspective data set (IC 15) and curve data set (TotalText), the gap we observed is larger than that in horizontal data set (IC 13). This means that when dealing with irregular text, the performance of the recognizer can be improved by sending the attention information of characters. (? The experimental data in the table is more effective for long-term texts. I wonder how this conclusion was reached. )

Importance of direction estimation

Because there are many multi-directional texts in the scene text image, direction estimation is very important. Our pixel-by-pixel averaging scheme is very useful for the recognizer to receive well-defined features. When direction information is not used, we compare the results of the model. On IC 15 data set, the performance drops from 74.9% to 74. 1% (-0.8%), and on TotalText data set, the average value of H drops from 78.7% to 77.5% (- 1.2%). The results show that the performance of rotating text can be improved by using correct angle information.

Reasoning speed

Because the reasoning speed varies with the size of the input image, we measure FPS at different input resolutions, and the long sides of each resolution are 960, 1280, 1600 and 2560 respectively. The FPS obtained from the test results are 9.9, 8.3, 6.8 and 5.4 respectively. In all experiments, we used Nvidia P40 GPU and Intel? Xeon? Central processing unit. Compared with the 8.6 FPS of the VGG-based CRAFT detector, the ResNet-based CRAFT network can get higher FPS with the same input. In addition, the direct use of control points from the correction module can alleviate the need for post-processing of polygon generation.

Particle size difference problem

We assume that the granularity difference between real data and predicted frames leads to relatively low detection performance of IC 15 data set. Character-level segmentation methods tend to generalize character connectivity based on spatial and color clues, rather than capturing all the features of word instances. Therefore, the output does not follow the annotation style of the box required by the benchmark. Fig. 9 shows a failure case in IC 15 data set, which proves that when we observe an acceptable qualitative result, the detection result is marked as incorrect.

In this paper, we propose an end-to-end trainable single pipeline model, which closely couples the detection and identification modules. * * * Enjoy the attention to the character area in the stage, make full use of the character area map, help the recognizer to correct it, and better participate in the text area. In addition, we design the propagation of recognition loss in the detection stage, which enhances the character positioning ability of the detector. In addition, the modification module in the * * * enjoyment stage can accurately locate the curved characters without developing manual post-processing. The experimental results verify the latest performance of CRAFTS on different data sets.