Current location - Education and Training Encyclopedia - Graduation thesis - Interpretation of LipNet papers
Interpretation of LipNet papers
Paper: lipnet: end-to-end emotional lip reading

Due to the lack of guidance, I seldom see a detailed analysis of lip reading related articles on Chinese websites when I investigate the content in related fields, so I spent a lot of time and energy. This paper analyzes a pioneering work of sentence level in this field, and introduces the main points of the paper. Before this paper, most of the work of Lipreading focused on the recognition of letters, words, numbers or phrases, which had certain limitations. Although the sentence patterns in the data used in this paper are limited and the vocabulary is relatively small, this does not prevent it from being recognized on the scale of sentences, and it has achieved quite good results.

First, introduce the data set. Grid data set is a sentence-level data set, which contains more than 30 thousand pieces of data. Each piece of data is a video, and the video content is that a person utters a fixed sentence, corresponding to a text tag, and the tag is marked with the start time and end time of each word. The sentence pattern of the sentence is limited, not logical and natural, that is:?

In other words, each sentence consists of six words of a fixed type, and the superscript indicates the number of types of the word type in the data set, such as? Explain that this position is a color word (such as blue), and a * * * in the data set has four color words.

Furthermore, it should be understood that the video I of the data set has 34 folders, corresponding to videos recorded by 34 different people. Each folder contains thousands of video data, all recorded by the same person. In the later experiment, the author will train and test in two different ways: (1) training with videos of 30 people and testing with videos of 4 others, that is, invisible speakers; (2) From the videos of 34 people, 255 videos are randomly selected as test data, and the rest are used as training data;

Firstly, according to the grouping method introduced at the end of the data set, the data is divided into two training sets and test sets. Then, using the existing face recognition detector, each frame of the video is processed into? A dimension frame containing only the mouth. Finally, each frame is standardized.

(1) Conventional image sequence and horizontally flipped image sequence are used for training respectively;

(2) Because the data set provides the start and end time of each word, the model can be trained by using the image frame sequence corresponding to each word;

(3) Delete or copy some frames at random with a probability of 0.05;

After introducing the organization of data, everyone knows that this is a problem of Seq2seq, much like the routine of speech recognition. Therefore, the routine of lip reading is a combination of CV and machine translation to a great extent.

The model structure of this article is nothing special, and there is a lot of nonsense in this article. To sum up, in fact, 3D convolution is used to extract the features of image frames, then two-layer bidirectional GRU is used as encoding and decoding to output a prediction value, and finally the fully connected layer is used to output the prediction probability. Generally speaking, the structure of the model is not complicated, and there are some improvements.

In addition, the loss function function is worth noting. In this paper, CTC loss function is used, which is a classic loss function used in speech recognition to avoid the alignment of frames and characters. Please refer to this article for details.

The indicators of WER and CER are word error rate and character error rate respectively, that is, word error rate and character error rate. Of course, the lower the better. The index is divided into two columns: invisible speakers and overlapping speakers, which correspond to the test results of the two data division methods introduced in the data set part respectively. As you can see, all the indexes of LipNet on the grid data set have reached the best at that time. Many subsequent works on the grid data set have reached 1.0%~2.0%, but the performance on the LRS data set, such as the effect on the grid data set, is far from good, because the sentence pattern in the grid data set is single, and the face is facing the camera, which can only be used as basic research. There is still a long way to go for sentence-level recognition of lower lip reading in natural scenes.

The level is limited, and everyone is welcome to criticize and correct me. If you have any questions, we can discuss them together.