Introduction: This paper proposes Deep Speaker, which is a neural speaker embedded system using neural network. The system arranges words on a hypersphere, and on this hypersphere, the similarity of speakers is calculated by cosine similarity.
Application scenario: speaker recognition, confirmation and clustering.
Methods: The acoustic features were extracted by Leskern and GRU structures, the speaker embedding at acoustic level was generated by the mean pool, and the triple loss based on cosine similarity was used for training.
Results: The experimental results on three different data sets show that the performance of Deep Speaker is better than that of i-vector baseline based on DNN. For example, on text-independent data sets, the authentication error rate is relatively reduced by 50% and the recognition accuracy is improved by 60%. In addition, the experiment also shows that the model trained in Putonghua can improve the recognition accuracy of English speakers.
1, Introduction
Basic knowledge points 1: speaker recognition
The algorithm is used to identify speakers from audio data. There are mainly two kinds: one is speaker verification, and the other is speaker identity verification (two classification tasks, whether it is a speaker or not); Second, speaker recognition, speaker recognition (multi-classification task, who is talking).
Basic knowledge point 2: Speaker
Recognition can be divided into two categories according to the input data: one is text-dependent recognition, which requires the speaker to read a specific sentence; Second, the text has nothing to do with recognition, just saying, no specific content is needed.
Industry Quotes 1: Speaker recognition is still a challenging task.
Basic knowledge point 3: traditional speakers
Recognition is based on I vector and probability linear discriminant analysis (PLDA). The framework is mainly divided into three steps: 1, collecting enough statistical data); Data; 2. Extracting the speaker embedding (I vector); 3. classification (PLDA).
Basic knowledge point 4: Official statistics (also called Baum-Welch statistics) can be calculated by Gaussian Mixture Model-Universal Background Model (GMM-UBM). The model is optimized by using sequence type feature vectors (such as mel-frequency cepstral coefficients and MFCC). Recently, Deep Neural Network (DNN) has also been used to extract surface statistics.
Basic knowledge point 5: The above three steps of traditional methods are independent of each other. The method based on DNN can combine the first step and the second step for training, and the frame-level vector provided by the middle bottleneck layer can be used for speakers not included in the training set. However, this method has at least two main problems: (1) step 1 and step 2 are not directly optimized for speaker recognition; (2) Training and testing do not match. Training uses frame-level tags, while testing uses expression-level tags.
Overview of the algorithm structure in this paper 1: (1) Using DNN(ResCNN and GRU) to extract frame-level features from speech patterns. (2) The pooling and length standardization layer generates speaker embedding with expression level. (3) The model adopts triple loss training, that is, the distance between vector pairs of the same speaker is minimized and the distance between vector pairs of different speakers is maximized. (4) softmax layer and cross entropy are used to improve the performance of the model.
Basic knowledge point 6: CNN can effectively reduce the spectral variation of sound features and model the spectral correlation of sound features.
The structural detail of the algorithm in this paper is 1: Unlike the loss function similar to PLDA, the loss function in this paper is the similarity of the embedding vector trained by DNN in this paper, which can directly reflect the similarity.
Detail 2 of the algorithm structure in this paper: global negative sampling is used to replace the negative sampling of the same small batch of training data to improve the training speed.
The conclusion of this paper is that 1: the deep speaker is obviously superior to the i-vector based on DNN.
In text-independent speaker recognition system, in text-related recognition, deep speakers can reach the baseline, and if text-independent debugging model is adopted, text-related recognition can be improved.
Conclusion 2: (1) Deep Speaker performs well on large-scale data; (2) The transfer between different languages is good.
2. Related works
Basic knowledge point 7: PLDA can be used to calculate vector similarity, and its variant methods are heavy-tailed PLDA and Gaussian -PLDA.
3. Depth speaker
Overall structure:
3. 1? DNN structure
3. 1. 1 residual CNN
Batch normalization: We adopt sequential batch normalization (bn) between convolution and nonlinearity, following [18].
Activate the limit rectifier linearity (relu) function:
3. 1.2 GRU
GRU adopts only striker gru; ;
BN and clipped ReLu are also used between layers.
3.2 speaker embedding
3.3 Triple Loss and Choice
Similarity calculation formula:
Loss function formula:
Among them,
Important: Look for negative samples around the world, not just this batch.
The loss of triplet can be referred to as/jcjx0315/article/details/77160273.
3.4 Softmax pre-training
Pre-training (initializing the weight of formal training with the weight obtained from pre-training): replacing the length (soft max+ cross entropy) with classification layer? Standardization and triplet loss layer.
Benefits of pre-training:
Note: there are pre-trained lines, the first 10 is softmax pre-training, and the last 15 is triple formal training, which leads to ACC and EER mutation.