Recently, I started to build my own Chinese speech recognition system, because it is also in the primary stage, so it is more difficult, until I found an open source project that I was already doing on GitHub and found the motivation to do it. Attach the GitHub address of the original author's project: Chinese speech recognition system based on deep learning.
This author is very kind and gives me a lot of inspiration. Then I attach the address of my project here: ASR.
Now the project is still in its infancy. Although some achievements have been made, they are not very good. We are still making some adjustments. We will update GitHub when there are good results. Now, we will use this article to sort out the construction ideas.
First of all, let me introduce the data set I use.
The data set I use is Tsinghua University THCHS30 Chinese Pinyin data set.
Data_thchs30.tgz OpenSLR domestic picture OpenSLR foreign picture
For the introduction of this data set, please refer to THCHS-30: Free Chinese Corpus.
In this data set, the training set, the verification set and the test set have been divided (in the folders of train, dev and test respectively), in which the training set has 10000 samples, the verification set has 893 samples and the test set has 2495 samples, and each sample is about 10 second speech segment.
The folder thchs30 contains index files (cv and dev seem to be the same).
Wav.txt is the relative path of the audio file.
Generally speaking, the common features of speech recognition are MFCC, Fbank and spectrogram.
In this project, 80-dimensional Fbank features are temporarily used, and python_speech_features library is used to extract features, and the extracted features are saved as npy files.
In the previous article, feature extraction was introduced in detail: using python_speech_features to extract audio file features.
Convert the pinyin in the label into numbers, for example, a 1 is 0, a2 is 1, and so on.
Take the first data as an example:
Yang Chun 65438+ Yang 65438+ Zhang 65438+ Li 65438+ Zhang
Converted into the corresponding number list is:
597 9 10 1 126 159 1 12 1 45 1 19 1 505 105 1 1 209 208 2 15 874 939 1 168 208 570 599 325 9 10 597 208 1072 420 1099 634 907 1 140 14 829
Similarly, tags are also saved to npy files.
The deep learning model we use in this system is called the deep convolution neural network proposed by Iflytek.
Neural network (DFCNN) model, paper address: Research progress and prospect of speech recognition technology.
His structural diagram is as follows:
For the loss function, select CTCLoss here.
To be updated ....