Construction process of Chinese speech recognition system

Tags: ASR, Python, Keras, CTC

Recently, I started to build my own Chinese speech recognition system, because it is also in the primary stage, so it is more difficult, until I found an open source project that I was already doing on GitHub and found the motivation to do it. Attach the GitHub address of the original author's project: Chinese speech recognition system based on deep learning.

This author is very kind and gives me a lot of inspiration. Then I attach the address of my project here: ASR.

Now the project is still in its infancy. Although some achievements have been made, they are not very good. We are still making some adjustments. We will update GitHub when there are good results. Now, we will use this article to sort out the construction ideas.

First of all, let me introduce the data set I use.

The data set I use is Tsinghua University THCHS30 Chinese Pinyin data set.

Data_thchs30.tgz OpenSLR domestic picture OpenSLR foreign picture

For the introduction of this data set, please refer to THCHS-30: Free Chinese Corpus.

In this data set, the training set, the verification set and the test set have been divided (in the folders of train, dev and test respectively), in which the training set has 10000 samples, the verification set has 893 samples and the test set has 2495 samples, and each sample is about 10 second speech segment.

The folder thchs30 contains index files (cv and dev seem to be the same).

Wav.txt is the relative path of the audio file.

Generally speaking, the common features of speech recognition are MFCC, Fbank and spectrogram.

In this project, 80-dimensional Fbank features are temporarily used, and python_speech_features library is used to extract features, and the extracted features are saved as npy files.

In the previous article, feature extraction was introduced in detail: using python_speech_features to extract audio file features.

Convert the pinyin in the label into numbers, for example, a 1 is 0, a2 is 1, and so on.

Take the first data as an example:

Yang Chun 65438+ Yang 65438+ Zhang 65438+ Li 65438+ Zhang

Converted into the corresponding number list is:

597 9 10 1 126 159 1 12 1 45 1 19 1 505 105 1 1 209 208 2 15 874 939 1 168 208 570 599 325 9 10 597 208 1072 420 1099 634 907 1 140 14 829

Similarly, tags are also saved to npy files.

The deep learning model we use in this system is called the deep convolution neural network proposed by Iflytek.

Neural network (DFCNN) model, paper address: Research progress and prospect of speech recognition technology.

His structural diagram is as follows:

For the loss function, select CTCLoss here.

To be updated ....

Summary of College Students' Rural Entrepreneurship Assistance

Wenzhou senior high school geography thesis

600 words in classical Chinese are indifferent.

Chapters 6 and 7 are inseparable. What's wrong with starting a new page and leaving a big blank in the middle?

Calcium pyrophosphate paper

An argumentative essay on protecting the environment

Three Discussions on Factory Safety Education

Tibetan dance thesis

Paper on the drum belt

Xu Bo's Personal Works