The application of speech recognition dictation machine in some fields was rated as one of the top ten events of computer development in 1997 by the American press. Many experts believe that speech recognition technology is one of the top ten important technology development technologies in the information technology field from 2000 to 20 10.
The fields involved in speech recognition technology include: signal processing, pattern recognition, probability theory and information theory, vocal mechanism and auditory mechanism, artificial intelligence and so on. Classification and application of tasks According to different recognition objects, speech recognition tasks can be roughly divided into three categories, namely, isolated word recognition, keyword recognition and continuous speech recognition. Among them, the task of isolated word recognition is to identify isolated words that are known in advance, such as "power on" and "power off". The task of continuous speech recognition is to recognize any continuous speech, such as a sentence or a paragraph; Keyword detection in continuous speech stream is aimed at continuous speech, but it doesn't recognize all the words, but only detects where some known keywords appear, such as "computer" and "world" in a paragraph.
According to different speakers, speech recognition technology can be divided into speaker-specific speech recognition and speaker-independent speech recognition. The former can only recognize the voice of one or several people, while the latter can be used by anyone. Obviously, the speaker-independent speech recognition system is more in line with the actual needs, but it is much more difficult than the recognition for specific people.
In addition, according to voice devices and channels, it can be divided into desktop (PC) voice recognition, telephone voice recognition and embedded devices (mobile phones, PDA, etc. ) speech recognition. Different acquisition channels will distort the acoustic characteristics of human pronunciation, so it is necessary to build their own recognition systems.
Speech recognition has a wide range of applications. Common application systems are: voice input system, which is more in line with people's daily habits, more natural and more efficient than keyboard input method; Voice control system, that is, using voice to control the operation of equipment, is faster and more convenient than manual control, and can be used in industrial control, voice dialing system, smart home appliances, voice-activated smart toys and many other fields. Intelligent dialogue inquiry system, which operates according to customers' voice, provides users with natural and friendly database retrieval services, such as home service, hotel service, travel agency service system, booking system, medical service, banking service and stock inquiry service. Front-end front-end processing refers to processing the original speech before feature extraction, partially eliminating noise and the influence of different speakers, so that the processed signal can better reflect the essential features of speech. The most commonly used front-end processing includes endpoint detection and speech enhancement. Endpoint detection refers to distinguishing the periods of speech signals and non-speech signals in speech signals and accurately determining the starting point of speech signals. After endpoint detection, only the speech signal can be processed, which plays an important role in improving the accuracy of the model and recognition. The main task of speech enhancement is to eliminate the influence of environmental noise on speech. At present, the commonly used method is Wiener filter, which is better than other filters in the case of large noise. Processing of acoustic features The extraction and selection of acoustic features is an important link in speech recognition. The extraction of acoustic features is not only a process of information compression, but also a process of signal deconvolution, in order to make the pattern classifier better divided. Because of the time-varying characteristics of speech signal, it is necessary to extract the features of a short segment of speech signal, that is, short-time analysis. This analysis interval, which is considered static, is called a frame, and the offset between frames is usually 1/2 or 1/3 of the frame length. Usually, the signal should be pre-emphasized to improve the high frequency, and the signal should be windowed to avoid the influence of the edge of short-term speech segment. Some commonly used acoustic characteristics * linear prediction coefficient LPC: linear prediction analysis starts with the mechanism of human vocalization, and through the study of the short tube cascade model of sound channels, it is considered that the transfer function of the system conforms to the form of all-pole digital filter, so that the signal at N moment can be estimated by the linear combination of the signals at previous moments. The linear prediction coefficient LPC can be obtained by minimizing the mean square error between the actual speech sample value and the linear prediction sample value LMS. The calculation methods of LPC include autocorrelation method (Durbin method), covariance method and lattice method. Fast and effective calculation ensures the wide use of this acoustic feature. Similar to the LPC prediction parameter model, the acoustic characteristics of the cable spectrum pair LSP, reflection coefficient and so on.
Cepstrum coefficient CEP: Cepstrum coefficient can be obtained by using homomorphic processing method, taking logarithm after discrete Fourier transform DFT of speech signal, and then inverse transform iDFT. For LPC Cepstrum (LPCCEP), after obtaining the linear prediction coefficient of the filter, it can be calculated by a recursive formula. Experiments show that cepstrum can improve the stability of characteristic parameters.
* Mel cepstrum coefficient MFCC and perceptual linear prediction PLP: Unlike LPC and other acoustic features obtained by studying human vocal mechanism, Mel cepstrum coefficient MFCC and perceptual linear prediction PLP are acoustic features derived from the research results of human auditory system. The study of human auditory mechanism shows that when two tones with similar frequencies are emitted at the same time, people can only hear one tone. Critical bandwidth refers to such a bandwidth boundary that people's subjective feelings suddenly change. When the frequency difference between two tones is less than the critical bandwidth, people will listen to two tones as one, which is the so-called shielding effect. Mel scale is one of the methods to measure this critical bandwidth.
The calculation of MFCC firstly transforms the time domain signal into frequency domain through FFT, then convolves its logarithmic energy spectrum with triangular filter banks distributed according to Mel scale, and finally carries out discrete cosine transform (DCT) on the vector formed by each filter output to get the first n coefficients. PLP still uses Durbin method to calculate LPC parameters, but it also uses DCT method to calculate autocorrelation parameters. The model of acoustic model speech recognition system usually consists of acoustic model and language model, which correspond to the calculation of the probability of speech to syllable and syllable to word respectively. This section and the next section introduce acoustic model and language model technology respectively.
HMM acoustic modeling: the concept of Markov model is discrete finite state automata in time domain. The HMM of hidden Markov model means that the internal state of this Markov model is invisible to the outside world, and the outside world can only see the output value at each moment. For speech recognition systems, the output value is usually the acoustic feature calculated from each frame. Using HMM to describe speech signals requires two assumptions, one is that the internal state transition is only related to the previous state, and the other is that the output value is only related to the current state (or current state transition), which greatly reduces the complexity of the model. The corresponding algorithms for scoring, decoding and training of HMM include forward algorithm, Viterbi algorithm and forward and backward algorithm.
In speech recognition, HMM is usually modeled as a one-way topology with self-circulation and left-to-right spanning. Phoneme is a three-to-five-state HMM, word is a HMM composed of several phonemes in series, and the whole model of continuous speech recognition is a combination of words and silence. Context-related modeling: co-pronunciation refers to the change of a sound under the influence of adjacent sounds. From the perspective of vocalization mechanism, the characteristics of human vocal organs can only change gradually when one sound turns to another, thus making the frequency spectrum of the latter sound different from that of other conditions. Context-related modeling method considers this influence in modeling, so that the model can describe speech more accurately. Bi- Phone only considers the influence of the front sound, while Tri-Phone only considers the influence of the front sound and the back sound.
English context-related modeling is usually based on phonemes. Because some phonemes have similar effects on subsequent phonemes, we can enjoy the model parameters through the clustering of phoneme decoding States. The result of clustering is called senone. Decision tree is used to realize the effective correspondence between three phonemes and four phonemes. By answering a series of questions about categories (meta/consonant, voiced/unvoiced, etc.). ), and finally determine which senone should be used in its HMM state. CART model of classified regression tree is used to label the pronunciation of words as phonemes. Language models Language models are mainly divided into regular models and statistical models. Statistical language model reveals the inherent statistical laws of language units in the form of probability statistics, in which N-Gram is simple and effective and widely used.
N-Gram: This model is based on the assumption that the appearance of the nth word is only related to the previous N- 1 words, and has nothing to do with any other words. The probability of the whole sentence is the product of the appearance probability of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of n words directly from the corpus. Binary binary model and ternary model are commonly used.
The performance of language models is usually measured by cross entropy and confusion. The significance of cross entropy lies in the difficulty of text recognition with this model, or from the point of view of compression, each word needs to be encoded with several bits on average. The meaning of complexity is to use this model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word. Smoothing refers to assigning a probability value to an unobserved n-tuple to ensure that a word sequence can always get a probability value through a language model. Commonly used smoothing techniques include Turing estimation, deletion interpolation smoothing, Katz smoothing and Kneser-Ney smoothing. The search in continuous speech recognition is to find a word model sequence to describe the input speech signal, so as to get a word decoding sequence. The search is based on the scores of acoustic model and language model in the formula. In practice, it is often necessary to add a high weight to the language model according to experience and set a penalty score for long words.
Viterbi: According to the state of Viterbi algorithm in dynamic programming at each time point, calculate the posterior probability of decoding state sequence to observation sequence, keep the path with the maximum probability, and record the state information corresponding to each node, so as to finally get the word decoding sequence in reverse. Viterbi algorithm solves the nonlinear time alignment between HMM model state sequence and acoustic observation sequence, word boundary detection and word recognition in continuous speech recognition without losing the optimal solution, thus making it the basic strategy of speech recognition search.
Because speech recognition cannot predict the situation after the current time point, heuristic pruning based on objective function is difficult to apply. Because of the time homogeneity of Viterbi algorithm, each path at the same time corresponds to the same observation sequence, so it is comparable. The beam search only keeps the first few paths with the greatest probability at each moment, which greatly improves the efficiency of the search. This time-uniform Viterbi beam algorithm is the most effective algorithm in speech recognition search at present. N-best search and multi-pass search: In order to make use of various knowledge sources in the search, it is usually necessary to conduct multi-pass search. For the first time, low-cost knowledge sources are used to generate candidate lists or word candidate grids. On this basis, high-cost knowledge sources are used for the second time to obtain the best path. The knowledge sources introduced before include acoustic model, language model and phonetic dictionary, which can be used for the first search. In order to achieve more advanced speech recognition or oral comprehension, it is often necessary to use some more expensive knowledge sources, such as 4th or 5th order N-gram, 4th or higher order context correlation model, inter-word correlation model, word segmentation model or grammar analysis, to re-grade. Many of the latest real-time large vocabulary continuous speech recognition systems use this multi-pass search strategy.
N-best search generates a candidate list, and each node should reserve n best paths, so the computational complexity will increase to n times. The simplified method is to keep only a few candidate words for each node, but the sub-optimal candidate words may be lost. A compromise is to only consider the path of two words and keep k-pieces. Word candidate grid gives multiple candidates in a more compact way. By modifying the N-best search algorithm, an algorithm for generating candidate grid can be obtained.
The forward and backward search algorithm is an example of applying multiple search. When a simple knowledge source is applied to the forward Viterbi search, the forward probability obtained in the search process can be used to calculate the objective function of the backward search, so the heuristic A algorithm can be used for the backward search, and N candidates can be searched economically. The requirements for the system to realize the selection of recognition primitives in speech recognition system are accurate definition, enough data for training and universality. English usually uses context-sensitive phonemes to model, while Chinese homophones are not as serious as English, so syllable modeling can be used. The size of training data required by the system is related to the complexity of the model. The design of the model is too complex, which exceeds the ability of the training data provided, and will make the performance drop sharply.
Dictation machine: A large vocabulary, non-specific and continuous speech recognition system is usually called dictation machine. Its architecture is HMM topology based on the above acoustic model and language model. In training, the model parameters are obtained by the forward-backward algorithm of each primitive. In recognition, the primitives are concatenated into words, and a silent model is added between words, and a language model is introduced as the transition probability between words to form a circular structure, which is decoded by Viterbi algorithm. In view of the easy segmentation of Chinese, it is a simplified method to improve efficiency to segment first and then decode segment by segment.
Dialogue system: The system used to realize man-machine oral dialogue is called dialogue system. Limited by the current technology, the dialogue system is often a system oriented to a narrow field with limited vocabulary, and its topics include travel inquiry, reservation, database retrieval and so on. Its front end is a speech recognizer, which recognizes the generated N-best candidate or word candidate grid, and the semantic information is analyzed by the parser, and then the response information is determined by the dialogue manager and output by the speech synthesizer. Because the current system often has a limited vocabulary, we can also obtain semantic information by extracting keywords. The performance of adaptive robust speech recognition system is affected by many factors, including different speakers, speaking styles, environmental noise, transmission channels and so on. Improving the robustness of the system is to improve the ability of the system to overcome these factors and keep the system stable under different application environments and conditions. The purpose of self-adaptation is to adjust the system automatically and pertinently according to different influence sources, and gradually improve the performance in use. The following are solutions to different factors that affect system performance.
According to the method of speech feature (hereinafter referred to as feature method) and the method of model adjustment (hereinafter referred to as model method), the solutions can be divided into two categories. The former needs to find better and more robust feature parameters or add some specific processing methods to the existing feature parameters. The latter uses a small amount of adaptive corpus to modify or transform the original person-independent (SI) model, thus making it a person-specific adaptive (s a) model.
Feature methods of speaker adaptation include speaker normalization and speaker subspace method, and model methods include Bayesian method, transformation method and model merging method.
The noise in the speech system includes environmental noise and electronic noise added in the recording process. Feature methods to improve system robustness include speech enhancement and finding features that are insensitive to noise interference. Modeling methods include parallel model combination PMC method and artificially adding noise in training. Channel distortion includes the distance between microphones, microphones with different sensitivities, preamplifiers with different gains, different filter designs and so on. The feature method includes cepstrum vector minus its long-term average and RASTA filtering, and the model method includes cepstrum translation. Microsoft speech recognition engine Microsoft uses its own speech recognition engine in both office and vista. The use of Microsoft speech recognition engine is completely free, so many speech recognition applications based on Microsoft speech recognition engine have been produced, such as voice game masters, voice control experts, open sesame and so on. Performance indicators of speech recognition system There are four main performance indicators of speech recognition system. 1 vocabulary range: this refers to the range of words or phrases that the machine can recognize. If there is no limit, the vocabulary can be considered infinite. (2) Speaker restriction: whether to recognize only the voice of the designated speaker or the voice of any speaker. ③ Training requirements: Do you need training before use, that is, do you want the machine to "listen" to a given voice first, and the training times? ④ Correct recognition rate: the average percentage of correct recognition is related to the first three indicators.
summary
The above has introduced the technologies to realize all aspects of the speech recognition system. These techniques have achieved good results in practical use, but how to overcome various factors affecting pronunciation needs more in-depth analysis. At present, dictation machine system can not completely replace keyboard input, but the maturity of recognition technology has promoted the research of higher-level speech understanding technology. Because English and Chinese have different characteristics, how to use the technology proposed for English in Chinese is also an important research topic, and the unique problems of Chinese, such as four tones, also need to be solved.