1、第 1 页 中文 3276 字 附录 英文原文 : Chinese Journal of Electronics Vo1.15,No.3,July 2006 A Speaker-Independent Continuous Speech Recognition System Using Biomimetic Pattern Recognition WANG Shoujue and QIN Hong (Laboratory of Artificial Neural Networks, Institute ol Semiconductors, Chinese Academy Sciences, B
2、eijing 100083, China) AbstractIn speaker-independent speech recognition,the disadvantage of the most diffused technology(HMMs, or Hidden Markov models)is not only the need of many more training samples, but also long train time requirement. This Paper describes the use of Biomimetic pattern recognit
3、ion(BPR)in recognizing some mandarin continuous speech in a speaker-independent Manner. A speech database was developed for the course of study The vocabulary of the database consists of 15 Chinese dishs names, the length of each name is 4 Chinese words Neural networks(NNs)based on Multi-weight neur
4、on(MWN) model are used to train and recognize the speech sounds The number of MWN was investigated to achieve the optimal performance of the NNs-based BPR.This system, which is based on BPR and can carry out real time recognition reaches a recognition rate of 98.14 for the first option and 99.81 for
5、 the first two options to the Persons from different provinces of China speaking common Chinese speech Experiments were also carried on to evaluate Continuous density hidden Markov models(CDHMM ),Dynamic time warping(DTW)and BPR for speech recognition The Experiment results show that BPR outperforms
6、 CDHMM and DTW especially in the cases of samples of a finite size Key wordsBiomimetic pattern recognition, Speech recogniton,Hidden Markov models(HMMs),Dynamic time warping(DTW) I Introduction The main goal of Automatic speech recognition(ASR)is to produce a system which will recognize accurately n
7、ormal human speech from any speaker The recognition system may be classified as speaker-dependent or speaker-independent The speaker dependence requires that the system be personally trained with the speech of the person that will be involved with its operation in order to achieve a high recognition
8、 rate For applications on the public facilities, on the other hand, the system must be capable of recognizing the speech uttered by many different people, with different gender, age, accent,etc.,the speaker independence has many more applications, primarily in the general area of public facilities T
9、he most diffused technology in speaker-independent speech recognition is Hidden Markov Models, the disadvantage of it is not only the need of many more training samples, but also long train time requirement Since Biomimetic pattern recognition(BPR) was first proposed by Wang Shoujue, it has already
10、been applied to object recognition, face identification and face recognition etc.,and achieved much better performance With some adaptations, such modeling techniques could be easily used within speech recognition 第 2 页 too In this paper, a real-time mandarin speech recognition system based on BPR i
11、s proposed, which outperforms HMMs especially in the cases of samples of a finite size The system is a small vocabulary speaker independent continuous speech recognition one. The whole system is implemented on the PC under windows98 2000 XP environment with CASSANN-II neurocomputer.It supports stand
12、ard 16-bit sound card II Introduction of Biomimetic Pattern Recognition and Multi Weights Neuron Networks 1 Biomimetic pattern recognition Traditional Pattern Recognition aims at getting the optimal classification of different classes of sample in the feature space However, the BPR intends to find t
13、he optimal coverage of the samples of the same type. It is from the Principle of HomologyContinuity, that is to say, if there are two samples of the same class, the difference between them must be gradually changed So a gradual change sequence must be exists between the two samples. In BPR theory th
14、e construction of the sample subspace of each type of samples depends only on the type itself More detailedly, the construction of the subspace of a certain type of samples depends on analyzing the relations between the trained types of samples and utilizing the methods of “coverage of objects with
15、complicated geometrical forms in the multidimensional space”. 2 Multi-weights neuron and multi-weights neuron networks A Multi-weights neuron can be described as follows: 1 2 mY = f ( , , , ) W W W X ,,Where: 1 2 m,W W , are m-weights vectors; X is the input vector; is the neurons computation functi
16、on; is the threshold; f is the activation function According to dimension theory, in the feature space nR , nXR , the function 1 2 m( , , , )W W W X , = construct a (n-1)-dimensional hypersurface in n-dimensional space which is determined by the weights 1 2 m,W W W , .It divides the n-dimensional sp
17、ace into two parts If 1 2 m( , , , )W W W X , is a closed hypersurface, it constructs a finite subspace According to the principle of BPR,determination the subspace of a certain type of samples basing on the type of samples itself If we can find out a set of multi-weights neurons(Multi-weights neuro
18、n networks) that covering all the training samples, the subspace of the neural networks represents the sample subspace. When an unknown sample is in the subspace, it can be determined to be the same type of the training samples Moreover, if a new type of samples added, it is not necessary to retrain
19、 anyone of the trained types of samples The training of a certain type of samples has nothing to do with the other ones III System Description The Speech recognition system is divided into two main blocks. The first one is the signal pre-processing and speech feature extraction block The other one i
20、s the Multi-weights neuron networks, which performs the task of BPR 1 Speech feature extraction Mel based Campestral Coefficients(MFCC) is used as speech features It is calculated as follows: A D conversion; Endpoint detection using short time energy and Zero crossing rate(ZCR); Preemphasis and hamm
21、ing windowing; Fast Fourier transform; DCT transform The number of features extracted for each frame is 16, and 32 frames are chosen for every utterance A 512-dimensiona1-Me1-Cepstral feature vector(16 32 numerical values) represented the pronunciation of every word 第 3 页 2 Multi-weights neuron netw
22、orks architecture As a new general purpose theoretical model of pattern Recognition, here BPR is realized by multi-weights neuron Networks. In training of a certain class of samples, an multi-weights neuron subNetwork should be established The subNetwork consists of one input layer one multi-weights
23、 neuron hidden layer and one output layer. Such a subNetwork can be considered as a mapping 512:F R R . 1 2 m( ) m i n ( , , Y )F X Y Y , WhereYi is the output of a Multi-weights neuron. There are m hidden Multi-weights neurons i= 1,2, ,m, 512XR is the input vector IV Training for MWN Networks 1 Bas
24、ics of MWN networks training Training one multi-weights neuron subNetwork requires calculating the multi-weights neuron layer weights The multi-weights neuron and the training algorithm used was that of Ref.4 In this algorithm, if the number of training samples of each class is N ,we can use 2N neur
25、ons In this paper , N =30 12 ( , , , ) i i i iY f s s s x , is a function with multi-vector input, one scalar quantity output 2 Optimization method According to the comments in IV.1,if there are many training samples, the neuron number will be very large thus reduce the recognition speed In the case
26、 of learning several classes of samples, knowledge of the class membership of training samples is available. We use this information in a supervised training algorithm to reduce the network scales When training class A, we looked the left training samples of the other 14 classes as class B So there
27、are 30 training samples in set 1 2 3 0: , , A A a a a ,and 420 training samples in set 1 2 4 2 0: , , B B b b , b Firstly select 3 samples from A, and we have a neuron:1 1 2 3Y = f ( , , , ) k k ka a a x.Let 0 1 _ 1 2 3, = f ( , , , ) A i k k k iA A Y a a a a , where i= 1,2, , 30; 1 _ 1 2 3Y = f ( ,
28、 , , ) B j k k k ja a a b, where j= 1,2,420 ;1_m in(Y )BjV ,we specify a value r ,0r1 .If 1_ *AiY r V,removed ia from set A, thus we get a new set (1)A .We continue until the number of samples in set ()kA is () kA ,then the training is ended, and the subNetwork of class A has a hidden layer of 1r ne
29、urons. V Experiment Results A speech database consisting of 15 Chinese dishs names was developed for the course of study. The length of each name is 4 Chinese words, that is to say, each sample of speech is a continuous string of 4 words, such as “yu xiang rou si”, “gong bao ji ding”, etc It was organized into two sets: training set and test set. The speech signal is sampled at 16kHz and 16-bit resolution