Mircea Giurgiu * Results on Automatic Speech Recognition in Romanian
WORD | SPEAKER 1 | SPEAKER 2 | SPEAKER 3 | OUTSIDE SPEAKER |
UNU | 100.00% | 100.00% | 100.00% | 100.00% |
DOI | 80.00% | 100.00% | 100.00% | 80.00% |
TREI | 80.00% | 100.00% | 100.00% | 100.00% |
PATRU | 80.00% | 100.00% | 100.00% | 100.00% |
CINCI | 100.00% | 80.00% | 80.00% | 60.00% |
ªASE | 100.00% | 100.00% | 100.00% | 80.00% |
ªAPTE | 100.00% | 100.00% | 100.00% | 80.00% |
OPT | 80.00% | 60.00% | 100.00% | 80.00% |
NOUÃ | 100.00% | 100.00% | 100.00% | 100.00% |
ZECE | 100.00% | 100.00% | 100.00% | 100.00% |
TOTAL | 92.00% | 94.00% | 98.00% | 88.00% |
3.3. Experiments using Artificial Neural Networks for automatic speech recognition
There are three different neural networks-based approaches for speech recognition: (a) feed-forward networks (transform sets of input signals into set of output signals), (b) feedback networks (the input information defines a state of a feedback system, and after transitions the asymptotic final state is identified as the outcome of the computation), as MLP are, and (c) self-organizing feature maps (SOFM) (mutual lateral interconnections among neighboring cells in a neural network develop specific detectors of different signal patterns) [3,9].
3.3.1. Experiments on speech recognition with Multilayer Perceptrons
We are focusing in this case on feed-forward networks known as Multilayer Perceptrons (MLP). The MLP structure we propose has ni neurons in the input layer, nh neurons in the hidden layer, and no neurons in the output layer. The network is fed up at the input with the spectral information extracted from the speech wave and gives a coded output for the recognized word. The appropriate number of neurons in each layer has been studied. For word recognition task the ANN are fed up with spectral information of the word. All training examples consisting of spectral patterns extracted from words collected from different speakers are presented cyclically until the error of the entire set is acceptably low. After training, an MLP has the ability for a proper response to input patterns not presented during the learning process [9].
The application of MLP to automatic word recognition task consists of defining an output layer with the number of units equal to the size of the vocabulary (10 neurons, each corresponding to the digits 1, 2,...,10) uttered in Romanian language. One hidden layer with a variable number of processing units has been used and the input layer is fed by LPC spectral representation or with the VQ codewords of each digit. When MLP are used for speech recognition, a big problem arises from the beginning: the variable length of speech patterns, which does not suit the fixed dimension of MLP input. In this particular case the input layer has 600 or 50 processing units, each corresponding to one spectral band derived from LPC spectrum or to a VQLPC codeword. Different tests have been done in order to find the activation function which gives the best recognition rate using MLP and, finally, the sigmoid function has been selected for all experiments, because it better encodes the speech spectral information (Table 6). The training has been accomplished by using Back-Propagation algorithm and the entire speech database has been divided in two parts (one for training and one for testing), each containing 200 patterns [3,10].
MLP
input/hidd/output (spectral info) | RECOGNITION [%]
f(x)=1/(1+e-x) | RECOGNITION [%]
f(x)=th(x) |
600/40/10 (LPC)
600/50/10 (LPC) 100/140/10 (VQLPC) 100/120/10 (VQLPC) | 84.3
82.5 81.5 75.0 | 20.0
47.5 48.0 42.5 |
Network | Inputs | Hidden | Outputs |
net40sw.nnw | 600 LPC | 40 | 10 |
net50sw.nnw | 600 LPC | 50 | 10 |
nvq120t.nnw | 50 VQLPC | 120 | 10 |
nvq140t.nnw | 50 VQLPC | 140 | 10 |
183