Mircea Giurgiu * Results on Automatic Speech Recognition in Romanian




Table 8. Recognition performance [%] for each Romanian spoken digit versus the number of reference vectors (recognition with training set A, and with testing set T)
WORD 20 30 60 90 150 200 300 400
1 A
T
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
2 A
T
76.92
66.67
76.92
66.67
84.62
66.67
100
75.11
100
66.67
100
66.67
100
66.67
100
66.67
3 A
T
84.62
76.92
84.62
76.92
100
92.31
100
84.62
100
92.31
100
92.31
100
92.31
100
92.31
4 A
T
100
100
100
100
100
75.00
100
75.00
100
75.00
100
75.00
100
75.00
100
75.00
5 A
T
100
100
100
100
100
92.31
100
100
100
100
100
100
100
100
100
100
6 A
T
61.54
53.85
61.54
53.85
100
100
100
100
100
92.31
100
92.31
100
92.31
100
92.31
7A
T
61.54
57.14
61.54
57.14
76.92
50.00
76.92
50.00
100
71.43
100
71.43
100
71.43
100
71.43
8 A
T
46.14
69.23
46.15
69.23
76.92
76.92
84.62
69.23
84.62
69.23
84.62
69.23
84.62
69.23
84.62
69.23
9 A
T
100
83.33
100
83.33
100
91.67
100
91.67
100
91.67
100
91.67
100
91.67
100
91.67
10A
T
91.67
61.54
91.67
61.54
91.67
84.62
91.67
84.62
91.67
84.62
91.67
84.62
91.67
84.62
91.67
84.62
TOTAL A
T
81.89
76.38
81.89
76.38
92.91
82.68
95.28
83.46
97.64
84.25
97.64
84.25
97.64
84.25
97.64
84.25



4. Conclusions

The experiments we have reported here concern practical results on Romanian spoken digit recognition using three different approaches: Dynamic Time Warping, Hidden Markov Models and Artificial Neural Networks. The experiments have been done during the last three years in the Laboratory of Speech Processing from the Department of Communications, Technical University of Cluj-Napoca, and there have been used software packages entirely developed in our laboratory.

DTW approach has been especially used to better understand the ASR problem, but it has proved to be, in fact, an efficient technique for small vocabularies, where it offers very good temporal description. The recognition is accurate when the average length of references is approximately the same as that of the input words. It is proposed a DTW system which incorporates the VQ of LPC patterns in order to reduce the amount of computing. Unfortunately, the algorithms are not able to model long-term correlation in the speech wave and adaptation system to a new speaker is time-consuming.

Based on a statistical model of production for each word, which has well defined mathematical description, the HMM are able to recognize speaker-independent words. The extension of vocabulary could be done only by adding a new HMM in the database. The main problems of HMM are the estimation of parameters because of insufficient training data, the probabilities scaling, and the computing time for training.

The application of ANN for speech recognition proves the ability of MLP structures to recognize speech patterns. To avoid the variable length duration of spoken words at the input of MLP, we have proposed and experimented the spectral segmentation of speech sequence. This is important for the evaluation of the stationary segments from the speech wave with some possible impact on the phonetics of Romanian digits. Moreover, using the segmentation, the recognition performances increase with 6% and the training time diminishes, as compared to the standard MLP approach.

The influence of ANN parameters on the recognition rate has been experimented, too. We have experimented the influence of the number of reference vectors in SOFM and it has been demonstrated the positive influence on recognition error of an increased number, but not greater than 150. The results of the work demonstrate the power of SOFM when applied to classification of speech and the applicability of the Constrained Clustering Segmentation to segmenting speech into significant acoustic segments. The training process is faster than backpropagation algorithm from MLP, is unsupervised, but a labeling process is needed for each entry in the training set. The experiments presented under this framework reveal the capacity of SOFM to better recognize speech patterns when CCS is used and with an increase of speed.

Using the accumulated experience on ASR for isolated words, we would like to continue the work for the continuous speech recognition and to develop a practical application, possibly by using hardware platforms with dedicated signal processors.

Acknowledgments. The author would like to thank Prof. Toderean Gavril from the Technical University of Cluj-Napoca for the PhD supervising and Dr. Dan Tufiº for his continuous efforts to highlight the research potential in language technology. Special thanks are also due to Prof. Antonio Rubio and Prof. Antonio Peinado from Granada University, Spain, for their specialized and worthy points of view during the author's visits in their department.



186

Previous Index Next