Results on Automatic Speech Recognition in Romanian

Mircea Giurgiu

1. Introduction

Research in speech processing has recently witnessed a remarkable progress. Such progress has ensured the wide use of speech synthetizers in many fields (speech accessed databases, secure-access systems, data and music compression, interactive text-to-speech systems, language translation, etc, [1]). Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech wave using computers or electronic circuits. The recognition of the spoken isolated words by machine has been long accepted as a challenging and important problem and has been viewed as a bench mark for ASR [2]. Especially, one of the most difficult problem in speech technology is the speaker independent recognition of isolated digits. This particular subset of sounds was chosen because it incorporated sufficient variability to be a good test for a recognition system, whilst avoiding the complication associated with time-varying data.

In typical speech recognition systems, the input speech is compared with stored reference templates of phonemes or words as in Dynamic Time Warping (DTW) technique, and the most similar reference template is selected as a candidate phoneme or word of input speech [1]. Therefore, short-time spectral density is usually extracted at short intervals and used for comparison with the reference templates. The comparison is accomplished in terms of spectral distance: this means that speech is first processed using an analysis technique such as Fast Fourier Transform (FFT) or Linear Predictive Coding (LPC) in order to find the short-time spectrum. Chapter 2 is dedicated to a general presentation of such speech analysis results - time-domain and spectral analysis, but mostly to Vector Quantization (VQ), as a technique for bit-rate reduction. The DTW recognition (sub-chapter 3.1) solves the problem of different duration of spoken words and if it is used with VQ the recognition algorithm speeds up [3].

To avoid different word-length problem, in sub-chapter 3.2 we will focus on statistical modeling of speech production and we will deal with discrete Hidden Markov Models (HMM) for speech recognition. The estimation of the parameter set of HMM for the acoustic signals has been performed by Baum-Welch algorithm and VQ has been used to obtain the speech observations sequence.

The motivation of using ANN structured as a Multilayer Perceptron (MLP) (sub-chapter 3.3) is to explore the ability of ANN to recognize words spectral patterns and to extract invariant features from highly variable data, particularly, and further to test their performances on the specific task of word recognition. It is proposed an MLP structure with one hidden layer and the patterns presented to the input of Neural Network (NN) are computed by Linear Predictive Coding (LPC analysis) over 25.6 ms from the words. At this point some problems regarding different word lengths could arise. To avoid these problems the idea is to segment the input speech into a sequence of stationary segments, each of them being represented by the segment centroid [4,5]. In this way, the sequence of segment centroids will feed the MLP and the amount of data is reduced. A fixed number of segments will suit the fixed input dimensionality of MLP. There will be presented results on CCS and its performances, on ASR using MLP and SOFM with or without CCS, with or without VQ, showing that high recognition rate and less computational time are obtained when CCS is included in the recognition process. Also, the effect of a number of processing units in the hidden layer, the learning factor, the activation function and momentum have been monitored. Finally, some conclusions on recognition performances using the mentioned techniques are given.

178