Mircea Giurgiu * Results on Automatic Speech Recognition in Romanian




2. Speech analysis and preprocessing

2.1. Time-domain analysis

Time-domain analysis is based on the observation that over speech segments of about 10-30 ms the properties of the waveform remain roughly invariant in that interval. Signal energy provides a good measure for separating voiced speech segments from unvoiced ones; also, it enables speech/nonspeech detection (Fig. 1 and 2). Zero crossing rate is a gross estimate of the frequency content of a speech signal and it has been used together with energy for speech endpointing [3]. Short-time autocorrelation has been also used in determining the fundamental frequency [1,3].

Fig. 1. - Silence/speech/silence (word cinci (five)).

Fig. 2. - Endpointed speech.

2.2. Short-time spectrum analysis

Short-time spectrum analysis is the key feature of speech processing. It has been accomplished by using both FFT and LPC analyses [3] applied to 256 samples of speech windowed with a Hamming window and the predictor coefficients are computed by using Levinson-Durbin recursive algorithm. From each segment of speech a 12-component vector is obtained by averaging the LPC spectrum (Fig. 3 and 4). These vectors are then vector-quantized in order to obtain speech compression and to reduce the calculus time. The predictor can be written as

(1)

and the transfer function of the vocal tract

(2)

where ak(k=1,2, ...,p) are the predictor coefficients and is the LPC spectrum.

Fig. 3. - A speech frame.

Fig. 4. - LPC and FFT spectra.



179

Previous Index Next