Romanian Language Technology

Mircea Giurgiu * Results on Automatic Speech Recognition in Romanian

2. Speech analysis and preprocessing

2.1. Time-domain analysis

Time-domain analysis is based on the observation that over speech segments of about 10-30 ms the properties of the waveform remain roughly invariant in that interval. Signal energy provides a good measure for separating voiced speech segments from unvoiced ones; also, it enables speech/nonspeech detection (Fig. 1 and 2). Zero crossing rate is a gross estimate of the frequency content of a speech signal and it has been used together with energy for speech endpointing [3]. Short-time autocorrelation has been also used in determining the fundamental frequency [1,3].

Fig. 1. - Silence/speech/silence (word cinci (five)).

Fig. 2. - Endpointed speech.

2.2. Short-time spectrum analysis

Short-time spectrum analysis is the key feature of speech processing. It has been accomplished by using both FFT and LPC analyses [3] applied to 256 samples of speech windowed with a Hamming window and the predictor coefficients are computed by using Levinson-Durbin recursive algorithm. From each segment of speech a 12-component vector is obtained by averaging the LPC spectrum (Fig. 3 and 4). These vectors are then vector-quantized in order to obtain speech compression and to reduce the calculus time. The predictor can be written as

(1)

and the transfer function of the vocal tract

(2)

where a_k(k=1,2, ...,p) are the predictor coefficients and is the LPC spectrum.

Fig. 3. - A speech frame.

Fig. 4. - LPC and FFT spectra.

179