Romanian Language Technology

Corneliu Burileanu & al * Text-to-SpeechSynthesis for Romanian Language

For the pitch detection we used a parallel processing approach based on an algorithm first proposed by Gold and then modified by Gold and Rabiner [19]. The basic principles of the algorithm are the following:

the signal is processed in such a manner to retain the periodicity of the original signal and not the features that are irrelevant for the pitch detection;
the period of the resulted impulse train must be very simple to estimate;
the processing time must be reasonable.

The steps in pitch detection are:

The speech signal is applied at the input of a lowpass filter with a cutoff of about 800 Hz. Special attention was paid to this filtering because it is crucial both for the accuracy of pitch detection and the processing time. In Figure 6 it is shown the response (magnitude in dB) for the FIR filter we used.

Fig. 6. - The FIRfilter response
Following the filter, six impulse trains are obtained such as: a train with impulses equal to each maximum, at the location of each peak; another train with impulses equal to the difference between peak amplitude and the previous minimum amplitude, at the location at each maximum; a train with impulses equal to the difference between two consecutive peak amplitudes, at the location of each peak; similarly with the negative of the amplitude of each minimum, at the location of each valley.
Each impulse train is processed by six individual pitch period estimators with a "peak detecting exponential window" algorithm.
The six estimates are combined with two of the most recent estimates for each detector. Then these estimates are compared and the value with the most occurrences is declared the pitch period at this time. If there is an obvious lack of consistency among the estimates, the frame is declared as "unvoiced" and the value of the pitch is set to zero. Figure 7 shows the contour of the pitch (in samples) during the utterance: Detecþia perioadei fundamentale folosind metoda Rabiner ("Pitch detection using Rabiner method").The most important problem remains the voiced/unvoiced frame detection.

Fig. 7. - Pitch contour for a sentence in Romanian language.

5. Conclusions

General considerations concerning a speech-to-text system for the Romanian language were presented. First of all, we pointed out the terms and definitions related to the automatic synthesis of speech. The most important techniques of the speech synthesis implementation were presented in a tutorial manner. Then, a general approach to a system for text-to-speech synthesis was detailed. We described two different approaches for this task:

a text-to-speech system based on diphone concatenation and a neural network approach;
a system based on syllable concatenation, either in their stored form, or using a LPC-based speech synthesizer.

149