Corneliu Burileanu & al *
Text-to-SpeechSynthesis for Romanian Language
For the pitch detection we used
a parallel processing approach based on an algorithm first proposed
by Gold and then modified by Gold and Rabiner
[19]. The basic
principles of the algorithm are the following:
- the signal is processed in
such a manner to retain the periodicity of the original signal
and not the features that are irrelevant for the pitch detection;
- the period of the resulted
impulse train must be very simple to estimate;
- the processing time must be
reasonable.
The steps in pitch detection are:
- The speech signal is applied at the input of a lowpass filter
with a cutoff of about 800 Hz. Special attention was paid to this
filtering because it is crucial both for the accuracy of pitch
detection and the processing time. In Figure 6 it is shown the
response (magnitude in dB) for the FIR filter we used.
Fig. 6. - The FIRfilter response
- Following the filter, six
impulse trains are obtained such as: a train with impulses equal
to each maximum, at the location of each peak; another train with
impulses equal to the difference between peak amplitude and the
previous minimum amplitude, at the location at each maximum; a
train with impulses equal to the difference between two consecutive
peak amplitudes, at the location of each peak; similarly with
the negative of the amplitude of each minimum, at the location
of each valley.
- Each impulse train is processed
by six individual pitch period estimators with a "peak detecting
exponential window" algorithm.
- The six estimates are combined
with two of the most recent estimates for each detector. Then
these estimates are compared and the value with the most occurrences
is declared the pitch period at this time. If there is an obvious
lack of consistency among the estimates, the frame is declared
as "unvoiced" and the value of the pitch is set to zero.
Figure 7 shows the contour of the pitch (in samples) during the
utterance: Detecþia perioadei fundamentale folosind metoda
Rabiner ("Pitch detection using Rabiner method").The
most important problem remains the voiced/unvoiced frame detection.
Fig. 7. - Pitch
contour for a sentence in Romanian language.
5. Conclusions
General considerations concerning
a speech-to-text system for the Romanian language were presented.
First of all, we pointed out the terms and definitions related
to the automatic synthesis of speech. The most important techniques
of the speech synthesis implementation were presented in a tutorial
manner. Then, a general approach to a system for text-to-speech
synthesis was detailed. We described two different approaches
for this task:
- a text-to-speech system based
on diphone concatenation and a neural network approach;
- a system based on syllable
concatenation, either in their stored form, or using a LPC-based
speech synthesizer.
149