Romanian Language Technology

Corneliu Burileanu & al * Text-to-SpeechSynthesis for Romanian Language

3.3.2. Concatenative synthesis

This synthesis method uses pieces of natural speech (usually encoded) as building blocks to reconstruct an arbitrary utterance. The principle of this method is to store parameters that characterize acoustical units or transitions between them, rather than to model these units (as the previous method does) [6].

Concatenative synthesis methods use various representation of units of speech that are concatenated together to form connected speech; these representations indicate at the same time the techniques used by the speech synthesizers. Two important techniques are used today:

The widely used technique is the linear prediction analysis-synthesis. The systems based on that technique typically represent speech in terms of an LPC frame sequence. This frame sequence, obtained from storage, is smoothed at the unit (-diphone, for example-) boundary. The resulting set of frames is then converted to appropriate coefficients for a digital LPC model and these are loaded into the digital synthesizer at the frame rate; the synthesizer then produces output speech samples, which are finally smoothed to form the analog waveform.
A relatively new approach for a concatenative-type system is the algorithm named TD-PSOLA ("Time Domain Pitch Syncronous Overlap and Add"); it allows the change of the speech prosody directly in the time domain, while performing concatenation of the acoustical units (usually diphones) [7, 8,9, 10].

The TD-PSOLA synthesis scheme involves the three following steps: an analysis of the original speech waveform to produce an intermediate non-parametric reprezentation of the signal, prosodic modification brought to this intermediate reprezentation and finally the synthesis of the modified signal from the modified intermediate representation. We present here the basic feature of this algorithm.

The waveform x(n) is decomposed into a sequence of short-term overlapping signal x_m(n). This short-term signal is obtained by multiplying the signal by a sequence of analysis windows h_m(n):

The successive instants t_m, called pitch-marks, are set at a pitch-synchronous rate of the voiced portions of the signal and at a constant rate on the unvoiced portions.

The stream of analysis short-term signal is processed to produce a stream of modified synthesis short-term signal , synchronized on a new set of synthesis pitch-marks f_q. The algorithm determines simultaneously the synthesis pitch-marks according to the pitch-scale and time-scale modification factors and the mapping between the synthesis and analysis pitch-marks. This mapping specifies which analysis short-term signal is to be copied to obtain any given synthesis short-term signal; the stream of synthesis pitch-marks indicates the delay to be used in the synthesis short-term signal.

Generally, this mapping is not one-to-one and results in either a duplication or an elimination of the analysis short-term signal. The synthetic speech x(n) is obtained by overlap-adding the stream of synthesis short-term signal:

Because the synthesis process is syncronized with the pitch synthesis period, the algorithm controls simultaneously the value of the synthesized pitch and the duration of the synthesized signal.

This algorithm is an efficient method to modify the prosodic parameters, while preserving most of the naturalness of the voice timbre. Moreover, it can be combined in a flexible manner with low-bit rate speech coding in order to reduce the memory requirements to store the acoustical units inventory.

4. Text-to-speech synthesis for Romanian language

Our research group has a long experience in signal and speech processing; theoretical studies and intensive experiments have been done on speech recognition and synthesis for limited vocabulary [11, 12], on speech codingmethods and also on speaker verification and identification.

In recent years, a lot of work was done in achieving text-to-speech synthesis for Romanian language. It must be first mentioned that the basic features of the synthesis at the research start were intended to be the following:

good intelligibility for the speech produced;
fair naturalness (syllabic stress, intonation at the end of the word);
a little amount of required memory;
a complete software version running on 486DX equipped PCs, acoustic syntheses being realized with a standard SoundBlaster audio interface;
real time speech synthesis for any word typed from the keyboard;
unlimited vocabulary.

We will discuss two different approaches for this task.

145