Romanian Language Technology

Corneliu Burileanu & al * Text-to-SpeechSynthesis for Romanian Language

IV. Anotherimportant step in the conversion of text to speech is the determination of the prosody. Intonation (pitch contour), timing and intensity are aspects of prosody that affect the sensation of pitch, length and loudness. Appropriate prosody is very important for a sentence to sound intelligible and natural. Prosody conveys both linguistic information and extra-linguistic information about the speaker's attitude, intentions and physical and emotional state.

From a linguistic point of view, prosody involves not only segmental structure, but also "suprasegmental" one (units larger than phonetic segments); there are local perturbations of intonation, timing and intensity due to particular vowels and consonants, but there are more global underlying prosodic patterns that result from the suprasegmental structure of the utterance: syllable, word, phrase. Some of the existing commercial TTS systems do not tailor an intonation contour to each sentence that is synthesized; they either use a monotone with a falling or rising pitch at the end of the sentence, or they synthesize a sentence like a list of words, treating each word like a declarative sentence in which pitch first rises and then falls at the end of the word. The better systems generate an intonation contour that is tailored for each sentence [5].

Speech timing is another important aspect of prosody that affects both intelligibility and naturalness; there are many effects that segmental duration expresses: the inherent duration of various phonemes, the effects of syntactic, phonological and phonetic context and non-linguistic factors such as speaking rate. Duration rules may vary in complexity from system to system; for example, the simplest systems may use only inherent duration of phonemes, without modeling contextual effects.

V. It is important to observe that in the earlier discussion about text-to-speech synthesis (also illustrated in Figure 3), the notion of "unit" is referring as "acoustic segment", or "phonetic segment". We emphasize that although at the end of the first stages of the system the pronunciation is represented by a string of allophones (usually one defines the allophone as a "reasonably" different realization of a given phoneme) with stress and juncture marking, this sequence can then be mapped onto a number of other classes of units, depending on the requirements of the synthesis scheme. The choice and representation of units must allow sufficient degree of freedom to control all significant aspects of the generated waveform and also it must be possible to analyze speech in terms of these units and their parameters.

The size of the units is limited by various constraints. Whole words as units are impractical because there are too many words to store and access readily; at the other end of the spectrum of possible units there is the phoneme itself, of which there are only about 34 basic units in Romanian language; while the size of the phoneme inventory would be easily manageable, the large coarticulatory effects between adjacent phonemes make the phoneme an unsatisfactory concatenative inventory unit. Other units that have been used for TTS systems are the diphone (the acoustic piece of speech from the middle of one phoneme to the middle of the next phoneme), the demi-syllable (which is either a syllable-initial consonant or consonant cluster plus the first half of the vowel, or the second part of the vowel plus the syllable-final consonant or consonant cluster), the syllable and the morpheme.

We therefore assume that one obtains after this step a detailed phonetic transcription of the input text together with the prosodic information, based on the suitable acoustic unit sequence.

VI. Thelast two stages carry out the speech synthesis according to the synthesis method of the TTS system and the technique used by the speech synthesizer. They perform the following basic tasks:

decode the information stored in the phonetic sequence received;
bring from a dictionary the parameters specific to the acoustic units;
generate the sequence parameter according to the method of synthesis;
produce the digital speech signal for the word or sentence;
generate the final speech waveform (with a D/A converter, filters and an audio amplifier).

3.3. Speech synthesis methods for text-to-speech systems

We pointed out in the previous chapter that the general method of synthesis used in text-to-speech systems is segmental synthesis-by-rules [3]. However, we can distinguish two different concepts that are used in today's practical implementation; we will briefly discuss these concepts and because the techniques used by the speech synthesizers are strongly related to these principles, we will give some examples of the most used techniques.

3.3.1. Synthesis-by-rules

For this approach, a phoneme-sized unit is used. The most important abstraction introduced is the notion of target. For each phonetic segment, target values for the individual formants are stored in a table; a parameter generation algorithm then operates on the target sequence to provide the spectral frames required by the synthesis model at regular intervals. Usually, one uses for the speech synthesizer a formant-type synthesis technique, controlled by heuristic acoustic domain rules. The rule-generated acoustic parameters are updated every 5 to 10 ms and include formant frequencies and bandwidth, fundamental frequency, source specifications etc.

In this case of formant synthesis, the synthesizer is made up of resonators connected either in cascade or in parallel (or both) which are excited by a sequence of glottal pulse and also a noise source.

Advantages of formant-based synthesis are its flexibility, its ability to generate smooth transitions between segments and its relatively small storage requirements. The major disadvantage is the difficulty of specifying sufficiently detailed rules to synthesize some acoustically rich segments, especially consonants.

144