Romanian Language Technology

Corneliu Burileanu & al * Text-to-SpeechSynthesis for Romanian Language

3.2. Stages of text-to-speech conversion

A general text-to speech system includes two main stages:

First, the system must analyze the input text and provide a phonetic and prosodic transcription of this text.
Then, this abstract linguistic representation must be transformed into a speech waveform, using specific technics, according to the speech synthesizer and the selected acoustic segments.

The basic steps involved in converting text to speech are illustrated in Figure 3.

Fig. 3. - Principal elements of a text-to-speech conversion system.

I. The linguistic preprocessing must convert the input text in a "normalized" form, that can be then correctly processed (we will assume that an ASCII representation of each input sentence is available as input to the text analysis stage):

words and sentence boundary are detected (periods may indicate an abbreviation or the end of the sentence and must be pointed out);
common abbreviations are expanded;
numbers (including integers, decimal numbers, hours and data) are also reformatted into words and punctuation;
some punctuation marks are also interpreted (they will be then utilized in the determination of prosody).

II. Syntactic analysis is important for word pronunciation as the one of several components in the determination of prosody; at least a partial syntactic analysis is made, in order to take some decision about the syntactic structure of the sentence (identification of the part of speech of each word).

III. The next stage performs the letter-to-phoneme conversion; the phonemic representation is obtained with the use of a dictionary and letter-to-sound rules, so that orthographic characters are mapped into the appropriate strings of phonemes and their associated lexical stress markers.

In text-to-speech synthesis a reliable grapheme-to-phoneme conversion tool becomes an important part, whether the synthesis method is. Most commercial existent TTS systems translate input text into a phonetic transcription by making a compromise between the use of letter-to-sound rules (sometimes hundreds of ordered rules) and a pronouncing dictionary (for most common irregular words, for example). The development of a large set of rules for a certain language is usually a very difficult and laborious task; also the storage of a large dictionary and the time required to identify a word raise many problems.

Stress assignment rules are also an important constituent of TTS systems; incorrect lexical stress assignment is very unpleasant for the listener, because the stress pattern (which affects several acoustic dimensions) is a basic feature in the recognition of speech. Stress assignment rules are generally modeled after linguistic theories, but some simpler stress rules may be used if one knows only the number of syllables and basic syllabic structure.

143