Romanian Language Technology

Corneliu Burileanu & al * Text-to-SpeechSynthesis for Romanian Language

This method forms the basis for converting unrestricted text to speech (which is essentially the ultimate goal of speech synthesis). It will be discussed in details in the next chapter.

We must however emphasise here that one must avoid the confusion between two concepts of a text-to-speech system; we shall then define:

a text-to-speech system as being "the system that can produce speech from a writtentext or from certain concepts";
a speech synthesizer as representing "the stage of an automatic speech synthesis system (which may be a TTS system or not), that performs the conversion of a parameter sequence in speech".

For a TTS system this difference is shown in Figure 2.

Fig. 2. - Text-to-speech system principle.

For a text-to-speech system, the speech synthesizer usually uses either a formant-based synthesis technique, or linear prediction analysis and synthesis techniques. They will be also discussed in the next chapter, together with a description of a rather new technique for speech synthesis: the pitch-synchronous overlap-add (PSOLA) technique.

We will only observe at this moment that another synthesis technique being developed for experimental TTS system is articulatory synthesis. This technique represents phonemes as articulatory targets and employs rules that model how articulators move in time and space to generate connected speech. Articulatory control parameters rather than text input or acoustic parameters control these synthesizers. It is hoped that the articulatory approach will lead to simpler and more elegant rules that will model more closely the human speech; however, a lack of complete, detailed articulatory data, makes difficult the optimization of articulatory TTS systems.

3. Text-to-speech conversion

3.1. Introduction

From the above description, it is clear that speech can be produced from a wide variety of inputs. These ranges from stored parts of speech, through encoded speech waveforms, frame parameterized speech, to segmental synthesis symbols and complete phonetic characterization including the prosodic analysis. All of these inputs have been formerly used in applications depending on many factors. Typically, the simpler systems are used where the number of messages and the message lengths are small and it is desirable to preserve the speaker identity. The more complex procedures provide the flexibility of abstraction to accommodate a large (potentially infinite) set of possible utterances, flexibility of output and the possibility of several voices, as in text-to-speech systems. The price paid for this increased flexibility and control is a loss of speech quality, although acceptable for many applications [5].

Text-to-speech systems can be evaluated and compared with respect to intelligibility, naturalness and suitability for particular applications. One can measure the intelligibility of individual phonemes, words, or words in sentence context and one also estimates listening comprehension of synthetic speech. But what is even more important in conceiving such a system is the predicted performance for a specific application; there is no existent TTS system good enough to fully replace a human, but it can be perfectly acceptable if it is part of an application that provides direct access to information stored in a computer, or permits easier or cheaper access to a present service because more telephone lines can be handled at a give cost, or can help blind persons to have full access at their computers (eventually connected on a large network).

What we want to suggest is the fact that even for Romanian language, the design of a TTS system, which is obvious a very difficult task and must involve the most diverse interdisciplinary efforts, needs to be oriented to a specific application.

142