Attila Ferencz & al. * A Text-To-SpeechSystem for the Romanian Language




Fig. 1. - The Building Elements of a Text-to-Speech System.



3. The sound database

The sound database contains the set of elementary sounds (speech units). Concatenating elementary sounds we can generate a sound signal corresponding to any text.

The speech units can be chosen between: words, sentences, morphemes, syllables, phonemes, demisyllables, etc., according to the requirements of the application. Using words and sentences as basic units (having them recorded with intonation and articulation) we can obtain high quality speech but for restrained domains (for example portable dictionaries).

Morphemes are alternative units which can be used. The English language contains, for example, 12.000 morphemes (like book, ed, have, s).

If we want to have an unrestricted vocabulary the storage space becomes too big, then the idea of recording all the words becomes inefficient. The solution is to use as speech units some more elementary sounds like phonemes. But here we meet the disadvantage that a phoneme corresponds to an infinite - but specified - class of temporal or frequencial variants. The physical features of a phoneme vary from one speaker to another, and even if the speaker is the same there can be changes depending on the speaker's state, the place of the phoneme and of the accent in the word, the intonation and the accent in the phrase, the pronouncing duration, etc. Although the number of the phonemes is relatively small (approximately equal to the number of signs in the alphabet - the Romanian language being a quasi-phonetical language), the synthesis based on phonemes produces an unnatural speech, difficult to understand.

Using phonemes as basic units we need interpolation at the transition from one phoneme to another because the vocal tract does not change shape abruptly, gliding smoothly from one articulation position to another. The effect of this transition must be incorporated into the algorithm by inserting sets of interpolated parameters between neighboring phonemes. This works well with slow transition as in case of vowels, but in the case of consonants the transition is too fast and the acoustic effect is lost. To overcome this problem we can use diphones or demisyllables.

If we want a trade-off between the storage space and the production of an intelligible speech we can use the diphones as database elements. A diphone is a sound consisting of the two neighboring halves of two adjacent phonemes. Then a diphone starts in the middle of the first phoneme and ends up in the middle of the second. Almost any combination of two phonemes could make up a diphone, so the number of diphones in a language is at most equal to the square of the number of phonemes in that language.

In the case of synthesis with diphones the sound database will consist of all the diphones in the language.

The sound database of our system consists of 900 diphones.



164

Previous Index Next