Attila Ferencz & al. * A Text-To-SpeechSystem for the Romanian Language
Fig. 1. - The Building Elements of a Text-to-Speech System.
The speech units can be chosen
between: words, sentences, morphemes, syllables, phonemes, demisyllables,
etc., according to the requirements of the application. Using
words and sentences as basic units (having them recorded with
intonation and articulation) we can obtain high quality speech
but for restrained domains (for example portable dictionaries).
Morphemes are alternative units which can be used.
The English language contains, for example, 12.000 morphemes (like
book, ed, have, s).
If we want to have an unrestricted
vocabulary the storage space becomes too big, then the idea of
recording all the words becomes inefficient. The solution is to
use as speech units some more elementary sounds like phonemes.
But here we meet the disadvantage that a phoneme corresponds to
an infinite - but specified - class of temporal or frequencial
variants. The physical features of a phoneme vary from one speaker
to another, and even if the speaker is the same there can be changes
depending on the speaker's state, the place of the phoneme and
of the accent in the word, the intonation and the accent in the
phrase, the pronouncing duration, etc. Although the number of
the phonemes is relatively small (approximately equal to the number
of signs in the alphabet - the Romanian language being a quasi-phonetical
language), the synthesis based on phonemes produces an unnatural
speech, difficult to understand.
Using phonemes as basic units
we need interpolation at the transition from one phoneme to another
because the vocal tract does not change shape abruptly, gliding
smoothly from one articulation position to another. The effect
of this transition must be incorporated into the algorithm by
inserting sets of interpolated parameters between neighboring
phonemes. This works well with slow transition as in case of vowels,
but in the case of consonants the transition is too fast and the
acoustic effect is lost. To overcome this problem we can use diphones
or demisyllables.
If we want a trade-off between
the storage space and the production of an intelligible speech
we can use the diphones as database elements. A diphone is a sound
consisting of the two neighboring halves of two adjacent phonemes.
Then a diphone starts in the middle of the first phoneme and ends
up in the middle of the second. Almost any combination of two
phonemes could make up a diphone, so the number of diphones in
a language is at most equal to the square of the number of phonemes
in that language.
In the case of synthesis with diphones the sound
database will consist of all the diphones in the language.
The sound database of our system
consists of 900 diphones.
164