Attila Ferencz & al. * A Text-To-SpeechSystem for the Romanian Language




8. The synthesizer module

Our research team has experimented several synthesis methods: synthesis based on sampling in amplitude-time domain, linear prediction synthesis, formantic synthesis.

The first method that leads to very good results, but not in real time, was a software simulation of the linear prediction algorithm. The hardware component was only a D/A converter and the audio equipment.

We tried the formantic synthesis method to obtain speech in real-time, using the dedicated processor PCF8200 produced by PHILIPS company. The synthesizer module was an external module connected to the computer by parallel interface.

The speech quality obtained was not good enough because that chip allowed only quantified values of the parameters (formants frequencies and bandwidths). Our algorithms for computing the formants for the Romanian language diphones obtained some values which had to be adjusted for feeding the formantic synthesizer chip. This approximation leads to a loss of intelligibility in the resulted voice.

The high quality of speech obtained by the linear prediction algorithm was a reason for us to implement it on a digital signal processor, TMS320C25. This chip from Texas Instruments family can work in parallel with the computer's processor for obtaining real-time synthesis of the vocal signals.



9. Experimental results and conclusions

The diphone database was built by semiautomatic cutting off the diphones from words pronounced by a human speaker. The recorded words could have been real words or phoneme groups without meaning. A complex algorithm for segmentation and analysis was applied to every diphone in order to obtain its specific acoustic parameters (energy, the voiced/unvoiced decision, the pitch for voiced sounds, formants frequencies and bandwidths, etc.).

Our analysis-synthesis algorithm allows the vocal signals to be decomposed into their basic components, and to be recomposed from these components, the small distortion being due to the mathematical imperfection of vocal tract model used in the formant or in the LPC synthesizer and to the data storing compression.

The system allows the modification of some characteristics of vocal signal, in order to obtain prosodic effects (including effects like: whispers sounds, singing sounds, hoarse sounds, etc.).

The naturalness and the quality of generated speech are quite acceptable, further improvements being necessary. A morphemic analysis for grapheme to phoneme conversion has to be tried, more rules for stress, intonation and pitch have to be implemented.

Synthesis by rule, especially for the Romanian language, is an open research subject.



References

  1. R. BOITE, M. KUNT, Traitement de la parole, Press Politechniques Romandes, 1987.

  2. G. OLASZY, G. NÉMETH, Multilingual Text-to-Speech Converter, Journal on Communications, Nr. 2 (1991).

  3. J. ALLEN, Speech Synthesis from Unrestricted Texts, Computer Speech Processing, Prentice-Hall Int., UK, 1985.

  4. S.D. ISARD, Speech Synthesis and the Rhythm of English, Computer Speech Processing, Prentice-Hall Int., UK, 1985.

  5. F. SADAOKI, Toward the Ultimate Synthesis/Recognition System, NTT Human Interface Laboratories, Tokyo.

  6. F. SADAOKI, Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, Inc. New York, U.S.A, 1989.




166

Previous Index Next