Romanian Language Technology

Corneliu Burileanu & al * Text-to-SpeechSynthesis for Romanian Language

As mentioned, synthesis based on waveform coding is the method in which words or phrases of human voice are stored (usually in digital form) and the desired speech is synthesised by reading and connecting the appropriate units. Synthesis systems based on this method are simple and can provide high-quality speech. However, this quality will require 64-128kb/s to represent digitally sampled speech; thus, if the message set is small and not likely to be changed, this approach may be satisfactory. From the research point of view, little knowledge of speech and language is needed, and only a modest amount of signal processing is required.

In this method, the quality of the synthesised speech is yet influenced by the quality of the continuity of acoustic features at the connection between units. Since the pitch pattern of each word changes according to its position in different sentences, it is probably necessary to store variations of the same word with rising, flat and falling inflection; but even if several variations of the same word are stored, the sentence stress pattern, rhythm and intonation, which depends on syntactic and semantic, is very unnatural when words are simply concatenated.

In order to reduce requirements for memory size, the units are often compressed by waveform coding methods, such as ADPCM ("Adaptive Differential Pulse-Code Modulation"); the required encoding and decoding algorithms can easily be implemented on a single integrated circuit and high-quality speech at 16-32 kb/s can be obtained.

In synthesis based on the analysis-synthesis method, words of human speech are usually analysed based on the speech production model and stored as time sequences of feature parameters. Parameter sequence of the appropriate units is connected and supplied to a speech synthesizer to produce the desired spoken message.

It is highly used the all-filter model, based on the acoustic theory of speech production; the acoustic generation of the waveform is modelled as the excitation of a linear filter by an appropriate excitation function. The filter that simulates the vocal tract has been characterised in terms of its resonances, the frequencies of which are called formants. These formants can be computed by Fourier transform spectral analysis and also by the linear prediction coding (LPC) analysis process, both of which are widely used.

Speech production models allow the speech waveform to be generated from the model based on a small set of parameters that vary slowly in time. Formants and LPC-based models are typically controlled by a set of parameters that are updated every 10 to 30 ms. Each set of parameter values is called a frame, and a speech utterance can be represented by a succession of such frames, which are then interpreted by the corresponding speech production model. The two basic techniques used here are the followings:

The formant-type speech synthesis technique (or the terminal analogue technique), which simulates the speech production mechanism using an electrical structure consisting of the cascade or parallel connection of several resonance (formants) and antiresonance (antiformant) circuits, formants being encoded by their central frequency, amplitude and bandwidth; the typical application is the formant vocoder.
The synthesis technique based on linear prediction, in which one can estimate the next sample of a speech waveform as the linear combination of several previous samples; the algorithm calculates the model coefficients in order to minimise the total error between the initial signal and the predicted signal, on each predetermined signal frame. The typical control parameters of the speech synthesis circuit are the linear predictor coefficients, voiced/unvoiced decision, pulse amplitude and fundamental period for the voice source (for voiced signal). The LPC method will allow the natural speech samples to be modified in fundamental frequency, amplitude and duration.

Although this method can reduce the bit rate to around 2.4 kb/s, the quality of synthesised speech is limited and in recent years some other versions of this basic model have been investigated, for bit rates from 2.4 to 9.6 kb/s; these include the residual-, multi-pulse- (MELP), or code-excited LPC (CELP) methods.

Segmental synthesis-by-rules is a method for producing any word or sentence based on sequences of phonetic symbols or letters. In this method, feature parameters for fundamental small units of speech such as morphemes, syllables, diphones, or even phonemes, are stored and connected by rules. At the same time, prosodic features such as pitch and amplitude are also controlled by rules. The quality of fundamental units for synthesis as well as control rules (control information and control mechanism) for acoustic parameters play a crucial important role in this method and they must be based on phonetic and linguistic characteristics of natural speech. Furthermore, to produce natural and distinct speech, temporal transition of pitch, stress and spectrum must be smoothed and other features such as pause location and duration must be appropriate. Future parameters for fundamental units are usually extracted from natural speech.

141