Corneliu Burileanu & al *
Text-to-SpeechSynthesis for Romanian Language
- As mentioned, synthesis
based on waveform coding is the method in which words or phrases
of human voice are stored (usually in digital form) and the desired
speech is synthesised by reading and connecting the appropriate
units. Synthesis systems based on this method are simple and can
provide high-quality speech. However, this quality will require
64-128kb/s to represent digitally sampled speech; thus, if the
message set is small and not likely to be changed, this approach
may be satisfactory. From the research point of view, little knowledge
of speech and language is needed, and only a modest amount of
signal processing is required.
In this method, the quality of
the synthesised speech is yet influenced by the quality of the
continuity of acoustic features at the connection between units.
Since the pitch pattern of each word changes according to its
position in different sentences, it is probably necessary to store
variations of the same word with rising, flat and falling inflection;
but even if several variations of the same word are stored, the
sentence stress pattern, rhythm and intonation, which depends
on syntactic and semantic, is very unnatural when words are simply
concatenated.
In order to reduce requirements
for memory size, the units are often compressed by waveform coding
methods, such as ADPCM ("Adaptive Differential Pulse-Code
Modulation"); the required encoding and decoding algorithms
can easily be implemented on a single integrated circuit and high-quality
speech at 16-32 kb/s can be obtained.
- In synthesis based on the
analysis-synthesis method, words of human speech are usually
analysed based on the speech production model and stored as time
sequences of feature parameters. Parameter sequence of the appropriate
units is connected and supplied to a speech synthesizer to produce
the desired spoken message.
It is highly used the all-filter
model, based on the acoustic theory of speech production;
the acoustic generation of the waveform is modelled as the excitation
of a linear filter by an appropriate excitation function. The
filter that simulates the vocal tract has been characterised in
terms of its resonances, the frequencies of which are called formants.
These formants can be computed by Fourier transform spectral analysis
and also by the linear prediction coding (LPC) analysis process,
both of which are widely used.
Speech production models allow
the speech waveform to be generated from the model based on a
small set of parameters that vary slowly in time. Formants and
LPC-based models are typically controlled by a set of parameters
that are updated every 10 to 30 ms. Each set of parameter values
is called a frame, and a speech utterance can be represented
by a succession of such frames, which are then interpreted by
the corresponding speech production model. The two basic techniques
used here are the followings:
- The formant-type speech
synthesis technique
(or the terminal analogue technique), which simulates the
speech production mechanism using an electrical structure consisting
of the cascade or parallel connection of several resonance (formants)
and antiresonance (antiformant) circuits, formants being encoded
by their central frequency, amplitude and bandwidth; the typical
application is the formant vocoder.
- The synthesis technique
based on linear prediction,
in which one can estimate the next sample of a speech waveform
as the linear combination of several previous samples; the algorithm
calculates the model coefficients in order to minimise the total
error between the initial signal and the predicted signal, on
each predetermined signal frame. The typical control parameters
of the speech synthesis circuit are the linear predictor coefficients,
voiced/unvoiced decision, pulse amplitude and fundamental period
for the voice source (for voiced signal). The LPC method will
allow the natural speech samples to be modified in fundamental
frequency, amplitude and duration.
Although this method can reduce
the bit rate to around 2.4 kb/s, the quality of synthesised speech
is limited and in recent years some other versions of this basic
model have been investigated, for bit rates from 2.4 to 9.6 kb/s;
these include the residual-, multi-pulse- (MELP), or code-excited
LPC (CELP) methods.
- Segmental synthesis-by-rules
is a method for producing any word or sentence based on sequences
of phonetic symbols or letters. In this method, feature parameters
for fundamental small units of speech such as morphemes, syllables,
diphones, or even phonemes, are stored and connected by rules.
At the same time, prosodic features such as pitch and amplitude
are also controlled by rules. The quality of fundamental units
for synthesis as well as control rules (control information and
control mechanism) for acoustic parameters play a crucial important
role in this method and they must be based on phonetic and linguistic
characteristics of natural speech. Furthermore, to produce natural
and distinct speech, temporal transition of pitch, stress and
spectrum must be smoothed and other features such as pause location
and duration must be appropriate. Future parameters for fundamental
units are usually extracted from natural speech.
141