Marian Boldea * Speech Technology Research at Computer Science Department, "Politehnica" University of Timiºoara
The first of these is a database for concatenate speech synthesis, designed to cover all practically possible Romanian diphones [9] by taking into account phonetic phenomena that occur in connected speech and already used in our Romanian text-to-speech system presented in the next section. As an evaluation companion to this, speech material to be used as bench mark data in the administration of a Romanian version of the Modified Rhyme Test [11] was also recorded by the same speaker.
The second, and most important, is a database of Romanian continuous speech, whose initially envisaged use was in the training, development and evaluation of continuous speech recognizers at the phoneme and word levels using speaker-independent phoneme models with some context-dependence modelling capability. During its design phase, the need for acoustic phonetics and phonology studies appeared also very clearly to us, and we hope that it will be used for this purpose, too. It is intended to be compatible with the EUROM-1 speech database [12] and will consist of high-quality recordings from 100 speakers (50 men and 50 women), with equal distributions for each sex in the age ranges: under 20; 21-30; 31-40; 41-50; over 50), every speaker recording mainly some read material (four passages of a few logically connected sentences; two or three independent sentences designed to increase the number of occurrences of the least frequent phonemes; three to seven independent sentences automatically selected to increase the number of read diphones; four independent sentences, each designed to determine - when correctly read - at least one occurrence of every Romanian phoneme; 26 integer numbers between 0 and 9999; the Romanian alphabet) as well as some semispontaneously spoken personal information (name - spoken and spelled; series and number of the personal identification document; telephone number; birth date; address). More details on its design, collection and processing can be found in [10]. Sixty-seven speakers (38 men and 29 women) have been recorded so far2, and work is in progress towards the automatic labelling of the signal and its use for continuous speech recognition experiments (see Section 4).
Converting texts into synthetic speech encompasses,
on the one side, a natural language processing (NLP) module able
to transform the input text into an appropriate intermediate representation,
and on the other side, a signal processing (SP) part capable of
turning this representation into an output signal. Ideally, the
NLP module should "understand" the input texts the same
way a perfectly trained human reader would do, in order to produce
not merely phonemic symbols, but also information about prosodic
(intonation, duration, intensity) and phonetic (coarticulation)
phenomena, and the SP component has to apply its own phonetic
rules to generate an output signal as naturally sounding as possible.
Even with human readers though, the way a text is read can vary
(e.g., poetic or dramatic texts), so that contemporary text-to-speech
systems are quite limited as far as the incorporation of appropriate
linguistic knowledge is concerned. Consequently, a typical structure
was chosen for our system, as illustrated in Fig. 1.
The first four blocks in the diagram
[8,9] are the
constituents of the NLP component, while the last two are actually
collapsed into a single overlap-and-add concatenate synthetiser
which uses the database presented in Section 1 and allows for
phonetic phenomena to be accounted for through diphones segmentation
[9,13].
The text normalisation stage replaces numbers and abbreviations
with full orthographic forms, the spell checking tries to spot
mistyped words in order to prevent failures of the graphemes-to-phonemes
translation, and all these first three stages use rules and dictionaries
to implement their tasks [8].
The prosodic component [9] makes
three passes (prosodic parsing and phoneme duration assignment,
duration adjustments according to the prosodic structure, and
fundamental frequency contour generation) over the representation
it receives, and transmits to the synthesis module the appropriate
control parameters.
Experimental intelligibility evaluations of the system
using a Modified Rhyme Test [11] adapted to
Romanian [9] and natural
bench mark signals recorded by the same speaker used for the diphones
database, although limited to 6 subjects, demonstrated a good
performance (90.89% average intelligibility for synthetic speech
vs. 98.21% for natural speech). No attempt has been made to evaluate
the system from the viewpoint of the naturalness of the synthetised
speech.
175
2 January 20, 1997.