Marian Boldea * Speech Technology Research at Computer Science Department, "Politehnica" University of Timiºoara




Pronunciation dictionaries are useful for both applications development and more elaborate speech database designs, and their creation has to be supported by appropriate procedures and dedicated tools. In our laboratory, an ASCII computer-compatible phonetic transcription system was defined for Romanian starting from the SAMPA phonetic alphabet [6,7]; using these conventions, graphemes-to-phonemes transcription rules for a text-to-speech system [8] were iteratively refined on a lexicon of about 24,000 words, and the corresponding pronunciation dictionary was obtained as an outcome of this process. To facilitate its manipulation and future evolutions, an ad hoc system was built [9] and its utility was already proved during the design of some components of a larger Romanian speech database [10].

The first of these is a database for concatenate speech synthesis, designed to cover all practically possible Romanian diphones [9] by taking into account phonetic phenomena that occur in connected speech and already used in our Romanian text-to-speech system presented in the next section. As an evaluation companion to this, speech material to be used as bench mark data in the administration of a Romanian version of the Modified Rhyme Test [11] was also recorded by the same speaker.

The second, and most important, is a database of Romanian continuous speech, whose initially envisaged use was in the training, development and evaluation of continuous speech recognizers at the phoneme and word levels using speaker-independent phoneme models with some context-dependence modelling capability. During its design phase, the need for acoustic phonetics and phonology studies appeared also very clearly to us, and we hope that it will be used for this purpose, too. It is intended to be compatible with the EUROM-1 speech database [12] and will consist of high-quality recordings from 100 speakers (50 men and 50 women), with equal distributions for each sex in the age ranges: under 20; 21-30; 31-40; 41-50; over 50), every speaker recording mainly some read material (four passages of a few logically connected sentences; two or three independent sentences designed to increase the number of occurrences of the least frequent phonemes; three to seven independent sentences automatically selected to increase the number of read diphones; four independent sentences, each designed to determine - when correctly read - at least one occurrence of every Romanian phoneme; 26 integer numbers between 0 and 9999; the Romanian alphabet) as well as some semispontaneously spoken personal information (name - spoken and spelled; series and number of the personal identification document; telephone number; birth date; address). More details on its design, collection and processing can be found in [10]. Sixty-seven speakers (38 men and 29 women) have been recorded so far2, and work is in progress towards the automatic labelling of the signal and its use for continuous speech recognition experiments (see Section 4).



3. Text-to-speech conversion

Given the wide area of applications that can use it, the characteristics of the human auditory system that allow synthetic speech to be understood even if it does not sound naturally, and its being necessary for realistic answer back capabilities during data collection for spoken dialogue systems development, speech synthesis from written text has been lately given special attention in our group.

Converting texts into synthetic speech encompasses, on the one side, a natural language processing (NLP) module able to transform the input text into an appropriate intermediate representation, and on the other side, a signal processing (SP) part capable of turning this representation into an output signal. Ideally, the NLP module should "understand" the input texts the same way a perfectly trained human reader would do, in order to produce not merely phonemic symbols, but also information about prosodic (intonation, duration, intensity) and phonetic (coarticulation) phenomena, and the SP component has to apply its own phonetic rules to generate an output signal as naturally sounding as possible. Even with human readers though, the way a text is read can vary (e.g., poetic or dramatic texts), so that contemporary text-to-speech systems are quite limited as far as the incorporation of appropriate linguistic knowledge is concerned. Consequently, a typical structure was chosen for our system, as illustrated in Fig. 1.

The first four blocks in the diagram [8,9] are the constituents of the NLP component, while the last two are actually collapsed into a single overlap-and-add concatenate synthetiser which uses the database presented in Section 1 and allows for phonetic phenomena to be accounted for through diphones segmentation [9,13]. The text normalisation stage replaces numbers and abbreviations with full orthographic forms, the spell checking tries to spot mistyped words in order to prevent failures of the graphemes-to-phonemes translation, and all these first three stages use rules and dictionaries to implement their tasks [8]. The prosodic component [9] makes three passes (prosodic parsing and phoneme duration assignment, duration adjustments according to the prosodic structure, and fundamental frequency contour generation) over the representation it receives, and transmits to the synthesis module the appropriate control parameters.

Experimental intelligibility evaluations of the system using a Modified Rhyme Test [11] adapted to Romanian [9] and natural bench mark signals recorded by the same speaker used for the diphones database, although limited to 6 subjects, demonstrated a good performance (90.89% average intelligibility for synthetic speech vs. 98.21% for natural speech). No attempt has been made to evaluate the system from the viewpoint of the naturalness of the synthetised speech.



2 January 20, 1997.

175

Previous Index Next