Speech Technology Research at Computer Science Department, "Politehnica" University of Timiºoara

Marian Boldea


1. Introduction

Started more than half a century ago, researches on building devices capable of artificial speech recognition or generation have become more and more noticeable as computing systems appeared and developed, and their present invasion in practically every human activity raises, among others, the issue of as comfortable as possible user interfaces. If besides comfort the information input or output rate is considered, spoken language emerges as one of the main options.

The efforts towards building human-computer voice interfaces have reached now a stage where, for a few languages, the existing commercial products are able to automatically recognise speech from vocabularies of thousands to tens of thousands words, uttered with short pauses between them, in a speaker-dependent or adaptive manner, and synthetise speech from a wide range of input texts.

The speed at which this area is evolving makes us believe that speech technology must be a priority of Romanian research, and the projects undertaken by our group1, briefly described in what follows, are attempts to materialise this belief.

A first and essential step towards spoken language computer interfaces is the design and collection of appropriate linguistic resources (text archives, pronunciation dictionaries, and speech databases) aimed at facilitating fundamental research in phonetics and phonology, and applications development. Section 2 gives some details of our work in this area. Text-to-speech conversion, described in Section 3, is an essential component for a conversational computer system, together with speech recognition, which is the topic of Section 4. The last section tries to draw some conclusions based on the work so far, and to outline future plans.



2. Linguistic resources

From a strictly technological point of view, building automatic speech recognition or synthesis systems depends on the availability of as large as possible (the larger, the better) quantities of linguistic material in both written (text archives, lexica, pronunciation dictionaries) and spoken (speech databases) form on computer-readable media (usually CD-ROM, nowadays). In a larger view, these resources are also an essential condition for researches of a more fundamental nature in phonetics and phonology, and their development should incorporate to a great extent the already existing linguistic expertise whose expansion they are supposed to facilitate. Although essential prior to applications building, all these resources will continue their evolution with the applications. The importance of this process was best put by Robert Mercer [1]: "There is no data like more data". Our activities in this area exemplify very well the evolution with applications mentioned above.

To explore the use of statistical modelling methods (hidden Markov models [2]) in speech recognition [3] and the impact of speech-processing methods on their performance [4], the first linguistic resource we collected was a small speech database [5] consisting of 300 signal files containing isolated utterances in Romanian of 0 to 9 digits, repeated three times by 100 speakers (50 men and 50 women), plus a file containing for each speaker some personal data about aspects that had or could have had an influence on their voice quality: sex, height, weight, age, mother tongue, smoking habits, etc. Organised in two training and test sets consisting of 68 and 32 speakers respectively, and partly hand-labelled at the word level, this database has been used for speech recognition studies, but its use is by no means limited to this, and we can make it available to other interested laboratories.



1 This paper is based in part on activities funded by the European Commission through Contract Copernicus 1304/1994 and the Romanian Ministry of Education through Contracts 4004/1995 and 5004/1996, implementing NURC Grants 56/1995, 354/1996 and 355/1996.

174

Previous Index Next