Peter Roach * Speech Technology: a Look into the Future




The second point to make is that we have made a fundamental and, I believe, permanent change in the way we train computers. In earlier times, it seemed obvious that we humans, having learned what are the principal characteristics of the speech sounds that must be recognised, must then instruct the computer on how to perform the same task: such systems were known as knowledge-based systems. But more recently, we have been able to work with computer systems that are able to learn for themselves what characteristics distinguish the various units of speech; these systems are capable of learning by themselves. In the case of knowledge-based systems, the relevant input of the speech scientist was to provide the engineer with the best possible descriptions of the data. But in self-teaching systems, the input is completely different. Since this is a matter of great concern for present-day research, let us look at it in some detail.

There are two main computational techniques which could be called self-teaching. One makes use of Hidden Markov Models (see for example [8]), a technique which essentially exploits the statistical nature of transitional probabilities to construct automatically a mathematical model of each unit to be recognised. This technique has been of enormous value in improving the design and performance of recognition systems. The other technique is that of Artificial Neural Networks, [14] a technique which has fascinated academic researchers with its promise of computational simulation of interaction between neurons in a nervous system. Although many people have questioned the appropriateness of equating this computational technique with the function of real nervous systems, there are many striking parallels [15]. These techniques have many points in common, but the most important one is that in order to learn, they must be given very large amounts of appropriate training data [16]. If we look at how a human child learns to understand speech, we can see that the process is one of repeated exposure to the data, day after day, with regular feedback on whether understanding has taken place correctly. There is no sudden transition in the child's learning which is equivalent to the moment when a complex computer program begins to perform correctly. The process is one of providing the computer with very large bodies of carefully prepared training data, so that it will become familiar with each particular unit (such as a phoneme or syllable) that it must learn, in every context in which it may occur, spoken by a large and representative set of speakers. If the data is badly prepared, the learning will never be successful. There has been, as a result of this, an enormous growth in the development of speech databases to be used for training (and later on for testing) recognition systems. These databases comprise carefully-controlled recordings of many speakers (sometimes recorded under a variety of conditions), and expert transcriptions of the data made in such a way that the computer can link each symbol in the transcription with a particular part of the sound recording [17, 18, 19, 20]. Thus any present-day attempt to develop an effective speech recognition system must have a suitable speech database, and if such a database does not exist, it must be created. For the languages which have been extensively worked on (such as English, French, German and Japanese), general databases already exist and much effort is going into constructing more specialised databases of particular kinds of speech (for example, in my laboratory we are working on a database of emotional speech). But at the same time, there is a strong growth in the compilation of speech databases for languages which have not received so much attention in the past. This is the main reason for the existence of the BABEL project, a three-year project funded by the European Union (COPERNICUS Project #1304) which brings together speech technology and speech science specialists in many different European countries, and is putting together a database of some languages of Central and Eastern Europe [21]. We hope that the number of languages in BABEL will grow, but the present list comprises Bulgarian, Estonian, Hungarian, Polish and Romanian; it is a great pleasure for me to be speaking in one of the BABEL partner nations, and to mention the valuable contribution of my colleague Marian Boldea, of the Technical University of Timisoara, who is leading the work on Romanian. This database should be of great value in the development of Romanian-language speech technology applications. It is based on the design of the speech database EUROM1 [22].

Before leaving the subject, I would like to mention the importance of prosody: factors in speech such as intonation, stress and rhythm. There is much evidence that humans make extensive use of prosody in speech understanding, yet artificial systems so far make little use of this information. Many other non-segmental factors need to be considered [23]. We can now turn to Speech Synthesis.



135

Previous Index Next