Speech Technology: a Look into the Future

Peter Roach


1. Introduction

Technology is changing all our lives, and one aspect of technology is soon to become a leading influence in this revolution. One of the most fundamental attributes of the human intellect is the ability to communicate by speaking, and we have at last achieved the goal of developing machines that are themselves capable of spoken communication. To many ordinary people, speech technology seems an idea out of science fiction, yet the dedicated work of scientists around the world involved in this area of technology in recent decades has resulted in products with real commercial and industrial potential at relatively low cost.

Many surveys of the field have been written (see for example [1, 2, 3, 4, 5, 6, 7]), and innumerable conferences such as ICASSP, EUROSPEECH, ICSLP and meetings of the ASA have reported developments, some dramatic, others more mundane, in the scientific work which has enabled the growth of this technology. It is not my purpose here to write yet another general survey, nor am I (as a specialist in phonetics and speech science, rather than in electronic engineering or computer science) competent to write a detailed summary of the latest technological advances. My aim is simply to list the most obvious ways in which speech technology is likely to affect our lives, and to look at the challenges which face researchers in this field today.

For most speech scientists, speech technology comprises two principal areas: automatic speech recognition, and speech synthesis. We should not underestimate the importance of another area, that of speech compression and coding, but the involvement of conventional speech science in this area is rather less obvious. In principle, any area where technology is involved in the process of spoken communication should be regarded as an example of Speech Technology. Recognition and synthesis are, to a large extent, complementary fields, and might be thought of as two sides of the same coin. There are, however, major differences. One of the most influential figures in the development of speech technology in Britain, Dr. John Holmes, who has made major advances in both fields, once jokingly said that if speech synthesis is comparable with getting toothpaste out of a tube, speech recognition is like trying to get the toothpaste back in. Speech recognition has the biggest potential for economic success, but presents the biggest technical challenges [8, 9, 10, 11]. Let us first look at speech recognition.

2. Speech recognition

2.1. Applications of speech recognition

The most frequently quoted application for speech recognition is in office dictation systems. It is believed that there will be major economic benefits when a fully reliable system is on the market. Currently users must choose between systems which recognise a small vocabulary (one or two thousand words) reliably and with reasonably natural speaking style, or a large vocabulary (tens of thousands of words) in an unnatural speaking style in which words are separated by pauses. The DragonDictate [12] system has an active vocabulary of 30,000 words, with the capability of using an additional 80,000 words from a well-known dictionary. It adapts to individual speakers, whereas most systems have to be trained to work with one particular person's voice. It is clear that we can expect soon to see an office dictation system which will be capable of taking dictation from many speakers using a large (though not unlimited) vocabulary and more or less natural connected speech. Such a system will receive the spoken input, and produce a letter or report with proper formatting and spelling. It must be remembered that achieving correct spelling is not an easy achievement in English, and the difficulty of converting spelling to sound and sound to spelling is one of the problems that receives most effort in English-speaking countries - a problem that could be avoided if English spelling were reformed. In this context we should note that most people can speak more rapidly than they can type, so a speech-input system is likely to speed up work in some areas.


133

Previous Index Next