Many surveys of the field have
been written (see for example
[1, 2, 3, 4, 5, 6, 7]), and innumerable
conferences such as ICASSP, EUROSPEECH, ICSLP and meetings of
the ASA have reported developments, some dramatic, others more
mundane, in the scientific work which has enabled the growth of
this technology. It is not my purpose here to write yet another
general survey, nor am I (as a specialist in phonetics and speech
science, rather than in electronic engineering or computer science)
competent to write a detailed summary of the latest technological
advances. My aim is simply to list the most obvious ways in which
speech technology is likely to affect our lives, and to look at
the challenges which face researchers in this field today.
For most speech scientists, speech technology comprises
two principal areas: automatic speech recognition, and speech
synthesis. We should not underestimate the importance of another
area, that of speech compression and coding, but the involvement
of conventional speech science in this area is rather less obvious.
In principle, any area where technology is involved in the process
of spoken communication should be regarded as an example of Speech
Technology. Recognition and synthesis are, to a large extent,
complementary fields, and might be thought of as two sides of
the same coin. There are, however, major differences. One of the
most influential figures in the development of speech technology
in Britain, Dr. John Holmes, who has made major advances in both
fields, once jokingly said that if speech synthesis is comparable
with getting toothpaste out of a tube, speech recognition is like
trying to get the toothpaste back in. Speech recognition has the
biggest potential for economic success, but presents the biggest
technical challenges [8, 9, 10, 11].
Let us first look at speech recognition.
2.1. Applications of speech recognition
The most frequently quoted application for speech
recognition is in office dictation systems. It is believed that
there will be major economic benefits when a fully reliable system
is on the market. Currently users must choose between systems
which recognise a small vocabulary (one or two thousand words)
reliably and with reasonably natural speaking style, or a large
vocabulary (tens of thousands of words) in an unnatural speaking
style in which words are separated by pauses. The DragonDictate
[12] system has an active vocabulary of
30,000 words, with the
capability of using an additional 80,000 words from a well-known
dictionary. It adapts to individual speakers, whereas most systems
have to be trained to work with one particular person's voice.
It is clear that we can expect soon to see an office dictation
system which will be capable of taking dictation from many speakers
using a large (though not unlimited) vocabulary and more or less
natural connected speech. Such a system will receive the spoken
input, and produce a letter or report with proper formatting and
spelling. It must be remembered that achieving correct spelling
is not an easy achievement in English, and the difficulty of converting
spelling to sound and sound to spelling is one of the problems
that receives most effort in English-speaking countries - a problem
that could be avoided if English spelling were reformed. In this
context we should note that most people can speak more rapidly
than they can type, so a speech-input system is likely to speed
up work in some areas.
133
2. Speech recognition