In typical speech recognition systems, the input
speech is compared with stored reference templates of phonemes
or words as in Dynamic Time Warping (DTW) technique, and the most
similar reference template is selected as a candidate phoneme
or word of input speech [1]. Therefore, short-time spectral density
is usually extracted at short intervals and used for comparison
with the reference templates. The comparison is accomplished in
terms of spectral distance: this means that speech is first processed
using an analysis technique such as Fast Fourier Transform (FFT)
or Linear Predictive Coding (LPC) in order to find the short-time
spectrum. Chapter 2 is dedicated to a general presentation of
such speech analysis results - time-domain and spectral analysis,
but mostly to Vector Quantization (VQ), as a technique for bit-rate
reduction. The DTW recognition (sub-chapter 3.1) solves the problem
of different duration of spoken words and if it is used with VQ
the recognition algorithm speeds up [3].
To avoid different word-length problem, in sub-chapter
3.2 we will focus on statistical modeling of speech production
and we will deal with discrete Hidden Markov Models (HMM) for
speech recognition. The estimation of the parameter set of HMM
for the acoustic signals has been performed by Baum-Welch algorithm
and VQ has been used to obtain the speech observations sequence.
The motivation of using ANN structured
as a Multilayer Perceptron (MLP) (sub-chapter 3.3) is to explore
the ability of ANN to recognize words spectral patterns and to
extract invariant features from highly variable data, particularly,
and further to test their performances on the specific task of
word recognition. It is proposed an MLP structure with one hidden
layer and the patterns presented to the input of Neural Network
(NN) are computed by Linear Predictive Coding (LPC analysis) over
25.6 ms from the words. At this point some problems regarding
different word lengths could arise. To avoid these problems the
idea is to segment the input speech into a sequence of stationary
segments, each of them being represented by the segment centroid
[4,5]. In this way, the sequence of segment centroids will feed
the MLP and the amount of data is reduced. A fixed number of segments
will suit the fixed input dimensionality of MLP. There will be
presented results on CCS and its performances, on ASR using MLP
and SOFM with or without CCS, with or without VQ, showing that
high recognition rate and less computational time are obtained
when CCS is included in the recognition process. Also, the effect
of a number of processing units in the hidden layer, the learning
factor, the activation function and momentum have been monitored.
Finally, some conclusions on recognition performances using the
mentioned techniques are given.
178