Maria-Mirela Petrea, Dan Cristea * Dealingwith Prosody. A Computer-Assisted Language Learning Approach
The first step in preparing the signal for comparison (where the first feature is extracted from the signal) is the delimitation of silence portions among speech units. The markers delimiting silence inside a signal are established, by processing the loudness, according to the following criteria:
The algorithm proceeds "looking-ahead": every time the loudness points out a frame in the original signal with local maximum amplitude inside a narrow ZERO centered interval, there is assumed that this frame constitutes the beginning of a silence segment (according to the second criterion). The sequel analysis should confirm or infirm this hypothesis. Any point in loudness indicating high energy in signal is first suspected as being the beginning of an irrelevant (noise) peak. If it is (i.e., no longer than 50 ms), it is simply appended to the current silence segment. If it isn't, the silence interval is ended. Further, if the length of the silence currently detected is significant according to the first criterion, the signal is marked with SILENCE and END_SILENCE marks.
Fig. 2. - Dialog box in PROSODICS to edit master signal.
3.3. Fricatives
The detection of
fricatives precedes pitch determination as a first indication
regarding voiced/unvoiced feature. To characterize fricatives
PROSODICS uses the zero-crossing rate function and the
loudness function. Using ZCR, different segments in the signal
are seen as belonging to one of three categories depending on
whether the zero-crossing values are high, middle or low. High
ZCR value segments are marked as frications, while for the middle
value segments the loudness is also inspected and they are marked
as frications only when the loudness points out rather small energy.
As the algorithm has a slight tendency of overestimating the fricatives
zones (marked between FRICATIVE and END_FRICATIVE markers), the
final decision will be taken at pitch detection and, eventually,
in the signal-to-text alignment step. Pitch detection will establish
the final voiced/unvoiced decision, while signal-to-text alignment
will select from proposals those really corresponding to fricative-tagged
phonemes.
154