Romanian Language Technology

Maria-Mirela Petrea, Dan Cristea * Dealingwith Prosody. A Computer-Assisted Language Learning Approach

3.2. Silence detection

The first step in preparing the signal for comparison (where the first feature is extracted from the signal) is the delimitation of silence portions among speech units. The markers delimiting silence inside a signal are established, by processing the loudness, according to the following criteria:

duration: a significant silence has to have a minimal duration (100 ms);
amplitude: the maximum amplitude inside a silence segment must lie within the limits of a narrow ZERO centered interval;
noise peaks elimination: that is, signal peaks irrelevant as duration inside a silence are eliminated.

The algorithm proceeds "looking-ahead": every time the loudness points out a frame in the original signal with local maximum amplitude inside a narrow ZERO centered interval, there is assumed that this frame constitutes the beginning of a silence segment (according to the second criterion). The sequel analysis should confirm or infirm this hypothesis. Any point in loudness indicating high energy in signal is first suspected as being the beginning of an irrelevant (noise) peak. If it is (i.e., no longer than 50 ms), it is simply appended to the current silence segment. If it isn't, the silence interval is ended. Further, if the length of the silence currently detected is significant according to the first criterion, the signal is marked with SILENCE and END_SILENCE marks.

Fig. 2. - Dialog box in PROSODICS to edit master signal.

3.3. Fricatives

The detection of fricatives precedes pitch determination as a first indication regarding voiced/unvoiced feature. To characterize fricatives PROSODICS uses the zero-crossing rate function and the loudness function. Using ZCR, different segments in the signal are seen as belonging to one of three categories depending on whether the zero-crossing values are high, middle or low. High ZCR value segments are marked as frications, while for the middle value segments the loudness is also inspected and they are marked as frications only when the loudness points out rather small energy. As the algorithm has a slight tendency of overestimating the fricatives zones (marked between FRICATIVE and END_FRICATIVE markers), the final decision will be taken at pitch detection and, eventually, in the signal-to-text alignment step. Pitch detection will establish the final voiced/unvoiced decision, while signal-to-text alignment will select from proposals those really corresponding to fricative-tagged phonemes.

154