Romanian Language Technology

Maria-Mirela Petrea, Dan Cristea * Dealingwith Prosody. A Computer-Assisted Language Learning Approach

3.4. Segmentation

The steps described intend to prepare the signal for two main further phases: segmentation and F₀ estimation. During the segmentation process the silenceportions previously detected are skipped over. The algorithm is essentially based on the convex-hull method (see [9]). The differences come especially from the length of the frame chosen to compute the loudness. Since the frame is rather small, a direct implementation of the convex-hull method would lead to a big number of segments, even when the thresholds have been enlarged accordingly. Actually, since small frame means a shaked up loudness, so many hill-valley alternations, it might become difficult to decide which minimum to retain as being a possible speech unit delimiter. The PROSODICS segmentator emphasizes only loudness local maximum adjacent to significant valleys. During the segmentation process a loudness tagging is also performed, each segment marker receiving a tag that characterizes the slope of the following segment: stationary, steep/sweet ascending, steep/sweet descending.

At the end of the segmentation a matching of the fricatives' markers against segmentation markers is performed in order to further classify segments according to their inclusion in a fricative class.

The PROSODICS' segmentator was not designed to point out syllables. Instead, the resulted segments can be at phone level or even lower. This feature is exploited at the editing phase: the MASTER has the possibility to choose the speech units she considers valid and relevant. Tagging segments are necessary when the signal is automatically aligned to its phonemic transcription, when scoring the match between the concatenation of two or more segments against a phoneme or a group of phonemes. For instance, the stationary tag of a segment is an indication of a nasal or low fricative presence, which can influence the segment(s) to phoneme(s) matching score.

For the editing phase the master uses a dialog box as that in figure 2. She has to listen to the segments already detected, to eventually join them up to valid speech units and to introduce text information related to each unit decided. Text information describes the unit to which it is attached, and may be expressed as a phonemic or as an orthographic transcription, depending on the type of alignment used. When pitch-to-pitch alignment is to be performed, the orthographic transcription is entirely introduced by the master. When signal-to-text alignment is used, the edit master dialog box displays alignment markers automatically detected and, for each portion between two consecutive markers, the phoneme group label; the master may join consecutive portions, case in which the label is automatically updated, or may correct the label currently attached to a portion.

3.5. Cut-silence and beginning/final noiseelimination

Cut-silence and noise elimination procedure are especially concerned with deleting in recorded signals the noise that precedes the speech and follows it. Usually, the speaker makes a silence before producing speech and before stopping the recording process, silence that must be cut off. This procedure deals also with the frequent noise caused by moving the microphone or breathing. In detecting noise, the most important criterion is the absence of the pitch on portions with high energy.

The procedure acts as follows (we sketched it for final portion of the signal only): after the last pitch marker, it is looking ahead in the signal for a long enough silence portion (greater than 250 ms); if it is encountered, anything coming after it is simply regarded as noise and eliminated, the silence portion included.

3.6. Pitch detection

3.6.1. Description of the method

The pitch detector of PROSODICS belongs to time domain analysis class (see also [5, 6]) and is based on a search of peaks' periodicity of the original waveform signal. It processes the original signal (no preliminary low-pass filtering is performed) using a peak-picking like strategy.

The input of the pitch detection algorithm is a waveform signal, fundamentally characterized by its sampling-frequency, a flag indicating whether the signal belongs to a male or a female speaker and a list of markers delimiting the silence and fricatives zones inside the signal. The already detected silence zones are skipped. The fricatives zones tagged as high-value in zero-crossing domain are skipped also, but not the middle-valued ones.

The method we propose takes as pitch period the time interval between two significant peaks (see figure 3). Informally, the algorithm consists of three basic actions: deciding whether a peak is an epoch, trying to start a new pitch trace and trying to continue a trace already detected with a new epoch. The peak controller may propose peaks that are not epochs, so it simultaneously develops pitch traces that may intersect with each other.

155