Romanian Language Technology

Teodor Vuºcan, Emma Tamâianu, Sanda Cherata * SILEX - a Lexico-Morphological Software for Romanian

3. The function of lemmatization

Lemmatization represents the process by which for each word form occurring in a text its lemma and its main attributes are given. The SILEX system is designed in such a way that it is possible to find, for each lemma, all the word forms appearing in its paradigm, without having to include a separate dictionary entry for each form. Lemmatization is the process by which the reversed function is performed: given a word form, the lemma(s) in whose paradigm that form appears is determined; the attributes pertaining to the lemma(s) are also given.

Note. The paradigm of a non-inflected word contains only one word form, which coincides with the lemma.

The lemmatization process will give, for each word form, the following results:

the lemma, as the representative of the paradigm;
the lexico-grammatical class of the word (noun, adjective, verb, etc.);
the lexico-grammatical sub-class of the word, for some grammatical categories (common or proper for nouns, predicative, auxiliary or copula for verbs, etc.);
in the case of inflected words, some other attributes are given, according to the morphological class:
- for nouns and adjectives: gender, number, case;
- for verbs: mood, tense, person, number;
- for pronouns: type of pronoun, gender, number, case, person;
- for numerals: type of numeral, gender, case ( when it exists).

Note.
In the first version of SILEX, the analysis of word forms was strictly context-free. That meant that the system finds, for each occurrence of a form, all the possible morphological determinations, i. e. all the lemmas in the paradigm of which the given form appears, with all of their specific attributes.

At present, various elements of contextual analysis have already been implemented, functioning as a filter that points to the most probable lemma in the given structure.

In the lemmatization process, for each analyzed word, the system tries all the possible decompositions of the form:

word = r + t

where r corresponds to a possible root and t corresponds to a possible word ending. In the initial decomposition, t is the empty string.

For each such decomposition, the system searches for a matching lemma. To this purpose, the SILEX dictionary is searched for an entry corresponding to r. If such an entry exists, the system checks if t corresponds to a word ending in the list of word endings associated to r in the dictionary. If this is the case, the lemma and its attributes are determined.

Note.
It is possible that, for a root r, two or more matching dictionary entries should be found (due to fact that the same invariant word segment may appear in the paradigm of more lemmas). That is why, for a given r, all the matching dictionary entries are analyzed. For each entry, a check on the ending t is performed. In this way, for a given decomposition, all the lemmas, in the paradigm of which the word form appears, are detected.

Then the system tries another decomposition of the form:

word =r' + t'

where: r = r' + u, t' = u + t
u being the last letter of r, and r and t - the elements of the previous decomposition, and the checking process is reiterated. It is possible that more than one decomposition should be valid (in case of homographs). By this procedure, all the possible determinations are found (that is, all the lemmas, in the paradigm of which the word form appears, are identified). The decomposition process comes to an end either when r' is the empty string, or when t' does not belong to any list of word endings.

4. Paradigm generation

The process of generating the paradigm of a given lemma (inflection) represents the building of all the forms belonging to a paradigm, given its representative (the lemma) and its lexico-grammatical class. Because of the homographs, one and the same lemma may have different paradigms, belonging to the same or to different morphological classes. The distinction between two homograph lemmas belonging to different morphological classes is made by the attribute 'lexico-morphological class' attached to the lemma (and to its corresponding entry in the SILEX dictionary). The distinction between two homograph lemmas having the same morphological class is made through the list of word endings attached to the lemma in the SILEX dictionary; in the case of nouns, the distinction is also made through the gender attribute. For homograph lemmas of the same lexico-morphological class, all the paradigms are generated.

Inflection is performed as a symmetric operation to lemmatization (from the point of view of both design and implementation).