Teodor Vuºcan, Emma Tamâianu, Sanda Cherata * SILEX - a Lexico-Morphological Software for Romanian
Note.
The paradigm of a non-inflected word contains only one word form,
which coincides with the lemma.
The lemmatization process will
give, for each word form, the following results:
Note.
At present, various elements of
contextual analysis have already been implemented, functioning
as a filter that points to the most probable lemma in the given
structure.
In the lemmatization process,
for each analyzed word, the system tries all the possible decompositions
of the form:
word
= r + t
where r corresponds to
a possible root and t corresponds to a possible word ending.
In the initial decomposition, t is the empty string.
For each such decomposition, the
system searches for a matching lemma. To this purpose, the
SILEX
dictionary is searched for an entry corresponding to r.
If such an entry exists, the system checks if t corresponds
to a word ending in the list of word endings associated to r
in the dictionary. If this is the case, the lemma and its attributes
are determined.
Note.
Then the system tries another
decomposition of the form:
word =r' + t'
where:
r = r' + u, t' = u + t
Inflection is performed as a symmetric
operation to lemmatization (from the point of view of both design
and implementation).
66
- for nouns and adjectives: gender,
number, case;
- for verbs: mood, tense,
person, number;
- for pronouns: type of pronoun,
gender, number, case, person;
- for numerals: type of numeral,
gender, case ( when it exists).
In the first version of SILEX, the analysis of word forms
was strictly context-free. That meant that the system finds, for
each occurrence of a form, all the possible morphological determinations,
i. e. all the lemmas in the paradigm of which the given form appears,
with all of their specific attributes.
It is possible that, for a root r, two or more matching
dictionary entries should be found (due to fact that the same
invariant word segment may appear in the paradigm of more lemmas).
That is why, for a given r, all the matching dictionary
entries are analyzed. For each entry, a check on the ending t
is performed. In this way, for a given decomposition, all the
lemmas, in the paradigm of which the word form appears, are detected.
u being the last letter of r, and r and t -
the elements of the previous decomposition, and the checking process
is reiterated. It is possible that more than one decomposition
should be valid (in case of homographs). By this procedure, all
the possible determinations are found (that is, all the lemmas,
in the paradigm of which the word form appears, are identified).
The decomposition process comes to an end either when r'
is the empty string, or when t' does not belong to any
list of word endings.
4. Paradigm generation