Teodor Vuºcan, Emma Tamâianu, Sanda Cherata *
SILEX - a Lexico-Morphological Software for Romanian
2. The dictionaries of SILEX
By SILEX dictionary
we understand all the lexico-morphological data used by the system
in order to analyze and recognize word forms. In its first stage,
the SILEX dictionary must contain information
regarding
almost all the words of present-day Romanian (which means approximately
100,100 words). The solutions regarding the organization and the
representation of linguistic information were chosen so as to
meet the following requirements:
- every word form has to be
recognized, either directly, the word having is own entry in the
SILEX dictionary, or through effective algorithms,
using the information in the dictionary; for each word form, the
system has to specify its appropriate lemma;
- the information in the dictionary
must allow for the reverse process, that of generating the whole
paradigm of any specified word, starting from its lemma;
- the dictionary search time
must be as short as possible, so that the applications using the
SILEX dictionary should be run without noticeable delays;
- the memory storage needed
for the large amount of dictionary data has to be minimized;
- dictionary maintenance must be performed efficiently and easily;
this implies the existence of several facilities for entering
new words, for error correction and for actualizing and enriching
the information in the dictionary;
- the information must be organized
in a way that would offer the possibility of word selection by
various lexico-morphological criteria, as well as by multiple
criteria combinations.
The information in the machine
dictionary is organized in three main subunits:
- the internal dictionary (SILEX dictionary)
- the one that is operational
during word analysis or paradigm generation;
- lists of word endings;
- auxiliary dictionaries.
2.1. The SILEX dictionary
Form an abstract point of view,
the internal dictionary is a set of records, each one being associated
to a lemma. The records in the dictionary contain two types of
information:
- information for establishing
the morphological attributes of an inflectional form which appears
in the paradigm of a given lemma;
- information for retrieving
of any inflectional form associated with a given lemma;
- information for generating
the whole paradigm of a given lemma.
Due to the peculiarities of the
Romanian language, for processing Romanian texts the machine
dictionary has to contain information referring to word inflection.
These data are used both for recognizing word forms and for developing
the routines that generate the inflectional forms of any given
lemma. As the inflectional procedures are rather cumbersome -
especially because of the changes that occur in the word root
(stem) - the encoding used by SILEX does not always conform
to linguistic criteria, being conceived, instead, according to
purely formal criteria. Consequently, the inflectional (sub)classes,
the "roots" and the sets of word endings are not totally
coincident with the subcategorizations found in linguistic descriptions.
This encoding pertains exclusively to the internal organization
of the information in SILEX, while the final result of
word analysis or of paradigm generation is totally coincident
with linguistic reality.
Thus, from the point of view of
machine analysis, an inflectional word has the following structure:
'root' + 'ending'
where 'root' means the
string that remains constant during word inflection (for the whole
paradigm or for a subset of the paradigm ) and 'ending'
means the string added to the 'root' to obtain one of the inflectional
forms of the word.
In the case of phonetic alternation
the solution chosen for the machine dictionary is to introduce,
for such a lemma, two or more entries ('roots'), specifying
for each 'root' the subparadigm it belongs to.
Therefore, for each entry the
dictionary contains the 'root', a code specifying the list
of 'endings' associated to the root and a set of attributes
specific to each lexico-grammatical class; by attaching to the
'root' the 'endings' in the list, there results
the paradigm of the given word.
64