Romanian Language Technology

Teodor Vuºcan, Emma Tamâianu, Sanda Cherata * SILEX - a Lexico-Morphological Software for Romanian

2. The dictionaries of SILEX

By SILEX dictionary we understand all the lexico-morphological data used by the system in order to analyze and recognize word forms. In its first stage, the SILEX dictionary must contain information regarding almost all the words of present-day Romanian (which means approximately 100,100 words). The solutions regarding the organization and the representation of linguistic information were chosen so as to meet the following requirements:

every word form has to be recognized, either directly, the word having is own entry in the SILEX dictionary, or through effective algorithms, using the information in the dictionary; for each word form, the system has to specify its appropriate lemma;
the information in the dictionary must allow for the reverse process, that of generating the whole paradigm of any specified word, starting from its lemma;
the dictionary search time must be as short as possible, so that the applications using the SILEX dictionary should be run without noticeable delays;
the memory storage needed for the large amount of dictionary data has to be minimized;
dictionary maintenance must be performed efficiently and easily; this implies the existence of several facilities for entering new words, for error correction and for actualizing and enriching the information in the dictionary;
the information must be organized in a way that would offer the possibility of word selection by various lexico-morphological criteria, as well as by multiple criteria combinations.

The information in the machine dictionary is organized in three main subunits:

the internal dictionary (SILEX dictionary) - the one that is operational during word analysis or paradigm generation;
lists of word endings;
auxiliary dictionaries.

2.1. The SILEX dictionary

Form an abstract point of view, the internal dictionary is a set of records, each one being associated to a lemma. The records in the dictionary contain two types of information:

information for establishing the morphological attributes of an inflectional form which appears in the paradigm of a given lemma;
information for retrieving of any inflectional form associated with a given lemma;
information for generating the whole paradigm of a given lemma.

Due to the peculiarities of the Romanian language, for processing Romanian texts the machine dictionary has to contain information referring to word inflection. These data are used both for recognizing word forms and for developing the routines that generate the inflectional forms of any given lemma. As the inflectional procedures are rather cumbersome - especially because of the changes that occur in the word root (stem) - the encoding used by SILEX does not always conform to linguistic criteria, being conceived, instead, according to purely formal criteria. Consequently, the inflectional (sub)classes, the "roots" and the sets of word endings are not totally coincident with the subcategorizations found in linguistic descriptions. This encoding pertains exclusively to the internal organization of the information in SILEX, while the final result of word analysis or of paradigm generation is totally coincident with linguistic reality.

Thus, from the point of view of machine analysis, an inflectional word has the following structure:

'root' + 'ending'

where 'root' means the string that remains constant during word inflection (for the whole paradigm or for a subset of the paradigm ) and 'ending' means the string added to the 'root' to obtain one of the inflectional forms of the word.

In the case of phonetic alternation the solution chosen for the machine dictionary is to introduce, for such a lemma, two or more entries ('roots'), specifying for each 'root' the subparadigm it belongs to.

Therefore, for each entry the dictionary contains the 'root', a code specifying the list of 'endings' associated to the root and a set of attributes specific to each lexico-grammatical class; by attaching to the 'root' the 'endings' in the list, there results the paradigm of the given word.