Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian
# Lexicon lemmas ... abroga * v/n/adj ... # Lexicon stems ... abrog v @abroga/lemmas !predcases !prefix (none) $verb1 abrog v @abroga/lemmas !nonpredcases !prefix (ne/none) $verb1 abrog n/adj @abroga/lemmas !rol(denom) !prefix (ne/none) $verb_suf4 {+part} $verb_suf_part1 ...
Fig. 5. - Morphological description of a lemma and related stems.
The nominal declension has been covered by a set of paradigms applying for nouns, adjectives and numerals. The common noun declension has been modelled by means of 52 global inflectional paradigms: 14 for masculine nouns, 22 for feminine nouns and 16 for neuter nouns. With appropriate restrictions to the lexical entries, these paradigms may also be used for proper nouns. Additionally, there are 7 paradigms specific to proper nouns. These global paradigms resulted from the combinations of 78 partial paradigms which correspond to grammatical suffixes (providing information on gender, number, case and definiteness). There are also partial semi-lexical paradigms describing the formation of diminutives (47) and augmentatives (7), plus gender changing suffixes (11). For the most productive proper lexical suffixes applying to nominal stems, we encoded 46 lexical paradigms. The declension of the adjectives and numerals which contextually turn into nouns or adjectives, observes the patterns of the inflectional paradigms for the common nouns. The other inflecting parts of speech (the article, the indefinite pronouns and adjectives, the demonstrative pronouns and adjectives) have been modelled in terms of a smaller number of inflectional paradigms (19 paradigms).
Unlike the inflectional parts of speech, the non-inflectional ones (adverbs, prepositions, conjunctions, interjections) have flat feature structures, with few attributes. That is why, all the information specific to a given word is written in the appropriate entry from the lemma lexicon (where the lemma coincides with the occurrence form of the word). Obviously, the field of the paradigmatic information is empty.
Special cases are the personal pronoun, the possessive pronoun and the possessive adjective. Although these categories are inflectional (various forms depending on gender, number, person, and strong-weak form distinction), they have been encoded as non-inflectional items. This choice was simply a matter of efficiency since they form a closed set and isolating their stems turned out to be a too much expensive operation.
The present morphological description has been tested against about 35,000 lexical entries covering all parts of speech of Romanian and all the paradigms.
Given the EGLU facility of morpho-lexical generation, we were
able to blow-up the 35.000 lemma dictionary to more than 1,500,000
word-forms feature structures and we could establish a set of
medium values of the spanning factor (i.e. the number of word-forms
a lemma stands for) for each stem considering the grammar category
of its default thematic lemma. Considering only the suffixation
(both grammatical and lexical) the average spanning factors are
as follows: nouns 32, verbs (except for impersonal, unipersonal
and defective verbs) 78, adjectives 14. If derivations by prefixation
are also taken into account, the values mentioned above must be
multiplied by N+1, where N is the number of the appropriate prefixes
for a given stem [16].
87