Romanian Language Technology

Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian

Remember what we have said before about the monotonic character of unification in EGLU. This means that in order to deal with suffixes that change the category, the category cannot be specified in the entry for those lemmas that might take a lexical suffix. The best that can be done is to specify a disjunctive set of possible grammar categories and let the affix (be it grammatical or lexical) decide on the final category. Actually this is exactly how we solved this problem. When the grammar category is left ambiguous as an ordered disjunction, the first category in the disjunction specifies the default category which we called the thematic grammar category. In the next Section we will follow more closely how this works. Here, we make some preliminary remarks. The lemmas are collected in a separate lexicon (our implementation splits the lemmas lexicon into several files). The lemmas lexicon contains the basic information pertaining to whatever inflectional or derivative word-form related to a given lemma (here is the place for lexical semantics). A different lexicon, containing stems, describes the morphology for each lemma. The coreferentiality between a lemma and a stem is explicitly given in the stems lexicon. In Figure 5, the notation @abroga/lemmas is to be read as "unify the feature-structure of the current entry with the feature structure that corresponds to the entry abroga in the lemmas lexicon". From the lemma abroga (thematically, a verb) there are two stems, a verbal one (generating verbs) and a nominal one (generating either nouns or adjectives). The duplication of the verbal stem was necessary in order to differentiate among the predicative forms which do not take the negative prefix from the non-predicative forms which may take it (for a detailed description of the EGLU syntax see [7] and [8]).

# Lexicon lemmas
...
	abroga  * v/n/adj
...
# Lexicon stems
...
abrog	v	@abroga/lemmas !predcases !prefix (none) $verb1
abrog	v	@abroga/lemmas !nonpredcases !prefix (ne/none) $verb1
abrog	n/adj	@abroga/lemmas !rol(denom) !prefix (ne/none) $verb_suf4 {+part} $verb_suf_part1
...

Fig. 5. - Morphological description of a lemma and related stems.

The nominal declension has been covered by a set of paradigms applying for nouns, adjectives and numerals. The common noun declension has been modelled by means of 52 global inflectional paradigms: 14 for masculine nouns, 22 for feminine nouns and 16 for neuter nouns. With appropriate restrictions to the lexical entries, these paradigms may also be used for proper nouns. Additionally, there are 7 paradigms specific to proper nouns. These global paradigms resulted from the combinations of 78 partial paradigms which correspond to grammatical suffixes (providing information on gender, number, case and definiteness). There are also partial semi-lexical paradigms describing the formation of diminutives (47) and augmentatives (7), plus gender changing suffixes (11). For the most productive proper lexical suffixes applying to nominal stems, we encoded 46 lexical paradigms. The declension of the adjectives and numerals which contextually turn into nouns or adjectives, observes the patterns of the inflectional paradigms for the common nouns. The other inflecting parts of speech (the article, the indefinite pronouns and adjectives, the demonstrative pronouns and adjectives) have been modelled in terms of a smaller number of inflectional paradigms (19 paradigms).

Unlike the inflectional parts of speech, the non-inflectional ones (adverbs, prepositions, conjunctions, interjections) have flat feature structures, with few attributes. That is why, all the information specific to a given word is written in the appropriate entry from the lemma lexicon (where the lemma coincides with the occurrence form of the word). Obviously, the field of the paradigmatic information is empty.

Special cases are the personal pronoun, the possessive pronoun and the possessive adjective. Although these categories are inflectional (various forms depending on gender, number, person, and strong-weak form distinction), they have been encoded as non-inflectional items. This choice was simply a matter of efficiency since they form a closed set and isolating their stems turned out to be a too much expensive operation.

The present morphological description has been tested against about 35,000 lexical entries covering all parts of speech of Romanian and all the paradigms.

Given the EGLU facility of morpho-lexical generation, we were able to blow-up the 35.000 lemma dictionary to more than 1,500,000 word-forms feature structures and we could establish a set of medium values of the spanning factor (i.e. the number of word-forms a lemma stands for) for each stem considering the grammar category of its default thematic lemma. Considering only the suffixation (both grammatical and lexical) the average spanning factors are as follows: nouns 32, verbs (except for impersonal, unipersonal and defective verbs) 78, adjectives 14. If derivations by prefixation are also taken into account, the values mentioned above must be multiplied by N+1, where N is the number of the appropriate prefixes for a given stem [16].