Romanian Language Technology

Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian

A prefix description (from the morpho-lexical point of view) is relatively simple: for each prefix, one specifies the grammatical category or categories of the stem that can be prefixed, as well as a series of restrictions meant for sorting out illegal combinations. The suffixes are described through lexical or grammatical global paradigms. The global grammatical paradigms are composed of partial paradigms, which encode all the grammatical suffixes along with appropriate information and restrictions. The partial paradigms stand for partial regularities in the declension or conjugation of inflectable parts of speech. Finally, each stem is associated with one or more feature-structures providing context-free grammatical and lexical information: the grammatical category or categories, the lemma form (associated with the implicit theme), restrictions with respect to affixation (lexical or grammatical). Besides, the implicit themes provide syntactic information (for instance, subcategorisation lists) and semantic ones (theta-roles structures, selectional restrictions).

The feature-structure for a word-occurrence, that is the output of the morpho-lexical analysis, or the input expected by the morpho-lexical generator, will contain information congruently provided by the heading, the stem and the ending. The congruency of this information is ensured through feature-structure unification.

As previously mentioned, in the EGLU implementation, the Romanian lexicon is a finite state automaton where the states contain information (equations and relational abstractions) and the transitions are labelled with strings of characters corresponding to word segments. Within a state, successor states are indicated with the special character '$'. A lexicon is thus distributed among a number of 'sublexicons', some of which will be starting points for the analysis and generation, and some others will be 'continuation classes'.

Having analysed the specific attributes of each part of speech, for the morpho-syntactic description of Romanian, we have used 20 grammatical categories. The classification took into account not only the requirements for morpho-lexical processing, but also the granularity needed by the syntactic parsing and generation, respectively. The grammatical and derivative morphology of Romanian is specified by means of several global paradigms, each of them being a combination of partial paradigms.

The verb morphology is encoded by means of 48 global grammatical paradigms, three of them being specific to the auxiliaries (see Figure 2). A global paradigm is made up of several partial paradigms, each corresponding to simple moods and tenses (there are 107 such partial paradigms). Figure 3 shows the encoding of one of the six partial paradigms corresponding to the indicative past-perfect. Beside grammatical verbal paradigms (exhaustively encoded), we considered the most frequent and productive lexical paradigms - which were 27 - attaching lexical suffixes to the verbal stems. These suffixes change the grammatical category of the verb by yielding nouns, adjectives and adverbs. Figure 4 shows such a lexical paradigm that allows for deriving a noun from a verbal stem.

# paradigm verb1 # paradigm indic_mmcperf_1
- !Verb !type(main) - !VTensed(past-perfect,indicative)
$indic_prez_1 asem v {+past} !my_Vagr(singular,1,_)
$indic_imperf_1 aseºi v {+past} !my_Vagr(singular,2,_)
$indic_perfsim_1 ase v {+past} !my_Vagr(singular,3,_)
$indic_mmcperf_1 aserãm v {+past} !my_Vagr(plural,1,_)
$conj_prez_1 aserãþi v {+past} !my_Vagr(plural,2,_)
$imper_prez_1 aserã v {+past} !my_Vagr(plural,3,_)
$infin_prez_1
$part_1
$gerund_2
Fig. 2. - Global verbal paradigm. Fig. 3. - Partial verbal paradigm.

# paradigm verb_suf4
!Encl(_,imperson) !common !rol(denom)
ar n {+a}{-b} $nom_fem8
ãr n {-a}{+b} $nom_fem8
Fig. 4. - A lexical paradigm for verbal stems.

# paradigm verb1		# paradigm indic_mmcperf_1
-	!Verb !type(main)	-	!VTensed(past-perfect,indicative)
	$indic_prez_1		asem v {+past} !my_Vagr(singular,1,_)
	$indic_imperf_1		aseºi v {+past} !my_Vagr(singular,2,_)
	$indic_perfsim_1		ase v {+past} !my_Vagr(singular,3,_)
	$indic_mmcperf_1		aserãm v {+past} !my_Vagr(plural,1,_)
	$conj_prez_1		aserãþi v {+past} !my_Vagr(plural,2,_)
	$imper_prez_1		aserã v {+past} !my_Vagr(plural,3,_)
	$infin_prez_1
	$part_1
	$gerund_2
Fig. 2. - Global verbal paradigm.		Fig. 3. - Partial verbal paradigm.

# paradigm verb_suf4
!Encl(_,imperson) !common !rol(denom)
ar	n	{+a}{-b}	$nom_fem8
ãr	n	{-a}{+b}	$nom_fem8
Fig. 4. - A lexical paradigm for verbal stems.