Romanian Language Technology

Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian

Based on the set of inflectional paradigms resulted from several learning sessions with our PARADIGM learning system [9,10,12,13], five years ago we began an exhaustive description of the Romanian morphology using a neutral and reversible language (FAVR-Flat Attribute-Value Representation [14]), which does not commit itself to a particular formalism. The rationale for this representation was to ensure not only formal accuracy as requested by computer applications, but also readability and easy migration to whatever computational representation.

As a consequence of the systematic tests applied for the last two years, the implementation of the morphological model has been stabilized. There were detected some errors and omissions (mostly idiosyncratic paradigms with few lexical representatives which often were regional or bookish forms). The corrections and additions made on the EGLU implementation have been operated in the underlying description FAVR too, which, to the best of our knowledge, constitutes the most complete computational linguistic resource of the Romanian language morphology. Figure 1 represents a verbal paradigm description in the FAVR model. The syntax of the FAVR is described in detail in [14].

The aim of the FAVR modelling was to fully cover the inflectional (grammatical) morphology. But, once we decided to migrate the FAVR descriptions in the EGLU formalism, we went ahead and considered the derivative (lexical) processes too. The lexical suffixes (about 600 in Romanian) and the prefixes are very productive [15]. Out of all, our implementation deals only with the most common ones.

Reversibility is an important characteristic of this encoding. The same linguistic description can be used both in analysis and generation and the reversibility is kept at all levels of the linguistic processing at which the linguistic information is required.

According to our description, a word is an entity composed of three types of fundamental units: the heading, the stem and the ending. While the heading and the ending of a word may be empty, the stem is always present. The heading of a word is a sequence of zero or more prefixes which change the lexical meaning of the stem. The ending of a word is made up of zero or more lexical suffixes, and a grammatical suffix which can be empty. The stem together with the first adjacent lexical suffix makes up a lexical theme. This theme, if it is followed by a new lexical suffix, makes up a new theme and so on. When the structure of a word has no lexical suffix, the stem stands for what we call the implicit lexical theme.

PRDM: verb21
 [VOICE:{active reflexive}
  [FIN:+ PRD:+ 
   [VFORM:indicative
    [TENSE:present
     [PLU:- [PERS:1 TERM:"ez"] [PERS:2 TERM:"ezi"] [PERS:3 TERM:"eazã"]]
     [PLU:+ [PERS:1 TERM:"am"] [PERS:2 TERM:"aþi"] [PERS:3 TERM:"eazã"]]]
    [TENSE:imperfect
     [PLU:- [PERS:1 TERM:"am"] [PERS:2 TERM:"ai"] [PERS:3 TERM:"a"]]
     [PLU:+ [PERS:1 TERM:"am"] [PERS:2 TERM:"aþi"] [PERS:3 TERM:"au"]]]
    [TENSE:simple-perfect
     [PLU:- [PERS:1 TERM:"ai"] [PERS:2 TERM:"aºi"] [PERS:3 TERM:"ã"]]
     [PLU:+ [PERS:1 TERM:"arãm"] [PERS:2 TERM:"arãþi"] [PERS:3 TERM:"arã"]]]
    [TENSE:past-perfect
     [PLU:- [PERS:1 TERM:"asem"] [PERS:2 TERM:"aseºi"] [PERS:3 TERM:"ase"]]
     [PLU:+ [PERS:1 TERM:"aserãm"][PERS:2 TERM:"aserãþi"][PERS:3 TERM:"aserã"]]]]
    [VFORM:conjunctive TENSE:present
     [PLU:- [PERS:1 TERM:"ez"] [PERS:2 TERM:"ezi"] [PERS:3 TERM:"eze"]]
     [PLU:+ [PERS:1 TERM:"am"] [PERS:2 TERM:"aþi"] [PERS:3 TERM:"eze"]]]
    [VFORM:imperative TENSE:present PERS:2 [PLU:- TERM:"eazã"][PLU:+ TERM:"aþi"]]]
  [FIN:- PRD:- VFORM:infinitive TENSE:present TERM:"a"]]
  [FIN:- PRD:- [VFORM:supine TERM:"at"]
	      [VFORM:participle
   	       [GEN:masculine [PLU:- TERM:"at"] [PLU:+ TERM:"aþi"]]
   	       [GEN:feminine [PLU:- TERM:"atã"] [PLU:+ TERM:"ate"]]]
  	      [VFORM:gerund TERM:"ând"]]]]

Fig. 1. - Inflectional paradigm in FAVR format.

The three types of morpho-lexical constituents are described in distinct "spaces" and possible combinations between different constituents of a word are indicated by the properties and the restrictions specified at the level of every constitutive element. The three informational "spaces" make up together the computational lexicon of the Romanian language. The stem "space" is, of course, the largest of all and it represents what is usually called the "lexicon" or much better the "lexical stock" of a certain language. In the following, each "space" will be referred to as a sublexicon, using the term lexicon for the entire collection defining both the morphology rules and the lexical stock.

The generic structure of a word is described by the following regular expression, where usual conventions (i.e. Kleene's star) have been used (in addition, the empty suffix is considered a proper grammatical suffix):

(prefix)* stem (lexical_suffix)* grammatical_suffix.