Romanian Language Technology

Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian

2. The environment for language resource development

Based on the UD implementation [4], the group at ISSCO developed ELU in AllegroCommon Lisp under SUNOS4. The initial UD environment included the basic unification machinery, the continuation-based lexicon, the parser and the generator and it was extended with a transfer module and a new lexical component based on inheritance hierarchies [5]. As a unification-based system, ELU can be characterised as weakly typed, with negation (~val) and disjunction (val1/val2/.../valn) over the atomic constants and allowing for recursive and parametric relational abstractions [6], [7]. During 1993-1994, a bilateral cooperation between ISSCO and RACAI [8] succeeded in the ELU environment porting on Macintosh computers under MCL. This port preserving full compatibility with the SUN version, required rewriting large parts of the code for efficiency and portability purposes. The new code has been developed so that the same files should compile both under MCL on MAC OS and under AllegroCL on SUNOS.

As RACAI has largely used this environment for the development of the Romanian morphology and of a large lexicon (see Chapter 3) meant for the public domain, we have decided to make another port, this time using a public domain LISP. Our choice was CMU-LISP, running under SOLARIS, probably the best freeware implementation of Common-Lisp. The decision was supported not only by the efficiency of this CL implementation, but also by the API facilities offered by CMU-CL allowing almost null-cost future ports on other platforms (HP, SGI, Linux, etc.). Full access to X graphical functions which CMU-CL gives has been another very attractive feature of this environment (we plan to add a graphical interface and a feature-structure browser). This new port, which we named EGLU (Environnment Generique Linguistique d'Unification) maintains full compatibility with the initial ELU, but due to the conditional compilation special forms, its code, as it is, compiles and runs under MCL (MacOS), AllegroCL (SunOS) and CMU-CL (SOLARIS). It takes better advantage of the operating system (for instance a command not recognized by the interpreter is sent to an operating system shell and, if the return code is an error-code, then a complaint is addressed to the user). Building applications and patching have been optimised (with appropriate conditional compilation forms) for the CMU-CL. For instance, initially, a dumped application needed 21 Mb of hard disk. After filtering out the unnecessary code for the execution of EGLU applications, the hard disk requirement decreased to 6Mb. Beside the ISO-Latin1 character set existing in the initial ELU implementation, ISO-Latin2 (covering Romanian diacritics) was incorporated.

3. Encoding linguistic knowledge in EGLU

Several language resources have been developed within the framework of EGLU and embedded into applications ranging from toy reversible machine translation systems to wide coverage lexical processors. In the remaining of this paper we shall dwell on our encoding of Romanian morphology and of the associated lexicon. The description fully covers the inflectional morphology of the Romanian language and deals with the most productive lexical affixes. The lexicon contains almost 35,000 lemma entries, based on which, in accordance with morphological description, more than 1.5 million word-forms can be interpreted or generated¹.

Although not discussed here, EGLU has been used for developing a comprehensive grammar of the noun phrase in Romanian. This implementation purposely avoided generalizations which would have committed it to a specific linguistic theory. Therefore, EGLU is rather a gloss over the most frequent NP structures, than an efficient description. However, being operational it can serve as a testbed for future HPSG-based encoding of a Romanian computational grammar.

3.1. Encoding Romanian morphology

One of the assumptions in the computational linguistics of the 1980s was that the procedural formalization is the most appropriate one for describing the phonological, morphological and orthographic levels. This opinion spread thanks to the influential work of Koskenniemi and his colleagues on the two-level phono-morphological model. On the other hand, the generalisation of unification-based formalisms, advocating descriptional uniformity and adequacy, declarativeness, reversibility and reusability, has naturally led to new modelling alternatives, based on feature-structure theories. The feature-structure paradigmatic morphology, defined at the end of the 1980s [9,10,11,12,13], tends to generalise because, beside its upper noted advantages, it blurs the distinction between the inflexional morphology and the derivational one, as far as necessary processing tools are concerned. It is worth mentioning another characteristic of the paradigmatic morphology, deriving from the underlying unification theory: the combination of partial information provided by different knowledge sources (this is relevant for the morpho-lexical acquisition process, as well as for the natural language parsing and generation process) is strictly monotonic.

¹ In estimating this number, we counted homographs and homonyms as individual word-forms; for instance, the string "vin" counts as three word-forms: wine, (to) come_I and come_they.