A Reversible and Reusable Morpho-Lexical Description of Romanian

Dan Tufiº, Ana-Maria Barbu

1. Introduction

Constructing a natural language dictionary and/or a grammar for computational use is a far-reaching project, requiring very important human and material resources. Generalisation of the lexical approaches in natural language modelling confers an essential role to the dictionary in every system for automatic natural language processing. More and more information, which was traditionally encoded by grammar rules (syntax), enters the dictionary. Thus, specific linguistic knowledge is encapsulated into lexical structures, while the syntax only encodes generic rules, conceptual restrictions, which get instantiated through specific lexical elements. Modern linguistic theories, such as Lexical-Functional Grammar (LFG), Categorial Grammars (CG), Head-driven Phrase Structure Grammar (HPSG) or Lexicalized Tree Adjoining Grammar (LTAG), emphasise the essential contribution of the lexical specifications to defining and representing the grammatical restrictions.

The so far research and development activity in the natural language processing (NLP) area, carried on in the context of unification-based formalisms, as well as the large acceptance of these approaches have rendered the lexicon one of the central topics of the 1990s. As a matter of fact, this is not at all surprising, as any attempt at constructing something more than a toy NLP system immediately brings about the necessity for a dictionary with wide linguistic coverage and, implicitly, with huge demands for material and human resources. Theoretical linguistics presents no problem of this kind, as it uses a few lexical entry examples for arguing one or another theory. Since linguistics, in general, and computational linguistics, in particular, are far from having reached a methodological consensus that might allow for choosing a certain theory or formalism to encode the linguistic knowledge necessary for a natural language processing system, it is very important that the criterion of linguistic reusability in language description should operate. In other words, easy "migration" (automatic, if possible) of a linguistic description from one formalism to another is a goal which if failed can generate "conservatorism", thus missing the conceptual progress in linguistic theory.

For long, NLP practitioners have used to treat parsing and generation as distinct and independent processes, with different tools, techniques and methodologies. Consequently, the linguistic knowledge required by the two ways of natural language processing has mostly been designed and implemented unidirectionally, incorporating procedural knowledge sensitive to the processing direction (analysis or generation).

In the mid 1980s, successful researches were carried out on a uniform treatment of the two processes and unification of the knowledge representation for both of these [1, 2, 3]. The researches into automatic translation were very productive in linguistic resource technology of the 1990s, which could be called reversibility technology. The reversibility of the linguistic descriptions becomes possible by the adoption of declarative formalisms, mainly those based on unification of attribute-value structures.

Due to fast technological and conceptual advances of the last years, the linguistic knowledge reusability makes a natural requirement. If one evaluates the effort invested in implementing a processing environment and the time spent for a wide coverage language description (according to some estimates this ratio is at least 1:100) the reusability criterion in encoding language descriptions should normally prevail efficiency. A linguistic description closely related to a particular formalism, whatever competitive, is at the risk of a partial or even complete reformulation, if the programming environment has to be changed. This is a first aspect of reusability. A second subtler one refers to being precautious in making fundamental decisions in linguistic phenomena modelling based upon a formalism or linguistic theory in fashion at a certain time. More difficult to cope with, this aspect of the reusability may be taken into account by localising very precisely all the elements specific to a certain formalism or linguistic theory.