Romanian Lexicon: Tools, Implementation, Usage

Svetlana Cojocaru

1. Introduction

This article is a survey of the results obtained at the Institute of Mathematics of the Academy of Sciences of Moldova in the computational morphology (linguistics) in the last five years.

The morphological inflexion is a necessary part in creating computational lexicons. Some ideas to solve these problems for the Romanian language are described in [1,2,3]. The static method, described in [1,2], proceeds from the knowledge of the base word and the inflexional group in correspondence to the classification giving in [4]. The dynamic method [3] results from the base word and the morphological category (the part of speech, the gender for nouns etc.). We discuss them in Section 2.

One of the problems with natural language processing software is how to integrate in various environments and how to develop an application for a specific platform. We propose the Romanian Spelling Pack (RomPW) which is represented under Windows by several DLLs (dynamic link libraries).

The Hyphenation function, the Checking function, the function of Romanian words inflection, the Vocabulary support function are compound parts of the Romanian Spelling Pack. The components of this Pack are described in Section 3.

We will show its integration into MS Word 6.0 as the word processing environment (Romanian Spelling Checker) in Section 4.

RomPW is a developing system, and the perspectives of its development are presented in the final Section 5.

2. Romanian words inflection

Being a highly inflexional language, Romanian makes really difficult the problem of compact representation of its vocabulary V. One of the well-known approaches here is to separate roots R and endings E.

Using the definition of binary decomposition specified in [1,2], there are various ways for constructing such decompositions: it is quite possible that R=V and all endings are empty words, or vice versa, when there is a single root (empty word), but all the elements of V serve as endings. If V is the vocabulary of word-forms for a language, there is some hope that taking a natural decomposition into E and R the above method lead to a reasonable map. It means that list L of all the possible values of subsets f(r) would not be so large (as compared to the size of V). In this case it would be sufficient to keep for every root r only the index of its subset f(r) in a list L; thus the necessary memory for the vocabulary would consist of two main parts: memory for root set R (plus memory for index for every root) and memory for the list L of possible sets of endings.

The starting point for this approach was book [4], where most of Romanian inflective words were classified according to the methods of flexion creation. There were 100 groups of masculine nouns, 273 of verbs etc. in the book, and about 30,000 words with their group numbers were listed. The classification was made from the linguistic point of view, and, for example, the accents were taken into account. Nevertheless, this classification was useful and led to the idea of introducing the special grammar formalising word-forms production. We present it in subsection 2.1. Using these grammar rules, we can formalise the process of creation of the decomposed vocabulary.

The above method of decomposition is based on the knowledge about the morphological group of a given word. Nevertheless it is necessary to have the possibility to include a new set of word-forms for the given item without this knowledge. We need to detect the group number dynamically.

First of all the word-forms themselves should be obtained. A special program to facilitate this boring work is presented in subsection 2.2.