Svetlana Cojocaru * Romanian Lexicon: Tools, Implementation, Usage




There is one more difficult problem, which we can solve only partially - hyphenation of compound words. There are compound words of the two types:

  1. words which are always written with hyphen: baba-oarba, prim-ministru, anglo-franco-italian;
  2. words consisting of several words, which are written as a single word: feldmareºal, concertmaistru, binefacere.

We can not correctly divide all the words of the second group yet, for the lack of necessary information. For example, the words bi-ne-fa-ce-re, bu-nã-vo-in-þã are divided by our algorithm correctly, but such compound words, as feldmareºal, concertmaistru, are divided incorrectly. Each word of the first group is processed as a group of independent words, between which the hyphen is already known: ba-ba-oar-ba, prim-mi-nis-tru, an-glo-fran-co-i-ta-li-an.

Tests showed that 70% of the words in the texts from the scientific and art literature are divided correctly.

3.2. Checking function

The Checking function is a function which checks a word against the base and activates a dialogue if the word is not found in the base. The dialogue proposes the following possibilities [9]:

3.3. Function of Romanian words inflection

The function of Romanian words inflection inflects Romanian words using the word-forming procedures described above (section 2.1).

Six grammar categories of eleven can be inflected by special procedures. Nouns, adjectives, articles, numerals and pronouns are declined in accordance with case, number and form. Verb is conjugated in accordance with tense, mood, person, etc. The rest - adverb, preposition, conjunction, particle and interjection are invariable and because they are not so numerous they are introduced in the vocabulary directly, without change by inflecting procedures.

3.4. Vocabulary support function

The Vocabulary support function is executed by a separate interactive program to maintain the vocabulary data base [8]. This program has the following functions:

These Vocabulary support functions are implemented by DLLs which represent the vocabulary data base support. The DLL does not use vocabularies. Other DLLs are of higher level and can be called directly from applications.

Two functions need further explanation - "to compress the base" and "to make triades".

Compressing means that all words stored into the user vocabulary, which is not effective, are moved to the main vocabulary.

"Triades" is the table of all three-letter combinations existing in the vocabulary entries. It is organised as an array of bitmaps. We use it when generating suggestions.

"Pages" are not printed pages but those vocabulary data base pages described in [1].

Synonym and translation support is the experimental part of our project. Now it is not included into the distributed version.

The output list of words can be used later to reconstruct the whole vocabulary from scratch. We had also used it to collect some statistical information.


81

Previous Index Next