Romanian Language Technology

Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian

4. Applications of the EGLU Lexicon

By morpho-lexical generation, from the 35,000 lemma dictionary we obtained more than 1,000,000 word-forms feature structures. With a small set of filters we converted them to pairs of the form <word-form MSD> where MSD (Morpho Syntactic Description) represents a flat linear encoding of the corresponding feature structure. The MSD specification was a concerted effort within MULTEXT-EAST to extend the EAGLES specifications for covering several Central and Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene; for a detailed specification see http://nl.ijs/si/ME). This word-form lexicon was developed as a deliverable for the MULTEXT-EAST Copernicus Project. However, its use goes beyond the initial purpose.

The TELRI Copernicus Concerted Action, within the Research Working Group, constructed a multilingual corpus consisting of translations of Plato's "Republic". The text has been segmented, lemmatized and manually morpho-syntactically annotated. The option for manual annotation was motivated by our intention to use this text (and some other texts from different registers, all of them making up almost half a million of hand-tagged texts) as a test-bed for training and evaluating a probabilistic tagger. In manual annotation of the "Republica"`s text (more than 130,000 words) the EGLU morphology and lexicon proved to be again extremely useful. The words missing from the MULTEXT-EAST word-form lexicon, but present in the "Republica" (about 1350) were fed into the EGLU lexicon. For these new words we generated, as described above, a number of new entries for the word-form lexicon. The new lexicon was then used as the main language resource for tokenizing the text of the "Republica" and MSD tagging of each lexical token. From the lemmatized and morpho-syntactically annotated text (not only Republica, but the whole corpus) we extracted several interesting statistics on word frequencies, MSD ambiguity classes, etc., which hopefully will give hints at designing a proper tagset for automatic tagging of Romanian.

The full words "lexicon" was the basis for building a spell-checker for Romanian within the context of Ispell for UNIX. This spell-checker works with several encodings for the Romanian diacritics: SGML entities, ISO-LAT2 8-bit codes and TEX-like.

Acknowledgments. The reported results were also contributed by some other former colleagues. Among them, special thanks are due to Octav Popescu, Lidia Diaconu, Cãlin Diaconu, Loreta Moisã ºi Cosmin Popovici.