Dan Tufiº, Ana-Maria Barbu * A Reversible and Reusable Morpho-Lexical Description of Romanian
The TELRI Copernicus Concerted
Action, within the Research Working Group, constructed a multilingual
corpus consisting of translations of Plato's "Republic".
The text has been segmented, lemmatized and manually morpho-syntactically
annotated. The option for manual annotation was motivated by our
intention to use this text (and some other texts from different
registers, all of them making up almost half a million of hand-tagged
texts) as a test-bed for training and evaluating a probabilistic
tagger. In manual annotation of the "Republica"`s text
(more than 130,000 words) the EGLU morphology and lexicon proved
to be again extremely useful. The words missing from the MULTEXT-EAST
word-form lexicon, but present in the "Republica" (about
1350) were fed into the EGLU lexicon. For these new words we generated,
as described above, a number of new entries for the word-form
lexicon. The new lexicon was then used as the main language resource
for tokenizing the text of the "Republica" and MSD tagging
of each lexical token. From the lemmatized and morpho-syntactically
annotated text (not only Republica, but the whole corpus) we extracted
several interesting statistics on word frequencies, MSD ambiguity
classes, etc., which hopefully will give hints at designing a
proper tagset for automatic tagging of Romanian.
The full words "lexicon"
was the basis for building a spell-checker for Romanian within
the context of Ispell for UNIX. This spell-checker works with
several encodings for the Romanian diacritics: SGML entities,
ISO-LAT2 8-bit codes and TEX-like.
Acknowledgments.
The reported results were also contributed by some other former
colleagues. Among them, special thanks are due to Octav Popescu,
Lidia Diaconu, Cãlin Diaconu, Loreta Moisã ºi
Cosmin Popovici.
92