CONCORD - Software System for Concordances of Romanian Poetical Texts

Sanda Cherata

1. Introduction

A concordance is the organization, as a dictionary, of all the words occurring in a certain author's work, specifying, for each word, its lemma, its relevant attributes and information about its position in text (its contexts). The lemmas are alphabetically ordered and under each lemma, all its occurrences, together with their context information, are listed. Each word is often quoted in context, showing the words in the text which precede and follow the one in question. Such KWIC (Key Word In Context ) can be used in several ways. Concordances make possible the scientific study of the language of an author and the quick retrieval of a large amount of morpho-lexical, historical and literary information.

There are various possibilities to specify the context of a word occurrence: the word may always be collocated at the center of the context, or the text may be segmented in a sequence of contextual units, identified by delimiting factors chosen by the user, such as changes of references (new line, verse, paragraph, etc.), punctuation marks, etc.

CONCORD is a system for developing concordances for poetical works, so we decided that the context of a word occurrence is always the verse in which it appears. In the electronic release of the concordance, the user will be able to visualize a dynamic context, i.e. the verses above and below the one in which the word actually appears; this ability may be useful in the case of very short verses. But in the printed release of the concordance the context is strictly the verse within which the word actually occurs.

As a concordance is essentially a dictionary, the stages needed to produce it are those needed to construct a dictionary, that is:

acquisition of documentation (information on the given author, the texts of his works, etc.);
preparation of the lexical entries (lemmas, their attributes and their contexts);
printing and publication.

The CONCORD system assists the user in all these stages. For certain functions, as spelling-checking the input texts and interactive lemmatization of the poetical texts, CONCORD uses the lexical analyzer of another software, SILEX, a morpho-lexical software for the Romanian language, implemented at the Centre of Text Analysis (see [7,8,22]). The lexical analyzer of SILEX receives as input a word and returns its analysis, i. e. its lexico-grammatical class and its main attributes, depending on the class: number, case, gendre, tense, person, etc.

The information processed by CONCORD is organized as databases. First of all, the information concerning the author, his works, and the texts of his (poetical) works is structured, coded and checked. As we do not have a scanner, the texts are typed, so it is necessary to collate them with the reference edition. In this stage, the user is provided with a menu which allows the choice of various options for editing and checking the input texts; besides this, the system validates the codes used in the databases. In checking the correctness of input texts, at the user's choice, the spelling-checker of SILEX may be activated.

As a second step, each poem has to be lemmatized, which means that for each occurrence of a word, its lemma and its relevant attributes must be established. The system automatically assigns the context to the lemmatized word. This stage proceeds interactively: the user has to validate the result returned by the lexical analyzer of SILEX, and, in the case of multiple results (for homographs), to choose the one validated by the context.

As a third step, the results obtained after the lemmatization of all the poems of the treated volume/author are merged, the lemmas are alphabetically ordered and for each lemma its relative frequency and all its occurrences in texts, together with the context information, are given. The information obtained in this stage is saved in databases and may serve for various queries on the lemmas occurring in the treated texts.

The main components of the CONCORD system are presented in the following sections.