CONCORD - Software System for Concordances of
Romanian Poetical Texts
1. Introduction
A concordance is the organization,
as a dictionary, of all the words occurring in a certain author's
work, specifying, for each word, its lemma, its relevant attributes
and information about its position in text (its contexts). The
lemmas are alphabetically ordered and under each lemma, all its
occurrences, together with their context information, are listed.
Each word is often quoted in context, showing the words in the
text which precede and follow the one in question. Such KWIC
(Key Word In
Context ) can be used
in several ways. Concordances make possible the scientific study
of the language of an author and the quick retrieval of a large
amount of morpho-lexical, historical and literary information.
There are various possibilities
to specify the context of a word occurrence: the word may
always be collocated at the center of the context, or the text
may be segmented in a sequence of contextual units, identified
by delimiting factors chosen by the user, such as changes of references
(new line, verse, paragraph, etc.), punctuation marks, etc.
CONCORD
is a system for developing concordances for poetical works, so
we decided that the context of a word occurrence is always the
verse in which it appears. In the electronic release of
the concordance, the user will be able to visualize a dynamic
context, i.e. the verses above and below the one in which the
word actually appears; this ability may be useful in the case
of very short verses. But in the printed release of the concordance
the context is strictly the verse within which the word actually
occurs.
As a concordance is essentially
a dictionary, the stages needed to produce it are those needed
to construct a dictionary, that is:
- acquisition of documentation
(information on the given author, the texts of his works, etc.);
- preparation of the lexical
entries (lemmas, their attributes and their contexts);
- printing and publication.
The CONCORD system assists
the user in all these stages. For certain functions, as spelling-checking
the input texts and interactive lemmatization of the poetical
texts, CONCORD uses the lexical analyzer of another software,
SILEX, a morpho-lexical software for the Romanian language,
implemented at the Centre of Text Analysis (see
[7,8,22]).
The lexical analyzer of SILEX receives as input a word
and returns its analysis, i. e. its lexico-grammatical class and
its main attributes, depending on the class: number, case, gendre,
tense, person, etc.
The information processed by CONCORD
is organized as databases. First of all, the information concerning
the author, his works, and the texts of his (poetical) works is
structured, coded and checked. As we do not have a scanner, the
texts are typed, so it is necessary to collate them with the reference
edition. In this stage, the user is provided with a menu which
allows the choice of various options for editing and checking
the input texts; besides this, the system validates the codes
used in the databases. In checking the correctness of input texts,
at the user's choice, the spelling-checker of SILEX may
be activated.
As a second step, each poem has
to be lemmatized, which means that for each occurrence of a word,
its lemma and its relevant attributes must be established. The
system automatically assigns the context to the lemmatized word.
This stage proceeds interactively: the user has to validate the
result returned by the lexical analyzer of SILEX, and,
in the case of multiple results (for homographs), to choose the
one validated by the context.
As a third step, the results obtained
after the lemmatization of all the poems of the treated volume/author
are merged, the lemmas are alphabetically ordered and for each
lemma its relative frequency and all its occurrences in texts,
together with the context information, are given. The information
obtained in this stage is saved in databases and may serve for
various queries on the lemmas occurring in the treated texts.
The main components of the CONCORD
system are presented in the following sections.
55