Wolfgang Teubert * Language Resources for Language
Technology
John Sinclair has recently developed
a corpus typology, and I am using it in my brief account of corpus
types [2]:
- Special corpus. Special corpora are assembled for a
specific purpose, and they
vary in size and composition according to their purpose. By intention,
special corpora are not balanced (except within the scope of their
given purpose) and, if used for other purposes, give a distorted
view of the language segment. They can have a number of advantages
compared with balanced corpora. Their main advantage is that the
texts can be selected in such a way that the phenomena one is
looking for occur much more frequently in special corpora than
in a balanced corpus. A corpus that is enriched in such a way
can be much smaller (perhaps ten times) than a balanced corpus
providing the same data.
- Reference corpus. Reference corpora come closest
to the old concept of a representative
corpus. They are composed on the basis of relevant parameters
agreed upon by the linguistic community and should include spoken
and written, formal and informal language representing various
social and situational strata. The idea behind reference corpora
is that they can be used for a large variety of purposes, thus,
rendering most special corpora unnecessary. They are also the
point of reference when it comes to measuring the distortion of
special corpora. They are used as benchmarks for lexicons and
for the performance of generic tools and specific language technology
applications. They are large in size; 50 million words is considered
to be the absolute minimum; 100 million will become the European
standard in a few years.
- Monitor corpus. Language changes, and these
changes should be reflected in a constant
growth rate of corpora, leaving untouched the relative weight
of its components (i.e., the balance) as defined by the parameters.
The same composition schema should be followed year by year, the
basis being a reference corpus with texts spoken or written in
one single year.
- Opportunistic corpus. The opportunistic corpus
is an inexpensive alternative to the
reference corpus. It is a collection of electronic texts that
can be obtained, converted, and used free or at a very modest
price; and its composition principle is that one should take all
one can get and try to fill in blank spots as soon as they are
recognized. Their place is in environments where size and corpus
access do not pose a problem. The opportunistic corpus is a virtual
corpus in the sense that the selection of an actual corpus (from
the opportunistic corpus) is up to the needs of a particular project.
Today's monitor corpora usually are opportunistic corpora.
- Comparable corpus. For multilingual research and
applications, corpora in each language
are needed that follow the same composition pattern and, thus,
can be used for language comparison. Opportunistic corpora cannot
fulfill this claim. The focus is, therefore, on reference corpora.
The Commission of the European Community is funding a project
whose main goal is the creation of comparable reference corpora
(of 50 million words each) for all the official languages of the
European Union including Catalan and Irish. Comparable corpora
are an indispensable source for bilingual and multilingual lexicons
and a new generation of dictionaries [3].
- Parallel corpus. Texts in one language and their
translations into other languages
constitute parallel corpora. They are the source for the detection
of translation equivalents and, thus, can play an important role
in the development of multilingual lexicons. In order to do this,
parallel corpora must be aligned at least sentence by sentence,
preferably phrase by phrase. Their disadvantage is that the language
of translations is distorted and does not contain the full range
of vocabulary and syntax. To compensate for this deficiency, one
can set up reciprocate parallel corpora, corpora containing authentic
texts as well as translations in each of the languages involved.
This allows double-checking translation equivalents. Only if a
collocation or phraseologism occurs also in authentic texts, is
it counted as an acceptable equivalent.
4. Standardization and validation
Anyone who wants to build up a
corpus has to make a number of decisions about how to encode the
text. A text is more than a stream of words; it contains much
more information. It may have one or more authors, a title, chapter
headings, tables and figures, footnotes, foreign language citations,
font shifts, paragraphs, and many other features. What should
be retained? How should it be encoded? Should one aim at reconstruction
of the source text complete with its specific layout? There are
no strict rules. However we decide, we have to abide by the decision
we made for the entire corpus. If we mark footnotes in text A,
we should also mark them in exactly the same way in text B. There
must be an explicit list of the features to be encoded, and there
must be a set of unambiguous codes. Only a text collection that
has been encoded in such a consistent way should be called a corpus.
Why do we need consistency in
the creation of corpora? Corpora are raw data waiting for linguistic
analysis. To extract information from corpora, we have to use
software. The more complicated our query is, the more sophisticated
our software must be. If we want to extract complex phraseologisms
consisting of all headlines of newspaper texts containing a certain
phraseologism, we must make certain that headlines are marked
consistently.
24