Wolfgang Teubert * Language Resources for Language
Technology
As I stated before, the linguistic
knowledge available the knowledge formulated in existing grammars
and dictionaries is unsuitable for language technology tools for
two reasons. First, most of it is not corpus-based; rather it
reflects the individual linguist's (lexicographer's) competence
based on a collection of data (citations); and, however large
their collection may be, it is permeated by a bias that cannot
be avoided. Second, traditional grammars and dictionaries have
been devised for human users, and human users differ substantially
from machines. Humans use inductive reasoning and can draw analogies
easily; faculties like these are taken for granted and are reflected
in the traditional arrangement and presentation of processed linguistic
data. Language technology tools cannot take recourse to common
sense. For language technology applications, all knowledge has
to be spelled out in the form of rules, lists, and probabilities.
This task is sufficiently demanding
in itself. In order to carry it out, we have to go back to the
sources; and the only source for linguistic data is the corpus,
the authentic and actual texts in their unannotated representation.
But, anyone who has gone to the sources has also experienced the
problem that when we start analyzing language as it occurs in
a corpus, we gain evidence that renders existing grammars and
dictionaries as very unreliable repositories of linguistic knowledge.
We discover that our traditional linguistic knowledge gives us
a very biased view of language, a view that has its roots in the
contingency of over two thousand years of linguistic theorizing.
We are so accustomed to this view that we take it for the truth,
for reality, not just for an interpretation of hard data. It is
true that traditional grammars and dictionaries have helped us,
fairly satisfactorily, to overcome the linguistic problems we
humans have to deal with. But they will not suffice for language
technology applications.
This is why, however cumbersome
and expensive it may be, language has to be described in a way
that will be appropriate for language engineering. In the Council
of Europe corpus project the Multilingual Dictionary Experiment
(project leader, John Sinclair, with participants from Croatia,
England, Germany, Hungary, Italy, Sweden) it has been demonstrated
that monolingual and bilingual dictionaries are of no (or only
little) use when it comes to automatically translating a word
from one language into another in cases where there is more than
one alternative. A close analysis of the problems involved in
the translation of nominalizations between German, French, and
Hungarian (also corpus-based) has shown that all the descriptions
available in traditional dictionaries and grammars are inadequate,
incomplete, and ultimately useless [1].
To reduce the cost of a corpus-based language analysis from scratch, which is
indispensable, corpus exploitation tools have to be developed which arrange the
hard facts (including statistic-driven devices for context analysis)
and which process them (with a great deal of human intervention
for the semantic interpretation of data) into algorithmic linguistic
knowledge, into rules derived from objective data rather than
individual competence. Perhaps, this will result in the finding
that traditional categories like noun, verb, and
adjective do not, after all, reflect categories useful
for NLP.
Corpora are the basic language
resources. They have been used by linguists for about thirty years.
We have learned that the early discussion on the representativeness
of corpora led in the wrong direction. Corpora represent nothing
but the texts they consist of, and certainly not a language universe.
The analysis of corpora has given rise to new insights, particularly
concerning the vocabulary. In the sixties, we were accustomed
to think that we could define the core of general language as
the intersection of special languages and that such a definition
would make it possible to define a finite general language vocabulary
of perhaps 50,000 to 100,000 lemmata. Today, we are much less
sure about the usefulness of such constructs. Instead, we talk
about balanced corpora, corpora composed according to parameters
like text type (e.g., 3rd person, present tense only), genre (e.g.,
romance, instruction, free asymmetrical conversation) and domain
(e.g., angling, crime, stock exchange).
We know that a balanced corpus regardless of size, number of parameters,
and numbers of values assigned to each parameter does not represent
general language; instead, we say it can be used for a number
of different purposes. It does not represent the vocabulary of
general language because a general language vocabulary is not
a meaningful concept. All we can say is that we aim at a corpus
that is 'saturated' in terms of the vocabulary. This means that
a particular chunk of our balanced corpus representing one text
type, one genre, and one domain (e.g., texts in the 1st and 3rd
person, past, and present tense; newspaper diary; angling) is
saturated once the growth rate of the vocabulary stops decreasing
and becomes constant. There is no point from which there will
be no hitherto unrecorded words, but there is a point from which
there will be perhaps eight new words (types) for each 10,000
additional words of text (tokens). Saturation of corpora is a
fairly new concept, and no one knows yet what it leads to in terms
of corpus size.
23