Wolfgang Teubert * Language Resources for Language
Technology
If we add linguistic information
of this kind to a corpus, then we might obliterate the very objective
we wanted to achieve by using the corpus. Corpora are used for
obtaining new linguistic knowledge. But if we use traditional
knowledge when it comes to the extraction of evidence from a corpus,
we may never get to the point where we may want to discard old
categories and replace them by new ones confirmed by corpus evidence.
It is from looking at unannotated corpora that we have learned
a lot about the fuzziness of the concept of lexical units and
that we now see a continuum from a word segment like -euro-
via single words, multiword units, collocations, to whole phraseologisms.
Particularly in multilingual applications, it is obvious that
the translation equivalents we are looking for are only very rarely
single words and sometimes can be complete phrases or even sentences.
Any linguistic information added to a text or corpus tends to
replicate any bias inherent in that category. Therefore, we must
be careful that standardization of linguistic features originally
not present in a text but added by hand or automatically does
not obscure the evidence we expect to obtain from corpora. Normative
so-called language-independent tagsets and categories for word
class, morphosyntactic or even semantic information, whether in
corpora or in lexicons, are useful only insofar as they can establish
comparability between resources of different origin. If these
tagsets are used for extracting evidence for linguistic phenomena
from corpora, results have to be evaluated with great care.
The creation of language resources
is expensive. The more effort that has been spent on the composition,
documentation, and encoding of textual features, on the annotation
of added linguistic information and on the comparability with
related resources, the more valuable a corpus (or a lexicon) will
be. To use it for just one specific purpose would be a waste it
should be reused as often as possible, covering a wide range of
applications. As we said above, standardization is an absolutely
necessary precondition if the resources are not only to be used
internally within one single institution but are also to be made
available to the NLP community. While in the past the exchange
of resources and resource-oriented tools was based on bilateral
agreement between provider and user, distribution centers today
serve as platforms for the language resources market. They will
be accepted only if the material they offer for distribution proves
valuable to the clients, and therefore, it has to be fully documented
and standardized. The first center was the Linguistic Data Consortium
(LDC) at Pennsylvania University, which was created more than
five years ago. Recently, the European Language Resources Association
(ELRA) was founded for the (Western) European market. It will
be complemented by national language resource centers. The Trans-European
Language Resources Infrastructure (TELRI), a Concerted Action
funded by the European Commission, is serving as a bridge between
Central and Eastern Europe (including the Commonwealth of Independent
States) on the one side and Western Europe on the other. These
distribution centers will pass on submitted resources to academic
and industrial users and will collect license fees. It is no longer
necessary to contact the institution where the resources were
created in the first place.
Such an arrangement works only
if the resources are not only documented but validated as well.
What does validation mean for written resources? For corpora,
it means that the corpus has the size it claims, that it is composed
the way it claims, that it is encoded the way it claims, that
all features encoded can be used for retrieval, that annotations
used conform to a given standard, and, above all, that the error
rate for encoding and annotation does not exceed a certain level.
Validation guarantees the client that he gets what he ordered
and that he can rely on the resources to the extent stated by
the validation certificate.
Validation has to be carried out on an unbiased and neutral basis,
and this means not by the institution where the resources were
created. Some of the validation features are language-independent,
like the conformancy of the text representation with accepted
rules and standards for encoding. But, some are language-specific,
e.g., a claim that a lexicon really covers the core vocabulary
of a given language, that the meanings assigned to a lemma have
been gained from corpus evidence, and that the error rate of morphosyntactic
tagging does not exceed a certain level. Therefore, it makes sense
that validation for written resources should be carried out by
institutions where there is sufficient competence for the language
in question. On the other hand, to make validation reliable, all
recognized validation centers should follow a validation procedure
prescribed by a standard that is accepted by the NLP community.
Some projects for the development of validation procedures are
under preparation; the European Commission has included validation
as a new work item in the new Telematics Programme.
26