Wolfgang Teubert * Language Resources for Language
Technology
That corpora must be marked up
consistently is self-evident. All large centers for language resources
have developed or obtained corpus exploitation software demanding
explicit coding of such linguistic and extra linguistic features.
Within the limitations imposed by the software, the centers were
free to chose the codes. This was adequate for a time when there
was little exchange of corpora and software. However, when there
was more demand for language resources and when a new generation
of corpus software had to be developed able to handle corpora
of 100 million words or more and still yield results in interactive
access in a reasonable time, cooperation, exchange, and distribution
became more and more important.
In recent years, the Text Encoding
Initiative (TEI), an international project with strong North American
and Western European participation, has developed standards and
guidelines for the encoding of all sorts of texts (spoken and
written) as corpora to be used as language resources. Likewise,
standards and guidelines were developed for the set-up and exchange
of lexical and terminological data [4].
The TEI recommendations
are based on the SGML standard of the International Standards
Organization (ISO). In the present form, they demand much extra
work for the corpus compiler if they are to be followed step by
step probably more than most institutions are willing to invest.
Therefore, European language resources projects like the Network
of European Reference Corpora (NERC) and the first PAROLE project
singled out subsets of the TEI standards that should be adopted
by any new corpus (and lexicon) project
[5]. ISO is preparing
a number of SGML-based standards for terminology databases and
the exchange of terminological data [6].
For some years now the European
Expert Group on Language Engineering Standards (EAGLES), an activity
funded by the European Commission, have developed recommendations
for the linguistic (and to some extent also extra-linguistic)
annotation of corpora and the data categories used for lexicons
[7]. For some application areas,
there are only preliminary reports;
in other fields, operational guidelines are already available.
Building up an infrastructure
of language resources only makes sense if these resources are
standardized. As we said before, this is the only way to ensure
the reusability of resources for unlimited applications within
the global NLP community. Standardization also helps in making
resources comparable and in building links between resources.
Only if monolingual lexicons follow the same architecture and
use the same categories, is it possible to merge them into a multilingual
lexicon. All alignment software for parallel corpora operates
on the assumption that all relevant phenomena are encoded consistently.
Standardization is a precondition of an operational language resources
infrastructure. Its importance cannot be exaggerated. On the other
hand, it is not necessary to adopt these standards for internal
use in institutions. Whoever has assembled a consistently encoded
corpus and has developed proprietary software for corpus exploitation
is under no obligation to change it to the TEI or EAGLES standards.
These common international standards have to be used when it comes
to data interchange. If we want to exchange or distribute our
resources or our tools, we must convert them to be conformant
or at least compatible with the narrow set of recommendations
agreed upon by the NLP community. This task can be carried out
by appropriate conversion software, some of which is offered by
companies or is available under public domain.
What exactly should be standardized?
It is a hot issue of discussion how far standardization should
stretch. Basically there are two fields where standards and guidelines
are offered for corpora. The one concerns the encoding of all
relevant features of the text that we want to include into a corpus,
data that are there, e.g., in the printed text of a novel. We
usually can identify the author(s), find the titles, determine
if there are chapters, paragraphs, and other layout features serving
a purpose for the text. Features like these have to be encoded.
But, what about full stops? These dots at the end of a sentence
have to be preserved; however, do they also have to be disambiguated?
On the surface, they do not differ from the little dot at the
end of an abbreviation, e.g., viz. or etc. There are reasons to
disambiguate them: we have to know where a sentence starts and
ends. If we want to search for the co-occurrence of specific words
forming a phraseologism it is useful to confine the search to
sentences as linguistic units. If we want to align parallel text,
the minimal parameter for alignment is the sentence, marked by
the full stop and not to be confused with the little dot indicating
that the previous string is an abbreviation. But if we decide
for disambiguation, we add linguistic information to the text,
using our linguistic knowledge and competence. Any kind of corpus
annotation means adding even more and potentially more questionable
linguistic information. Admittedly, there is a lot of agreement
among experts whether a given word is a noun, verb, adjective,
or something else. Yet, how should we tag the first element of
language technology, the English equivalent to the German
word Sprachtechnologie? Is it really a noun, or a noun
used as a modifier, like an adjective, or is it just the first
constituent of a compound?
25