Wolfgang Teubert * Language Resources for Language
Technology
Validation is important for generic
resources designed for reusability. These resources will always
be necessary for language technology. However, the more sophisticated
applications are being designed, the more specific the language
resources backing them will have to be. These resources will either
have to be produced from scratch or by adding value to existing
generic resources. In either case, specific resources of this
kind will probably not be distributed by large centers, but produced
by one institution and then be passed on directly to the user
on a bilateral agreement. This trend will lead to a reassessment
of the validation issue in the long run.
5. The Trans-European Language Resources Infrastructure (TELRI)
The boom in language industry
has brought with it a growing demand for more and better monolingual
and multilingual resources. In Europe, only a joint effort of
existing focal language institutions could be expected to harmonize
existing and designed standardized new resources in compliance
with the needs of dictionary makers and developers of language
technology applications.
The European Commission, realizing
the central role of language engineering in the emergent information
and communication technology market, has supported a number of
relevant infrastructure activities, serving the needs of the three
'colleges': speech (spoken language), terminology, and written
resources. These projects helped to set up a common infrastructure
for the countries of the European Union and the European Economic
Area (formerly EFTA), and at the same time encouraged formation
of national language resources networks. After years of preparation,
the new PAROLE II project with partners from all European Union
countries will produce a first generation of harmonized, comparable
generic textual and lexical reusable resources, meeting the basic
demands of language technology.
But Europe is larger than the
European Union. All European countries must be given the opportunity
to participate on an equal level in academic and industrial research
and development. In the COPERNICUS Programme, the European Commission
provided a framework of projects aiming at the integration of
activities in Central and Eastern Europe with complementary ones
in Western Europe. Several projects currently underway deal with
various aspects of speech, terminology, and written resources.
One of these projects dealing primarily with written resources
is the Trains-European Language Resources Infrastructure (TELRI)
[8]. Since it aims at
including as many partners in Central and
Eastern Europe as possible, it is set up as a Concerted Action
rather than as a project proper. Its partners are 22 focal language
and language technology institutions in 17 countries, six Western
European and 11 Central and Eastern European countries. The partners
in the West are also cooperating in the PAROLE projects, thus
linking closely TELRI activities with Western European developments.
For the time being, there are no partners from former Yugoslavia
(with the exception of Slovenia) or from the Commonwealth of Independent
States. However, formalized links have been established with the
leading institutions in Croatia, Serbia, and Russia; and these
associated partners are participating in TELRI activities.
The Concerted Action TELRI has
an initial duration of three years, beginning in early 1995, and
is working on a budget of about half a million ECU. It is not
a research project: rather, its goal is to create a viable infrastructure
in order to establish a permanent platform for industry, research
institutes and universities, and to supply the NLP community with
precompetitive or public domain monolingual and multilingual language
resources. These resources are: corpora, machine readable dictionaries
and lexicons, lexical data bases, and software tools for the creation,
reuse, maintenance, valorization, and exploitation of linguistic data.
The activities of TELRI are organized in Working Groups for specific
tasks. The collection, documentation, and dissemination of relevant
information on language resources, providers and users, their
potentials, and their needs is a basic activity. TELRI will promote
the formation of national language resource networks, and TELRI
partners will act as focal nodes. They will also design small
scale joint ventures with private industry in order to foster
cooperation between academic research and development. TELRI will
pool and enhance existing service activities, providing resources,
expertise, consulting and training facilities. The central platform
will be annual seminars directed at the needs of small- and medium-sized
enterprises. TELRI will engage in European and global standardization
and validation activities and contribute to the harmonization
of already existing resources. It is organizing joint research
in the field of corpus-based multilingual lexicography and the
use parallel aligned texts.
27