Wolfgang Teubert * Language Resources for Language
Technology
In most European countries, we find one or more focal academic
institutions dedicated to document, analyze, and describe the
national language(s) and to put linguistic knowledge across to
the public. Language was regarded as the paramount cultural asset
responsible for ethnic or national identity. In all research work,
the main emphasis was, therefore, on this language; it was important
to demonstrate that this language was as well-researched as any
other major world language. Communication across borders, from
language to language, if encouraged at all, was rarely a topic
for these institutions. The transnational flow found its limitations
in the number of translators and translations, serving as a useful
filter of incoming and outgoing information.
Today, the situation has changed: European integration is now
imperative to our societies. A precondition is the uncurtailed
accessibility of information. National economies can successfully
compete on the global market only if all relevant data are available.
However, although it is relatively easy to obtain all the information
wanted, it is usually written in a foreign language, unless you
happen to speak English or perhaps French as your mother tongue.
So, the large majority of people who do not speak a foreign language
are put at a disadvantage. German companies regularly complain
that calls for tender issued by the Commission of the European
Communities are initially and at times only published in French
and English: this makes it much more difficult for them to compete.
International TV channels first thought that they could reach
their audience by transmitting their programmes in English; however,
programme sponsors now insist on subtitles or dubbing in the local
languages. Most viewers prefer programmes in their native language
even if they understand English.
Language, once a cultural asset,
has now become an economic commodity. The focal national language
institutes have acquired new responsibilities. Now, their task
is to provide the means that information from abroad can be made
available in the national language and that locally produced information
can be distributed world-wide in major international languages
and in the language of the neighboring countries. To train and
employ more translators is not sufficient. We have also to take
care that the necessary language technology is being developed.
2. Current issues in Language Engineering
Spelling checkers were among the
earliest successful language technology applications. They have
been accepted as useful devices and are still being sold today
in ever-improved versions. On the other hand, more ambitious projects
often have failed. The majority of early machine translation systems,
particularly the more sophisticated ones have not survived. SYSTRAN
is still kept alive by the European Commission's translation services,
but many others have disappeared without leaving a trace.
Spelling checkers do not need semantics. Even in the seventies,
they were based on little more than a list of the most frequent
word forms, i.e., linguistic knowledge widely available or easy
enough to generate from corpora. Machine translation, however,
needs semantics. Our understanding of word meanings or lexical
semantics in the seventies was contained in dictionaries, and
it was arranged in a form that a human user, with some experience,
could understand, using a great deal of implicit knowledge about
the world and inductive reasoning and having the ability to draw
analogies all assets computers usually do not possess. Therefore,
it is no surprise that early applications involving semantics
were not very successful.
Today, we know that the semantic
data needed for sophisticated language technology cannot (or only
to a very small extent) be derived from dictionaries designed
for human users. Rather, it has to be generated from scratch,
namely, on the basis of corpora. Language technology can work
with two kinds of semantic information. The one is rule-based
and presupposes an intellectual semantic analysis of the phenomenon
in question. The other kind uses statistics and is not really
semantic at all: it computes the five or ten words or word forms
preceding or following the word in question and relates this information
to the different translation equivalents found in parallel corpora.
The German word Schnecke, e.g., is translated into English
either by snail or slug, where snail refers
to the creature with a 'house' and slug without one. The
rule-based approach states just that and searches the German text
for clues from which we can infer the correct translation equivalent.
The statistic-based approach does not look at the meaning at all.
It looks for words and other traces that frequently occur when
Schnecke is translated as snail and for other patterns
co-occurring with Schnecke being translated as slug.
In the context of slug, we would probably find words like
vegetable (garden), lane, wet, and various
forms of get rid of; while in the case of snail,
I would expect words like table, course, wine,
but also vineyard and sunny.
21