Wolfgang Teubert * Language Resources for Language
Technology
Today's successful applications
involving semantics work with an amalgam of the rule- and the
statistic-based approach. The statistical approach has some attractive
advantages: the data required can be generated form corpora with
no or only little human intervention; and since it is just an
emulation of semantics, one does not have to be able to state
explicitly what a lexical item means. Indeed, it leaves out the
entire question of meaning. Its inherent shortcoming is the rate
of accuracy. Even a rate of 95% of correct translation equivalents
implies that every twentieth word in a text is mistranslated,
practically every sentence and certainly more than what most people
would like to live with. On the other hand, the rule-based approach
can be a very expensive alternative. It presupposes something
like a bilingual dictionary that would enable translation of a
text correctly into an unknown foreign language. The reason why
such dictionaries do not exist for human users or machines is
that the explicit linguistic knowledge they would have to contain
is not yet available and that this knowledge is being extremely
expensive to produce.
The new generation of language technology applications (monolingual
and bilingual or multilingual ones) deals with semantic problems.
They recognize the fact that computers cannot understand spoken
or written texts in the way humans can. Therefore, these text
processing systems can only emulate the human faculty of 'understanding'
by a mixture of rules and probabilities. To find the right mix
is less a question of theory and principles than of calibration
and learning by doing. The crucial point is the performance of
an application under real life conditions. The application has
to prove its cost efficiency, i.e., it must demonstrate that it
can complete a task cheaper than a trained human. After two decades
of experimental and pilot systems, the emphasis today is on robust
applications for which there is a real market, i.e., one where
users are willing to pay a fair price.
This new state of the art is reflected
in the policy of the European Commission on information technology.
Under the Telematics 4th Framework Programme, only such projects
are funded where the usefulness of the application under development
is verified and attested by an industrial interest group, representing
the end users of the application.
3. The importance of language resources
The quality of language technology applications rests foremost
with the comprehensiveness and reliability of the language data
with which the tools are used. For traditional applications like
a spelling checker, 'hard' data are needed: word forms (including
orthographic variants) and a morphosyntactic analysis leading
to lemmatization. For more sophisticated applications that somehow
involve meaning, we need 'soft' data, data which are interpretations
of hard data by competent linguists. Finally, for multilingual
applications, we need soft data for various languages and procedures
allowing the mapping of these data onto each other, again necessitating
interpretation by competent linguists.
Hard and soft data together constitute our linguistic knowledge.
We have argued that today's knowledge is neither reliable enough
nor sufficient for the development of robust heavy duty tools
performing noticeably better than the existing toy or pilot systems.
Many computer linguists, particularly
those with an engineering background, take it for granted that,
with the linguistic knowledge available today, a steady improvement
of language technology systems is possible to the point where,
say, an operational, robust system for machine translation can
be developed. They are impressed by the fact that seemingly astonishing
results can be achieved with stochastic methods necessitating
little knowledge and even less interpretation of soft data of
the language(s) involved. They believe to be justified in their
convictions because they look at natural languages as nothing
but a set of particularly complex formal languages. However, there
is a generic difference. Given basic conditions, formal languages
can be translated into each other. But anyone who has translated
from one natural language to another knows it takes more than
just the hard linguistic data plus a few statistic operations.
Translators must also understand the content of the text and must
be aware that they will understand it only if they have sufficient
knowledge of the world. This is not knowledge about facts, but
rather how we interpret the objective world that we perceive.
To some extent, this (community agreed) interpretation will have
to be integrated into the language data.
22