Wolfgang Teubert * Language Resources for Language Technology


In most European countries, we find one or more focal academic institutions dedicated to document, analyze, and describe the national language(s) and to put linguistic knowledge across to the public. Language was regarded as the paramount cultural asset responsible for ethnic or national identity. In all research work, the main emphasis was, therefore, on this language; it was important to demonstrate that this language was as well-researched as any other major world language. Communication across borders, from language to language, if encouraged at all, was rarely a topic for these institutions. The transnational flow found its limitations in the number of translators and translations, serving as a useful filter of incoming and outgoing information.

Today, the situation has changed: European integration is now imperative to our societies. A precondition is the uncurtailed accessibility of information. National economies can successfully compete on the global market only if all relevant data are available. However, although it is relatively easy to obtain all the information wanted, it is usually written in a foreign language, unless you happen to speak English or perhaps French as your mother tongue. So, the large majority of people who do not speak a foreign language are put at a disadvantage. German companies regularly complain that calls for tender issued by the Commission of the European Communities are initially and at times only published in French and English: this makes it much more difficult for them to compete. International TV channels first thought that they could reach their audience by transmitting their programmes in English; however, programme sponsors now insist on subtitles or dubbing in the local languages. Most viewers prefer programmes in their native language even if they understand English.

Language, once a cultural asset, has now become an economic commodity. The focal national language institutes have acquired new responsibilities. Now, their task is to provide the means that information from abroad can be made available in the national language and that locally produced information can be distributed world-wide in major international languages and in the language of the neighboring countries. To train and employ more translators is not sufficient. We have also to take care that the necessary language technology is being developed.

2. Current issues in Language Engineering

Spelling checkers were among the earliest successful language technology applications. They have been accepted as useful devices and are still being sold today in ever-improved versions. On the other hand, more ambitious projects often have failed. The majority of early machine translation systems, particularly the more sophisticated ones have not survived. SYSTRAN is still kept alive by the European Commission's translation services, but many others have disappeared without leaving a trace.

Spelling checkers do not need semantics. Even in the seventies, they were based on little more than a list of the most frequent word forms, i.e., linguistic knowledge widely available or easy enough to generate from corpora. Machine translation, however, needs semantics. Our understanding of word meanings or lexical semantics in the seventies was contained in dictionaries, and it was arranged in a form that a human user, with some experience, could understand, using a great deal of implicit knowledge about the world and inductive reasoning and having the ability to draw analogies all assets computers usually do not possess. Therefore, it is no surprise that early applications involving semantics were not very successful.

Today, we know that the semantic data needed for sophisticated language technology cannot (or only to a very small extent) be derived from dictionaries designed for human users. Rather, it has to be generated from scratch, namely, on the basis of corpora. Language technology can work with two kinds of semantic information. The one is rule-based and presupposes an intellectual semantic analysis of the phenomenon in question. The other kind uses statistics and is not really semantic at all: it computes the five or ten words or word forms preceding or following the word in question and relates this information to the different translation equivalents found in parallel corpora. The German word Schnecke, e.g., is translated into English either by snail or slug, where snail refers to the creature with a 'house' and slug without one. The rule-based approach states just that and searches the German text for clues from which we can infer the correct translation equivalent. The statistic-based approach does not look at the meaning at all. It looks for words and other traces that frequently occur when Schnecke is translated as snail and for other patterns co-occurring with Schnecke being translated as slug. In the context of slug, we would probably find words like vegetable (garden), lane, wet, and various forms of get rid of; while in the case of snail, I would expect words like table, course, wine, but also vineyard and sunny.


21

Previous Index Next