Romanian Language Technology

Wolfgang Teubert * Language Resources for Language Technology

Today's successful applications involving semantics work with an amalgam of the rule- and the statistic-based approach. The statistical approach has some attractive advantages: the data required can be generated form corpora with no or only little human intervention; and since it is just an emulation of semantics, one does not have to be able to state explicitly what a lexical item means. Indeed, it leaves out the entire question of meaning. Its inherent shortcoming is the rate of accuracy. Even a rate of 95% of correct translation equivalents implies that every twentieth word in a text is mistranslated, practically every sentence and certainly more than what most people would like to live with. On the other hand, the rule-based approach can be a very expensive alternative. It presupposes something like a bilingual dictionary that would enable translation of a text correctly into an unknown foreign language. The reason why such dictionaries do not exist for human users or machines is that the explicit linguistic knowledge they would have to contain is not yet available and that this knowledge is being extremely expensive to produce.

The new generation of language technology applications (monolingual and bilingual or multilingual ones) deals with semantic problems. They recognize the fact that computers cannot understand spoken or written texts in the way humans can. Therefore, these text processing systems can only emulate the human faculty of 'understanding' by a mixture of rules and probabilities. To find the right mix is less a question of theory and principles than of calibration and learning by doing. The crucial point is the performance of an application under real life conditions. The application has to prove its cost efficiency, i.e., it must demonstrate that it can complete a task cheaper than a trained human. After two decades of experimental and pilot systems, the emphasis today is on robust applications for which there is a real market, i.e., one where users are willing to pay a fair price.

This new state of the art is reflected in the policy of the European Commission on information technology. Under the Telematics 4th Framework Programme, only such projects are funded where the usefulness of the application under development is verified and attested by an industrial interest group, representing the end users of the application.

3. The importance of language resources

The quality of language technology applications rests foremost with the comprehensiveness and reliability of the language data with which the tools are used. For traditional applications like a spelling checker, 'hard' data are needed: word forms (including orthographic variants) and a morphosyntactic analysis leading to lemmatization. For more sophisticated applications that somehow involve meaning, we need 'soft' data, data which are interpretations of hard data by competent linguists. Finally, for multilingual applications, we need soft data for various languages and procedures allowing the mapping of these data onto each other, again necessitating interpretation by competent linguists.

Hard and soft data together constitute our linguistic knowledge. We have argued that today's knowledge is neither reliable enough nor sufficient for the development of robust heavy duty tools performing noticeably better than the existing toy or pilot systems.

Many computer linguists, particularly those with an engineering background, take it for granted that, with the linguistic knowledge available today, a steady improvement of language technology systems is possible to the point where, say, an operational, robust system for machine translation can be developed. They are impressed by the fact that seemingly astonishing results can be achieved with stochastic methods necessitating little knowledge and even less interpretation of soft data of the language(s) involved. They believe to be justified in their convictions because they look at natural languages as nothing but a set of particularly complex formal languages. However, there is a generic difference. Given basic conditions, formal languages can be translated into each other. But anyone who has translated from one natural language to another knows it takes more than just the hard linguistic data plus a few statistic operations. Translators must also understand the content of the text and must be aware that they will understand it only if they have sufficient knowledge of the world. This is not knowledge about facts, but rather how we interpret the objective world that we perceive. To some extent, this (community agreed) interpretation will have to be integrated into the language data.