Romanian Language Technology

Wolfgang Teubert * Language Resources for Language Technology

As I stated before, the linguistic knowledge available the knowledge formulated in existing grammars and dictionaries is unsuitable for language technology tools for two reasons. First, most of it is not corpus-based; rather it reflects the individual linguist's (lexicographer's) competence based on a collection of data (citations); and, however large their collection may be, it is permeated by a bias that cannot be avoided. Second, traditional grammars and dictionaries have been devised for human users, and human users differ substantially from machines. Humans use inductive reasoning and can draw analogies easily; faculties like these are taken for granted and are reflected in the traditional arrangement and presentation of processed linguistic data. Language technology tools cannot take recourse to common sense. For language technology applications, all knowledge has to be spelled out in the form of rules, lists, and probabilities.

This task is sufficiently demanding in itself. In order to carry it out, we have to go back to the sources; and the only source for linguistic data is the corpus, the authentic and actual texts in their unannotated representation. But, anyone who has gone to the sources has also experienced the problem that when we start analyzing language as it occurs in a corpus, we gain evidence that renders existing grammars and dictionaries as very unreliable repositories of linguistic knowledge. We discover that our traditional linguistic knowledge gives us a very biased view of language, a view that has its roots in the contingency of over two thousand years of linguistic theorizing. We are so accustomed to this view that we take it for the truth, for reality, not just for an interpretation of hard data. It is true that traditional grammars and dictionaries have helped us, fairly satisfactorily, to overcome the linguistic problems we humans have to deal with. But they will not suffice for language technology applications.

This is why, however cumbersome and expensive it may be, language has to be described in a way that will be appropriate for language engineering. In the Council of Europe corpus project the Multilingual Dictionary Experiment (project leader, John Sinclair, with participants from Croatia, England, Germany, Hungary, Italy, Sweden) it has been demonstrated that monolingual and bilingual dictionaries are of no (or only little) use when it comes to automatically translating a word from one language into another in cases where there is more than one alternative. A close analysis of the problems involved in the translation of nominalizations between German, French, and Hungarian (also corpus-based) has shown that all the descriptions available in traditional dictionaries and grammars are inadequate, incomplete, and ultimately useless [1]. To reduce the cost of a corpus-based language analysis from scratch, which is indispensable, corpus exploitation tools have to be developed which arrange the hard facts (including statistic-driven devices for context analysis) and which process them (with a great deal of human intervention for the semantic interpretation of data) into algorithmic linguistic knowledge, into rules derived from objective data rather than individual competence. Perhaps, this will result in the finding that traditional categories like noun, verb, and adjective do not, after all, reflect categories useful for NLP.

Corpora are the basic language resources. They have been used by linguists for about thirty years. We have learned that the early discussion on the representativeness of corpora led in the wrong direction. Corpora represent nothing but the texts they consist of, and certainly not a language universe. The analysis of corpora has given rise to new insights, particularly concerning the vocabulary. In the sixties, we were accustomed to think that we could define the core of general language as the intersection of special languages and that such a definition would make it possible to define a finite general language vocabulary of perhaps 50,000 to 100,000 lemmata. Today, we are much less sure about the usefulness of such constructs. Instead, we talk about balanced corpora, corpora composed according to parameters like text type (e.g., 3^rd person, present tense only), genre (e.g., romance, instruction, free asymmetrical conversation) and domain (e.g., angling, crime, stock exchange).

We know that a balanced corpus regardless of size, number of parameters, and numbers of values assigned to each parameter does not represent general language; instead, we say it can be used for a number of different purposes. It does not represent the vocabulary of general language because a general language vocabulary is not a meaningful concept. All we can say is that we aim at a corpus that is 'saturated' in terms of the vocabulary. This means that a particular chunk of our balanced corpus representing one text type, one genre, and one domain (e.g., texts in the 1st and 3rd person, past, and present tense; newspaper diary; angling) is saturated once the growth rate of the vocabulary stops decreasing and becomes constant. There is no point from which there will be no hitherto unrecorded words, but there is a point from which there will be perhaps eight new words (types) for each 10,000 additional words of text (tokens). Saturation of corpora is a fairly new concept, and no one knows yet what it leads to in terms of corpus size.