Romanian Language Technology

Wolfgang Teubert * Language Resources for Language Technology

John Sinclair has recently developed a corpus typology, and I am using it in my brief account of corpus types [2]:

Special corpus. Special corpora are assembled for a specific purpose, and they vary in size and composition according to their purpose. By intention, special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. They can have a number of advantages compared with balanced corpora. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in a balanced corpus. A corpus that is enriched in such a way can be much smaller (perhaps ten times) than a balanced corpus providing the same data.

Reference corpus. Reference corpora come closest to the old concept of a representative corpus. They are composed on the basis of relevant parameters agreed upon by the linguistic community and should include spoken and written, formal and informal language representing various social and situational strata. The idea behind reference corpora is that they can be used for a large variety of purposes, thus, rendering most special corpora unnecessary. They are also the point of reference when it comes to measuring the distortion of special corpora. They are used as benchmarks for lexicons and for the performance of generic tools and specific language technology applications. They are large in size; 50 million words is considered to be the absolute minimum; 100 million will become the European standard in a few years.

Monitor corpus. Language changes, and these changes should be reflected in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e., the balance) as defined by the parameters. The same composition schema should be followed year by year, the basis being a reference corpus with texts spoken or written in one single year.

Opportunistic corpus. The opportunistic corpus is an inexpensive alternative to the reference corpus. It is a collection of electronic texts that can be obtained, converted, and used free or at a very modest price; and its composition principle is that one should take all one can get and try to fill in blank spots as soon as they are recognized. Their place is in environments where size and corpus access do not pose a problem. The opportunistic corpus is a virtual corpus in the sense that the selection of an actual corpus (from the opportunistic corpus) is up to the needs of a particular project. Today's monitor corpora usually are opportunistic corpora.

Comparable corpus. For multilingual research and applications, corpora in each language are needed that follow the same composition pattern and, thus, can be used for language comparison. Opportunistic corpora cannot fulfill this claim. The focus is, therefore, on reference corpora. The Commission of the European Community is funding a project whose main goal is the creation of comparable reference corpora (of 50 million words each) for all the official languages of the European Union including Catalan and Irish. Comparable corpora are an indispensable source for bilingual and multilingual lexicons and a new generation of dictionaries [3].

Parallel corpus. Texts in one language and their translations into other languages constitute parallel corpora. They are the source for the detection of translation equivalents and, thus, can play an important role in the development of multilingual lexicons. In order to do this, parallel corpora must be aligned at least sentence by sentence, preferably phrase by phrase. Their disadvantage is that the language of translations is distorted and does not contain the full range of vocabulary and syntax. To compensate for this deficiency, one can set up reciprocate parallel corpora, corpora containing authentic texts as well as translations in each of the languages involved. This allows double-checking translation equivalents. Only if a collocation or phraseologism occurs also in authentic texts, is it counted as an acceptable equivalent.

4. Standardization and validation

Anyone who wants to build up a corpus has to make a number of decisions about how to encode the text. A text is more than a stream of words; it contains much more information. It may have one or more authors, a title, chapter headings, tables and figures, footnotes, foreign language citations, font shifts, paragraphs, and many other features. What should be retained? How should it be encoded? Should one aim at reconstruction of the source text complete with its specific layout? There are no strict rules. However we decide, we have to abide by the decision we made for the entire corpus. If we mark footnotes in text A, we should also mark them in exactly the same way in text B. There must be an explicit list of the features to be encoded, and there must be a set of unambiguous codes. Only a text collection that has been encoded in such a consistent way should be called a corpus.

Why do we need consistency in the creation of corpora? Corpora are raw data waiting for linguistic analysis. To extract information from corpora, we have to use software. The more complicated our query is, the more sophisticated our software must be. If we want to extract complex phraseologisms consisting of all headlines of newspaper texts containing a certain phraseologism, we must make certain that headlines are marked consistently.