Romanian Language Technology

Wolfgang Teubert * Language Resources for Language Technology

That corpora must be marked up consistently is self-evident. All large centers for language resources have developed or obtained corpus exploitation software demanding explicit coding of such linguistic and extra linguistic features. Within the limitations imposed by the software, the centers were free to chose the codes. This was adequate for a time when there was little exchange of corpora and software. However, when there was more demand for language resources and when a new generation of corpus software had to be developed able to handle corpora of 100 million words or more and still yield results in interactive access in a reasonable time, cooperation, exchange, and distribution became more and more important.

In recent years, the Text Encoding Initiative (TEI), an international project with strong North American and Western European participation, has developed standards and guidelines for the encoding of all sorts of texts (spoken and written) as corpora to be used as language resources. Likewise, standards and guidelines were developed for the set-up and exchange of lexical and terminological data [4]. The TEI recommendations are based on the SGML standard of the International Standards Organization (ISO). In the present form, they demand much extra work for the corpus compiler if they are to be followed step by step probably more than most institutions are willing to invest. Therefore, European language resources projects like the Network of European Reference Corpora (NERC) and the first PAROLE project singled out subsets of the TEI standards that should be adopted by any new corpus (and lexicon) project [5]. ISO is preparing a number of SGML-based standards for terminology databases and the exchange of terminological data [6].

For some years now the European Expert Group on Language Engineering Standards (EAGLES), an activity funded by the European Commission, have developed recommendations for the linguistic (and to some extent also extra-linguistic) annotation of corpora and the data categories used for lexicons [7]. For some application areas, there are only preliminary reports; in other fields, operational guidelines are already available.

Building up an infrastructure of language resources only makes sense if these resources are standardized. As we said before, this is the only way to ensure the reusability of resources for unlimited applications within the global NLP community. Standardization also helps in making resources comparable and in building links between resources. Only if monolingual lexicons follow the same architecture and use the same categories, is it possible to merge them into a multilingual lexicon. All alignment software for parallel corpora operates on the assumption that all relevant phenomena are encoded consistently. Standardization is a precondition of an operational language resources infrastructure. Its importance cannot be exaggerated. On the other hand, it is not necessary to adopt these standards for internal use in institutions. Whoever has assembled a consistently encoded corpus and has developed proprietary software for corpus exploitation is under no obligation to change it to the TEI or EAGLES standards. These common international standards have to be used when it comes to data interchange. If we want to exchange or distribute our resources or our tools, we must convert them to be conformant or at least compatible with the narrow set of recommendations agreed upon by the NLP community. This task can be carried out by appropriate conversion software, some of which is offered by companies or is available under public domain.

What exactly should be standardized? It is a hot issue of discussion how far standardization should stretch. Basically there are two fields where standards and guidelines are offered for corpora. The one concerns the encoding of all relevant features of the text that we want to include into a corpus, data that are there, e.g., in the printed text of a novel. We usually can identify the author(s), find the titles, determine if there are chapters, paragraphs, and other layout features serving a purpose for the text. Features like these have to be encoded. But, what about full stops? These dots at the end of a sentence have to be preserved; however, do they also have to be disambiguated? On the surface, they do not differ from the little dot at the end of an abbreviation, e.g., viz. or etc. There are reasons to disambiguate them: we have to know where a sentence starts and ends. If we want to search for the co-occurrence of specific words forming a phraseologism it is useful to confine the search to sentences as linguistic units. If we want to align parallel text, the minimal parameter for alignment is the sentence, marked by the full stop and not to be confused with the little dot indicating that the previous string is an abbreviation. But if we decide for disambiguation, we add linguistic information to the text, using our linguistic knowledge and competence. Any kind of corpus annotation means adding even more and potentially more questionable linguistic information. Admittedly, there is a lot of agreement among experts whether a given word is a noun, verb, adjective, or something else. Yet, how should we tag the first element of language technology, the English equivalent to the German word Sprachtechnologie? Is it really a noun, or a noun used as a modifier, like an adjective, or is it just the first constituent of a compound?