Romanian Language Technology

Wolfgang Teubert * Language Resources for Language Technology

If we add linguistic information of this kind to a corpus, then we might obliterate the very objective we wanted to achieve by using the corpus. Corpora are used for obtaining new linguistic knowledge. But if we use traditional knowledge when it comes to the extraction of evidence from a corpus, we may never get to the point where we may want to discard old categories and replace them by new ones confirmed by corpus evidence. It is from looking at unannotated corpora that we have learned a lot about the fuzziness of the concept of lexical units and that we now see a continuum from a word segment like -euro- via single words, multiword units, collocations, to whole phraseologisms. Particularly in multilingual applications, it is obvious that the translation equivalents we are looking for are only very rarely single words and sometimes can be complete phrases or even sentences.

Any linguistic information added to a text or corpus tends to replicate any bias inherent in that category. Therefore, we must be careful that standardization of linguistic features originally not present in a text but added by hand or automatically does not obscure the evidence we expect to obtain from corpora. Normative so-called language-independent tagsets and categories for word class, morphosyntactic or even semantic information, whether in corpora or in lexicons, are useful only insofar as they can establish comparability between resources of different origin. If these tagsets are used for extracting evidence for linguistic phenomena from corpora, results have to be evaluated with great care.

The creation of language resources is expensive. The more effort that has been spent on the composition, documentation, and encoding of textual features, on the annotation of added linguistic information and on the comparability with related resources, the more valuable a corpus (or a lexicon) will be. To use it for just one specific purpose would be a waste it should be reused as often as possible, covering a wide range of applications. As we said above, standardization is an absolutely necessary precondition if the resources are not only to be used internally within one single institution but are also to be made available to the NLP community. While in the past the exchange of resources and resource-oriented tools was based on bilateral agreement between provider and user, distribution centers today serve as platforms for the language resources market. They will be accepted only if the material they offer for distribution proves valuable to the clients, and therefore, it has to be fully documented and standardized. The first center was the Linguistic Data Consortium (LDC) at Pennsylvania University, which was created more than five years ago. Recently, the European Language Resources Association (ELRA) was founded for the (Western) European market. It will be complemented by national language resource centers. The Trans-European Language Resources Infrastructure (TELRI), a Concerted Action funded by the European Commission, is serving as a bridge between Central and Eastern Europe (including the Commonwealth of Independent States) on the one side and Western Europe on the other. These distribution centers will pass on submitted resources to academic and industrial users and will collect license fees. It is no longer necessary to contact the institution where the resources were created in the first place.

Such an arrangement works only if the resources are not only documented but validated as well. What does validation mean for written resources? For corpora, it means that the corpus has the size it claims, that it is composed the way it claims, that it is encoded the way it claims, that all features encoded can be used for retrieval, that annotations used conform to a given standard, and, above all, that the error rate for encoding and annotation does not exceed a certain level. Validation guarantees the client that he gets what he ordered and that he can rely on the resources to the extent stated by the validation certificate.

Validation has to be carried out on an unbiased and neutral basis, and this means not by the institution where the resources were created. Some of the validation features are language-independent, like the conformancy of the text representation with accepted rules and standards for encoding. But, some are language-specific, e.g., a claim that a lexicon really covers the core vocabulary of a given language, that the meanings assigned to a lemma have been gained from corpus evidence, and that the error rate of morphosyntactic tagging does not exceed a certain level. Therefore, it makes sense that validation for written resources should be carried out by institutions where there is sufficient competence for the language in question. On the other hand, to make validation reliable, all recognized validation centers should follow a validation procedure prescribed by a standard that is accepted by the NLP community. Some projects for the development of validation procedures are under preparation; the European Commission has included validation as a new work item in the new Telematics Programme.