The overall objective of the Curated Multilingual Language Resources for CEF AT Action is to compile curated datasets in seven languages targeted by the consortium, in domains of relevance to the European Digital Service Infrastructures (DSIs) with a view to enhance the Automated Translation.
Period: June 2020 to May 2022
June 2020 to May 2022
CEF-TC-2019-1 – Automated Translation grant agreement number INEA/CEF/ICT/A2019/1926831
The overall objective of the Curated Multilingual Language Resources for CEF AT Action is to compile curated datasets in seven languages targeted by the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to the European Digital Service Infrastructures (DSIs) with a view to enhance the Automated Translation.
The prime source of data are the national corpora of the above-mentioned languages. The data will cover domains relevant for some of the CEF DSIs, such as eHealth, Europeana and eGovernment in general. The Action will deliver at least 14 Million sentences (estimated to contain at least 140 Million words) from domains including culture, education, health and science. Moreover, the Action will address the gap in machine translation technology, which crucially depends on the provision of domain specific quality language resources for the under-resourced languages.
The Action consists of the following activities: Aggregation and data preparation (collect the relevant parts of the national monolingual corpora), Additional collection and IPR clearance (identify unbalanced domain distribution across the targeted languages and collect additional text data), Anonymisation (remove all personal and sensitive data from the language resources), Terminology enrichment (terminology enrichment of monolingual corpora), Metadata harmonisation (homogenization of individual metadata schemes across monolingual corpora), Dissemination (promote the Action’s results and important achievements), Management (efficient coordination between the consortium partners).