This work was supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI – UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0818 - 73/2018 , within PNCDI III
Resources and Tools
The lexicon developed in ReTeRom, RoLEX, is a resource with 330,866 entries of the following tabular form:
word_form lemma MSD_tag syllabification lexical_stress phonetic_transcription
It is a thoroughly manually validated and corrected resource and represents the most extended phonological validated dataset available at this point for Romanian.
RoLEX is based on the textual component of an assembled speech corpus containing data coming from the Romanian Wikipedia articles, news, interviews on contemporary subjects, talk shows, spontaneous speech, tales, novels, etc. The data set’s completion with morphosyntactic and lemma information is based on a general, largescale (over 1,1 million entries) lexicon of the Romanian language under development at RACAI, called TBL.TBL was also used to include in RoLEX all the morphological variants of the lemmas encountered in the speech corpus. The other lexical information (syllabification, lexical stress, phonetic transcription) were partially based on RoSyllabiDict (Barbu, 2008) and MaRePhor (Toma et al., 2017), and partially automatically predicted with the front-end tool developed in (Stan et al. 2011). The aggregated lexicon underwent a carefully designed automatic and manual correction process.
Bibliography1. Barbu, Ana-Maria. "Romanian Lexical Data Bases: Inflected and Syllabic Forms Dictionaries." LREC (2008)
2. Stan, Adriana, Junichi Yamagishi, Simon King, and Matthew Aylett. „The Romanian Speech Synthesis (RSS) corpus: building a high quality HMM-based speech synthesis system using a high sampling rate” In Speech Communication vol.53 442-450. (2011)
3. Toma, Ştefan-Adrian, et al. "MaRePhoR—An open access machine-readable phonetic dictionary for Romanian." 2017 International Conference on Speech Technology and HumanComputer Dialogue (SpeD). IEEE. (2017)
Lists of Romanian Named Entities
The lists of NEs contain (simple and compound) proper nouns distributed as follows:
15.944 names of people (first names, surnames),
5.336 names of places and
6.441 names of companies / juridical entities/institutions.
Each type of NE is included in the respective file, which bears a suggestive name.
These lists are comprehensive, but not exhaustive.
An important contribution to the creation of these lists belongs to Diana Popescu, during her research stage at ICIA, in spring 2020.
TEPROLIN: a Dockerized, Romanian text processing platform
TEPROLIN is a Python 3 module which standardizes the interoperability of different Romanian text processing tools. In its current version, available on GitHub, TEPROLIN is able to perform 15 text processing operations for Romanian, automatically inferring the execution dependencies among operations (e.g. POS tagging requires tokenization beforehand). Besides doing standard text processing operations such as dependency parsing or NER, TEPROLIN offers text processing options that are useful for tools that deal with spoken Romanian, such as phonetic transcription or stressed syllable detection. . References:
Radu Ion. (2018). TEPROLIN: An Extensible, Online Text Preprocessing Platform for Romanian. In Proceedings of the 13th International Conference CONSILR 2018, Iași, November 22-23.
Vasile Păiș, Radu Ion and Dan Tufiș. (2020). A Processing Platform Relating Data and Tools for Romanian Language. In Proceedings of the 1st International Workshop on Language Technology Platforms (IWLTP 2020), European Language Resources Association (ELRA), Georg Rehm et al. (eds.), pp. 81-88 - indexed by DBLP and ISI Web of Science.