
ROMANIAN LANGUAGE RESOURCES CONVERTED
TO LINKED DATA SPECIFICATIONS
Its development started in the BalkaNet project.
At the moment, it contains 59,348 synsets, in which 53,092 words with 85,227 senses occur.
There are 2,216 nonlexicalized synsets. RoWN is aligned to Princeton WordNet 3.0 (PWN).
There are 541 synsets not aligned to (PWN), as they are considered Balkan-specific concepts and no corresponding one was identified in (PWN).
There are 138,592 relations in RoWN.
RoWN is available under the license CC BY-SA 4.0.
Sample SPARQL queries:
These resources are available under the license CC BY-NC-ND 4.0.
Our activity is carried out in the context of the Nexus Linguarum COST Action.
The Romanian Reference Treebank (RoRefTrees or RRT) contains 9,523 sentences, with 218,511 tokens, distributed into domains as follows: 19.09% imaginative writing, 16.86% law, 12.70% medical, 11.46% translations from FrameNet, 9.97% academic writing, 9.79% journalistic, 3.80% science, 2.63% wikipedia and the rest are randomly selected sentences. The PoS tagged sentences are syntactically parsed according to the Universal Dependencies (UD) principles.
The treebank has been included in all UD releases since 2016.
The corpus is available under the license CC BY-SA 4.0.
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations (ORG), locations (LOC), persons (PER), time (TIME) and legal resources (LEGAL) mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.
The corpus is available under the license CC BY-NC-ND 4.0.
Sample SPARQL queries:
SiMoNERo is a corpus of medical texts, containing 4,681 sentences and 146,020 tokens. The medical texts belong to three domains: cardiology (40.6%), diabetes (43%) and endocronology (16.4%). The texts are morphologically and syntactically annotated, according to the Universal Dependencies (UD) specifications. Four types of medical entities are annotated in the corpus: disorders (DISO), chemicals (CHEM), body parts (ANAT) and medical procedures (PROC).
The
corpus has been included in each biannual UD release since 2019.
The corpus is available under the license CC BY-SA 4.0.
Sample SPARQL queries: Don't forget to select the appropriate SiMoNERo dataset!
PARSEME-Ro is a journalistic corpus, containing 56,703 sentences and 1,015,624 tokens. The texts were annotated with four types of verbal multiword expressions, following the PARSEME guidelines. The texts are morphologically and syntactically annotated, according to the Universal Dependencies (UD) specifications.
The
corpus is released together with the other PARSEME corpora.
The corpus is available under the license CC BY-SA 4.0.
Sample SPARQL queries: Don't forget to select the appropriate SiMoNERo dataset!
The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format. The main archive is available here.
The corpus is available under the license CC BY-NC-ND 4.0.
Sample SPARQL queries:
RoLEX contains 330,866 entries. For each entry the following information is provided: lemma, morphosyntactic description, syllables, place of lexical stress and phonetic transcription (using the SAMPA alphabet). This lexicon was created in the ReTeRom project.
The corpus is available under the license CC BY-NC-ND 4.0.
Sample SPARQL queries: