The Romanian Wordnet

  • the Romanian wordnet in XML/LMF format download
  • the Romanian wordnet in JSON format download
  • the Romanian wordnet in Turtle RDF format download

The Romanian Wordnet (RoWN)

Its development started in the BalkaNet project. At the moment, it contains 59,348 synsets, in which 53,092 words with 85,227 senses occur. There are 2,216 nonlexicalized synsets. RoWN is aligned to Princeton WordNet 3.0 (PWN). There are 541 synsets not aligned to (PWN), as they are considered Balkan-specific concepts and no corresponding one was identified in (PWN). There are 138,592 relations in RoWN.
RoWN is available under the license CC BY-SA 4.0.

Sample SPARQL queries:

CoRoLa-based linguistic information

CoRoLa word embeddings converted to LLOD format.
Lemma frequency from CoRoLa corpus converted to LLOD format, using the OntoLex-FRAC module.
Word frequency from CoRoLa corpus converted to LLOD format, using the OntoLex-FRAC module.

These resources are available under the license CC BY-NC-ND 4.0.

Treebanks

RRT, in linked data format

The Romanian Reference Treebank (RoRefTrees or RRT) contains 9,523 sentences, with 218,511 tokens, distributed into domains as follows: 19.09% imaginative writing, 16.86% law, 12.70% medical, 11.46% translations from FrameNet, 9.97% academic writing, 9.79% journalistic, 3.80% science, 2.63% wikipedia and the rest are randomly selected sentences. The PoS tagged sentences are syntactically parsed according to the Universal Dependencies (UD) principles.

The treebank has been included in all UD releases since 2016.

The corpus is available under the license CC BY-SA 4.0.


LegalNERo - Corpus with RDF-Turtle format included

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations (ORG), locations (LOC), persons (PER), time (TIME) and legal resources (LEGAL) mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).

The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.

The corpus is available under the license CC BY-NC-ND 4.0.

Sample SPARQL queries:


SiMoNERo, Medical treebank annotated with domain-specific entities

SiMoNERo is a corpus of medical texts, containing 4,681 sentences and 146,020 tokens. The medical texts belong to three domains: cardiology (40.6%), diabetes (43%) and endocronology (16.4%). The texts are morphologically and syntactically annotated, according to the Universal Dependencies (UD) specifications. Four types of medical entities are annotated in the corpus: disorders (DISO), chemicals (CHEM), body parts (ANAT) and medical procedures (PROC).

The corpus has been included in each biannual UD release since 2019.
The corpus is available under the license CC BY-SA 4.0.

Sample SPARQL queries: Don't forget to select the appropriate SiMoNERo dataset!

  • Tokens marked as CHEM
  • Tokens marked as DISO

PARSEME-Ro Treebank annotated with verbal multiword expressions

PARSEME-Ro is a journalistic corpus, containing 56,703 sentences and 1,015,624 tokens. The texts were annotated with four types of verbal multiword expressions, following the PARSEME guidelines. The texts are morphologically and syntactically annotated, according to the Universal Dependencies (UD) specifications.

The corpus is released together with the other PARSEME corpora.
The corpus is available under the license CC BY-SA 4.0.

Sample SPARQL queries: Don't forget to select the appropriate SiMoNERo dataset!

ROBIN, Technical Acquisition Speech Corpus (ROBINTASC)

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format. The main archive is available here.

The corpus is available under the license CC BY-NC-ND 4.0.

Sample SPARQL queries:

RoLex, lexicon for speech processing

RoLEX contains 330,866 entries. For each entry the following information is provided: lemma, morphosyntactic description, syllables, place of lexical stress and phonetic transcription (using the SAMPA alphabet). This lexicon was created in the ReTeRom project.

The corpus is available under the license CC BY-NC-ND 4.0.

Sample SPARQL queries:

Team

PhD Verginica Mititelu, senior researcher II (coordinator)
Acad. Dan Tufiș (consultant)
PhD Elena Irimia, senior researcher III
PhD Vasile Florian Păiș, senior researcher III
PhD Maria Carp, senior researcher III
Eric Curea, senior researcher
Andrei Marius Avram, research assistant