The project aims to study the usability of open data for building speech datasets for types of voices that are usually missing or underrepresented in existing speech datasets. We will conduct a case study on the Romanian language, with the possibility of applying the same methodology to any other language. We will identify existing multimedia open data, including platforms, types of media, percent of usable voices in a data sample, types of open licenses, types of underrepresented voices (including children, young people, older people, women, etc.), percent of underrepresented voices. To validate our methodology we will build a pilot dataset of Romanian underrepresented voices aligned with the corresponding textual representation.

Annotation Guide

This is the first version of the Annotation Guide, created before the annotation process started. An updated version of the Annotation Guide is available as an appendix in the final project report.


Research Institute for Artificial Intelligence, Romanian Academy

Principal investigator

dr. Vasile Păiș

Team members

  • dr. Elena Irimia
  • dr. Radu Ion
  • dr. Verginica Barbu Mititelu
  • acad. Dan Tufiș




Calea 13 Septembrie nr 13,
sector 5, București, 050711