Uspadaro

Funding

European Language Equality (ELE) 2 Open Call for SRIA Contribution Projects

About USPDATRO

The project aims to study the usability of open data for building speech datasets for types of voices that are usually missing or underrepresented in existing speech datasets. We will conduct a case study on the Romanian language, with the possibility of applying the same methodology to any other language. We will identify existing multimedia open data, including platforms, types of media, percent of usable voices in a data sample, types of open licenses, types of underrepresented voices (including children, young people, older people, women, etc.), percent of underrepresented voices. To validate our methodology we will build a pilot dataset of Romanian underrepresented voices aligned with the corresponding textual representation.

Resources

The GitHub Repository contains scripts used for segmenting the original files, metadata creation, and internal utilities.

The USPDATRO Dataset is available in Zenodo, ELG and RELATE:

Annotation Guide

This is the first version of the Annotation Guide, created before the annotation process started. An updated version of the Annotation Guide is available as an appendix in the final project report.

USPDATRO 15 February - 15 May 2023

Funding

About USPDATRO

Resources

Annotation Guide

Implementation

Principal investigator

Team members

Email

Phone

Address