The project aims to study the usability of open data for building speech datasets for types of voices that are usually missing or underrepresented in existing speech datasets. We will conduct a case study on the
with the possibility of applying the same methodology to any other language. We will identify existing multimedia open data, including platforms, types of media, percent of usable voices in a data sample, types of open licenses, types of underrepresented voices (including children, young people, older people, women, etc.), percent of underrepresented voices. To validate our methodology we will build a pilot dataset of Romanian underrepresented voices aligned with the corresponding textual representation.