ADAMo

Automatic detection of AI-generated texts from Moldova and Romania

Partners

Project summary

The goal of the ADAMo project is to create a text classifier that can identify texts produced by artificial intelligence (AI). Even though language specific features will be employed, the resulting solution will be a language-independent one. Given the existence of similar resources in other languages, new neural models can be trained to detect AI-generated texts in other languages. We will use the representative corpus of the contemporary Romanian language (CoRoLa), containing more than 1 billion words (in written and spoken texts) as original data for training the classifier. In order to cope with the language varieties in Romania and Moldova, CoRoLa will be enriched with at least 15M tokens of high quality texts, with cleared intellectual property rights, from Moldova, following the design principles of the original data collection, metadata construction, levels of annotation. The whole corpus will also undergo syntactic parser, so as to capture similarities and differences at more linguistic levels and, thus, to be a valuable resource for the study of the two varieties.

Partners

I.C.I.A.

The Research Institute for Artificial Intelligence “Mihai Drăgănescu” ICIA of the Romanian Academy

U.T.M.

Technical University of Moldova

Results

Automatic Detection of AI-Generated Texts from Moldova and Romania (ADAMo), A Project Presentation

in the 20th International Conference on Linguistic Resources and Tools for Natural Language Processing, Bucharest, 8-10 Oct, 2025.

conference presentation: Verginica Barbu Mititelu, Victoria Bobicev, Victoria Alexei, Rodica Braniște, Olesea Caftanatov, Maria Mitrofan, Radu Ion, Elena Irimia, Daniela Istrati, Ludmila Malahov, Sergiu Nisioi and Alexandr Parahonco

in E. Irimia et al. (eds.), Proceedings of the 20th International Conference Linguistic Resources and Tools for Natural Language Processing, Publishing House of the “Alexandru Ioan Cuza” University of Iași, 2025, pp. 249-264.

published in the proceedings of a scientific conference. Verginica Barbu Mititelu, Victoria Bobicev, Victoria Alexei, Rodica Braniște, Olesea Caftanatov, Maria Mitrofan, Radu Ion, Elena Irimia, Daniela Istrati, Ludmila Malahov, Sergiu Nisioi, Alexandr Parahonco, Vasile Păiș

Romance Reflexive Constructions Revisited

at the Universal Dependencies Workshop, 16 May 2026

article in Çağrı Çöltekin and Kaja Dobrovoljc (eds.), Proceedings of the Ninth Workshop on Universal Dependencies (UDW 2026), p. 142–152, May 16, 2026.

authors: Verginica Barbu Mititelu1, Elena Irimia, Adriana Pagano, Ioana Buhnila, Roxana Ciolăneanu

CoRoLa version 2.0: corpus enrichment and a new annotation level

in the 12th Workshop on the Challenges in the Management of Large Corpora, 11 Mai 2026

article in Piotr Bański, Dawn Knight, Marc Kupietz, Andreas Witt, Alina Wróblewska, Proceedings of The 12th Workshop on Challenges in the Management of Large Corpora (CMLC-12), Language Resources Association (ELRA), 2026, p. 91-97. ISBN: 78-2-493814-67-8.

authors: Elena Irimia, Verginica Barbu Mititelu, Radu Ion, Vasile Păiș, Maria Mitrofan, Dan Tufiș

Speech Corpus Transcription Experiments

at the 2nd UniDive training school in Yerevan, 20-24 January 2026

workshop presentation: Victoria Bobicev, Victoria Alexei

The Romanian Corpus Annotated with Multiword Expressions. PARSEME-Ro Version 2.0

at the international conference Language Resources and Evaluation Conference (LREC 2026), 13-15 May 2026

article in Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) (pp. 11979–11991). European Language Resources Association (ELRA). https://doi.org/10.63317/3dnsaryv4kdh.

authors: Verginica Barbu Mititelu, Mihaela Cristescu, Elena Irimia, Carmen Vasile

Cognitive impact:

The project has a major cognitive impact in the domains of Artificial Intelligence and Romanian linguistics, as it is gathering together in the same corpus texts from the two varieties of Romanian with two major goals: their automatic linguistic comparison (seconded by manual analysis) and the impact on the detection of AI-generated texts when two language varieties are involved. The results will contribute to the existing literature of both domains with original empiric data, thus offering a sound foundation for further developments. We have already established synergy with the project Defending against deep fake news with large language and image models, in which the data collected in ADAMo will be used for deep fake news detection. Furthermore, the project results (the corpus) will serve as a qualitative source of information for the description of the Romanian language with its both varieties.

The cognitive impact is also supported through the development of the research competencies of the involved team, particularly in areas such as corpus collection, text preprocessing and processing, metadata-based description, and the development of classifiers for detecting linguistic varieties and AI-generated texts. The knowledge generated can be further transferred to the academic and educational environment through the integration of the results into training activities and scientific dissemination.

Socio-economic impact:

From a socio-economic perspective, the project addresses a major and highly topical challenge, namely the ability to distinguish between original texts and those generated by Artificial Intelligence, as well as between true information and false content. By collecting a corpus from the Republic of Moldova and conducting experiments involving prompt design and text generation, we will obtain linguistic material for developing classifiers capable of automatically distinguishing between original and AI-generated texts, as well as between the two varieties of the language. Through the linguistic resources and tools that will be made available to the community, the project will contribute to the development of instruments of automatic detection of AI-generated texts and, ultimately, of potentially false news.