Dr. Radu ION
Senior Researcher II
Education2007 Ph.D. (Magna Cum Laude) in Computer Science (Artificial Intelligence, Natural Language Processing) at the Romanian Academy. Thesis title (click to download): “Word Sense Disambiguation Methods Applied to English and Romanian” (in Romanian). An English resume can be found here. 2005 M.Sc. in Theoretical Linguistics, Faculty of Letters, University of Bucharest. 2002 M.Sc. in Computer Science, Faculty of Automatic Control and Computers, “POLITEHNICA” University of Bucharest, Romania. 2001 B.Sc. in Computer Science, Faculty of Automatic Control and Computers, “POLITEHNICA” University of Bucharest, Romania.
Present positionSenior researcher, 3rd grade at the ”Mihai Drăgănescu” Research Institute for Artificial Intelligence of the Romanian Academy (work experience: 12 years). Before this, I was a Junior C++ programmer for the COSWIN application of the Romanian software company SIVECO Romania S.A.
Research interests and academic activitiesI am interested in Machine Learning for Natural Language Processing and statistical analysis and modeling of written natural language. The range of problems that I am interested in include Word Sense Disambiguation, Dependency Parsing and Word Alignment with direct application on (Statistical) Machine Translation or Question Answering. I am also interested in Natural Language Understanding via (deductive, inductive, abductive) reasoning with ontology-based axioms. I have served as a reviewer for the following conferences: LREC 2014, ACL 2013 and 2012, IJCNLP 2013, COLING 2012, ACL-HLT 2011, LREC 2010, RANLP 2009, COLING 2008, SemEval 2007 and 2010. I have served as a reviewer for the following journals: Language Resources and Evaluation Journal, Journal of Natural Language Engineering, Journal of Speech and Technology and International Journal on Artificial Intelligence Tools. I am currently a member of the editorial board of the Proceedings of the Romanian Academy Series A, the Information Science section. Google Scholar profile: here. Microsoft Academic profile: here.
- BalkaNet (2001-2004) – Design and Development of a Multilingual Balkan WordNet – European project (FP5, IST-2000-29388). My task was to check the interlingual conceptual alignment between the Princeton WordNet and the Romanian WordNet. To this end, I developed WSDTool (Perl and a Java interface), a WSD algorithm working with aligned wordnets and lexical aligners.
- CLARIN (2008-2010) – Common Language Resources and Technology Infrastructure (INFRA-2007-2.2-01, 212230). My task was to participate in the interoperable linguistic web services standardization group working on the ways in which various NLP tools/resources can work/be represented in a unified way (e.g. how to use a English tokenizer developed independently with your own POS tagger).
- ACCURAT (2010-2012) – Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation – European project (FP7-ICT-2009-4, grant agreement no. 248347). I was responsible with textual unit alignment and parallel data extraction from comparable corpora of any parallelism degree.
- MetaNet4U (2011-2013) – Enhancing the European Linguistic Infrastructure (ICT PSP Objective identifier: 6.1 Open linguistic infrastructure, grant agreement no. 270893). I was responsible with computational linguistic resources (corpora, lexicons, NLP tools) standardization, characterization (with the agreed metadata) and documentation.
Software developmentRecently developed NLP applications:
- A Romanian to CQP translator (Perl) which takes simple Romanian sentences describing searching constraints (e.g. “Give me all sentences in which the common noun “car” follows the main verb “drive” at most 5 words”) and translates them into CQP query scripts used to retrieve all matching contexts;
- PEXACC (Perl), a distributed tool for parallel sentence extraction from comparable corpora to be used in Statistical Machine Translation. A C# port of the PEXACC is used by LEXACC , a Lucene-based tool for parallel sentence extraction from comparable corpora;
- A query spellchecker (Perl) that was a part of the TiradeAI combined spellchecking system that won the 4th place at the Microsoft Research Speller Challenge. This spellchecker uses the Google 1,2,3-grams from the Google Web 1T 5-gram corpus stored in a BerkeleyDB and exposed as a REST web service under Apache;
- A C# interface to the SRILM language modeling toolkit that we used to develop our own Statistical Machine Translation decoders;
- John (Perl), a custom Statistical Machine Translation decoder for any language to English translation using Moses phrase tables and SRILM language models. Currently, on DGT-TM data , John has a BLEU score of about 55% compared to Moses’ 58%.
- TTL, a Perl module which does sentence splitting, tokenization, POS tagging, lemmatization and chunking for free running Romanian, English and French texts. The module is exposed as a SOAP-enabled web service with a custom interface written in Perl for the Apache 2 web server.
- LexPar, a Perl dependency linking program, using machine learning on raw texts, which produces a dependency-like analysis of an input sentence.
- YAWA, a Perl word alignment program for Romanian to English alignments to be used in Statistical Machine Translation.
Awards and achievements4th place at the Microsoft Research Speller Challenge competition with the RACAI team (named TiradeAI). The task was to develop a web query spelling alteration algorithm that would correct and/or improve query’s capacity to produce more hits. 1st place at the Question-Answering CLEF 2009 competition from Romanian legal documents (ResPubliQA) with a multi-factored QA system which linearly (based on previous training) combines a range of relevance measures of system’s responses to the user’s natural language query. The best unsupervised, knowledge-based, English All-Words, Fine Grained WSD system at the SemEval-2007 competition featuring our Meaning Attraction Models.