Estimating Conditional Probabilities and Digram Statistical Structure in Printed Romanian

Adriana Vlad, Adrian Mitrea

1. Introduction

It is known that natural written languages (NWL) are well approximated by an m-th order (m < 30) ergodic Markov chain, [1-4]. Considering this fact, we present a new method of estimating conditional probabilities based on a single preceding letter in NWL. These are the transition probabilities of the p(j/i) type, i.e. p(j/i) is the probability that letter i is followed by letter j.

The method here developed starts with the second order approximation of the NWL, as described by Shannon in [1], Part I, and then it uses Theorem 2.11 (W. Doeblin), pp. 93-94, [5].

The method application is illustrated for printed Romanian language, on long texts of about 12 million characters. The estimates are given together with the significance levels and the confidence intervals. The predetermination of the sample data volume is also possible, in order to resume the experimental work more accurately.

Starting from the estimates of p(j/i) probabilities, we could obtain the p(i,j) digram probabilities for printed Romanian.

Furthermore, the quantitative results regarding digram probabilities were compared to those which could be obtained by using other more conventional methods.

The paper also presents the estimates of letter occurrence probabilities in printed Romanian. These estimates of letter probabilities have their own value, but here they were requested for digram probabilities calculation by the method we proposed.

The experimental values given in Tables 1, 2.x - 4.x, where x=1,2, were obtained according to the procedures presented in Sections 2 and 3 on data sampled from the Romanian texts mentioned in Appendix. Texts were considered in two situations: with and without blanks; punctuation marks and figures were eliminated (this does not diminish the application area of the method). The alphabet thus obtained consists of the 31 letters for Romanian language (ABC...XYZÃÂÎªÞ) or 32, if blanks are taken into consideration.

It is to be noticed that the experimental data come from texts of comparable lengths, namely:

literary and philosophic texts by Romanian authors (Appendix, positions 1-7);
literary and philosophic texts by foreign authors translated into Romanian (Appendix, positions 14-21);
scientific and technical works by Romanian authors (Appendix, positions 8-13).

2. Estimating p(j/i) and p(i,j) probabilities. Method presentation

Let X be the initial information source, namely the multiple Markov chain approximating the respective written language (in our experiments, we considered a sample belonging to X, consisting of an N letters text obtained from the works mentioned in Appendix).