Romanian Language Technology

Adriana Vlad, Adrian Mitrea * Estimating Conditional Probabilities and Digram Statistical Structure in Printed Romanian

In order to determine the signal to noise ratio and the relative error from relations (4), as p is unknown, in a first approximation we have replaced p by

(which is usual [6,7]). These are the (S/N) and &vepsilon;_r values given in Tables 2.x - 3.x, where x=1,2. If by this substitution, the calculated value for the relative error

is small enough, it means that the difference appeared when replacing p by

is also small and consequently we can accept the result as valid. Thus, a 100(1-α)% confidence interval for p results:

(5)

A more accurate value for the confidence interval of the estimated probability, [7], is:

p₁< p < p₂,

where:

(6)

In this case too, we can say that the true value p will lie within the interval (p₁, p₂) with a confidence level of (1-α), i.e. in 100(1-α)% of cases.

Observation: If N is large, it appears from relation (6) that the three terms involving are negligible compared to the other terms; this leads to relation (5). For example, in our illustrations where =1.96 and N (1-) > 20, relation (5) can be used in order to obtain the upper and lower confidence limits for p.

On the other hand, we can predetermine the N volume of the experimental data so that we can obtain the desired accuracy of the confidence interval (N results from relations (4)). Therefore calculations can be resumed in several iterations.

4. Experimental results

The experimental results and illustrations were obtained considering the texts already mentioned in the Introduction; these texts were concatenated in the same order as their quotation. The lengths for the whole printed Romanian text considered in the two situations, with or without blanks, are given in Table 1. The same table also gives the length for the corresponding Y and Y₁ texts. Texts generated by Y₁ source have different lengths according to the choice of the marker-letter. For exemplification, the lengths corresponding to E, P and Ţ markers (i. e. high, medium and low frequency marker-letter) are presented.

Table 1. The lengths (number of letters) of the processed texts corresponding to X, Y, and Y₁ information sources
Text Without blank With blank

X - printed Romanian 12,672,756 15,122,431

Y - first-order Markov chain 55,004 65,330

Y₁ for marker-letter E 6,716 6,699

Y₁ for marker-letter P 1,660 1,729

Y₁ for marker-letter 654 652

Table 1. The lengths (number of letters) of the processed texts corresponding to X, Y, and Y₁ information sources
Text	Without blank	With blank
X - printed Romanian	12,672,756	15,122,431
Y - first-order Markov chain	55,004	65,330
Y₁ for marker-letter E	6,716	6,699
Y₁ for marker-letter P	1,660	1,729
Y₁ for marker-letter	654	652

For all probabilities (of p(i), p(j/i) and p(i,j) type) the estimate values are given together with their corresponding relative errors calculated for an α significance level of 0.05. That leads to a 95% confidence interval for the respective probabilities, whose upper and lower limits can be directly written as shown in Section 3, relation(5).

The estimates are filled in Tables 2.x - 4.x, where x=1 or 2, according to the cases without or with blanks. (Blanks are denoted by the sign "-".)

Condition Np(1-p) >> 1 (required by de Moivre-Laplace approximation to the binomial distribution) was each time checked up; concretely all estimates verified that N (1-) > 20.

We have considered only cases when signal to noise ratio was greater than 20 (equivalently, relative error &vepsilon;_r < 0.1).

We give now some more explanations referring to Tables 1-4.

A) Conditional probabilities, p(j/i)
Results concerning conditional probabilities estimation are given in the first four columns of Tables 3.1 and 3.2; these were obtained on the basis of our method described in Section 2 with a choice of the s parameter as 200 letter (s = 200, Fig. 1). That means that the sample data (the N observations required for probability estimation) consisted of the N letters text emitted by Y₁ source, as shown in Fig. 2. These data comply with the i.i.d. statistical model and consequently we can use relations (4)-(6) for an error control of probability calculation.