Romanian Language Technology

Adriana Vlad, Adrian Mitrea * Estimating Conditional Probabilities and Digram Statistical Structure in Printed Romanian

The first column of Tables 3.1 and 3.2 contains the (i, j) digrams which allowed an estimated relative error for the p(j/i) conditional probability less than 0.1 (the digrams are sorted according to the alphabetic order). The second column gives the estimates for p(j/i) conditional probabilities, while the third and fourth columns contain the corresponding calculated values for the signal to noise ratio and for the relative error, respectively.

For example, in Table 3.1 for digram RE one can find: p(E/R) estimate value is 0.2745; the experimental value for the signal to noise ratio is 39.47 and for the relative error is 0.0497. That leads to the following 95% confidence interval for p(E/R): 0.2745 (1±0.0497).

B) Letter probabilities, p(i)
A new evaluation of letter probabilities is here contained (Tables 2.1 and 2.2) because our study is based upon texts following the new orthography (after 1993) and also considering the need of an error control of these probabilities. The results show a concordance with [3,10].

The experimental data (the N observations for the probability calculation) were periodically sampled with a step of 200 letters from the initial printed Romanian text X. Thus, a new information source named X^* appeared, which is the first-order approximation to Romanian and which complies with the i.i.d. statistical model mentioned in Section 3. The length for the X^* processed texts was 75,613 or 63,364 according to cases if blanks were considered or not.

The second column in Tables 2.x, where x=1,2 contains the estimates of letter probabilities, in the decreasing order of their values. The third and fourth columns give the corresponding estimates for the signal to noise ratio and relative error, respectively. Columns five and six contain the upper and lower confidence limits, p₂ and p₁ respectively, calculated for each letter i, according to relation (6). (The confidence limits can also be calculated by means of the experimental relative error value, as shown in Section 3, relation (5); columns five and six give us the possibility to compare these results when the more accurate relation (6) is used).

The last column contains (for enabling a comparison) the estimate values of letter probability calculated directly on the whole X printed text as the ratio between the number of occurrences and the length of the text (N total observations). In this way the experimental data are correlated.

For example, in Table 2.1, letter I has a probability estimate of 0.1058, the experimental values for the signal to noise ratio and for the relative error being 86.6 and 0.0226, respectively. The 95% confidence interval calculated with relation (6) is (0.1034, 0.1082). The probability estimate, calculated on the whole X text, is 0.1056.

C) Digram probabilities, p(i,j)
Tables 3.x, where x=1,2, contain in columns five and six numerical results concerning digrams obtained by using the method we proposed (Section 2). Thus, the fifth column gives the (i,j) digram occurrence probability calculated by means of relation (2), where the p(i) letter probabilities are taken from Table 2.x and the p(j/i) conditional probabilities from the second column of Table 3.x .

The sixth column of Table 3.x gives a cumulated value for the relative error, (involved by relation (2)), taking into account the errors resulted both in p(i) and p(j/i) probability calculations.

For the digram probabilities estimates, besides the previously proposed and illustrated method, we also used for comparison two other more conventional procedures.

The first procedure consisted of digrams analysis directly on the whole X printed text by successively considering two by two adjacent letters, each time shifting a single letter. (N-1 digram corresponds to an N letters initial text.) The corresponding estimates of the p(i,j) probabilities (calculated as ratio between the occurrence number and the total digram number) are given in the seventh column of Table 3.x. In this case the sample data are correlated.

A second procedure was to periodically sample the Romanian text with a step of 200 letters and then to record each time the two adjacent letters. We can assume that the obtained data sample complies with the i.i.d. model so that relations (4)-(6) could be used. The corresponding digram probability estimates are given in the eighth column of Tables 3.1 and 3.2.

As for example, in Table 3.1, digram RE has a probability estimate of 0.0209 and a cumulated relative error of 0.0782 calculated on the basis of our method. If instead of it one uses the other two procedures, the probability estimates are 0.0208 and 0.0206, respectively.

Finally, in Tables 4.x, where x=1,2, we present a hierarchy of the digram occurrence for printed Romanian, obtained by sorting the data from the fifth column of Tables 3.x.

Observation: This is not an absolute hierarchy of the digram occurrences for the whole printed Romanian text, X. This hierarchy has as constraints the sorting data which satisfy condition (S/N) >20, both for p(i) and p(j/i) estimation.

A digram rank in this hierarchy can be obtained at the intersection of the corresponding row and column in Tables 4.x. For example, digram ES, with the rank 12 in the text without blank, can be found at the intersection between row_1 with column_2 in Table 4.1.