The method here developed starts
with the second order approximation of the NWL, as
described by Shannon in [1], Part I,
and then it uses Theorem 2.11 (W. Doeblin), pp. 93-94,
[5].
The method application is illustrated
for printed Romanian language, on long texts of about 12 million
characters. The estimates are given together with the significance
levels and the confidence intervals. The predetermination of the
sample data volume is also possible, in order to resume the experimental
work more accurately.
Starting from the estimates of
p(j/i) probabilities, we could obtain the p(i,j)
digram probabilities for printed Romanian.
Furthermore, the quantitative
results regarding digram probabilities were compared to those
which could be obtained by using other more conventional methods.
The paper also presents the estimates of letter occurrence probabilities
in printed Romanian. These estimates of letter probabilities have
their own value, but here they were requested for digram probabilities
calculation by the method we proposed.
The experimental values given in Tables 1, 2.x - 4.x, where x=1,2,
were obtained according to the procedures presented in Sections
2 and 3 on data sampled from the Romanian texts mentioned in Appendix.
Texts were considered in two situations: with and without blanks;
punctuation marks and figures were eliminated (this does not diminish
the application area of the method). The alphabet thus obtained
consists of the 31 letters for Romanian language
(ABC...XYZÃÂΪÞ) or 32, if blanks are taken
into consideration.
It is to be noticed that the experimental
data come from texts of comparable lengths, namely:
44
2. Estimating p(j/i) and p(i,j) probabilities.
Method presentation