Mihai Ciocoiu, Štefan Bruda * GULiveR: Generalized Unification Based LR Parser for Natural Languages
For this reason, nonterminals
can be identified by their internal names which do not change
during the parsing process when using a monadic grammar, because
the information contained in any nonterminal does not change anymore.
In a PATR-II grammar (or any unification-based formalism), two
instances of a nonterminal defined by a production will probably
contain two different sets of data (different new attributes or
different values for some attributes). Because of this difference
between two instances, dags may have to be copied during parsing
and, of course, this operation needs to be as efficient as possible.
Moreover, the infinite number
of nonterminals generates a new problem: how can be classified
a dag as nonterminal or as terminal2. In a monadic grammar, the
nonterminals are precisely specified; therefore any symbol which
is not in the nonterminal set is a terminal. Even if in a chart
parser or any similar parser it is not necessary to discern between
terminals and nonterminals, in a LR parser (and, of course, in
a generalized LR one) this classification is vital, because the
table which guides such parsers contains two section, one indexed
by nonterminals and the other by terminals.
Finally, unification itself generates
a problem, because unification must respect two opposite conditions:
it must be correct and efficient. These conditions are opposite
because both the correctness and the efficiency depend on the
volume of data which is copied: when this volume increases the
correctness increases and the efficiency decreases.
4.2. Unification
We have chosen for dag unification
an implementation of Pereira's algorithm [13].
This algorithm assures not only an efficient unification but also an efficient
duplication of a dag.
The main idea of this algorithm
is that the information contained in a processed dag is the same
as the information contained in the original dag and a set of
updates to it. Thus, a dag is represented as a pair which
contains the dag skeleton and a set of updating records
named the dag environment.
During the unification process,
a node of a dag may be either augmented by adding new edges
to it or rerouted to another node. This is made by adding
new updating records to the dag environment.
The representation of the skeleton
is not critical as it never changes. The skeleton nodes are indexed
because they are referred in the update records. Anyway, the indexes
are generated by request: when we need to refer a node, we generate
an index for it. This makes the update records very compact.
An update record may specify either
an edge appended to a node of the skeleton or the rerouting of
a node. The environment is basically an array of such update records.
For the sake of efficiency this array has a special implementation:
it is a virtual copy array.
An ordinary array is an indexed
collection of cells which contain some data. When the information
stored in some cell needs to be modified, the new information
is stored in the same place as the older one, which cannot be
recovered anymore. If we want to keep the older information, then
we need to duplicate the whole array and this is a costly operation
(time and space). In our case, arrays must be duplicated all the
time: each unification produces a new environment (i.e., a new
array) and should not modify the older one.
100
2 Note that the terminal domain is also infinite.