Romanian Language Technology

Mihai Ciocoiu, Ştefan Bruda * GULiveR: Generalized Unification Based LR Parser for Natural Languages

4. The generalized unification based LR Parser

4.1. The problem

The original Tomita algorithm [10] works with monadic categories. For such grammar a finite number of nonterminals are defined. A unification-based grammar defines an infinite set of nonterminals, because the dags defined by grammar productions are not merely nonterminals, but nonterminal patterns.

For this reason, nonterminals can be identified by their internal names which do not change during the parsing process when using a monadic grammar, because the information contained in any nonterminal does not change anymore. In a PATR-II grammar (or any unification-based formalism), two instances of a nonterminal defined by a production will probably contain two different sets of data (different new attributes or different values for some attributes). Because of this difference between two instances, dags may have to be copied during parsing and, of course, this operation needs to be as efficient as possible.

Moreover, the infinite number of nonterminals generates a new problem: how can be classified a dag as nonterminal or as terminal². In a monadic grammar, the nonterminals are precisely specified; therefore any symbol which is not in the nonterminal set is a terminal. Even if in a chart parser or any similar parser it is not necessary to discern between terminals and nonterminals, in a LR parser (and, of course, in a generalized LR one) this classification is vital, because the table which guides such parsers contains two section, one indexed by nonterminals and the other by terminals.

Finally, unification itself generates a problem, because unification must respect two opposite conditions: it must be correct and efficient. These conditions are opposite because both the correctness and the efficiency depend on the volume of data which is copied: when this volume increases the correctness increases and the efficiency decreases.

4.2. Unification

We have chosen for dag unification an implementation of Pereira's algorithm [13]. This algorithm assures not only an efficient unification but also an efficient duplication of a dag.

The main idea of this algorithm is that the information contained in a processed dag is the same as the information contained in the original dag and a set of updates to it. Thus, a dag is represented as a pair which contains the dag skeleton and a set of updating records named the dag environment.

During the unification process, a node of a dag may be either augmented by adding new edges to it or rerouted to another node. This is made by adding new updating records to the dag environment.

The representation of the skeleton is not critical as it never changes. The skeleton nodes are indexed because they are referred in the update records. Anyway, the indexes are generated by request: when we need to refer a node, we generate an index for it. This makes the update records very compact.

An update record may specify either an edge appended to a node of the skeleton or the rerouting of a node. The environment is basically an array of such update records. For the sake of efficiency this array has a special implementation: it is a virtual copy array.

An ordinary array is an indexed collection of cells which contain some data. When the information stored in some cell needs to be modified, the new information is stored in the same place as the older one, which cannot be recovered anymore. If we want to keep the older information, then we need to duplicate the whole array and this is a costly operation (time and space). In our case, arrays must be duplicated all the time: each unification produces a new environment (i.e., a new array) and should not modify the older one.

² Note that the terminal domain is also infinite.

100