Romanian Language Technology

Mihai Ciocoiu, Ştefan Bruda * GULiveR: Generalized Unification Based LR Parser for Natural Languages

Virtual copy arrays are basically trees. The useful information is stored in their leaves. Each edge leaving a node of such tree is labeled by a number from 0 to k, where k is the arity of the tree (or the degree of the virtual copy array). The piece of information stored at some index can be retrieved by converting that index in base k (the degree of the current array) and then following the path denoted by the digits of the index converted that way.

When updating such an array, a new array is created. This new array shares the whole unchanged part with the older one (in fact only the path to the newly added element is duplicated). As a consequence, the older array remains unchanged and the amount of supplementary memory space used for this operation is minimized. Also the time complexity of every operation over a virtual copy array (reading and writing) is O(log n), where n is the greatest non empty index in that array. For the current parser we used virtual copy arrays of degree 2, because of the easy index base 2 encoding.

A dag represented in that manner is a cons cell (Lisp sense) in which the car points to the skeleton and the cdr points to the environment. Due to the structure of the environment, the complexity of the unification process depends only linearly on the number of nodes in the dags participating to the process (which can be upper-bounded by a constant) and logarithmically on the number of previously applied unifications on the two dags (let this number be d). Therefore, the time complexity of unification is O(log d).

Another important consequence of the environment structure is that the duplication of a dag is a very efficient process: it is sufficient to duplicate the cons cell which represents the dag. Therefore this process is O(1) time bounded and needs only the dimension of a cons cell as supplementary memory space.

Note that this is a very important property for our parser because, as we will see below, when the parser performs a nondeterministic step, the parse forest needs to be duplicated. This duplication is very efficient due to the property of dag representation mentioned above.

4.3. The table builder

This tool creates a table which guides the parser during the analysis. Due to this table, a generalized LR parser becomes almost deterministic and therefore very efficient. The table building does not depend but on grammar, so it can be created at a moment previous to the effective parsing. Moreover, once created, such tables may be stored and loaded later. Therefore we do not need a real-time table builder. Even if we had not tried this, the table builder was written to be as fast as possible.

The first problem we needed to solve was the classification of dags in terminals and nonterminals (see section 4.1). We first thought that we can parse the context-free backbone of the grammar using an usual generalized LR parser and then we can add the unification information to the parse forest created in that way (i.e., to parse first a monadic grammar where only "cat" attributes are kept for each dag). This is a bad solution because it is highly improbable that the following two dags represent the same nonterminal:

	[ cat:VP ; inflection:[ type:intransitive ] ]
	[ cat:VP ; inflection:[ type:transitive ] ]

even if they have the same category.

The solution we found is the following: two "equal" dags are considered the same terminal or nonterminal. Two dags are "equal" if they are able to unify each other. In the phase of table building we do not need to unify two dags but only to to verify if they are unifiable. Therefore we wrote a new predicate which duplicates the two dags, unifies the copies and returns the result (whether they unify or not).

A dag is a nonterminal if a dag "equal" to it appears in the left hand side of some productions of the grammar.

The table builder itself is a classical one [8], with minor changes made in order to keep all variants in multiple action entries [10] and with another few changes due to unification-based grammars. The grammar used by the table builder is augmented by production

	[ cat:S' ] --> S,

where S denotes the start dag of the original grammar. The start dag of the new grammar becomes [ cat:S' ]. A special symbol $ denotes the end of the string.

101