GULiveR: Generalized Unification Based LR Parser for Natural Languages

Mihai Ciocoiu, Ştefan Bruda

1. Introduction

GULiveR (The Generalized Unification Based LR Parser) is an environment for parsing natural languages, written and developed at the Romanian Academy of Sciences, Center for Artificial Intelligence. As its name indicates, GULiveR is based on unification and its purpose is the development of applications involving natural language analysis. The environment consists of a parser, a parse table generator, a lexical analyzer and a unifier.

Many modern linguistic theories, like "Lexical-Functional Grammar" [1], "Functional Unification Grammar" [2,3], "Generalized Phrase Structure Grammar" [4], "Unification Categorial Grammar" [5] and "Head-Driven Phrase Structure Grammar" [6] have replaced the atomic categories of context-free grammars with structured categories which contain the syntactic and semantic properties of groups of words. These structured categories are indirectly specified, as constraints which must be satisfied. The lexical entries constrain the categories which may be the leafs of the parse tree, and the grammar's productions constrain the structured categories corresponding with a parent node and its immediate descendants. As a consequence, a parse tree is well-formed if all the constraints are simultaneously satisfied. Our environment is based on PATR-II [7], one of the most simple and influential unification-based formalisms.

Parsing algorithms were initially written for programming languages [8], for which both top-down and bottom-up approaches have been tried. While the top-down approach is based on information from the grammar, being similar to goal-oriented reasoning, the bottom-up approach tries to combine words from the input string into complex grammatical categories (data-driven reasoning).

Both methods have the advantage of generality, being applicable to arbitrary context-free grammars, and both suffer of inefficiency, their complexity being exponential. This inefficiency comes from using partial information (from the grammar or from the input string, respectively), which leads to a high degree of non determinism.

The simultaneous use of both kinds of information led to the development of combined chart-parsing algorithms, like Earley [9] and Left-corner [8]. Those approaches have both the advantages of generality and speed, their time and space complexity being O(n³) and O(n²), respectively.

However, truly efficient (linear) algorithms can be written by sacrificing some generality. If we restrict ourselves to a subclass of context-free grammars called LR Grammars, the very efficient LR parsing method becomes available. The LR parsing algorithm is entirely deterministic, its actions being guided by a LR parsing table created before the actual parsing takes place.

While this approach worked well for programming languages, it cannot be applied for parsing natural languages, which are not LR. In this case, the solution is generalizing the LR algorithm for arbitrary context free grammars, by handling non determinism. However, a naive generalization would lead to an exponential time and space behavior, caused by the parallel stacks and parse trees.

Tomita's algorithm [10] solves these problems gracefully with two devices, a graph structured stack and a packed shared parse forest. Its complexity is O(n^p+1), where p is the number of symbols in the right hand side of the longest production in the grammar, but with some changes [11] its worst-case behaviour may be reduced to O(n³) complexity. However, natural language grammars are not that bad. For a grammar "close" to LR, the algorithm works with close to LR efficiency, and most natural language grammars are like that (i.e. having few conflicts in their parse table, like the one in figure 1).