Many modern linguistic theories,
like "Lexical-Functional Grammar"
[1], "Functional
Unification Grammar" [2,3],
"Generalized Phrase Structure
Grammar" [4],
"Unification Categorial Grammar"
[5] and "Head-Driven
Phrase Structure Grammar" [6] have
replaced the atomic categories of context-free grammars with structured
categories which contain the syntactic and semantic properties
of groups of words. These structured categories are indirectly
specified, as constraints which must be satisfied. The lexical
entries constrain the categories which may be the leafs of the
parse tree, and the grammar's productions constrain the structured
categories corresponding with a parent node and its immediate
descendants. As a consequence, a parse tree is well-formed if
all the constraints are simultaneously satisfied. Our environment
is based on PATR-II [7], one of the
most simple and influential
unification-based formalisms.
Parsing algorithms were initially
written for programming languages [8],
for which both top-down
and bottom-up approaches have been tried. While the top-down approach
is based on information from the grammar, being similar to goal-oriented
reasoning, the bottom-up approach tries to combine words from
the input string into complex grammatical categories (data-driven
reasoning).
Both methods have the advantage
of generality, being applicable to arbitrary context-free grammars,
and both suffer of inefficiency, their complexity being exponential.
This inefficiency comes from using partial information (from the
grammar or from the input string, respectively), which leads to
a high degree of non determinism.
The simultaneous use of both kinds
of information led to the development of combined chart-parsing
algorithms, like Earley [9] and
Left-corner [8]. Those approaches
have both the advantages of generality and speed, their time and
space complexity being O(n3) and
O(n2), respectively.
However, truly efficient (linear)
algorithms can be written by sacrificing some generality. If we
restrict ourselves to a subclass of context-free grammars called
LR Grammars, the very efficient LR parsing method becomes available.
The LR parsing algorithm is entirely deterministic, its actions
being guided by a LR parsing table created before the actual parsing
takes place.
While this approach worked well
for programming languages, it cannot be applied for parsing natural
languages, which are not LR. In this case, the solution is generalizing
the LR algorithm for arbitrary context free grammars, by handling
non determinism. However, a naive generalization would lead to
an exponential time and space behavior, caused by the parallel
stacks and parse trees.
Tomita's algorithm [10] solves
these problems gracefully with two devices, a graph structured
stack and a packed shared parse forest. Its complexity
is O(np+1), where p is the number of symbols in
the right
hand side of the longest production in the grammar, but with some
changes [11] its worst-case behaviour
may be reduced to O(n3)
complexity. However, natural language grammars are not that bad.
For a grammar "close" to LR, the algorithm works with
close to LR efficiency, and most natural language grammars are
like that (i.e. having few conflicts in their parse table, like
the one in figure 1).
94