Mihai Ciocoiu, Štefan Bruda * GULiveR: Generalized Unification Based LR Parser for Natural Languages
Note that, in a natural language
parsing, the input data is a string of characters (i.e., phrase)
divided into discernible substrings (the words). The lexical look-up
associates with each word one or more lexical categories which
are structured categories. Therefore the parsing process will
work on two distinct data types: strings (of characters) and structured
categories.
3.1. PATR-II
In the PATR-II formalism structured
categories are represented as direct acyclic graphs (dags).
A structured category is defined here as a set of attribute-value
pairs, where the values may be either atomic (elementary) values
or structured categories. The representation of a structured category
as a dag is immediate: each node is a structured category and
the edges leaving that node are labeled by attribute names; each
edge labeled with an attribute name points to a node which corresponds
to the value of that attribute (which may be complex or atomic
as the value is structured or atomic). This structure may be a
graph because more attributes may share the same value.
Complex dags can be viewed as
partial functions from attribute names (or edge labels) to dag
values. The notation D(l) denotes therefore the value corresponding
to label l in the dag D. We can also refer to the domain of a
dag D (denoted by Dom(D)). A dag D for which Dom(D) = φ
is called an empty dag or a variable.
A path in a dag is a sequence of labels which can
be used to pick out a particular subdag of that dag. By extension,
the notation D(p) denotes the subdag picked out from D using path p.
Note that dags are partial functions
-- it is not necessary to specify values for all labels. Dags
may be augmented during the unification process (i.e., new values
are specified for some attributes).
PATR-II formalism is based on
two fundamental principles:
Moreover, a partial ordering relation
on the set of dags can be defined. This relation is called subsumption.
The set of dags is a lattice. A dag D subsumes a dag D' (notation:
D 98
D')
if and only if D(l)
D'(l) for all l in Dom(D) and D'(p) = D'(q) for all paths p and q such
that D(p) = D(q). Intuitively, D subsumes D' if D contains a subset
of the information contained in D' (i.e., D is more general than
D'). Thus variables subsume all other dags (because they contain
no information) and an atomic dag neither subsumes nor is subsumed
by any different atomic dag.
1 The application subtools which deal with structured categories
(unification, grammar compilation) are precisely delimited. Therefore any change
into these subtools will probably not affect other parts; moreover, the input
grammar is compiled into an internal format before use, so if the input form of
the grammar needs to be changed then only the grammar compiler must be rewritten.