Romanian Language Technology

Mihai Ciocoiu, Ştefan Bruda * GULiveR: Generalized Unification Based LR Parser for Natural Languages

3. Structured categories

In the following we will summarize the notions of structured category and unification-based grammar and their properties relevant to this paper. The present version of our parser works with grammars written using PATR-II formalism but, with minor changes, it can work using similar unification-based grammar formalisms such as definite-clause grammars [12], functional-unification grammars or lexical-functional grammars¹.

Note that, in a natural language parsing, the input data is a string of characters (i.e., phrase) divided into discernible substrings (the words). The lexical look-up associates with each word one or more lexical categories which are structured categories. Therefore the parsing process will work on two distinct data types: strings (of characters) and structured categories.

3.1. PATR-II

In the PATR-II formalism structured categories are represented as direct acyclic graphs (dags). A structured category is defined here as a set of attribute-value pairs, where the values may be either atomic (elementary) values or structured categories. The representation of a structured category as a dag is immediate: each node is a structured category and the edges leaving that node are labeled by attribute names; each edge labeled with an attribute name points to a node which corresponds to the value of that attribute (which may be complex or atomic as the value is structured or atomic). This structure may be a graph because more attributes may share the same value.

Complex dags can be viewed as partial functions from attribute names (or edge labels) to dag values. The notation D(l) denotes therefore the value corresponding to label l in the dag D. We can also refer to the domain of a dag D (denoted by Dom(D)). A dag D for which Dom(D) = φ is called an empty dag or a variable. A path in a dag is a sequence of labels which can be used to pick out a particular subdag of that dag. By extension, the notation D(p) denotes the subdag picked out from D using path p.

Note that dags are partial functions -- it is not necessary to specify values for all labels. Dags may be augmented during the unification process (i.e., new values are specified for some attributes).

PATR-II formalism is based on two fundamental principles:

The concatenation is the only operator which can be applied to combine two strings; conforming to this principle, the formalism is based on context-free grammars;
The unification is the only operator which can be applied to combine information stored in two structured categories (dags); this principle determines the declarativity of the formalism.

Moreover, a partial ordering relation on the set of dags can be defined. This relation is called subsumption. The set of dags is a lattice. A dag D subsumes a dag D' (notation: D D') if and only if D(l) D'(l) for all l in Dom(D) and D'(p) = D'(q) for all paths p and q such that D(p) = D(q). Intuitively, D subsumes D' if D contains a subset of the information contained in D' (i.e., D is more general than D'). Thus variables subsume all other dags (because they contain no information) and an atomic dag neither subsumes nor is subsumed by any different atomic dag.

¹ The application subtools which deal with structured categories (unification, grammar compilation) are precisely delimited. Therefore any change into these subtools will probably not affect other parts; moreover, the input grammar is compiled into an internal format before use, so if the input form of the grammar needs to be changed then only the grammar compiler must be rewritten.