Canonical LR parser


In computer science, a canonical LR parser or LR parser is an LR parser for k=1, i.e. with a single lookahead terminal. The special attribute of this parser is that any LR grammar with k>1 can be transformed into an LR grammar. However, back-substitutions are required to reduce k and as back-substitutions increase, the grammar can quickly become large, repetitive and hard to understand. LR can handle all deterministic context-free languages.. In the past this LR parser has been avoided because of its huge memory requirements in favor of less powerful alternatives such as the LALR and the LL parser. Recently, however, a "minimal LR parser" whose space requirements are close to LALR parsers, is being offered by several parser generators.
Like most parsers, the LR parser is automatically generated by compiler compilers like GNU Bison,
MSTA, Menhir, HYACC,.

History

In 1965 Donald Knuth invented the LR parser a type of shift-reduce parser, as a generalization of existing precedence parsers. This parser has the potential of recognizing all deterministic context-free languages and can produce both left and right derivations of statements encountered in the input file. Knuth proved that it reaches its maximum language recognition power for k=1 and provided a method for transforming LR, k > 1 grammars into LR grammars.
Canonical LR parsers have the practical disadvantage of having enormous memory requirements for their internal parser-table representation. In 1969, Frank DeRemer suggested two simplified versions of the LR parser called LALR and SLR. These parsers require much less memory than Canonical LR parsers, but have slightly less language-recognition power. LALR parsers have been the most common implementations of the LR Parser.
However, a new type of LR parser, some people call a "Minimal LR parser" was introduced in 1977 by David Pager who showed that LR parsers can be created whose memory requirements rival those of LALR parsers. Recently, some parser generators are offering Minimal LR parsers, which not only solve the memory requirement problem, but also the mysterious-conflict-problem inherent in LALR parser generators. In addition, Minimal LR parsers can use shift-reduce actions, which makes them faster than Canonical LR parsers.

Overview

The LR parser is a deterministic automaton and as such its operation is based on static state transition tables. These codify the grammar of the language it recognizes and are typically called "parsing tables".
The parsing tables of the LR parser are parameterized with a lookahead terminal. Simple parsing tables, like those used by the LR parser represent grammar rules of the form
which means that if we go from state A to state B then we will go to state A1. After parameterizing such a rule with a lookahead we have:
which means that the transition will now be performed only if the lookahead terminal is a. This allows for richer languages where a simple rule can have different meanings depending on the lookahead context. For example, in a LR grammar, all of the following rules transition to a different state in spite of being based on the same state sequence.
The same would not be true if a lookahead terminal was not being taken into account. Parsing errors can be identified without the parser having to read the whole input by declaring some rules as errors. For example,
can be declared an error, causing the parser to stop. This means that the lookahead information can also be used to catch errors, as in the following example:
In this case A, B will be reduced to A1 when the lookahead is a, b or c and an error will be reported when the lookahead is d.
The lookahead can also be helpful in deciding when to reduce a rule. The lookahead can help avoid reducing a specific rule if the lookahead is not valid, which would probably mean that the current state should be combined with the following instead of the previous state. That means in the following example
the state sequence can be reduced to
instead of
if the lookahead after the parser went to state B wasn't acceptable, i.e. no transition rule existed. States can be produced directly from a terminal as in
which allows for state sequences to appear.
LR parsers have the requirement that each rule should be expressed in a complete LR manner, i.e. a sequence of two states with a specific lookahead. That makes simple rules such as
requiring a great many artificial rules that essentially enumerate the combinations of all the possible states and lookahead terminals that can follow. A similar problem appears for implementing non-lookahead rules such as
where all the possible lookaheads must be enumerated. That is the reason why LR parsers cannot be practically implemented without significant memory optimizations.

Constructing LR(1) parsing tables

LR parsing tables are constructed in the same way as LR parsing tables with the modification that each Item contains a lookahead terminal. This means, contrary to LR parsers, a different action may be executed, if the item to process is followed by a different terminal.

Parser items

Starting from the production rules of a language, at first the item sets for this language have to be determined. In plain words, an item set is the list of production rules, which the currently processed symbol might be part of. An item set has a one-to-one correspondence to a parser state, while the items within the set, together with the next symbol, are used to decide which state transitions and parser action are to be applied. Each item contains a marker, to note at which point the currently processed symbol appears in the rule the item represents. For LR parsers, each item is specific to a lookahead terminal, thus the lookahead terminal has also been noted inside each item.
For example, assume a language consisting of the terminal symbols 'n', '+', '', the nonterminals 'E', 'T', the starting rule 'S' and the following production rules:
Items sets will be generated by analog to the procedure for LR parsers. The item set 0 which represents the initial state will be created from the starting rule:
The dot '•' denotes the marker of the current parsing position within this rule. The expected lookahead terminal to apply this rule is noted after the comma. The '$' sign is used to denote 'end of input' is expected, as is the case for the starting rule.
This is not the complete item set 0, though. Each item set must be 'closed', which means all production rules for each nonterminal following a '•' have to be recursively included into the item set until all of those nonterminals are dealt with. The resulting item set is called the closure of the item set we began with.
For LR for each production rule an item has to be included for each possible lookahead terminal following the rule. For more complex languages this usually results in very large item sets, which is the reason for the large memory requirements of LR parsers.
In our example, the starting symbol requires the nonterminal 'E' which in turn requires 'T', thus all production rules will appear in item set 0. At first, we ignore the problem of finding the lookaheads and just look at the case of an LR, whose items do not contain lookahead terminals. So the item set 0 will look like this:

FIRST and FOLLOW sets

To determine lookahead terminals, so-called FIRST and FOLLOW sets are used.
FIRST is the set of terminals which can appear as the first element of any chain of rules matching nonterminal A. FOLLOW of an Item I is the set of terminals that can appear immediately after nonterminal B, where α, β are arbitrary symbol strings, and x is an arbitrary lookahead terminal. FOLLOW of an item set k and a nonterminal B is the union of the follow sets of all items in k where '•' is followed by B. The FIRST sets can be determined directly from the closures of all nonterminals in the language, while the FOLLOW sets are determined from the items under usage of the FIRST sets.
In our example, as one can verify from the full list of item sets below, the first sets are:

Determining lookahead terminals

Within item set 0 the follow sets can be found to be:
From this the full item set 0 for an LR parser can be created, by creating for each item in the LR item set one copy for each terminal in the follow set of the LHS nonterminal. Each element of the follow set may be a valid lookahead terminal:

Creating new item sets

The rest of the item sets can be created by the following algorithm
In the example we get 5 more sets from item set 0, item set 1 for nonterminal E, item set 2 for nonterminal T, item set 3 for terminal n, item set 4 for terminal '+' and item set 5 for ':
Item set 2 :
Item set 3 :
Item set 4 :
Item set 5 :
From item sets 2, 4 and 5 several more item sets will be produced. The complete list is quite long and thus will not be stated here. Detailed LR treatment of this grammar can e.g. be found in .

Goto

The lookahead of an LR item is used directly only when considering reduce actions.
The core of an LR item is the LR item S → a A • B e. Different LR items may share the same core.
For example, in item set 2
the parser is required to perform the reduction if the next symbol is '$', but to do a shift if the next symbol is '+'. Note that a LR parser would not be able to make this decision, as it only considers the core of the items, and would thus report a shift/reduce conflict.
A state containing will move to a state containing with label X.
Every state has transitions according to Goto.

Shift actions

If is in state Ik and Ik moves to state Im with label b, then we add the action

Reduce actions

If is in state Ik, then we add the action