Raincode PEG Grammar


Version 4.2.424.0

1. Introduction

1.1. Parsing

The PegGrammar parser generator is a tool to facilitate the process of developing a parser. Parsing in this context means implementing a mapping between a textual representation of programs in a chosen software language and its intended representation as a data structure in memory [1]. The textual incarnation of a program is occasionally referred to as Concrete. In contrast, the object-oriented representation incarnation is referred to as abstract, simply because one is the way concrete programs are written down, and the other one abstracts from many details that are no longer relevant when manipulating a program as a data structure.

Parser generator tool (peggenerator.exe) gets installed while installing the Raincode legacy compilers. Refer to the Raincode legacy compilers installation guide for more details.

PEG: Parsing Expression Grammars

PEG, short for Parsing Expression Grammars, is a relatively novel approach to develop top-down parsers by relying on recognition and prioritized choice [2]. Technically a PEG represents a straight forward recursive descent parser with backtracking [3], which implies easier maintenance due to its understandable semantics compared to over-engineered bottom-up parsers. On the other hand, PEGs are better than pure regular expressions for parsing due to extended expressiveness and reliable deterministic behaviour [4].

There are known ways to address problems typical for all top-down parsers, such as support for left recursion [5] , [6], and together with techniques like highly configurable memoization [7], they allow efficient implementations that work in linear time [8] and nearly constant space [9].

Raincode PegGrammar implements PEG extended with several additional features to be discussed later in the document.

PegGrammar workflow

The PegGrammar framework helps an engineer to develop a parser by modeling some easily automated yet error-prone parts. Still, it is not meant to be an out of the box fully automated solution. The user of PegGrammar will need to program the abstract syntax (the target representation of the parser) manually in C# and then write a grammar that serves as a specification of how to recognize correct concrete programs and how to map them into this abstract representation. The PegGrammar generator produces a C# program that implements this specification and build a complete solution together with manually written code fragments. The grammar may contain explicit references to some of those code fragments.

2. Tutorial

2.1. Motivation

The parsers generated by Raincode PEG Grammar are designed to parse the small languages that are used as input by mainframe tools. Therefore they behave somewhat differently from parsers generated by other tools designed essentially for modern programming languages. In particular, our grammars are entirely explicit; blanks (or other constructs) are never automatically ignored. They are also based on simple integer positions. They have no concept of line and column, avoiding the question of the definition of what a line is, which can be language-dependent in the presence of constructs such as continuation lines. By being implemented as recursive descent parser, they behave in a very similar way to a hand-written parser, even in corner cases, which makes it easier to parse languages defined before the generalized use of parsers based on formal context-free grammars.

2.2. Introductory example

Below is a small sample input for a fictional tool, FICTOOL, for which we will write a parser.

INPUT  WHEN(17)
OUTPUT WHEN(45,AND,33),BUILD(17,28)

We will start with the pure grammatical aspects and only later introduce the building of a parse tree.

2.2.1. Parsing

Below is a simple parser for this source.

[version 1]
[namespace FicTool.Parsing]
[class Parser]
void Commands ::= "INPUT  WHEN(17)\nOUTPUT WHEN(45,AND,33),BUILD(17,28)\n";

Supposing that this grammar was saved as FicTool.peg, we can generate the C# file FicToolParser_gen.cs by the following command: PegGenerator FicTool.peg FicToolParser_gen.cs

The generated file has the following content.

//Generated file, do not edit.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace FicTool.Parsing
{
    internal partial class Parser
    {
        public Parser(string s) { /*...*/ }
        /*...*/
        public void ParseCommands()
        {
            /*...*/
        }
        /*...*/
    }
}

This parser can be used in your program. Of course, this parser only recognizes the single example and not any other potential input. We can already see that literal strings are written with the C# syntax and parse a constant text.

Here is a slightly more reasonable grammar.

[version 1]
[namespace FicTool.Parsing]
[class Parser]
void Commands ::= (Command "\n")+;
void Command ::= "INPUT" " "+ When / "OUTPUT" " "+ When "," Build;
void When ::= "WHEN(" Condition ")";
void Condition ::= Index ",AND," Condition / Index;
void Index ::= rei:"[0-9]+";
void Build ::= "BUILD(" Index+,"," ")";

Since this grammar defines six non-terminals, the generated parser will have six corresponding methods to parse them: ParseCommands, ParseCommand, ParseWhen, ParseCondition, ParseIndex, and ParseBuild. There is no specific root non-terminal, so a single grammar can be used to parse different (but related) kinds of input.

The Commands non-terminal is now defined as a repetition (with a minimum of one) of sequences of a Command and a newline. The repetition is indicated by the +, and the sequence is implicit when parsing expressions are juxtaposed. The parentheses are used for precedence. Without them, we would only parse a single Command followed by any number (greater than one) of newlines.

A Command is in turn defined as either a sequence of INPUT, blanks, and a When clause or OUTPUT, blanks, a When clause, a comma, and a Build clause.

Similarly, the When and Condition non-terminals are defined with known concepts. Note that definitions can be recursive (as is the case for Condition) but be careful that, as with a typical hand-written parser, immediate recursion (without progressing in the input) will result in an infinite loop. Ordering of the cases is meaningful as they are tried in order, and the first matching branch is taken, even if it makes the parsing fail later on and that taking a different branch would have allowed it to succeed.

The Index non-terminal introduces a new concept: regular expressions. It uses the .NET regular expressions defined in System.Text.RegularExpressions and is very useful to parse non-constant terminals.

Finally, Build is defined as a sequence, one element of which is a repetition of one or more indexes separated by commas. Alternatively, we could have defined Build as Build ::= "BUILD(" Index ("," Index)* ")" with a repetition of zero or more sequences of comma and Index, but the option we choose is more specific and will give a better parse tree later on.

2.2.2. Building a parse-tree

The parser generator only generates a parser. It does not create classes for an abstract syntax tree (AST). This gives you a lot of controls on how the classes are implemented and their inheritance hierarchy. The only things the parser needs is that instances of the classes can be constructed without giving arguments to the constructor (see below for more details) and that fields (or properties) are accessible, defined with compatible types, and (in the case of lists) initialized.

For the sake of this example, let us suppose that the following file is part of the project.

namespace FicTool.Ast
{
    internal partial class Commands: List<Command> { }
    internal partial class Command
    {
        public ICondition WhenClause;
    }
    internal partial class Input: Command { }
    internal partial class Output: Command
    {
        public Build BuildClause;
    }
    internal partial interface ICondition { }
    internal partial class SimpleCondition: ICondition
    {
        public int Index;
    }
    internal partial class Conjunction: ICondition
    {
        public ICondition Left, Right;
    }
    internal partial class Build
    {
        public List<int> Indexes = new List<int>();
    }
}

We can now write the following grammar that will generate a more useful parser.

[version 1]
[namespace FicTool.Parsing]
[class Parser]
Commands ::= (Command "\n")+;
Command ::= Input := "INPUT" " "+ WhenClause = When / Output := "OUTPUT" " "+ WhenClause = When "," BuildClause = Build;
[default ICondition $C# {null}]
ICondition When ::= "WHEN(" Condition ")";
ICondition Condition ::= Conjunction := Left = Index ",AND," Right = Condition / SimpleCondition := Index = Index;
int Index ::= re:"[0-9]+" => int $C# {int.Parse};
Build ::= "BUILD(" Indexes = Index+,"," ")";

The only change to the definition of Commands is that we removed void from the front as it now as a semantic type (also known as return type) which, if unspecified, defaults to the same as the non-terminal name. Since it has to receive potentially several Commands, it has to inherit from List<Command>. (More specifically, it has to have an Add method that can take an argument with the semantic type of the Command non-terminal.)

The non-terminal Command had more changes. We now see what class to instantiate (Input := and Output :=) and what field or property to set (WhenClause = and BuildClause =).

The When non-terminal had just a change of semantic type from void to ICondition.

The Condition non-terminal is changed in a way very similar to the changes made to Command.

For Index, the regular expression was changed from a rei: prefix to a re: prefix. This has a semantic type of string and gives access to the matched text. We then transform that string into an int using the .NET static method int.Parse.

Finally, for Build, we added an assignment to the Indexes field. Because of the repetition, the parser will infer that Indexes is a list in which elements have to be added and not just a field or property to be assigned. We didn’t have to explicitly write anything to instantiate the class Build. We could have written more explicitly Build ::= Build := "BUILD(" Indexes = Index,"," ")";+ but the creation of an instance of the semantic type is implicit with the definition of a non-terminal. Since ICondition is an interface, it cannot be instantiated. Therefore we need to specify a way to create a default ICondition. This is done by the [default ICondition $C# {null}] directive. It is reasonable as in all non-terminals with a semantic type of ICondition, there is no direct affectation of property (like we do in the definition of Build).

3. Reference

3.1. Syntax

This section aims to give a comprehensive tour of the features of our parsing expression grammars but not to cover every fine point of their semantic, in particular in combination with each other.

The approach will be bottom-up, starting with the simplest parsing expressions and continuing with the operators by decreasing precedence (one sub-section by precedence level) and finally, the rules and directives.

3.1.1. Simple expressions

Simple expressions are simple in that they are not made by the application of an operator to one or more sub-expressions. Some of the most advanced parsing expressions are in this category and may be skipped on a first reading.

Terminals are formed by a string between double-quotes and potentially a prefix (lit:, re:, or rei:). With no prefix or a lit: prefix, the terminal recognizes the exact string (up to C# escapes). The difference is that without prefix, the semantic type is void, and with the lit: prefix, it is string with a semantic value equal to the parsed string (or the literal itself as they are the same). With a re: or rei: prefix, the terminal is interpreted as a .NET regular expression encoded in a C# string. Be careful about the needed escaping for special characters, in particular backslashes. The difference between the two prefixes is again that rei: has a void semantic and re: a string semantic where the semantic value is the text matched by the regular expression.

Non-terminals are formed of an identifier and potentially a prefix (i: or extent:). With no prefix, the semantic type is the one declared when defining the non-terminal. With an i: prefix ,it is void and with an extent: prefix, it is string with a semantic value equal to the part of the input parsed by the non-terminal.

Values can be used as parsing expressions that match the empty string. They are used only for their semantic value. The possible values are $true and $false of bool type, $left, which has the type of the non-terminal being defined (See later left-recursive rules), integers in decimal notation (with an optional negative sign) of type int and arbitrary C# values written as @= T $C# { code } where the C# code must have balanced braces ({ and }) and must be an expression compatible with type T.

A permutation scope is enclosed in square brackets ([ and ]). Permutations will be described with the permutation operator ^.

Any simple or complex parsing expression can be enclosed in parentheses (( and )) to force the precedence of operators.

The remaining cases are considerably more advanced concepts.

External parsers are written @ T $C# { code } where the C# code must have balanced braces and must be a method or delegate that returns a T and takes two parameters of type ref int, and out bool.

This construction allows manually writing parts of the parser. The boolean output argument indicates if the parsing was successful or not. The integer argument is the current parsing position (which must be updated if the parsing is successful), and the return is the semantic value.

A function call is an identifier followed by a colon and, between parentheses, a list of arguments. Functions will be described in more details later.

An expression variable reference is written $$:xxx where xxx is any identifier. It can only be used in the body of a function and will be described with them.

3.1.2. Postfix operators

As postfixes, these complex expressions are made by writing an operator after a simpler expression.

Repetitions can be done with + or * for, respectively, at least one or any repetitions of the element.

These are actually shortcuts for {1} and {0} respectively. In general, {n} can be used for at least n repetitions, and {n, m} can be used for between n and m repetitions (bounds included).

An expression can be made optional with ?.

A semantic transformation can be applied to an expression by => T $C# { code } where the C# code must have balanced braces and must be a method or delegate that takes the semantic value of the sub-expression and return a value of type T.

A separation can be applied to either a repetition or a permutation scope by following it with a comma and a simple expression. This is, therefore, not strictly a postfix, but it has the same priority as the other postfixes. For example, "a",":" will parse a:a:a:aa:a:aa but not a::a whereas "a",(":") will match the second example but not the first.

The last case is considerably more advanced.

An external transformation is applied to an expression by @=> TR (T1 P1, …​, Tn Pn) $C# { code } where the C# code must have balanced braces and must be a method or delegate that returns a TR and takes n + 2 parameters of types T1 up to Tn, ref int, and out bool. This construction allows to write manually parts of the parser dependent on previously parsed values and/or to transform multiple values. The boolean output argument indicates if the parsing was successful or not. The integer argument is the current parsing position (which must be updated if the parsing is successful), and the return is the semantic value. The input arguments are the semantic values assigned to properties P1 to Pn in the expression. These properties need not be defined in any AST class.

3.1.3. Prefix operators

The prefix operators are written in front of the simpler expression they modify.

An assignment is written as the name of a property followed by an equal sign.

A lookahead is written & (positive) or ! (negative). It saves the current parsing position, parses its sub-expression, and resets the parsing position. If the sub-expression parsed successfully, a positive lookahead succeeds, and a negative lookahead fails, and the opposite if the sub-expression fails to parse.

3.1.4. Sequences

Sequences are written by simple concatenation of simpler expressions without an explicit operator.

3.1.5. Construction

The construction operation is written by prefixing a simpler expression with a typename and a := operator. While this makes it technically a prefix operator, it doesn’t share the precedence of the other prefix operators described above.

3.1.6. Alternatives

Alternatives are written by separating simpler expressions with the / operator.

3.1.7. Permutations

Permutations are written by separating simpler expressions with the ^ operator. Permutations are only meaningful in a permutation scope delimited by [ and ]. Repetitions and optional expressions have a different interpretation when they are (directly) sub-expressions of a permutation. Let us consider the permutation scope [ A ^ B? ^ C{7,42} ^ D+ ]. It is parsed very similarly to ( A / B / C / D )* but with the added constraints that branches are not entered if they have already had their maximum number of occurrences (here 1, 1, 42, and infinite) and, at the end, there is a check that all branches have had at least their minimum number of occurrences (here 1, 0, 7, and 1). If the final check fails, the parsing of the whole permutation fails. Another slight difference is that from a semantic point of view, non-repeated sub-expressions of a permutation don’t have a list semantic.

This concludes the parsing expressions themselves.

3.1.8. Rules

A rule defines a non-terminal. The syntax for a rule is T NT ::= E ; where NT is the non-terminal to define, T its semantic type, and E is a parsing expression. The semantic type can be omitted if it has the same name as the non-terminal itself (which is a common case).

Because the generated parser is close to a hand-written recursive descent parser, left-recursion is problematic. However, left-recursion is the natural way to express left-associative operations, which are fairly common even in mainframe tool input languages. Here is an example of a left-recursive rule.

Expr ::= Plus := Left = Expr "+" Right = Term
       / Minus := Left = Expr "-" Right = Term
       / Term ;

The generic way to remove the left-recursion is to transform it as in the following rule. The idea is to define a base case and a repetition of addends to it.

Expr ::= Term ("+" Term / "-" Term)*;

This solves the parsing problem, but we cannot easily get the same semantic value as before. We can do it with the following definitions.

Expr <<+ Plus := Left = $left "+" Right = Term;
Expr <<+ Minus := Left = $left "-" Right = Term;
Expr ::= Term;

The standard rule for Expr only gives the base, non-recursive, case. The other cases are defined with special left-recursion rules. They define the addends that would have been used in the previous block, but they have access to $left, which is the semantic value of the previously matched base case and addends. Notice how close the structure of the final grammar is to the initial one. Because the standard rule gives the semantic type, left-recursion rules are always of the form NT <<+ E; where NT is a non-terminal and E is a parsing expression.

3.1.9. Directives

The [version 1] directive defines how the grammar will be produced. It can only be the first element of the file (except for comments). If not present, the grammar is produced as with preceding versions (version 0).

The [using NS] directive where NS is a .NET namespace can be used to add a corresponding using directive to the generated parser. This is useful to use members of that namespace in inlined C# code blocks. Note also that as the generated class is marked partial, it is easy to define, in the parser class itself, a small private method in another file to which one can then directly refer in the grammar.

The [namespace NS] directive defines the .NET namespace NS into which the generated parser class will be placed.

The [class Vis C exceptionBuilder = $C# { excCode } textGetter = $C# { tgCode } ] directive defines the name C of the generated parser class and, optionally, its visibility Vis (public or internal) and some extra parameters. If specified, excCode must have balanced braces and must be a method or delegate that returns a throwable type and takes four parameters of type string, bool, object, and int. This will be called (and the result will be thrown) when the text to parse doesn’t fully correspond to the invoked parsing method. The first two arguments are the name of the non-terminal, and whether the non-terminal was successfully parsed (but then not up to the end of the text to parse). The last two arguments only have a meaningful value if the second argument is true and are the semantic value of the non-terminal and the final position. If specified, tgCode must have balanced braces and must be an expression of type string that returns the text to parse. When specified, no constructors are generated for the parser.

The [compatibilityV0 static external] directive permits a smoother transition of old grammars to version1. The optional static parameter restores the generation of static parsing methods. The optional external parameter restores the old signature for external parsers (i.e. @ constructs).

The [default T $C# { code }] directive where the C# code must have balanced braces and must be an expression of type compatible with the type T can be used if T is used in the grammar but code should be used to construct instances of T rather than new T(). By far, the most common use case is to use null for interfaces and abstract classes.

The [root Vis ( NT1, …​, NTn)] directive specifies that the parsing method for the named non-terminals has to have the specified visibility (e.g. private or public override). If that directive is used at least once, only named non-terminals will have parsing methods generated.

The [export Vis ( NT1, …​, NTn)] directive specifies that a parsing method for the named non-terminals with the specified visibility (e.g. private or public override) must be generated. That parsing method has a signature compatible with external parsers.

3.1.10. Macros and functions

A macro looks a lot like a non-terminal. The differences at the definition are that it has no type and uses a === rather than ::= defining operator. It is used as a non-terminal but shouldn’t have any prefix. Before the code is generated, the references to the macro are replaced with its definition. This can be useful to avoid repetition in the grammar in cases that cannot be made into proper non-terminals, such as parts of a permutation scope containing ^ operators or assignment to fields (or properties) found in different types.

Functions are executed at the same time as macros. They differ from macros by taking arguments and having a syntax closer to the prefixes used with terminals and non-terminals. Here is an example of a function definition and usage.

optParent:(Arg) === "(" $$:Arg ")" / $$:Arg;
FooClause ::= "FOO=" Kind = optParent:(re:"A|B|C" => EFoo $C# {StringToEFoo});

This is expanded to the following rule.

FooClause ::= "FOO=" Kind = ("(" re:"A|B|C" => EFoo $C# {StringToEFoo} ")" / re:"A|B|C" => EFoo $C# {StringToEFoo}) ;

There can be multiple arguments separated by ;, both in the definition and usage.

The arguments can be:

Description Example

Variable reference

$$:Var

Function call

Func:()

Identifier (Non-terminal, type, function, or property)

Hello

Type

System.Drawing.Color

balanced-braces block

{MyFunction}

Parsing expression

(A/B)+,";"

Each can be used in the body where the corresponding element can be used.

complexFunction:(Expr; Code; Type; Prop; Func) ===
   $$:Func:($$:Prop = $$:Expr => $$:Type $C# $$:Code);

3.1.11. Comments

Comments can only appear between rules, directives, macro definitions, or function definitions. They start with # and extend to the end of the line.

3.2. Typing

Non-terminals produce a semantic value whose type can be any .NET type. However, for a generic parsing expression, we cannot always assign a simple .NET type. As an example, consider the expression NtX / A = NtA* B = (NtB / NtC) where NtX, NtA, NtB, and NtC are non-terminals with semantic values of types TypeX, TypeA, TypeB, and TypeC respectively and A and B are property names. The type of a parsing expression is therefore more complex. In general, we have a primary value that can be from a certain set of types and named values (to be placed in corresponding fields or properties) each of which can be from its own set of types. And all of these types can be either a simple .NET type or a repetition of it. We will therefore represent the type of an expression as a mapping from names (using the empty name for the primary value) to sets of types which can be either a .NET type or a repetition of it, indicated by prefixing it with a *. The example expression as therefore the following type : ''→ {TypeX}, 'A' → {*TypeA}, 'B' → {TypeB, TypeC}.

In addition, the type of an expression also tracks whether the expression guarantees the presence of a primary value that we will indicate with either true (value guaranteed) or false (no guarantee) at the front of the type.

The complete type of the example is thus: [false, ''→ {TypeX}, 'A' → {*TypeA}, 'B' → {TypeB, TypeC}] because in the second branch of the alternative, there is no primary value.

The type of an expression is defined recursively from the types of its parts using a few constructs.

3.2.1. Operations on PEG expression types

We define a function fromType that takes a .NET type and return a PEG expression type in the following way:

fromType(void) = [false] which indicates that there is no guaranteed primary value and that all names (including the empty name) map to the empty set of types.

fromType(T) = [true, ''→{T}] for any other .NET type T which indicates that the primary value can only be of type T and is guaranteed to be present.

We define a unary function list on PEG expression types in the following way:

list([g, a1 → { t11, t12, *t13, …​}, …​]) = [false, a1 → {*t11, *t12, *t13, …​}, …​] which forces all the types to be repetitions and does not guarantee the presence of a primary value.

We define a function def taking a non-empty name n and a PEG expression type and returning a PEG expression type in this way:

def(n, [g, '' → { t1, t2, *t3, …​}]) = [false, 'n'→ { t1, t2, *t3, …​}]

It is an error for def to be applied on a PEG expression type that maps non-empty names to anything else than the empty set.

We define a binary function disjunct on PEG expression types by:

disjunct([gL, a1s1L, a2s2L, …​], [gR, a1s1R, a2s2R, …​])=[gL and gR, a1s1L union s1R, a2s2L union s2R, …​]

and finally a binary function conjunct on PEG expression types by:

conjunct([gL, a1s1L, a2s2L, …​], [gR, a1s1R, a2s2R, …​])=[gL or gR, a1s1L union s1R, a2s2L union s2R, …​]

It is an error to apply conjunct on types where a name (possibly empty) is mapped by one argument to a set that contains a non-repeated type and to a non-empty set by the other argument.

These last two functions will also be used as n-ary functions as they can easily be shown to be associative.

3.2.2. Type of expressions

With the preceding definitions, we can now define the expression typing function exprType taking a PEG expression and returning its type. The definition is given by cases depending on the structure of the expression.

exprType("some string") = exprType(rei:"some regexp") = exprType(i:_SomeNonTerminal_) = exprType(!expr) = fromType(void)

exprType(lit:"some string") = exprType(re:"some regexp") = exprType(extent:_SomeNonTerminal_) = fromType(string)

exprType($true) = exprType($false) = fromType(bool)

exprType(n) = fromType(int) when n is a number.

exprType(T $C# { code }) = exprType(exprT $C# {code}) = exprType(@ T $C# {code}) = exprType(expr @=> T (…​) $C# {code}) = exprType(T := expr) = fromType(T)

exprType(SomeNonTerminal) = fromType(T) where T is the declared semantic type of non-terminal SomeNonTerminal.

exprType($left) = fromType(T) where T is the declared semantic type of the non-terminal in which the expression is used.

exprType( (expr) ) = exprType(&_expr_) = exprType(expr)

exprType( a = expr) = def(a, exprType(expr))

exprType(expr1 expr2 …​ exprN) = conjunct(exprType(expr1), exprType(expr2), …​, exprType(exprN))

exprType(expr1/expr2/…​/exprN) = disjunct(exprType(expr1), exprType(expr2), …​, exprType(exprN))

exprType([expr1^expr2^…​^exprN]) = conjunct(exprType(expr1), exprType(expr2), …​, exprType(exprN))

exprType(expr+) = exprType(expr*) = exprType(expr{n}) = exprType(expr{n,m}) = list(exprType(expr))

exprType(expr1,expr2) = conjunct(exprType(expr1), list(exprType(expr2))) as separators can normally occur several times.

4. Grammar format

4.1. Grammar rules


Expression SimpleExpression ::=         "(" i:s        Expression ")" i:s
                              / Perm := "[" i:s Term = Expression "]" i:s
                              / Empty := Value = Value
                              / External := "@" i:s Type = Type Code = CSharpValue
                              / NonTerminal := (Prototype = Prototype)? NtName = NonTerminal
                              / Terminal    := (Prototype = Prototype)? Value = Literal ;

The input PEG for the PegGrammar generator consists of grammar rules similar to the example provided above. Each rule defines a nonterminal or a fragment of a nonterminal. The generator will produce one parse function per each nonterminal. Each grammar rule has a left-hand side and a right-hand side, separated by ::=. The right-hand side describes how to process the textual input, and the left-hand side specifies the form in which the result is expected to be provided. If the left-hand side consists of one word, it is used both as a nonterminal (a structured entity, a grammar fragment) and as a return type in the generated code; otherwise, it should have a form Expression SimpleExpression, where the first word is a return type and the second one is a nonterminal. There are two built-in return types: string and int.

The right-hand side is an ordered collection of one or more alternatives separated by slashes (/). Order means that if the first one succeeds, the rest are not attempted; if the first one fails, the second one is tried starting from the position where the first one was attempted; and if the second one succeeds, no other alternatives are attempted and so forth. Each alternative can be marked with a name in the form of SubType := …​, which would mean that its parse result will be an instance of a subtype of the type of the grammar rule. For example, all following one line definitions are equivalent:

Expr Expr ::=         Number "+" Number ;
Expr Expr ::= Expr := Number "+" Number ;
     Expr ::=         Number "+" Number ;
     Expr ::= Expr := Number "+" Number ;

Besides that, each alternative consists of a sequence of recognition symbols, and there are many kinds of them.

4.2. Grammar symbols

A terminal is the simplest form of grammar symbols: it consumes whatever is written between the double quotes. There are four variations of terminal symbols:

  • "x" a simple terminal. It will recognise itself and discard the result.

  • lit: "x" a literal terminal. It will recognise itself and return the result as a string.

  • re:"x" a regular expression terminal. It will match the input according to its value, and return the result as a string.

We use the same style of regular expressions that is used in C#. Some escaping rules might not be as trivial as they could be, so we include them in Terminal escaping rules.

  • rei:"x" an ignored regular expression terminal. It will match the input according to its value and discard the result.

A nonterminal is a reference to another nonterminal defined elsewhere within the same grammar. A reference to a nonterminal can be named, in which case the field containing its parsed value will have a different name.

  • i:X can be used to use the nonterminal X to parse then input, and then disregard its value.

Any symbol can be repeated, yielding a value of List <T> type, where T is the type of the base symbol. There are five kinds of repetitions:

  • X? symbol X is not present or present once

  • X* symbol X is not present or repeated any number of times

  • X+ symbol X is present at least once and can be repeated any number of times

  • X{N} symbol X is repeated at least N times but can be repeated more

  • X{N,M} symbol X is repeated at least N and at most M times.

Any repetition can be enhanced with a separator: a special symbol that occurs between adjacent elements in a sequence but not outside. A separator should yield no value, so it is usually a terminal or an ignored nonterminal. A separated sequence is denoted with a comma:

  • X*,"," a comma-separated, possibly empty list of Xs

  • Y+,";;" a semicolon-separated non-empty list of Ys

  • Z{5},i:s a space-separated list of at least five Zs

  • Z{1,4},"." a dot-separated list of at least one and at most four Zs

A permutation is a set of symbols that can be parsed in any order; symbols marked as optional are not necessary for a successful parse; the rest are:

  • [A ^ B ^ C?] a shorthand way to write AB / BA / ABC / ACB / CAB / BAC / BCA / CBA.

In any context, a group of symbols can be parenthesized to be treated as one.

  • (A / B) (C / D) is the same as AC / AD / BC / BD.

4.3. Macros

Any fragment of grammar can be defined as a named macro and used in several places without duplication. Usually, this is done with sequences, choices, or permutations.

Example 1

ABC === A B? C ;
X ::= ABC "." ;
Y ::= "!" ABC ;

Example 2

ABC === A | B | C ;
X ::= ABC | D ;
Y ::= E | ABC ;

Example 3

ABC === A ^ B ^ C? ;
X ::= [ ABC ^ D ] ;
Y ::= [ ABC ^ E ^ F? ];

The macros are exactly that macros that are expanded before the grammar is treated as a specification to generate a parser, so it is the responsibility of the grammar engineer to use them properly.

Any attempt to use permutation macros without square brackets will fail.

4.4. Left recursion

Example 4

Expression Postfixed ::= SimpleExpression ;
           Postfixed <<+ Repeat := Term = $left "*" ;
           Postfixed <<+ Repeat := Term = $left "+" ;
           Postfixed <<+ Optional := Term = $left "?" i:s ;

Left associative expressions such as one displayed in the example above, can be specified by using a reserved keyword $left inside a <<+ construct.

4.5. Annotations

Annotations provide information that PegGrammar cannot infer by itself:

Target namespace: [namespace MyProject]

The parser class: [class MyParser]

Default value of a type: [default Rule $C# {null}]

5. Best practices

This section is not required to be read before PegGrammar can be used. However, it presents typical solutions to commonly occurring problems, which can be reused and adapted to fit the problem at hand.

5.1. Terminal escaping rules

Some of escaping rules of C# are tricky, but the reference below is reliable:

To get a backslash (\)

type "\\" or re:"\\\\" or re:"[\\\\]".

To get a caret (^)

type "^" or re:"\\^" or re:"[\\^]".

To get a double quote (")

type "\"" or re:"\"" or re:"[\\"]".

To get a terminal with both square brackets ([])

type "[]" or re:"\\[]" or re:"[\\[\\]]".

5.2. External parsers

Sometimes it gets easier to parse a particularly tricky fragment of the input by manually programming its recognition instead of encoding it in the grammar. This is possible with a @ symbol that calls a function in a base language with the following signature:


_SemanticType_ __MyFunction__(ref int pos, out bool valid);

It is useful to define a delegate type as follows to define functions directly in the grammar as lambdas.

public delegate T ExtParser<T>(ref int pos, out bool valid);

In many cases, a simpler symbol would suffice; it is used after any other symbol and can be used to reparse its result. These symbols have many uses, some of them will be shown in the next sections.

5.3. Integers

PegGrammar is aware of the int type but does not provide a default parser for it because integers come in many forms and shapes. A standard minimal implementation would look like this:


int Integer ::= re:"[0-9]+" => int $C# {int.Parse};

5.4. Enumerations

Similarly, the following function can be used to turn a limited set of input strings into an enum value (the fact that the input set is limited is usually ensured with a parser, this is the reason to use here in conjunction with other features).


public static T ParseEnum<T>(string s)
{
   return (T)Enum.Parse(typeof(T), s);
}

It can then be used in different ways:


EMode  ::= re:"SHR|NEW|OLD" => EMode $C# {ParseEnum<EMode>};
Status ::= lit:"OK" | lit:"FAIL" => Status $C# {ParseEnum<Status>};

5.5. Position

External parsers contained the definition of ExtParser delegate. While the @ operator is very versatile, a common pattern is to use it not for recognition, but to get the current parsing position.

It can be done like this:


int Position ::= @ int $C#
	{new ExtParser<int>((ref int pos, out bool valid)
	=> {valid = true; return pos;})} ;

5.6. Layout

Some other parser generation frameworks define their layout or allow grammar engineers to specify them and then make sure it can occur at any place without consequence for the constructed abstract trees. PegGrammar is meant to be used for smaller languages with fine grained control over characters, including spaces. However, it is easy to implement:


ListOfThings ::= Things = Thing+,s ;
void s ::= rei:"[ \\t\\r\\n]*" ;

5.7. Lookahead

A lookahead symbol is an advanced feature that lets the parser to peek ahead at a few symbols and decide whether the current alternative should be pursued further or terminated. In PegGrammar, there are two kinds of lookahead: & for the positive one and ! for the negative one. Due to the algorithmic nature of the PEG technology, this can not only be used for parse views [10] but also to construct one abstract syntactic structure incorporating required details from all the views, including the negative ones, which is impossible in other currently existing implementations of Boolean and conjunctive grammars.

Glossary

PegGrammar solution contents

Project

Description

Grammar

Contains the actual code generator that traverses the input PEG and produces a parser as a textual C# file. Also contains the PEG grammar of PEG.

PEG

Implements Parsing Expression Grammars as a C# library to be used directly without the separate generation phase. Feature-wise approximately equivalent to the Grammar project.

BootstrapGenerator

A part of the build process, that takes the PEG grammar from the Grammar project and uses it to generate a parser for the ParserGenerator project. This project contains a parser for PEGs manually written by using the PEG library; this parser is technically behind the main library, and is only required to be able to process the PEG of PEG.

ParserGenerator

A wrapper that uses the Grammar project to generate a parser. It relies on a C# parser to parse input grammars.

StandaloneGenerator

A wrapper that uses the Grammar project to generate a parser. It relies on a C# parser to parse input grammars this parser is identical to that of ParserGenerator. The wrapper is designed to be run as a separate tool (an .exe file).

TestPeg

Contains 256 test cases that are used to ensure the correct behaviour of the generated parsers. They can also be used as examples of fully functional PEGs.

Convergence

Contains a work in progress on a grammar convergence [11] infrastructure that could synchronise several implementations of the same grammar within the system.


1. V. Zaytsev and A. H. Bagge, "Parsing in a Broad Sense," in MoDELS, 2014.
2. B. Ford, "Parsing Expression Grammars: A Recognition-based Syntactic Foundation," in POPL, 2004.
3. R. R. Redziejowski, "Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking," Fundamenta Informaticae, vol. 79, no. 3-4, pp. 513-524, 2007.
4. R. Ierusalimschy, "A Text Pattern-Matching Tool based on Parsing Expression Grammars," SPE, vol. 39, no. 3, pp. 221-258, 2009.
5. A. Warth, J. R. Douglass and T. Millstein, "Packrat Parsers Can Support Left Recursion," in PEPM, 2008.
6. S. Medeiros, F. Mascarenhas and R. Ierusalimschy, "Left Recursion in Parsing Expression Grammars," in SBLP, 2012
7. N. Laurent and K. Mens, "Parsing Expression Grammars Made Practical," in SLE, 2015.
8. B. Ford, "Packrat Parsing: Simple, Powerful, Lazy, Linear Time," in ICFP, 2002.
9. K. Mizushima, A. Maeda and Y. Yamaguchi, "Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space," in PASTE, 2010
10. A. Stevenson and J. R. Cordy, "Parse Views with Boolean Grammars," Science of Computer Programming, vol. 97, pp. 59-63, 2015.
11. R. Lämmel and V. Zaytsev, "An Introduction to Grammar Convergence," in iFM, 2009.