How to build an expressions analyzer, the grammar

When we are working with mathematical functions, it is very convenient to have a parser that allows us to write and perform different versions of the equations that must be processed by the program, rather than having to modify and recompile the code to adapt it to these changes each time. In this series of articles I will show how to build a simple but powerful expressions analyzer, which can be extended easily to deal with more sophisticated expressions, starting by explaining in this article the basic theory necessary to successfully face this kind of development.

First, we must make a distinction between compilers and interpreters. An interpreter processes the expression and executes it at the end, returning the result, while a compiler processes and transforms it into something that we can run all the times we want, with the possibility to modify each time the values of the parameters that could have the expression.

Normally a compiler produces executable code, but this is not the only possibility to transform the compiled expression or expressions. The program that I will offer in this series transforms the expression into a series of objects that can then be used to work with the formula in a configurable and executable format.

In any case, the first step needed to build a parser is to define an appropriate grammar for the language that we will use. This grammar must be designed so as to have some basic properties that allows us write the program in the simplest way possible.

Let's start by defining a notation for the grammar rules. We will use the well-known BNF (Backus-Naur Form) notation, which allows us to design a set of rules to move from a generic object (for example, an expression) to a particular well-constructed object (for example, the expression 10 * x + y).

Suppose that all our expressions will always be of the form n * v1 + v2, where n is any number between 0 and 9, and v1 and v2 are two possible different variables, which may or may not appear in the expression. Thus, possible expressions are n * v2 + v1, n * v1, v1 and v2.

The grammar components that must be defined breaking them down into simpler ones are called non-terminals, and are represented writing them between the characters < and >, for example:

<expression>

To form a rule of decomposition for a non-terminal symbol, we use the symbol ::=, and, on the right part, we indicate the elements in which the non-terminal on the left is composed. If there are several possibilities, all are written, separating them with the character |. In our example:

<expression>::=<number>*<variable><restofexpression>|<variable>

The symbol * is called terminal, it is directly defined by itself and is part of the actual expression, so it is not necessary to create a derivation rule to decompose it. A terminal symbol is also called token, and may be formed of more than one character.

For non-terminal symbols on the left side, we must create new rules until all of them end in terminal symbols, for example, for <number>:

<number>::={1,2,3,4,5,6,7,8,9}

Here I have introduced a new element, which is the set of symbols. Instead of separating the numbers with the symbol |, I've written them between the { and } characters, separated by commas. The two notations are equivalent.

With this we have the symbol <number> fully defined, we can now define <variable> as follows:

<variable>::=x|y

We need to define <restofexpression>. To have expressions of the form n * x + y and n * x, we can do as follows:

<restofexpression>::=+<variable>|<void>

The symbol <void> can be considered a special case of terminal symbol, it means that there is the alternative that after n * x is nothing more.

In this way, we have a grammar that allows us to decide whether an expression of our language is properly written, with no more than start reading from the left and see if, as we move forward, we can find the grammar rules that define it as an expression belonging to the language.

One of the main objectives of these grammars is to fully facilitate their implementation in a program avoiding unnecessary complications. For this there are some properties that we must ensure are met.

The first decision to make is whether the interpretation of the rules is done in ascending or descending order, or in other words, using the right or left part of the rules to perform the analysis.

Let's see an example with the above grammar, for the expression 5 * x + y, using a bottom-up, or LR, parser. The point is to search through the rules one whose right part matches any of these items in order to go substituting them with the elements on the left side to get to the root expression of our grammar:

5*x+y <number>*<variable>+<variable> <number>*<variable><restofexpression> <expression>

With a downward, or LL, parser, we do it the other way round, starting by the rule <expression> and trying to reach the input text string, reading from left to right. As we see that the first thing we find is, indeed, a number, and is followed by a token *, we see that it is complying with the syntax of the expression and continue until we check that everything is correct. As such analysers are the simplest to implement, although less powerful than the LR ones, the one which I will build will be of this type.

To decide for a particular rule to continue the analysis, a parser may have to check a number n of tokens from the input string. When this number is 1, that is, only with reading the next token it can decide the rule to apply, the grammar is called LL(1). Obviously, this is the easier case to implement, and is also the option that I will choose, so that the grammar will be designed so that it can build a LL(1) parser with it. The simple grammar of the example above meets this condition.

It is important that we design the grammar so that, once a rule it is selected as the correct, there is no possibility to find that, in fact, it is not the appropriate rule and we have to undo all the work and return to the starting point to try another rule. In the above example we have avoided this by substituting the rule

<expression>::=<number>*<variable>+<variable>|<number>*<variable>|<variable>

That we could cause problems with the expression 5 * x, which is correct, if we choose the first rule of the right side to analyse it, rather than the second. We have avoided this by adding another rule, <restofexpression>, which covers the two possibilities, that after the symbol x there comes a + sign, or that the expression ends here.

Finally, another property to avoid when designing a grammar is what is known as left recursion. This consists simply in not to use rules of the form:

<S>::=<S>+<X>

The workaround is simple, but not trivial, so, to not extend much the article and go straight to the grammar that I will use in the parser, I leave this link about elimination of left recursion.

Design of a grammar for simple expressions

I will design a grammar which will analyse expressions with the following requirements:

You can use numbers with decimals.
You can define constants using a given set of characters.
Constants cannot start with a number or the letters x, y or z.
They can be used three possible variables, X, Y and Z.
The binary operators allowed are difference (-), addition (+), product (*), division (/) and exponentiation (^).
The unary minus operator (-) is allowed.
You can build subexpressions writing them between parentheses.

In addition, the parser must take into account the following conditions:

The expressions enclosed in parentheses and unary minus have the highest precedence (ie, they run before any other operator in case of doubt).
The next level of precedence is for the exponentiation.
Then the product and division.
Finally, the sum and difference are the lower precedence operators.

To begin with, we can see that the highest-level rule is for the expressions, and, below this, we must have rules to define numbers, constants and variables. I will design the grammar following this approach.

Defining the numbers

The grammatical rules that define the numbers are as follows:

<number>::=<digit><rnumber>|<decimalsep><rdecimal> <rnumber>::=<digit><rnumber>|<decimalsep><rdecimal>|<void> <rdecimal>::=<number><fdecimal> <fdecimal>::=<number><fdecimal>|<void> <digit>::={0,1,2,3,4,5,6,7,8,9} <decimalsep>::={“,”,.}

With the first rule we can see that a number must begin with a digit or a decimal separator, allowing us to decide only by examining the first character of the string whether we are or not reading a number. With the rule <rnumber> we define the rest of the number, which can be numbers, a decimal separator followed by more numbers or end with the first digit.

The rule <rdecimal> force that after a decimal point exists at least one number, and <fdecimal> prevents to put an additional decimal separator. In this grammar I allow to use as a decimal separator a point (.) or comma (,) interchangeably.

Defining variables and constants

The definition of the variables is simpler. As they only consist of a letter, we need only one rule:

<var>::={x,y,z}

The constants, however, may consist of more than one letter or number, so we need more than one rule to define them:

<const>::=<letter><rconst> <rconst>::=<allchar><rconst>|<void> <letter>::={a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w} <allchar>::={0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,g,h,i,j, k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}

I have defined two sets of characters, <letter> for the initial letter, and <allchar> for the rest of the name of the constant, which can contain numbers or the letters x, y and z.

The symbol <void> should not be interpreted as that the input string must finish at the end of the constant, it means that if we do not find a valid character, we can consider that we're done with the constant. Some higher-level rule will take care of deciding whether it is a valid character.

Defining the expressions

Finally, it remains to define the higher level entity, the expression. Since there are rules of operator precedence, it is a good idea to reflect them in the grammar, as this will facilitate the implementation of the program, as the expression is compiled naturally following these rules and we won’t have to implement any additional logic.

This is achieved by dividing the rules involving different operators in various levels, the lowest of which will define the elements of higher precedence:

You can see that the higher precedence level is <element>, which defines the expressions between parentheses and the numbers, constants and variables of the expression.

In this way, we descend from the highest level, <expr> until reach the lowest level we can find and then give preference to this rule over the higher levels.

In the next article, I will explain the implementation of an expressions analyser using this grammar.

If you want to explore deeper into the subject of parsing and definition of grammars, I can recommend you these books: