A C++ Expression Parser Tutorial

Introduction

Table of Contents

This tutorial describes how to evaluate a string as a mathematical expression. Specifically, it describes the design and coding of a recursive descent parser. The parser accepts a string having valid syntax, such as 4.0 * atan(1.0), and returns a double.

When I was young, my family had one of those computer terminals you hook up to your TV. To do anything, you had to program it in BASIC. To evaluate an expression, you had to put it on a program line. One day I tried putting an expression in a string variable. I was surprised that there was no language mechanism to evaluate that string. I was curious from that day on about how to solve the problem.

I learned later on that there is a whole theory behind the problem. It involves technical concepts such as context free grammars (CFGs) and using the grammar to parse source text. As information became more and more available, I learned how to solve the problem. I can't help bragging a bit. I did end up writing a rudimentary expression parser in TI99/4A Extended BASIC. Believe me, it's much easier to do in C++.

It is possible to provide expression evaluation as a language feature. However, in that case you're usually stuck with the syntax someone else provided. For instance, you might want to write (2 * 5) ^ 3 to mean 2 times 5 all raised to the power of 3. However, if C++ provided an expression parser, it would probably mimic the use of the ^ operator in code; i.e., ^ would be a bitwise exclusive or operator. If you can write your own parser, you're in control of the syntax (language structure) and semantics (language meaning).

Context Free Grammars: The Heart of the Solution

Table of Contents

Context free grammars (CFGs) are a definitional mechanism. They give a formal definition of the structure of a language. We're not going to define an entire programming language here. We're just going to define an expression. But, the same techniques apply.

Consider describing the structure of an addition expression. We want to describe not only 2 + 3, but 2 + 3 + 4, and 2 + 3 + 4 + 5, and in general the addition of any amount of numbers. We want to devise a minimal set of rules that describes all these expressions. Let's start with three symbols, add_expr, PLUS, and NUMBER. We want to say that an add_expr can be a subexpression PLUS a NUMBER. We will write


add_expr:	add_expr PLUS NUMBER

The above rule is called a production. add_expr: is called the left hand side (lhs), and add_expr PLUS NUMBER is called the right hand side (rhs). Grammar productions are applied when evaluating a language construct. For instance, 2 + 3 + 4 is supposed to be a valid construct. We start by saying 2 + 3 + 4 is an add_expr. This must fit the pattern add_expr PLUS NUMBER. If it does, then 2 + 3 must be an add_expr. It also must fit the pattern add_expr PLUS NUMBER. It does if we can say that 2 is an add_expr. For this to happen, we need another rule. We need to say that an add_expr can also be a NUMBER:


add_expr:	NUMBER

The result of applying our two rules recursively can be represented by the following tree:


				add_expr
				   |
			   -------------------------
			   |		  |	   |
			add_expr	PLUS	NUMBER (4)
			   |
		   -------------------------
		   |		  |	   |
		add_expr	PLUS	NUMBER (3)
		   |
		NUMBER (2)

We constructed this tree from the top down, but is evaluated in a preorder sequence. So, the way our grammar is written, this expression will get evaluated from left to right.

So far, we have two grammar rules:


add_expr:	add_expr PLUS NUMBER
add_expr:	NUMBER

These rules have the same left hand side. By convention, we usually write the left hand side once, joining the right hand sides with the | symbol to mean "or". We end with a ; to show that we're done. Using this convention, our grammar looks like:


add_expr:	add_expr PLUS NUMBER
	|	NUMBER
	;

The first rule in our grammar is called a left recursive rule. One property of left recursive rules is that they specify a left to right evaluation of an expression. If we wanted to specify that expressions should be evaluated right to left, we would use a right recursive rule:


add_expr:	NUMBER PLUS add_expr
	|	NUMBER
	;

Nonterminals, Terminals, and Tokens

Table of Contents

When a syntax tree is generated from grammar productions, the symbols on the left hand side never appear as leaves in the tree. Hence, they are called nonterminals. The symbols that do appear as leaves of the tree are called terminals. The terminal symbols of a language are also called tokens. One property of a token is that it can appear directly in source text. For instance, you may see a 2 or a + in an expression, but there is no token representing an add_expr. You have to compose an add_expr out of the available tokens.

The Meaning of Context Free

Table of Contents

Context free simply means that the lhs of every production in the grammar consists of one and only one nonterminal. Grammars that don't have this property are called context sensitive, because the lhs then determines the context in which an rhs can replace an lhs.

Context sensitive grammars are more powerful, but efficient parsers for these grammars do not exist. On the other hand, efficient parsers for many classes of CFGs do exist.

Describing Tokens

Table of Contents

Tokens can be described in a grammar. But, in practice, they are usually described separately. The reason for this is that parsing techniques are generally slower than the techniques used to recognize a token. When tokens are specified with a grammar, the grammar subset specifying a token turns out to belong to a class of grammars called regular grammars. There is an alternate way of describing a token called a regular expression. Each symbol used in such an expression corresponds directly to a regular grammar construct. If you are using parser generation tools, you will need to be familiar with regular expressions. However, describing regular expressions here would detract from our task of describing how to write a parser. It is sufficient to describe our tokens informally.

In describing NUMBER, we must decide what kind of number we want. For instance, we might want an integer, or we might want a floating point number. We will use floating point numbers here.

It's easiest to describe our numbers if we break them up into two kinds: numbers starting with a digit, and numbers starting with a decimal point. Each of these kinds of numbers can optionally be followed by an exponent part.

A number is composed of:

one or more digits
followed by zero or one decimal points (i.e., an optional decimal point)

followed by zero or more digits

followed by zero or one exponent parts.

a decimal point
followed by one or more digits
followed by zero or one exponent parts.

An exponent part is composed of:

an e or an E
followed by zero or one sign symbols (+ or -)
followed by one or more digits.

Token descriptions can get rather verbose in English. The advantage of regular expressions is that they are compact and concise. The disadvantage is that they can be cryptic until you get used to them. The regular expression that describes a number is:

(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))([eE][+-]?[0-9]+)?


That's pretty unwieldy. But it's what we implement in our lexer; you can see the regular expression implemented in code. Here
is a key to understanding the above expression:

Some characters have special meaning. But they lose their special meaning when enclosed in square brackets.
Square brackets denote a character class. [0-9] means any one of the digits 0 through 9.
[+-] means one of the symbols + or -.
The exception is that a ^ at the beginning of a class means anything except what is in the class. For example, [^+-*]
means any one character except a +, -, or *
Outside square brackets, we use + to mean one or more, * to mean zero or more, and ? to mean zero or one. A period (.)
means any one character except newline. If you want a literal one of these characters outside square brackets, precede it with
a backslash. Note the use of \. above to denote a decimal point.
Regular expressions basically read as “this followed by that”. The first part of the expression above is one
or more digits: [0-9]+. The next expression is in parentheses, and the ? tells us this part is optional (zero or one). So one
or more digits, i.e., [0-9]+, is optionally followed by a decimal point which is followed by zero or more digits; this is the
\.[0-9]* in parentheses.
The | means “or” or alternate. So the pattern matches what we previously described, OR it matches a decimal
point followed by one or more digits.

Hopefully by this point you can recognize the optional exponent part of the number.


Operator Precedence in CFGs
Table of Contents



Consider the following grammar:

add_expr:	add_expr PLUS mul_expr
	|	mul_expr
	;

mul_expr:	mul_expr MUL NUMBER
	|	NUMBER
	;

The expression 2 + 3 * 4 would produce the following derivation tree:

				add_expr
				   |
		   ---------------------------------
		   |              |                |
		add_expr	PLUS		mul_expr
		   |                              |
		   |                       --------------
		   |			   |	  |	|
		mul_expr		mul_expr MUL NUMBER
		   |			   |
		NUMBER			NUMBER

Notice that when processed in a preorder sequence, the multiplication gets evaluated first. This is precisely
what we want. The above grammar illustrates how to express operator precedence in a grammar. Our final grammar
will contain a full set of operators and will handle things in addition to numbers such as identifiers and
parenthesized expressions.


The Expression Grammar: First Version
Table of Contents



Here is a grammar defining the syntax of our expressions. The reason that this is "version 1" is that, for
reasons explained below, the type of parser we are going to use can't deal with things such as left recursion.
We will end up rewriting some rules in a different, but equivalent, way.


assign_expr:			ID ASSIGN add_expr
	|			add_expr
	;

add_expr:			add_expr PLUS mul_expr
	|			add_expr MINUS mul_expr
	|			mul_expr
	;

mul_expr:			mul_expr MUL pow_expr
	|			mul_expr DIV pow_expr
	|			mul_expr MOD pow_expr
	|			pow_expr
	;

pow_expr:			unary_expr pow_expr_suffix
	|			unary_expr
	;

pow_expr_suffix:		POW unary_expr
	;

unary_expr:			PLUS primary
	|			MINUS primary
	|			primary
	;

primary:			ID
	|			NUMBER
	|			LP add_expr RP
	|			SIN LP add_expr RP
	|			COS LP add_expr RP
	|			TAN LP add_expr RP
	|			ASIN LP add_expr RP
	|			ACOS LP add_expr RP
	|			ATAN LP add_expr RP
	|			LOG LP add_expr RP
	|			EXP LP add_expr RP
	|			LOG10 LP add_expr RP
	|			EXP10 LP add_expr RP
	|			SQRT LP add_expr RP
	|			INT LP add_expr RP
	;


Token Descriptions
An ID is a letter followed by zero or more letters or digits. A NUMBER is the same as
previously described. ASSIGN, PLUS, MINUS, MUL,
DIV, MOD, POW, LP, and RP represent
the symbols = + - * / % ^ ( ). The rest of the tokens are the function names. They are represent
keywords having the same spelling, but in all lowercase.


The Phases of Language Translation
Table of Contents



An expression parser, or more generally, a compiler, usually operates in several phases. First, the
characters in the source text must be properly grouped and turned into tokens. Next, the parser checks the
sequence of tokens to see whether they satisfy the grammar. Next, the parser handles the semantics of what it
recognizes; e.g., when it recognizes an addition expression, it collects the information it needs and prepares
to do the addition. In our parser, the addition is done the the point that an addition subexpression is
recognized. If we were writing a compiler, we would have our parser generate code.


Grammars and Parsers
Table of Contents



Given a language construct, the rules in a grammar can be applied from the top down or from the bottom up. As
long as same derivation tree is generated, the rules can be applied in any fashion. This gives rise to
different classes of grammars and different parser types. The bottom up approach isn't very intuitive. However,
this approach, along with the parsers which support it, end up being more powerful than the top down approach.
The bottom up approach makes parsing full featured languages easier. Programs that generate parsers from a
grammar plus other information usually generate a bottom up parser.



We will be using a top down parser called a recursive descent parser. Recursive descent parsers are best
at handling a class of grammars called LL(1). This arcane name refers to how a token stream is parsed.
The first L stands for leftmost, and means that the token stream is read left to right. The L(1) means that
the parser can look ahead at most one token.

We won't go into the differences between various grammar classes. The property of LL(1) grammars that we are
interested in is that, given a token and the left hand side of a production, the token must predict one and only
one right hand side. The grammar above doesn't meet this requirement. Here is why. Take the rule set

add_expr:	add_expr PLUS mul_expr
	|	mul_expr
	;

If the current token in our token stream is a number, which right hand side do we choose? We don't know, without
looking ahead further, whether we're starting an addition expression or a multiplication expression. In general,
no grammar containing left recursion can be LL(1). The reason I have used left recursion is that it's more
intuitive than the LL(1) equivalent.



An LL(1) equivalent of the above is

add_expr:	mul_expr add_expr_tail
	;

add_expr_tail:	PLUS mul_expr add_expr_tail
	|	// EMPTY
	;

The comment // Empty just refers to "no token". Now, if our current token is NUMBER
and we're at a point where we're processing an add_expr, we only have the rule mul_expr
add_expr_tail to choose from. If a NUMBER can't start this rule, we would have an error.
But, a NUMBER can start an add_expr, so we may proceed. Going down the parse tree,
mul_expr eventually becomes a primary, which becomes a NUMBER. With that
token matched, we advance to the next token. We are now processing the add_expr_tail nonterminal in
add_expr: mul_expr add_expr_tail. If our current token is PLUS, the rule
add_expr_tail: PLUS mul_expr add_expr_tail is used. Otherwise, the rule add_expr_tail: //
EMPTY is used.



The first two rules in our grammar also exclude it from the LL(1) class:

assign_expr:	ID ASSIGN add_expr
	|	add_expr
	;

Can you see why? An add_expr become a primary which can become an ID. So
an ID can begin either right hand side.



The easiest way to make our grammar LL(1) is to allow assignment to any kind of expression. We will delegate
checking that assignment is being done to an identifier to the static semantic checking.  Hence, we only need to
transform our first two rules as follows:

assign_expr:		add_expr assign_expr_suffix
	;

assign_expr_suffix:	ASSIGN add_expr
	|		// EMPTY
	;



The Expression Grammar: Second Version
Table of Contents



Here is an LL(1) grammar that is equivalent to our original grammar.

assign_expr:		add_expr assign_expr_suffix
	;

assign_expr_suffix:	ASSIGN add_expr
	|		// EMPTY
	;

add_expr:		mul_expr add_expr_tail
	;

add_expr_tail:		PLUS mul_expr add_expr_tail
	|		MINUS mul_expr add_expr_tail
	|		// EMPTY
	;

mul_expr:		pow_expr mul_expr_tail
	;

mul_expr_tail:		MUL pow_expr mul_expr_tail
	|		DIV pow_expr mul_expr_tail
	|		MOD pow_expr mul_expr_tail
	|		// EMPTY
	;

pow_expr:		unary_expr pow_expr_suffix
	;

pow_expr_suffix:	POW unary_expr
	|		// EMPTY
	;

unary_expr:		PLUS primary
	|		MINUS primary
	|		primary
	;

primary:		ID
	|		NUMBER
	|		LP add_expr RP
	|		SIN LP add_expr RP
	|		COS LP add_expr RP
	|		TAN LP add_expr RP
	|		ASIN LP add_expr RP
	|		ACOS LP add_expr RP
	|		ATAN LP add_expr RP
	|		LOG LP add_expr RP
	|		EXP LP add_expr RP
	|		LOG10 LP add_expr RP
	|		EXP10 LP add_expr RP
	|		SQRT LP add_expr RP
	|		INT LP add_expr RP
	;



LL(1) Grammars and Recursive Descent Parser Routines
Table of Contents



One nice thing about LL(1) grammars and recursive descent parsers is that each set of productions having the
same left hand side can be directly mapped into a parser function. For example, consider the production

add_expr:		mul_expr add_expr_tail
	;

This can be simply coded as

void add_expr()
{
  mul_expr();
  add_expr_tail();
}

The production set

add_expr_tail:		PLUS mul_expr add_expr_tail
	|		MINUS mul_expr add_expr_tail
	|		// EMPTY
	;

can be coded as

void add_expr_tail()
{
  switch (current_token) {
  case PLUS:
    advance();	// skip past PLUS token
    mul_expr();
    add_expr_tail();
    break;
  case MINUS:
    advance();	// skip past MINUS token
    mul_expr();
    add_expr_tail();
  default:
    return;	// represents EMPTY.
  }
}

Don't get too clever by factoring out the commonality in this code. It will disappear when you add the semantics
later.



The recursion in the add_expr_tail production has potential to be slow and to eat memory. When such
X and X_tail productions occur, we will code them together using the following kind of
optimization:

void add_expr()
{
  mul_expr();

  for (;;) {
    switch (current_token()) {
    case PLUS:
      advance();
      mul_expr();
      break;
    case MINUS:
      advance();
      mul_expr();
      break;
    default:
      return;
    }
  }
}

We apply a similar optimization to a X and X_suffix pair:

void assign_expr()
{
  add_expr();

  if (current_token == ASSIGN) {
    advance();
    add_expr();
  }
}



The Coding Phases
Table of Contents



Ordinarily, it's a good idea to design a program before coding it. However, our program is small enough to allow
us to jump right into the coding. It also simplifies the tutorial. We will first code an object called a
Lexer. The job of the lexer is to turn the input character stream into expression tokens. Next, we will
code the parser proper. Finally, we will add semantic processing to the parser. Once we do that, we will have a
parser that evaluates the expressions it recognizes.


Parser Errors
Table of Contents



Our parser will support simple error handling. We want to be able to report lexical errors, syntax errors, and
runtime errors (errors in semantic processing).

template<int N>
  class Error {
  public:
    explicit Error(const std::string& s) : msg{s} { }

    std::string get_message() const { return message; }
    void put(std::ostream& os) const { os << message; }

  private:
    std::string message;
  };

using Lexical_error = Error<1>;
using Syntax_error = Error<2>;
using Runtime_error = Error<3>;

template<int N>
  std::ostream& operator<<(std::ostream& os, const Error<N>& x)
  {
    x.put(os);
    return os;
  }



The Lexer
Table of Contents




The Token Type
Table of Contents



The first step is to give each of our token symbols a representation in code. We will use an enum
class to do this.

enum class Token {
  Id, Number, Sin, Cos, Tan, Asin, Acos, Atan, Log, Exp,
  Log10, Exp10, Sqrt, Int, Assign='=', Plus='+', Minus='-',
  Mul='*', Div='/', Mod='%', Pow='^', Lp='(', Rp=')', Eofsym=-1
};



The Lexer Class
Table of Contents



Here is our Lexer class.


class Lexer {
public:
  explicit Lexer(std::istream& is);
  explicit Lexer(std::istream* ps);

  // A lexer belongs to a parser and should not be copied or moved.

  Lexer(const Lexer&) = delete;
  Lexer& operator=(const Lexer&) = delete;

  Lexer(Lexer&&) = delete;
  Lexer& operator=(Lexer&&) = delete;

  ~Lexer() { if (owns_input) delete p_input; }

  Token token() const { return cur_token; }
  std::string text() const { return buffer; }

  void advance() { cur_token = get_token(); }

private:
  std::istream* p_input;	// The source stream (a stream of characters).
  bool owns_input;		// True if we can delete p_input, false if we can't.

  Token cur_token;
  std::string buffer;

  Token get_token();		// The workhorse. Assembles characters from p_input into tokens.
};


The constructors simply initialize p_input and owns_input, and call
init() to do a priming read on the stream.


Lexer::Lexer(std::istream& is)
  : p_input{&is}, owns_input{false}
{
  advance();
}

Lexer::Lexer(std::istream* ps)
  : p_input{ps}, owns_input{true}
{
  advance();
}


get_token() is the workhorse of our lexer. It assembles the characters from the input stream into
tokens. It starts out as follows:


Token Lexer::get_token()
{
  std::istream& input = *p_input;	// shorthand to make the notation convenient.

  token_buffer.clear();			// clear the buffer for the new token.

  char c = input.get();			// a priming read on the stream.

  // Skip whitespace.
  while (isspace(c)) c = input.get();

  // If there are no characters, we're at the end of the stream.
  if (!input) return Token::Eofsym;

  // ...
}


Next, we look for an identifier or function name. Both have the same syntax. We handle this by looking for an
identifier first. Then, we check what we have found against the function names.


Token Lexer::get_token()
{
  // ...

  // Look for an identifier or a function name. Both start with a letter.
  if (isalpha(c)) {
    token_buffer = c;
    c = input.get();

    // Look for zero or more letters or digits.
    while (isalnum(c)) {
      token_buffer += c;
      c = input.get();
    }

    // The current character doesn't belong to our identifier, and must be put back into the stream.
    input.putback(c);

    // Check for a function name.
    if (token_buffer == "sin") return Token::Sin;
    if (token_buffer == "cos") return Token::Cos;
    if (token_buffer == "tan") return Token::Tan;
    if (token_buffer == "asin") return Token::Asin;
    if (token_buffer == "acos") return Token::Acos;
    if (token_buffer == "atan") return Token::Atan;
    if (token_buffer == "log") return Token::Log;
    if (token_buffer == "exp") return Token::Exp;
    if (token_buffer == "log10") return Token::Log10;
    if (token_buffer == "exp10") return Token::Exp10;
    if (token_buffer == "sqrt") return Token::Sqrt;
    if (token_buffer == "int") return Token::Int;

    // Whatever is not a function name must be an identifier.
    return Token::Id;
  }

  // ...
}


Here is the code to assemble a number from characters. We want numbers such as 304, 46., .72, 3.14, and any of patterns
followed by an exponent part, e.g., 4.92e-3, 21.34e6, .5e+7. The following code recognizes our number pattern.


Token Lexer::get_token()
{
  // ...

  // Look for a number.
  if (isdigit(c) || c == '.') {
    if (isdigit(c)) {
      buffer += c;
      c = input.get();
      while (isdigit(c)) {
        buffer += c;
	c = input.get();
      }
      if (c == '.') {
        buffer += c;
	c = input.get();
	while (isdigit(c)) {
	  buffer += c;
	  c = input.get();
	}
      }
    } else {  // c == '.'
      buffer += c;
      c = input.get();
      if (!isdigit(c))
        throw Lexical_error{"no digits after decimal point"};
      while (isdigit(c)) {
        buffer += c;
	c = input.get();
      }
    }
    if (c == 'e' || c == 'E') {
      buffer += c;
      c = input.get();
      if (c == '+' || c == '-') {
        buffer += c;
	c = input.get();
      }
      if (!isdigit(c))
        throw Lexical_error{"no digits in exponent part"};
      while (isdigit(c)) {
        buffer += c;
	c = input.get();
      }
    }
    input.putback(c);
    return Token::NUMBER;
  }
}


The rest of our tokens are single character tokens. They are simple to handle. If we get through our single
character tokens without returning, we're left with an error.


Token Lexer::get_token()
{
  // ...

  // Check for a single character token.
  token_buffer = c;
  switch (c) {
  // Note: fallthrough intentional.
  case '=':
  case '+':
  case '-':
  case '*':
  case '/':
  case '%':
  case '^':
  case '(':
  case ')':
    return Token(c);
  }

  // Anything else is an error.
  throw Lexical_error{token_buffer};
}



The Parser
Table of Contents



We will code the parser as a class. There are other ways to do it, but I like having the parser initialize
itself, and being able to pass the expression to the parser using the call operator.



Each nonterminal in the grammar is represented by a parser function. The exceptions are the _tail
and _suffix nonterminals, for which we provide the optimization mentioned above.  For now, it is
convenient to declare the parser functions as returning void. Later, when we add our semantics and
our functions return values, we'll change that return type to double. The alternative is to have our
functions return double from the beginning, and return a dummy value. But if we miss replacing that dummy
value later on, we'll have a buggy program.



Here is the Parser class.


class Parser {
public:
  Parser();

  // Change the return type to double when we add semantics.
  void operator()(const std::string& s);

private:
  Lexer* p_lexer;

  // Change these return types to double when we add semantics.
  void assign_expr();
  void add_expr();
  void mul_expr();
  void pow_expr();
  void unary_expr();
  void primary();
};


We could let the compiler provide a default constructor at this stage, but when we add our symbol table we will
use the constructor to reserve a couple of identifiers as constants.



For now, the constructor is empty. The operator() function simply creates a Lexer, calls
assign_expr() to start the parsing process, and then deletes the Lexer when it's done.


Parser::Parser()
{
}

void Parser::operator()(const std::string& s)
{
  std::istringstream ist{s};	// Create an input stream from the string.
  p_lexer = new Lexer{ist};
  assign_expr();		// Our entry point into the parsing process.
  delete p_lexer();
}


Now we will code the parsing routines. Each function is named after a nonterminal. The function bodies conform
closely to the right hand sides of all the productions for which the nonterminal is the left hand side. We
choose which of the rules we are processing based on the current token in the stream. Whenever we encounter a
nonterminal in an rhs, we simply call the function with that name. When we encounter a terminal, we check
whether it matches the current token in the stream. If it does, we advance the lexer. If it doesn't, we throw an
exception.


void Parser::assign_expr()
{
  // We must save the current token at this point, because if it is followed by an assignment
  // operator, we have to check whether it is an identifier.
  Token t = p_lexer->get_current_token();

  add_expr();

  // Process assign_expr_suffix.
  if (p_lexer->get_current_token() == Token::Assign) {
    // The target of our assignment must be an Id.
    if (t != Token::Id)
      throw Syntax_error{"target of assignment must be an identifier"};
    p_lexer->advance();	// Move past the Assign token.
    add_expr();
  }
}


In our grammar, an assignment to any expression is valid. However, when that target is not an identifier,
assignment doesn't make sense. When we specified our grammar, we left that requirement to semantic checking.
But, it's a check we can provide during translation. Semantics that can be checked during translation are called
static semantics.



Here is add_expr().


void Parser::add_expr()
{
  mul_expr();

  // Process add_expr_tail.
  for (;;) {
    switch (p_lexer->get_current_token()) {
    case Token::Plus:
      p_lexer->advance();
      mul_expr();
      break;
    case Token::Minus:
      p_lexer->advance();
      mul_expr();
      break;
    default:
      return;
    }
  }
}


Note that the two cases in the switch statement above have common code. We don't want to factor it out, because
when we add semantics later on, the code segments will no longer be identical.



The mul_expr() is very similar.


void Parser::mul_expr()
{
  pow_expr();

  // Process mul_expr_tail.
  for (;;) {
    switch (p_lexer->get_current_token()) {
    case Token::Mul:
      p_lexer->advance();
      pow_expr();
      break;
    case Token::Div:
      p_lexer->advance();
      pow_expr();
      break;
    case Token::Mod:
      p_lexer->advance();
      pow_expr();
      break;
    default:
      return;
    }
  }
}


The code for pow_expr() and unary_expr are pretty straightforward.


void Parser::pow_expr()
{
  unary_expr();

  // Process pow_expr_suffix.
  if (p_lexer->get_current_token() == Token::Pow) {
    p_lexer->advance();
    pow_expr();
  }
}

void Parser::unary_expr()
{
  switch (p_lexer->get_current_token()) {
  case Token::Plus:
    p_lexer->advance();
    primary();
    break;
  case Token::Minus:
    p_lexer->advance();
    primary();
    break;
  default:
    primary();
  }
}


The primary() function is just a straightforward processing of the right hand side of each production in every
case. Instead of writing the code for function calls over and over, we'll put it in a function called
get_argument(). When we add semantic processing, get_argument() will return the value
produced by add_expr(), i.e., the argument to the function.


void Parser::primary()
{
  switch (p_lexer->get_current_token()) {
  case Token::Id:
    p_lexer->advance();
    break;
  case Token::Number:
    p_lexer->advance();
    break;
  case Token::Lp:
    p_lexer->advance();
    add_expr();
    if (p_lexer->get_current_token() != Token::Rp)
      throw Syntax_error{"missing ) after subexpression"};
    p_lexer->advance();
    break;
  case Token::Sin:
    get_argument();
    break;
  case Token::Cos:
    get_argument();
    break;
  case Token::Tan:
    get_argument();
    break;
  case Token::Asin:
    get_argument();
    break;
  case Token::Acos:
    get_argument();
    break;
  case Token::Atan:
    get_argument();
    break;
  case Token::Log:
    get_argument();
    break;
  case Token::Exp:
    get_argument();
    break;
  case Token::Log10:
    get_argument();
    break;
  case Token::Exp10:
    get_argument();
    break;
  case Token::Sqrt:
    get_argument();
    break;
  case Token::Int:
    get_argument();
    break;
  default:
    throw Syntax_error{"invalid primary expression"};
  }
}

// Change the return type to double when we add semantics.
void Parser::get_argument()
{
  p_lexer->advance();	// Toss the function name. We already know it.
  if (p_lexer->get_current_token() != Token::Lp)
    throw Syntax_error{"missing ( after function name"};
  p_lexer->advance();
  add_expr();
  if (p_lexer->get_current_token() != Token::Rp)
    throw Syntax_error{"missing ) after function argument"};
  p_lexer->advance();
}


We now write a driver for the parser. At this point, the program will only produce output if something is wrong.
Don't let that discourage you.


int main()
{
  Parser parser;

  while (std::cin) {
    std::string s;
    std::getline(std::cin, s);
    if (!std::cin || s == "quit") break;
    try {
      parser(s);
    } catch (const Lexical_error& e) {
      std::cerr << "Lexical error: " << e << '\n';
    } catch (const Syntax_error& e) {
      std::cerr << "Syntax error: " << e << '\n';
    } catch (const Runtime_error& e) {
      std::cerr << "Runtime error: " << e << '\n';
    }
  }
}



Semantics
Table of Contents



We will now add the code that makes the parser do its computations. Ideally, the semantics should be
specified along with the grammar. Because our parser is small, we can get away with describing them as we write
them in code. Also, I think this approach is a good way to learn exactly what semantics are and what issues they
present.




Helper Functions
Table of Contents



The fundamental element in our expressions is the number. Our lexer already composes input characters into text
representing a number. When our primary() function recognizes a number, it will need to convert the
string in the token text into a double. We can provide a simple helper function to do that. It
works by turning the given string into an input stream, and then reads the number from that stream.


double to_number(const std::string& s)
{
  std::istringstream ist{s};

  // Tell the input stream to throw an exception if it can't read the number.
  ist.exceptions(std::ios_base::failbit);

  double x;
  ist >> x;
  return x;
}


The opposite conversion, from number to string, is helpful when composing error messages.


std::string to_string(double x)
{
  std::ostringstream ost;
  ost << x;
  return ost.str();
}



The Symbol Table
Table of Contents



We need to associate every identifier with a value. The mechanism used to do this is called a symbol table.
In our expressions, the mention of a variable is sufficient to bring it into existence. We don't need to store any
information with it other than its value. Hence, a map is sufficient to use as a symbol table.


std::map symbol_table;



Parser()
Table of Contents



We're going to add two constants to our parser, pi and e. The constructor will put
their values into the symbol table. It will be the job of assign_expr() to protect against
assignments to these two constants.


Parser::Parser()
{
  symbol_table["pi"] = 4.0 * atan(1.0);
  symbol_table["e"] = exp(1.0);
}



primary()
Table of Contents



When we recognize an Id, we look it up in the symbol table and return its associated value. If the
Id hasn't been used yet, it's entered into the symbol table with a value of 0, and 0 is
returned.



When we recognize a Number, we convert its string representation to a double and
return that value.



When we recognize a subexpression, we return the value of the expression in the parentheses. We do this simply
by returning what add_expr() returns.



When we process a function, we're trying to achieve the effect of calling the function on the argument. We call
our function get_argument() to get the argument value. When we need to, we check whether the
argument is valid for the function. We then return the result of calling the function on the argument. Here is
primary() with the semantics added.


double Parser::primary()
{
  std::string text = p_lexer->get_token_text();
  double arg;

  switch (p_lexer->get_current_token()) {
  case Token::Id:
    p_lexer->advance();
    return symbol_table[text];
  case Token::Number:
    p_lexer->advance();
    return to_number(text);
  case Token::Lp:
    p_lexer->advance();
    arg = add_expr();
    if (p_lexer->get_token() != Token::Rp)
      throw Syntax_error{"missing ) after subexpression"};
    p_lexer->advance();
    return arg;
  case Token::Sin:
    return sin(get_argument());
  case Token::Cos:
    return cos(get_argument());
  case Token::Tan:
    arg = get_argument();
    if (cos(arg) == 0)
      throw Runtime_error{"invalid argument to tan: " + to_string(arg)};
    return tan(arg);
  case Token::Asin:
    return asin(get_argument());
  case Token::Acos:
    return acos(get_argument());
  case Token::Atan:
    return atan(get_argument());
  case Token::Log:
    arg = get_argument();
    if (arg < 1)
      throw Runtime_error{"invalid argument to log: " + to_string(arg)};
    return log(arg);
  case Token::Exp:
    return exp(get_argument());
  case Token::Log10:
    arg = get_argument();
    if (arg < 1)
      throw Runtime_error{"invalid argument to log10: " + to_string(arg)};
    return log10(arg);
  case Token::Exp10:
    return exp10(get_argument());
  case Token::Sqrt:
    arg = get_argument();
    if (arg < 0)
      throw Runtime_error{"attempt to take square root of negative number"};
    return sqrt(arg);
  case Token::Int:
    arg = get_argument();
    if (arg < 0)
      return ceil(arg);
    else
      return floor(arg);
  default:
    throw Syntax_error{"invalid primary expression"};
  }
}


We also need ot modify get_argument() so it returns the value of the argument.


double Parser::get_argument()
{
  p_lexer->advance();	// Toss the function name. We already know it.
  if (p_lexer->get_current_token() != Token::Lp)
    throw Syntax_error{"missing ( after function name"};
  p_lexer->advance();
  double arg = add_expr();
  if (p_lexer->get_current_token() != Token::Rp)
    throw Syntax_error{"missing ) after function argument"};
  p_lexer->advance();
  return arg;
}



unary_expr()
Table of Contents



Adding semantics to this function is simple. We simply want to apply the sign, if there is one.


double Parser::unary_expr()
{
  switch (p_lexer->get_current_token()) {
  case Token::Plus:
    p_lexer->advance();
    return +primary();
  case Token::Minus:
    p_lexer->advance();
    return -primary();
  default:
    return primary();
  }
}



pow_expr()
Table of Contents



pow_expr() uses the library function pow() to do its computation. We also want to make
sure that we're not trying to take the root of a negative number. We provide a function
check_domain() that does this. I chose to put check_domain() in the parser class, but
made it static because it doesn't depend on any members of the class.


double Parser::pow_expr()
{
  double result = unary_expr();

  if (p_lexer->get_current_token() == Token::Pow) {
    p_lexer->advance();
    double x = unary_expr();
    check_domain(result, x);
    result = pow(result, x);
  }

  return result;
}

double Parser::check_domain(double x, double y)
{
  // There is no error if x is nonnegative.
  if (x >= 0) return;

  // There is no error unless 0 < abs(y) < 1.
  double e = std::abs(y);
  if (e <= 0 || e >=1) return;

  // We have an error.
  throw Runtime_error{"attempt to take root of a negative number"};
}



mul_expr()
Table of Contents



Providing semantics for multiplication and division is pretty straightforward. In the case of the division and
modulo operations, we want to catch division by zero.


double Parser::mul_expr()
{
  double result = pow_expr();
  double x;

  for (;;) {
    switch (p_lexer->get_current_token()) {
    case Token::Mul:
      p_lexer->advance();
      result *= pow_expr();
      break;
    case Token::Div:
      p_lexer->advance();
      x = pow_expr();
      if (x == 0)
        throw Runtime_error{"attempt to divide by zero"};
      result /= x;
      break;
    case Token::Mod:
      p_lexer->advance();
      x = pow_expr();
      if (x == 0)
        throw Runtime_error{"attempt to divide by zero"};
      result = fmod(result, x);
      break;
    default:
      return result;
    }
  }
}



add_expr()
Table of Contents




double Parser::add_expr()
{
  double result = mul_expr();

  for (;;) {
    switch (p_lexer->get_current_token()) {
    case Token::Plus:
      p_lexer->advance();
      result += mul_expr();
      break;
    case Token::Minus:
      p_lexer->advance();
      result -= mul_expr();
      break;
    default:
      return result;
    }
  }
}



assign_expr()
Table of Contents



In assign_expr(), we want to guard against assigning to our constants, and we want to update any
value assigned to in our symbol table. We count assignment to a constant as a syntax error because it's
something that can be checked for during translation, and we're grouping our syntax errors and static semantic
errors together.


void Parser::assign_expr()
{
  // We must save the current token at this point, because if it is followed by an assignment
  // operator, we have to check whether it is an identifier.
  Token t = p_lexer->get_current_token();
  std::string text = p_lexer->get_token_text();

  double result = add_expr();

  // Process assign_expr_suffix.
  if (p_lexer->get_current_token() == Token::Assign) {
    // The target of our assignment must be an Id.
    if (t != Token::Id)
      throw Syntax_error{"target of assignment must be an identifier"};

    // Check whether we're modifying a constant.
    if (text == "pi" || text == "e")
      throw Syntax_error{"attempt to modify the constant " + text};

    p_lexer->advance();	// Move past the Assign token.
    return symbol_table[text] = add_expr();
  }

  return result;
}



The Complete Parser Code
Table of Contents



Below is the parser code in its entirety.


#include <iostream>
#include <sstream>
#include <map>
#include <string>
#include <algorithm>
#include <cctype>
#include <cmath>

double to_number(const std::string& s)
{
  std::istringstream ist{s};
  ist.exceptions(std::ios_base::failbit);
  double x;
  ist >> x;
  return x;
}

std::string to_string(double x)
{
  std::ostringstream ost;
  ost << x;
  return ost.str();
}

template<int N>
  class Error {
  public:
    explicit Error(const std::string s) : message{s} { }

    std::string get_message() const { return message; }
    void put(std::ostream& os) const { os << message; }

  private:
    std::string message;
  };

using Lexical_error = Error<1>;
using Syntax_error = Error<2>;
using Runtime_error = Error<3>;

template<int N>
  std::ostream& operator<<(std::ostream& os, const Error<N>& e)
  {
    e.put(os);
    return os;
  }

// The basic elements of our expressions.
enum class Token {
  Id, Number, Sin, Cos, Tan, Asin, Acos, Atan, Log, Exp,
  Log10, Exp10, Sqrt, Int, Assign='=', Plus='+', Minus='-',
  Mul='*', Div='/', Mod='%', Pow='^', Lp='(', Rp=')', Eofsym=-1
};

class Lexer {
public:
  explicit Lexer(std::istream& is);
  explicit Lexer(std::istream* ps);

  // A Lexer belongs to a parser and shouldn't be copied or moved.

  Lexer(const Lexer&) = delete;
  Lexer& operator=(const Lexer&) = delete;

  Lexer(Lexer&&) = delete;
  Lexer& operator=(Lexer&&) = delete;

  ~Lexer() { if (owns_input) delete p_input; }

  Token get_current_token() const { return current_token; }
  std::string get_token_text() const { return current_token_text; }

  void advance();		// Read the next token in the stream.

private:
  std::istream* p_input;		// The source stream (a stream of characters).
  bool owns_input;			// True if we can delete p_input, false if we can't.

  Token current_token;
  std::string current_token_text;

  void init();				// Code common to each constructor.

  Token get_token();			// The workhorse. Assembles characters from p_input into tokens.
  std::string token_buffer;		// The text of the token that get_token() just found.

  void exponent_part(char& c);		// A helper function for get_token() when it is looking for a number.
};

Lexer::Lexer(std::istream& is)
  : p_input{&is}, owns_input{false}
{
  init();
}

Lexer::Lexer(std::istream* ps)
  : p_input{ps}, owns_input{false}
{
  init();
}

void Lexer::init()
{
  current_token = get_token();
  current_token_text = token_buffer;
}

void Lexer::advance()
{
  if (current_token != Token::Eofsym) {
    current_token = get_token();
    current_token_text = token_buffer;
  }
}

Token Lexer::get_token()
{
  std::istream& input = *p_input;		// Shorthand to make the notation convenient.

  token_buffer.clear();				// Clear the buffer for the new token.

  char c = input.get();				// A priming read on the stream.

  // Skip whitespace.
  while (isspace(c)) c = input.get();

  // If there are no characters, we're at the end of the stream.
  if (!input) return Token::Eofsym;

  // Look for an identifier or function name.
  if (isalpha(c)) {
    token_buffer = c;
    c = input.get();

    // Look for zero or more letters or digits.
    while (isalnum(c)) {
      token_buffer += c;
      c = input.get();
    }

    // The current character doesn' belong to our identifier.
    input.putback(c);

    // Check for a function name.
    if (token_buffer == "sin") return Token::Sin;
    if (token_buffer == "cos") return Token::Cos;
    if (token_buffer == "tan") return Token::Tan;
    if (token_buffer == "asin") return Token::Asin;
    if (token_buffer == "acos") return Token::Acos;
    if (token_buffer == "atan") return Token::Atan;
    if (token_buffer == "log") return Token::Log;
    if (token_buffer == "exp") return Token::Exp;
    if (token_buffer == "log10") return Token::Log10;
    if (token_buffer == "exp10") return Token::Exp10;
    if (token_buffer == "sqrt") return Token::Sqrt;
    if (token_buffer == "int") return Token::Int;

    // Whatever is not a function name must be an identifier.
    return Token::Id;
  }

  // Look for a number beginning with a digit.
  if (isdigit(c)) {
    token_buffer = c;
    c = input.get();

    // Look for other digits.
    while (isdigit(c)) {
      token_buffer += c;
      c = input.get();
    }
    
    // Look for an optional decimal point.
    // If there is one, it can be followed by zero or more digits.
    if (c == '.') {
      token_buffer += c;
      c = input.get();
  
      while (isdigit(c)) {
        token_buffer += c;
        c = input.get();
      }
    }

    // Look for an optional exponent part.
    exponent_part(c);

    input.putback(c);
    return Token::Number;
  }

  // Look for a number beginning with a decimal point.
  if (c == '.') {
    token_buffer = c;
    c = input.get();

    // A decimal point must be followed by a digit. Otherwise we have an error.
    if (!isdigit(c)) {
      throw Lexical_error{token_buffer += c};
    }
    while (isdigit(c)) {
      token_buffer += c;
      c = input.get();
    }

    // Check for the optional exponent part.
    exponent_part(c);

    input.putback(c);
    return Token::Number;
  }

  // Check for a single character token.
  token_buffer = c;
  switch (c) {
  // Note: fallthrough intentional.
  case '=':
  case '+':
  case '-':
  case '*':
  case '/':
  case '%':
  case '^':
  case '(':
  case ')':
    return Token(c);
  }

  // Anything else is an error.
  throw Lexical_error{token_buffer};
}

void Lexer::exponent_part(char& c)
{
  std::istream& input = *p_input;

  if (c != 'e' || c != 'E')
    return;

  token_buffer += c;
  c = input.get();

  // Check for an optional sign.
  if (c == '+' || c == '-') {
    token_buffer += c;
    c = input.get();
  }

  // We must have a digit. Otherwise, we have an error.
  if (!isdigit(c))
    throw Lexical_error{token_buffer += c};
  while (isdigit(c)) {
    token_buffer += c;
    c = input.get();
  }
}

std::map<std::string, double> symbol_table;

class Parser {
public:
  Parser();
  double operator()(const std::string& s);

private:
  Lexer* p_lexer;

  double assign_expr();
  double add_expr();
  double mul_expr();
  double pow_expr();
  double unary_expr();
  double primary();

  double get_argument();

  // Check for root of a negative number.
  static void check_domain(double x, double y);
};

Parser::Parser()
{
  symbol_table["pi"] = 4.0 * atan(1.0);
  symbol_table["e"] = exp(1.0);
}

double Parser::operator()(const std::string& s)
{
  std::istringstream ist{s};
  p_lexer = new Lexer{ist};
  double result = assign_expr();
  delete p_lexer;
  return result;
}

double Parser::assign_expr()
{
  Token t = p_lexer->get_current_token();
  std::string text = p_lexer->get_token_text();

  double result = add_expr();

  if (p_lexer->get_current_token() == Token::Assign) {
    if (t != Token::Id)
      throw Syntax_error{"target of assignment must be an identifier"};

    if (text == "pi" || text == "e")
      throw Syntax_error{"attempt to modify the constant " + text};

    p_lexer->advance();
    return symbol_table[text] = add_expr();
  }

  return result;
}

double Parser::add_expr()
{
  double result = mul_expr();

  for (;;) {
    switch (p_lexer->get_current_token()) {
    case Token::Plus:
      p_lexer->advance();
      result += mul_expr();
      break;
    case Token::Minus:
      p_lexer->advance();
      result -= mul_expr();
    default:
      return result;
    }
  }
}

double Parser::mul_expr()
{
  double result = pow_expr();
  double x;

  for (;;) {
    switch (p_lexer->get_current_token()) {
    case Token::Mul:
      p_lexer->advance();
      result *= pow_expr();
      break;
    case Token::Div:
      p_lexer->advance();
      x = pow_expr();
      if (x == 0)
        throw Runtime_error{"attempt to divide by zero"};
      result /= x;
      break;
    case Token::Mod:
      p_lexer->advance();
      x = pow_expr();
      if (x == 0)
        throw Runtime_error{"attempt to divide by zero"};
      result = fmod(result, x);
      break;
    default:
      return result;
    }
  }
}

double Parser::pow_expr()
{
  double result = unary_expr();

  if (p_lexer->get_current_token() == Token::Pow) {
    p_lexer->advance();
    double x = unary_expr();
    check_domain(result, x);
    return pow(result, x);
  }

  return result;
}

double Parser::unary_expr()
{
  switch (p_lexer->get_current_token()) {
  case Token::Plus:
    p_lexer->advance();
    return +primary();
  case Token::Minus:
    p_lexer->advance();
    return -primary();
  default:
    return primary();
  }
}

double Parser::primary()
{
  std::string text = p_lexer->get_token_text();
  double arg;

  switch (p_lexer->get_current_token()) {
  case Token::Id:
    p_lexer->advance();
    return symbol_table[text];
  case Token::Number:
    p_lexer->advance();
    return to_number(text);
  case Token::Lp:
    p_lexer->advance();
    arg = add_expr();
    if (p_lexer->get_current_token() != Token::Rp)
      throw Syntax_error{"missing ) after subexpression"};
    p_lexer->advance();
    return arg;
  case Token::Sin:
    return sin(get_argument());
  case Token::Cos:
    return cos(get_argument());
  case Token::Tan:
    arg = get_argument();
    if (cos(arg) == 0)
      throw Runtime_error{"invalid argument to tan: " + to_string(arg)};
    return tan(arg);
  case Token::Asin:
    return asin(get_argument());
  case Token::Acos:
    return acos(get_argument());
  case Token::Atan:
    return atan(get_argument());
  case Token::Log:
    arg = get_argument();
    if (arg < 1)
      throw Runtime_error{"invalid argument to log: " + to_string(arg)};
    return log(arg);
  case Token::Exp:
    return exp(get_argument());
  case Token::Log10:
    arg = get_argument();
    if (arg < 1)
      throw Runtime_error{"invalid argument to log10: " + to_string(arg)};
    return log10(arg);
  case Token::Exp10:
    return exp10(get_argument());
  case Token::Sqrt:
    arg = get_argument();
    if (arg < 0)
      throw Runtime_error{"attempt to take square root of negative number"};
    return sqrt(arg);
  case Token::Int:
    arg = get_argument();
    if (arg < 0)
      return ceil(arg);
    else
      return floor(arg);
  default:
    throw Syntax_error{"invalid primary expression"};
  }
}

void Parser::check_domain(double x, double y)
{
  // There is no error if x is nonnegative.
  if (x >= 0) return;

  // There is no error unless 0 < abs(y) < 1.
  double e = std::abs(y);
  if (e <= 0 || e >= 1) return;

  // We have an error.
  throw Runtime_error{"attempt to take root of a negative number"};
}

double Parser::get_argument()
{
  p_lexer->advance();
  if (p_lexer->get_current_token() != Token::Lp)
    throw Syntax_error{"missing ( after function name"};
  p_lexer->advance();
  double arg = add_expr();
  if (p_lexer->get_current_token() != Token::Rp)
    throw Syntax_error{"missing ) after function argument"};
  p_lexer->advance();
  return arg;
}

int main()
{
  Parser parser;

  std::cout.precision(12);

  while (std::cin) {
    std::string s;
    std::getline(std::cin, s);
    if (!std::cin || s == "quit") break;
    try {
      std::cout << parser(s) << '\n';
    } catch(const Lexical_error& e) {
      std::cerr << "Lexical error: " << e << '\n';
    } catch(const Syntax_error& e) {
      std::cerr << "Syntax error: " << e << '\n';
    } catch(const Runtime_error& e) {
      std::cerr << "Runtime error: " << e << '\n';
    }
  }
}



Copyright © 2017 by Chris French (Unclechromedome")

A C++ Expression Parser Tutorial

Table of Contents

Introduction

Context Free Grammars: The Heart of the Solution

Nonterminals, Terminals, and Tokens

The Meaning of Context Free

Describing Tokens

Operator Precedence in CFGs

The Expression Grammar: First Version

Token Descriptions

The Phases of Language Translation

Grammars and Parsers

The Expression Grammar: Second Version

LL(1) Grammars and Recursive Descent Parser Routines

The Coding Phases

Parser Errors

The Lexer

The Token Type

The Lexer Class

The Parser

Semantics

Helper Functions

The Symbol Table

`Parser()`

`primary()`

`unary_expr()`

`pow_expr()`

`mul_expr()`

`add_expr()`

`assign_expr()`

The Complete Parser Code