A brief intro to Pratt parsers with Rust
The problem: expression parsing with operator precedence
Human-readable languages tends to have a lot of infix operators with various precedences. In CFG, this corresponds to an annoying list of different rules, e.g.:
add : add '+' mul | add '-' mul | mul
mul : mul '*' atom | mul '/' atom | mul '%' atom | atom
// etc.
With good error recovery necessary, normal parsing methods would need to repeat the rules. Assuming it's possible to collapse the operators, it's still necessary to name each rules, and repeat the names three more times: for recursion and in the operation of one level lower priority as operand and fallback.
To make things worse, since operations like subtraction are not associative, left-associative operators (i.e. the leftmost subexpression is evaluated first) is inevitable, which also requires special handling in parser generation methods like PEG or LL, and naively implemented parser combinators.
So there's need for a better parsing strategy, which allows us to write less rules, and handle the details in parsing automatically. One solution is Pratt parsing. It not only provides a strategy to easily designate the precedence and associativity of infix operators, but is also capable of processing both prefix and postfix operators, with a little extension.
The idea and an implementation for infix operators
The key idea of Pratt parsing is the binding power of the operators. In an expression of atoms and infix operators, the atoms can be seen to be separated by the operators, e.g. in 1 + 2 * 3 - 4
, each operators have exactly two atoms at both end of it, and the atom must be at the end of the subexpression, which is an operand of the operator.
In other words, the operands of all operators are “bind” to their ends. Thus we can imagine that in case when two operators are “fighting for” the same operand, the operator binding tighter wins, and associate with the operand first, then the second operator. E.g. in 1 + 2 * 3 - 4
, *
wins in both ends, and gets 2 * 3
forms a subexpression first, then +
and -
, and in 1 + 2 * 3 ^ 4
, *
loses to ^
, but wins +
, so 3 ^ 4
gets bound first, then 2 * (3 ^ 4)
, +
comes the last.
That's the idea: designate each operator with a binding power, and when two operators fight, make the one with the higher binding power win.
Surprisingly, this mechanic can also be used to deal with associativity. As a clarification, associativity only applies when two operators of the same precedence chains, e.g. in 1 * 2 + 3 * 4
, the *
wouldn't be burdened with associativity, since they binds with their atoms eagerly, and don't fight with each other, but in 1 + 2 * 3 - 4
, +
and -
would fight with each other, thus their associativity applies.
The solution is the same: since associativity involves determining the operator at which side wins, we can designate the operators' ends with different binding power, thus operators with the right end a stronger binding power would have the operator at the left win, and therefore is left associative, and vice versa.
Now it's time for the implementation. There's a straightforward introduction and recursive implementation in this blog (which may be better than this one x x). Here we give an implementation by stack and a loop, with the chumsky parser combinator library, and we'll explain the algorithm later:
// `Oper` is the type of operators, `Expr` is the type of expressions.
// `atom` is the parser to parse an atom, and `oper` is the parser to parse an operator.
enum State {
Expr,
Oper,
}
custom(move |inp| {
let mut ops: Vec<Oper> = vec![];
let mut exprs = vec![];
let mut state = State::Expr;
loop {
// Read the next token according to the state
let Some(op) = (match state {
State::Expr => {
state = State::Oper;
exprs.push(inp.parse(&atom)?);
continue;
}
State::Oper => {
state = State::Expr;
inp.parse(oper.or_not())?
}
}) else {
break;
};
// Bind the previous operators
let cp = op.bindpow().0;
while let Some(pre) = ops.last().copied() {
let pp = pre.bindpow().1;
if pp < cp {
break;
}
ops.pop();
// there's no better way of popping multiple elements from a Vec in Rust ><
let Some((r, l)) = (try { (exprs.pop()?, exprs.pop()?) }) else {
unreachable!()
};
exprs.push(Expr::Op(pre, Box::new((l, r))));
}
ops.push(op);
}
// Finally: collect the rest items in the stack into an expression.
// Since they didn't get bound before, their precedence must be in ascending order,
// binding reversly is correct.
let Some(mut res) = exprs.pop() else {
unreachable!()
};
while let Some(op) = ops.pop() {
let Some(v) = exprs.pop() else { unreachable!() };
res = Expr::Op(op, Box::new((v, res)));
}
Ok(res)
})
Here we have maintain a state machine, which has the following states: the initial state, the previous item being an operator, the previous item being an expression, and the final state. Since we are parsing a sequence with atoms first, and operator and atom alternating, the item to be expected at each state is certain. Since the states are only used to fetch the items, we merged the states.
Now that we have the item sequence, we need to compose the sequence into the result expression.
Consider the information we want and we have: we want to know when to merge the expressions around an operator; and at each step, when we get an atom, we get nothing, since we already know it would be there; and for each operator, we can only know whether it can bind the expressions around it when the other two operators around it are known. Luckily, when we get an operator, we've already known all operators on its left, so we only need to get the operator on its right to know if the operator can be bound.
At this step we can have a figure about the algorithm: when we get an operator, check if the operator on its left can be bound, and regardless of whether the last operator binds, store the current operator and wait for the next operator. When we should store multiple operators, it's obvious that their binding powers should be ascending, since if not so, for the decreasing operator, all previous operators with stronger binding power should be bound before it.
Thus in the rest of our algorithm, we maintain a stack for the operators and expressions, ensure the ascendance of the operators before pushing, and finally bind all the rest operators in the stack: since they're ascending, binding them from later to earlier is correct.
That's all.
Patches for prefix and postfix operators
Despite being true in most languages, it's false that prefix and postfix operators must have higher precedence than infix operators. A seemingly ridiculous but applicable example is the if expr then expr else expr
structure, where if
can be a prefix operator, and then
and else
are postfix, such that the whole structure is actually parsed as (((if expr) then expr) else expr)
.
Our analysis above still applies here: a prefix operator must fight with the infix operator after it, before it can bind to its operand, due to the precedence rules, that's also true for postfix operators, but the directions are reversed. Thus for prefix operators, we only need to wait for the next operator and compare their binding powers, as what we've done for infix operators; the only difference is that instead of popping two expressions, we only need to pop one element to bind to the prefix operator. For postfix operators, things are better —– they can be bound instantly when read, without entering the stack.
Conveniently, we can note that in an AST with reasonably design, whether a operator is prefix, infix or postfix is inherent, thus we can just ask for themselves and store them in the same stack:
custom(move |inp| {
let mut ops: Vec<Oper> = vec![];
let mut exprs = vec![];
let mut state = State::Expr;
loop {
let Some(op) = (match state {
// Prefixes are processed like exprs, except that they don't change the state.
// Also, they should be pushed to diff stacks.
State::Expr => {
match inp.parse(&atom)? {
Atom::Expr(e) => {
state = State::Expr;
exprs.push(e);
}
Atom::Op(o) => ops.push(o),
}
continue;
}
State::Oper => {
state = State::Expr;
inp.parse(oper.or_not())?
}
}) else {
break;
};
let cp = op.bindpow().0;
while let Some(pre) = ops.last().copied() {
let pp = pre.bindpow().1;
if pp < cp {
break;
}
ops.pop();
// Prefix opers are checked and bound like infix opers,
// only note that their operand numbers diff
if pre.is_bin() {
let Some((r, l)) = (try { (exprs.pop()?, exprs.pop()?) }) else {
unreachable!()
};
exprs.push(Expr::BinOp(pre, Box::new((l, r))));
} else {
let Some(v) = exprs.pop() else {
unreachable!()
};
exprs.push(Expr::UnOp(pre, Box::new(v)));
}
}
// With higher precedence opers bound, postfix opers can safely bind now
if op.is_postfix() {
let Some(v) = exprs.pop() else {
unreachable!()
};
exprs.push(Expr::UnOp(op, Box::new(v)));
} else {
ops.push(op);
}
}
let Some(mut res) = exprs.pop() else {
unreachable!()
};
while let Some(op) = ops.pop() {
// The same as above, process infix and prefix opers diffly
if op.is_bin() {
let Some(v) = exprs.pop() else {
unreachable!()
};
res = Expr::BinOp(op, Box::new((v, res)));
} else {
res = Expr::UnOp(op, Box::new(res));
}
}
Ok(res)
})