senioria

The problem: expression parsing with operator precedence

Human-readable languages tends to have a lot of infix operators with various precedences. In CFG, this corresponds to an annoying list of different rules, e.g.:

add : add '+' mul | add '-' mul | mul
mul : mul '*' atom | mul '/' atom | mul '%' atom | atom
// etc.

With good error recovery necessary, normal parsing methods would need to repeat the rules. Assuming it's possible to collapse the operators, it's still necessary to name each rules, and repeat the names three more times: for recursion and in the operation of one level lower priority as operand and fallback.

To make things worse, since operations like subtraction are not associative, left-associative operators (i.e. the leftmost subexpression is evaluated first) is inevitable, which also requires special handling in parser generation methods like PEG or LL, and naively implemented parser combinators.

So there's need for a better parsing strategy, which allows us to write less rules, and handle the details in parsing automatically. One solution is Pratt parsing. It not only provides a strategy to easily designate the precedence and associativity of infix operators, but is also capable of processing both prefix and postfix operators, with a little extension.

The idea and an implementation for infix operators

The key idea of Pratt parsing is the binding power of the operators. In an expression of atoms and infix operators, the atoms can be seen to be separated by the operators, e.g. in 1 + 2 * 3 - 4, each operators have exactly two atoms at both end of it, and the atom must be at the end of the subexpression, which is an operand of the operator.

In other words, the operands of all operators are “bind” to their ends. Thus we can imagine that in case when two operators are “fighting for” the same operand, the operator binding tighter wins, and associate with the operand first, then the second operator. E.g. in 1 + 2 * 3 - 4, * wins in both ends, and gets 2 * 3 forms a subexpression first, then + and -, and in 1 + 2 * 3 ^ 4, * loses to ^, but wins +, so 3 ^ 4 gets bound first, then 2 * (3 ^ 4), + comes the last.

That's the idea: designate each operator with a binding power, and when two operators fight, make the one with the higher binding power win.

Surprisingly, this mechanic can also be used to deal with associativity. As a clarification, associativity only applies when two operators of the same precedence chains, e.g. in 1 * 2 + 3 * 4, the * wouldn't be burdened with associativity, since they binds with their atoms eagerly, and don't fight with each other, but in 1 + 2 * 3 - 4, + and - would fight with each other, thus their associativity applies.

The solution is the same: since associativity involves determining the operator at which side wins, we can designate the operators' ends with different binding power, thus operators with the right end a stronger binding power would have the operator at the left win, and therefore is left associative, and vice versa.

Now it's time for the implementation. There's a straightforward introduction and recursive implementation in this blog (which may be better than this one x x). Here we give an implementation by stack and a loop, with the chumsky parser combinator library, and we'll explain the algorithm later:

// `Oper` is the type of operators, `Expr` is the type of expressions.
// `atom` is the parser to parse an atom, and `oper` is the parser to parse an operator.
enum State {
    Expr,
    Oper,
}
custom(move |inp| {
    let mut ops: Vec<Oper> = vec![];
    let mut exprs = vec![];
    let mut state = State::Expr;
    loop {
        // Read the next token according to the state
        let Some(op) = (match state {
            State::Expr => {
                state = State::Oper;
                exprs.push(inp.parse(&atom)?);
                continue;
            }
            State::Oper => {
                state = State::Expr;
                inp.parse(oper.or_not())?
            }
        }) else {
            break;
        };
        // Bind the previous operators
        let cp = op.bindpow().0;
        while let Some(pre) = ops.last().copied() {
            let pp = pre.bindpow().1;
            if pp < cp {
                break;
            }
            ops.pop();
            // there's no better way of popping multiple elements from a Vec in Rust ><
            let Some((r, l)) = (try { (exprs.pop()?, exprs.pop()?) }) else {
                unreachable!()
            };
            exprs.push(Expr::Op(pre, Box::new((l, r))));
        }
        ops.push(op);
    }
    // Finally: collect the rest items in the stack into an expression.
    // Since they didn't get bound before, their precedence must be in ascending order,
    // binding reversly is correct.
    let Some(mut res) = exprs.pop() else {
        unreachable!()
    };
    while let Some(op) = ops.pop() {
        let Some(v) = exprs.pop() else { unreachable!() };
        res = Expr::Op(op, Box::new((v, res)));
    }
    Ok(res)
})

Here we have maintain a state machine, which has the following states: the initial state, the previous item being an operator, the previous item being an expression, and the final state. Since we are parsing a sequence with atoms first, and operator and atom alternating, the item to be expected at each state is certain. Since the states are only used to fetch the items, we merged the states.

Now that we have the item sequence, we need to compose the sequence into the result expression.

Consider the information we want and we have: we want to know when to merge the expressions around an operator; and at each step, when we get an atom, we get nothing, since we already know it would be there; and for each operator, we can only know whether it can bind the expressions around it when the other two operators around it are known. Luckily, when we get an operator, we've already known all operators on its left, so we only need to get the operator on its right to know if the operator can be bound.

At this step we can have a figure about the algorithm: when we get an operator, check if the operator on its left can be bound, and regardless of whether the last operator binds, store the current operator and wait for the next operator. When we should store multiple operators, it's obvious that their binding powers should be ascending, since if not so, for the decreasing operator, all previous operators with stronger binding power should be bound before it.

Thus in the rest of our algorithm, we maintain a stack for the operators and expressions, ensure the ascendance of the operators before pushing, and finally bind all the rest operators in the stack: since they're ascending, binding them from later to earlier is correct.

That's all.

Patches for prefix and postfix operators

Despite being true in most languages, it's false that prefix and postfix operators must have higher precedence than infix operators. A seemingly ridiculous but applicable example is the if expr then expr else expr structure, where if can be a prefix operator, and then and else are postfix, such that the whole structure is actually parsed as (((if expr) then expr) else expr).

Our analysis above still applies here: a prefix operator must fight with the infix operator after it, before it can bind to its operand, due to the precedence rules, that's also true for postfix operators, but the directions are reversed. Thus for prefix operators, we only need to wait for the next operator and compare their binding powers, as what we've done for infix operators; the only difference is that instead of popping two expressions, we only need to pop one element to bind to the prefix operator. For postfix operators, things are better —– they can be bound instantly when read, without entering the stack.

Conveniently, we can note that in an AST with reasonably design, whether a operator is prefix, infix or postfix is inherent, thus we can just ask for themselves and store them in the same stack:

custom(move |inp| {
    let mut ops: Vec<Oper> = vec![];
    let mut exprs = vec![];
    let mut state = State::Expr;
    loop {
        let Some(op) = (match state {
            // Prefixes are processed like exprs, except that they don't change the state.
            // Also, they should be pushed to diff stacks.
            State::Expr => {
                match inp.parse(&atom)? {
                    Atom::Expr(e) => {
                        state = State::Expr;
                        exprs.push(e);
                    }
                    Atom::Op(o) => ops.push(o),
                }
                continue;
            }
            State::Oper => {
                state = State::Expr;
                inp.parse(oper.or_not())?
            }
        }) else {
            break;
        };
        let cp = op.bindpow().0;
        while let Some(pre) = ops.last().copied() {
            let pp = pre.bindpow().1;
            if pp < cp {
                break;
            }
            ops.pop();
            // Prefix opers are checked and bound like infix opers,
            // only note that their operand numbers diff
            if pre.is_bin() {
                let Some((r, l)) = (try { (exprs.pop()?, exprs.pop()?) }) else {
                    unreachable!()
                };
                exprs.push(Expr::BinOp(pre, Box::new((l, r))));
            } else {
                let Some(v) = exprs.pop() else {
                    unreachable!()
                };
                exprs.push(Expr::UnOp(pre, Box::new(v)));
            }
        }
        // With higher precedence opers bound, postfix opers can safely bind now
        if op.is_postfix() {
            let Some(v) = exprs.pop() else {
                unreachable!()
            };
            exprs.push(Expr::UnOp(op, Box::new(v)));
        } else {
            ops.push(op);
        }
    }
    let Some(mut res) = exprs.pop() else {
        unreachable!()
    };
    while let Some(op) = ops.pop() {
        // The same as above, process infix and prefix opers diffly
        if op.is_bin() {
            let Some(v) = exprs.pop() else {
                unreachable!()
            };
            res = Expr::BinOp(op, Box::new((v, res)));
        } else {
            res = Expr::UnOp(op, Box::new(res));
        }
    }
    Ok(res)
})

Yep, this is a quite complicated title, so let's illustrate our work in this article with a simple example, in Rust in a different world:

trait DataCap {
    fn serialize(&self) -> String;
}

impl<T: Serialize> DataCap for T {
    fn ser(&self) -> String {
        Serializer::serialize(self, serializer)
    }
}
impl<T: !Serialize> DataCap for T {
    fn ser(&self) -> String {
        unreachable!()
    }
}

Sadly this is not accepted by Rust currently:

  • Specialization is proven unsound, thus would never be stabilized and is not considered;
  • The minimal version of specialization which requires the impls strictly subsetting doesn't apply to this case;
  • Not to mention that negative bounds are far from being added to Rust;
  • Turning to decide the impl during instantiation is against Rust's early error spirit and would never appear in Rust;
  • The lean4 way which just gives up when conflicts and let users to explicit decide which impl should be used is good, but may be too far for industrial langs like Rust :)

Therefore our workaround is to mark whether the trait is implemented, and make a wrapper to unify the impls:

use std::marker::PhantomData;

trait Serialize {
    fn serialize<S>(&self, s: S) -> String;
}

impl Serialize for i32 {
    fn serialize<S>(&self, _s: S) -> String { todo!() }
}

pub trait SerFlag {}
pub struct WithSer;
pub struct NoSer;
impl SerFlag for WithSer {}
impl SerFlag for NoSer {}

// User should impl this
trait DataCap {
    type HasSer: SerFlag;
}
trait DataSer {
    fn serialize(&self) -> String;
}

struct Wrap<T, S: SerFlag>(T, PhantomData<S>);

impl<T: DataCap<HasSer = WithSer> + Serialize> DataSer for Wrap<T, WithSer> {
    fn serialize(&self) -> String { self.0.serialize(()) }
}
impl<T: DataCap<HasSer = NoSer>> DataSer for Wrap<T, NoSer> {
    fn serialize(&self) -> String { "".to_string() }
}

fn use_data<T: DataCap>(data: T) where Wrap<T, T::HasSer>: DataSer {
    let data = Wrap(data, PhantomData::<T::HasSer>);
    data.serialize();
}

impl DataCap for i32 {
    type HasSer = WithSer;
}
impl DataCap for () {
    type HasSer = NoSer;
}

fn main() {
    use_data(0);
    use_data(());
}

This may originate from a quite sophisticated case: we have a list of data, and we want to transform it with a function that may invoke indefinite amount of requests to an web API with rate limit, and is vulnerable of errors. Our goal is to complete as more requests at a time, and invoke as less requests when an error occurs.

Since the transform function is of synchronous logic, i.e. a next request wouldn't be raised until the last request got its response, which prevents us from directly paralleling all requests, which is in fact the default strategy of all request lib, and easy to adjust to fit the rate limit, our parallelism can only take place between different calls to the function, which incurs batching the functions and sync the states like current API usage and errors between the site performing the requests and the site calling the functions.

A natural idea is to bind the two sites, i.e. let the calling site actually call the web API, and the function only returns its arguments. This inverses the common dependency order of the functions. In most languages, such inversion is impossible without coroutines. The resulting architecture is like: the controller polls the coroutines in batch, invoking the requests, providing the results, and fitting other requirements like process the quota or handle the errors.

An implementation in ts may be like:

const data = ...;

const reqlist = [];
let datadone = false;
const poll_co = async ([args, co]) => {
    const { done, value } = co.next(await /* call the API with args */);
    if (!done) reqlist.push([value, co]);
};
do {
    const ps = [];
    for (const _ of Array.from({ length: batchsize })) {
        const req = reqlist.shift();
        if (req) {
            ps.push(poll_co(req));
        } else if (!datadone) {
            const { done, value: co } = data.next();
            if (done) {
                datadone = true;
                break;
            }
            const { done: noreq, value: args } = co.next();
            if (noreq) continue;
            ps.push(poll_co([args, co]));
        } else break;
    }
    try {
        await Promise.all(ps);
    } catch (e) {
        // Process the error
    }
    // Just for simplicity, since we implemented poll_co in this way :)
    await new Promise(res => setTimeout(res, 60000));
} while (reqlist.length != 0);

The idea of the whole article is quite simple... Since with reasonable designs, future is just a special kind of coroutine, and the conversion between coroutine functions and async functions is usually not complex, just incurs some keyword replacement and syntax tweaks. But this is still inevitable if we're refactoring an implement serially execute the transform functions, even for languages like C++ which has a unified coroutine/async system, that the functions alone the whole call chain should be replaced with a new signature, because the type of the function is changed in deed. The actual web API calls are doomed to be replaced one-by-one.

But if the language is effect-based and with a reasonable type system to implicitly propagate the effects (which doesn't exist at all XD), we can omit the chain-replacing step, only have the call-replacing step and the controller part, since the added effect would just implicitly infect the functions alone the chain.

Well... So this is only an article showcasing one of the benefits of algebra effects: for implicitly-propagated effects, only the use site and the handle site needs to be focused, all other layers would automatically adapt to them.

Well... Since the title was in English...

Let's skip all the common preambles illustrating the importance, etc. of partial evaluation, and just say we want to write a partial evaluator for a procedural language, which Senioria didn't find any articles introducing many details. Note that this is only a personal note, and the algorithm here may not be the optimal :)

The language has the following characteristics:

  • Functions don't take closures, or, in other words, closures are passed to functions as an implicit argument;
  • Instructions are grouped in basic blocks, thus all branch instructions may only occur at the end of a basic block, and each block is terminated with a branch instruction.

Concepts and basic partial evaluation

We all know how to write an interpreter, right? And yes, this is how a partial evaluator works: just evaluate the instructions, record all variables (here, the results of each instruction), and generate corresponding instructions when the result of the instruction may only be known at runtime, i.e. the instruction must be generated. Other optimizations may also apply, like pattern match the instruction or a sequence of instructions and replace them with a more efficient version, or postpone the instruction generation, and only write the values used by the control flow and side effect-generating instructions into the result (specialized) function to do dead code elimination.

Here, we call the dictionary recording all variables “poly”, which is an abbreviation of polyvariant, which... means that multiple versions of the function is generated, or more precisely, specialized. Not a good name, frankly.

Let's see an example of our partial evaluator at this point, for the following function:

int f(int n) {
entry:
  int i = 5;
  goto if_cond
if_cond:
  bool _0 = i < n;
  if _0 goto if_then else goto if_else;
if_then:
  println("lt");
  int i1 = i + 1;
  goto if_end;
if_else:
  println("ge");
  int i1 = i + 2;
  goto if_end;
if_end:
  int i2 = i1 + n;
  return i2;
}

With n = 1, our partial evaluator would be working like this:

int f_1() {  // Specializing the function, poly: { n = 1 }
entry:
  int i = 5;  // poly: { n = 1, i = 5 }
  goto if_cond;  // poly unchanged, emit the jump, see our introduction below
if_cond:
  bool _0 = false;  // poly: { n = 1, i = 5, _0 = false }
  goto if_else;  // poly unchanged, since the result of the branch is known, just jump to the desired block
if_else:
  println("ge");  // (Same as above, idem for the followings (x)
  int i1 = 7;
  goto if_end;
if_end:
  int i1 = 8;
  return 8;
}

If this is not good enough, by not emitting the constant assignments and a pass that we would introduce later, the resulting function would be like:

int f_1() {
entry:
  println("ge");
  return 8;
}

And with n unknown, the result would be like:

int f(int n) {  // poly: { n = (arg) }
entry:
  int i = 5;  // poly: { i = 5, n = (arg) }
  goto if_cond
if_cond:
  bool _0 = i < n;
  if _0 goto if_then else goto if_else;  // Unable to determine, generate both branches
if_then:
  println("lt");
  int i1 = i + n;
  goto if_end_1;
if_else:
  println("ge");
  int i1 = i + 2;
  goto if_end_2;
if_end_1:
  int i2 = i1 + n;
  return i2;
if_end_2:
  int i2 = i1 + n;
  return i2;
}

Here we choose to keep the unconditional jumps, since they can be eliminated easily by adding another pass pasting the target block to the position of the jump and eliminating all blocks not referenced by any jumps after pasting —– a pass requiring almost no analysis but translating natural language. And by keeping the blocks, we may eliminate redundant specializations by only emitting the jump and not performing the evaluation when we found ourselves jumping to a same block with the same poly, or we may record the source block in the poly, thus we can directly say “when we found we are jumping to the same poly”.

In the example with unknown n, we generated two different blocks for the same source block if_end. This can simplify some analysis since all variables have only one possible value. Some implements would be in favor of keeping the resulting blocks one-to-one to the source blocks as possible, and only record the variables as dynamic, or record all possible values of the variables. We don't like this because this would reduce the optimization potential by losing information by only record variables as dynamic, or be complex by requiring to order the blocks to evaluate all predecessors of a block before evaluating the block. If we want to reduce the generated code size, there are some more analysis we can add, e.g. collect the values that may be used by each block and their successors, and remove all unused variables from the poly, thus we can avoid emitting the same block twice just because an unrelated value in the control flow, or scan up from the bottom of each block and collect some or all reoccurring instruction sequences into one block, depending on how compact we want the code to be.

Loops

At the point, our strategy works well: constant values are propagated to eliminate unused branches and generate appropriate specialized instructions; what's better, many loops are automatically unrolled completely[^1], some loops have dynamic conditions and can't be unrolled completely, but after one iteration, the body finds itself jumping to an evaluated poly, and the evaluator continues happily. But a simple program would drive our evaluator into infinite loop:

[^1]: Though may be overly aggressive, that we may need to add one more pass to reroll the repeating instructions back into a loop, or add some analysis to process big loops differently

int *gv;
int f(int n) {
  for (int i = 0; i < n; ++i)
    gv[i] = 0;
}

As we can see, i changes every iteration, so our evaluator tries to evaluate all the iterations, but the condition is dynamic, which means on the static perspective, there are infinite possible iteration to be generated, Oops.

We may come up with some simple solution: store both the constant value and the instruction returning the value, and ignore the constant value when deduplicating blocks, or only record the instructions. Great, now we are generating gv[0] = 0 for every iteration, or give up unrolling constant loops.

Well... Other partial evaluators seem not to be bothered by the problem, because one of these applies:

  • We just don't know them;
  • They are partial evaluating some language without loops, so they have other problems to face :);
  • They are evaluating some higher-level, structural IR, where they can directly access the “header” of the loop, and determine ad-hoc whether the loop can be unrolled, and use the correct strategy.

After all, in our IR, we should process non-unrollable loops specially.

First we should know whether we are in an non-unrollable loop. Our solution is performing an analysis to obtain all loops and their containing blocks first, then when we encounter a conditional jump, check if we are at the head block of a loop, and if so, pollute the poly to make every variable that is read in the loop and written in both the loop body and before the head block dynamic, and reevaluate the head block with the polluted poly. This way the loop header is adjusted not specialized with values in the first iteration of variables varying in the loop. And since this only applies to conditional jumps, unrollable loops are not affected because their jumps in the header can be specialized to an unconditional jump.

This is enough for code-generation of the loop, since all variables both changed in and read by the loop is correctly set dynamic, and we would only need to repeat the story on the loop once, because when we encounter the header and the body the second time, we are with a different poly, in which there are variables in the loop body, and then after the second time the body is evaluated, the evaluator would recognize the fact that the incoming header block is evaluated in the second iteration, and end the evaluation to the loop.

Forcefully pollute all read and written variables may be rough, and we can do better. There is an instruction called phi, which yields a value depending on from which block the control flow gets to the phi instruction. With this instruction, we can transform the program to SSA form, i.e. all variables are statically written once, by replacing all reassignments with a new variable, and when there are different possible values of a variable, inserting a phi instruction, since they must be from the different block, one block a different value respectively. With SSA form, we can keep the phi instructions, which normally would be eliminated since in both actual evaluation and partial evaluation, the actual control flow can only go along one path, determining the value of the phi instruction, in the reevaluation of the header block, and mark the value as dynamic; since all variables in the SSA form must be assigned before, this can ensure all variables both changed in the loop and used by the loop are set to the correct source —– the phi instruction. This transformation can be more accurate, and may be more universal.

With this strategy, we must memorize the arms of every phi instruction in the header block of the loop, and fill in the correct value when the control flow reenters the loop. And then we found another problem: the poly would be different after evaluating the loop body, so without special process, we would be attempting to create a new header block, and repeat the iteration once; this is OK when we are free to assign, but the first phi instruction would be left with only one value, which is not allowed in some architectures like LLVM. To reuse the header block we just generated, we can record in the poly a path of dynamic loops, push a loop into it when we found ourselves encounter a dynamic loop at the conditional branch, and pop the loops we are leaving at other jumps; when we are going to jump, check if we are jumping to the header of current loop, and if so, fill in the phi arms and reuse the generated block.

That's all of our strategy for partial evaluating non-unrollable loops. This is Senioriae source code handling branches and phis:

void PeInstVisitor::visitPHINode(llvm::PHINode& inst)
{
  if (poly.inloop) {
    auto val = inst.clone();
    auto phiv = llvm::dyn_cast<llvm::PHINode>(val);
    val->setName(inst.getName());
    val->insertInto(poly.insbb, poly.insbb->end());
    PhiPatch patch{ phiv, {} };
    for (size_t i = 0; i < phiv->getNumIncomingValues(); ++i) {
      patch.arms.emplace(phiv->getIncomingBlock(i), phiv->getIncomingValue(i));
    }
    bbphi[poly.insbb].push_back(std::move(patch));
    while (phiv->getNumIncomingValues())
      phiv->removeIncomingValue(unsigned(0), false);
    auto curv = inst.getIncomingValueForBlock(poly.from);
    phiv->addIncoming(poly.get(curv), poly.genfrom);
    poly.set(&inst, val);
  } else
    poly.set(poly.ip, inst.getIncomingValueForBlock(poly.from));
  state.bfsq.push_back(poly.next());
}


void PeInstVisitor::visitBranchInst(llvm::BranchInst& inst)
{
  auto brto = [&](PePoly& poly, llvm::BasicBlock* dst, bool lastjmp = true) {
    auto path = lastjmp
                  ? poly.path
                  : std::make_shared<std::vector<LoopPathItem>>(*poly.path);
    auto src = inst.getParent();
    PePoly res{
      poly.env, poly.base, &*dst->begin(), src, nullptr, poly.insbb, dst, path,
    };
    // Pop the loops we are leaving (or in other words, not jumping into)
    while (!path->empty() && !loops.getLoopFor(path->back().src)->contains(dst))
      path->pop_back();
    // Br to the cond block of the loop
    auto reloop =
      !path->empty() && dst == path->back().src ? path->back().gen : nullptr;
    if (reloop) {
      for (auto& patch : bbphi[reloop]) {
        patch.node->addIncoming(poly.get(patch.arms.at(poly.srcbase)),
                                poly.insbb);
      }
    } else {
      res.env = std::make_shared<llvm::ValueToValueMap>(*poly.env);
      res.base = std::make_shared<llvm::ValueToValueMap>(*poly.env);
    }
    // If new block: create the block
    auto blkit = polybb.find(res);
    if (!reloop && blkit == polybb.end()) {
      res.insbb = llvm::BasicBlock::Create(ctx, dst->getName());
      res.insbb->insertInto(effect);
      polybb.emplace(res, res.insbb);
      bbpoly.emplace(dst, res);
    }
    return std::make_tuple(res,
                           reloop                  ? reloop
                           : blkit != polybb.end() ? blkit->second
                                                   : res.insbb);
  };
  // Get the target bbs
  llvm::BasicBlock* thenbb = inst.getSuccessor(0);
  llvm::BasicBlock* elsebb = nullptr;
  auto cond = inst.isConditional() ? poly.get(inst.getCondition()) : nullptr;
  if (inst.isConditional()) {
    auto cc = cond ? llvm::dyn_cast<llvm::ConstantInt>(cond) : nullptr;
    if (cc && cc->isZero())
      thenbb = nullptr;
    if (!cc || cc->isZero())
      elsebb = inst.getSuccessor(1);
  }
  // Unconditional
  if (thenbb == nullptr || elsebb == nullptr) {
    auto dst = (thenbb != nullptr ? thenbb : elsebb);
    auto [sub, dstbb] = brto(poly, dst);
    llvm::BranchInst::Create(dstbb)->insertInto(poly.insbb, poly.insbb->end());
    if (sub.insbb)
      state.bfsq.push_back(sub);
    return;
  }
  // In loop: switch to loop mode and regenerate the block
  if (loops.isLoopHeader(inst.getParent()) && !poly.inloop) {
    auto next = poly;
    next.env = std::make_shared<llvm::ValueToValueMap>(*poly.env);
    next.ip = &*inst.getParent()->begin();
    next.inloop = true;
    next.path->push_back({ inst.getParent(), next.insbb });
    // Require an rerun
    polybb.emplace(next, next.insbb);
    next.insbb->erase(next.insbb->begin(), next.insbb->end());
    state.bfsq.push_back(next);
    return;
  }
  // Conditional: copy the inst and insert
  auto val = insinst(&inst);
  auto brval = llvm::dyn_cast<llvm::BranchInst>(val);
  auto ping = [&](int idx) {
    auto [sub, dstbb] = brto(poly, inst.getSuccessor(idx), idx == 1);
    brval->setSuccessor(idx, dstbb);
    return sub;
  };
  auto lsub = ping(0);
  auto rsub = ping(1);
  if (lsub.insbb) {
    state.bfsq.push_back(lsub);
  }
  if (rsub.insbb) {
    state.bfsq.push_back(rsub);
  }
}

There are some problems left, like how to obtain the loops and how to transform the program to SSA. Senioria didn't really implement them, since they are already done by llvm, but the algorithms seem to work well.

For loop detection, we can first make the domination graph, a directed graph of blocks with blocks as nodes and jumps in the blocks as edges. Then we can DFS on the graph, maintaining a stack of visited nodes, and when we visit a visited node N, add all nodes in the stack above N into the loop with header N. We can safely assume that a block an only be the header of one node here, since we can't do any better without better specialization technique.

For SSA transformation, we can iterate the blocks in topological order, ignoring edges from a loop body to its header when entering the header, replacing every reassignment with a new variable, and replace succeeding accesses to the assigned variable with the new variable, and create a phi instruction at the beginning of a block for potentially used variables in the block.

Further steps

We haven't discussed how to process functions yet, which is also crucial, and is at least the same complex as processing loops. Well, Senioria am lazy :)