Inverse request invocation with coroutines for parallelism

December 14, 2024

This may originate from a quite sophisticated case: we have a list of data, and we want to transform it with a function that may invoke indefinite amount of requests to an web API with rate limit, and is vulnerable of errors. Our goal is to complete as more requests at a time, and invoke as less requests when an error occurs.

Since the transform function is of synchronous logic, i.e. a next request wouldn't be raised until the last request got its response, which prevents us from directly paralleling all requests, which is in fact the default strategy of all request lib, and easy to adjust to fit the rate limit, our parallelism can only take place between different calls to the function, which incurs batching the functions and sync the states like current API usage and errors between the site performing the requests and the site calling the functions.

A natural idea is to bind the two sites, i.e. let the calling site actually call the web API, and the function only returns its arguments. This inverses the common dependency order of the functions. In most languages, such inversion is impossible without coroutines. The resulting architecture is like: the controller polls the coroutines in batch, invoking the requests, providing the results, and fitting other requirements like process the quota or handle the errors.

An implementation in ts may be like:

const data = ...;

const reqlist = [];
let datadone = false;
const poll_co = async ([args, co]) => {
    const { done, value } = co.next(await /* call the API with args */);
    if (!done) reqlist.push([value, co]);
};
do {
    const ps = [];
    for (const _ of Array.from({ length: batchsize })) {
        const req = reqlist.shift();
        if (req) {
            ps.push(poll_co(req));
        } else if (!datadone) {
            const { done, value: co } = data.next();
            if (done) {
                datadone = true;
                break;
            }
            const { done: noreq, value: args } = co.next();
            if (noreq) continue;
            ps.push(poll_co([args, co]));
        } else break;
    }
    try {
        await Promise.all(ps);
    } catch (e) {
        // Process the error
    }
    // Just for simplicity, since we implemented poll_co in this way :)
    await new Promise(res => setTimeout(res, 60000));
} while (reqlist.length != 0);

The idea of the whole article is quite simple... Since with reasonable designs, future is just a special kind of coroutine, and the conversion between coroutine functions and async functions is usually not complex, just incurs some keyword replacement and syntax tweaks. But this is still inevitable if we're refactoring an implement serially execute the transform functions, even for languages like C++ which has a unified coroutine/async system, that the functions alone the whole call chain should be replaced with a new signature, because the type of the function is changed in deed. The actual web API calls are doomed to be replaced one-by-one.

But if the language is effect-based and with a reasonable type system to implicitly propagate the effects (which doesn't exist at all XD), we can omit the chain-replacing step, only have the call-replacing step and the controller part, since the added effect would just implicitly infect the functions alone the chain.

Well... So this is only an article showcasing one of the benefits of algebra effects: for implicitly-propagated effects, only the use site and the handle site needs to be focused, all other layers would automatically adapt to them.

Partial evaluation for procedural, asm-like languages

July 2, 2024

~~Well... Since the title was in English...~~

Let's skip all the common preambles illustrating the importance, etc. of partial evaluation, and just say we want to write a partial evaluator for a procedural language, which Senioria didn't find any articles introducing many details. Note that this is only a personal note, and the algorithm here may not be the optimal :)

The language has the following characteristics:

Functions don't take closures, or, in other words, closures are passed to functions as an implicit argument;
Instructions are grouped in basic blocks, thus all branch instructions may only occur at the end of a basic block, and each block is terminated with a branch instruction.

Concepts and basic partial evaluation

We all know how to write an interpreter, right? And yes, this is how a partial evaluator works: just evaluate the instructions, record all variables (here, the results of each instruction), and generate corresponding instructions when the result of the instruction may only be known at runtime, i.e. the instruction must be generated. Other optimizations may also apply, like pattern match the instruction or a sequence of instructions and replace them with a more efficient version, or postpone the instruction generation, and only write the values used by the control flow and side effect-generating instructions into the result (specialized) function to do dead code elimination.

Here, we call the dictionary recording all variables “poly”, which is an abbreviation of polyvariant, which... means that multiple versions of the function is generated, or more precisely, specialized. Not a good name, frankly.

Let's see an example of our partial evaluator at this point, for the following function:

int f(int n) {
entry:
  int i = 5;
  goto if_cond
if_cond:
  bool _0 = i < n;
  if _0 goto if_then else goto if_else;
if_then:
  println("lt");
  int i1 = i + 1;
  goto if_end;
if_else:
  println("ge");
  int i1 = i + 2;
  goto if_end;
if_end:
  int i2 = i1 + n;
  return i2;
}

With n = 1, our partial evaluator would be working like this:

int f_1() {  // Specializing the function, poly: { n = 1 }
entry:
  int i = 5;  // poly: { n = 1, i = 5 }
  goto if_cond;  // poly unchanged, emit the jump, see our introduction below
if_cond:
  bool _0 = false;  // poly: { n = 1, i = 5, _0 = false }
  goto if_else;  // poly unchanged, since the result of the branch is known, just jump to the desired block
if_else:
  println("ge");  // (Same as above, idem for the followings (x)
  int i1 = 7;
  goto if_end;
if_end:
  int i1 = 8;
  return 8;
}

If this is not good enough, by not emitting the constant assignments and a pass that we would introduce later, the resulting function would be like:

int f_1() {
entry:
  println("ge");
  return 8;
}

And with n unknown, the result would be like:

int f(int n) {  // poly: { n = (arg) }
entry:
  int i = 5;  // poly: { i = 5, n = (arg) }
  goto if_cond
if_cond:
  bool _0 = i < n;
  if _0 goto if_then else goto if_else;  // Unable to determine, generate both branches
if_then:
  println("lt");
  int i1 = i + n;
  goto if_end_1;
if_else:
  println("ge");
  int i1 = i + 2;
  goto if_end_2;
if_end_1:
  int i2 = i1 + n;
  return i2;
if_end_2:
  int i2 = i1 + n;
  return i2;
}

Here we choose to keep the unconditional jumps, since they can be eliminated easily by adding another pass pasting the target block to the position of the jump and eliminating all blocks not referenced by any jumps after pasting —– a pass requiring almost no analysis but translating natural language. And by keeping the blocks, we may eliminate redundant specializations by only emitting the jump and not performing the evaluation when we found ourselves jumping to a same block with the same poly, or we may record the source block in the poly, thus we can directly say “when we found we are jumping to the same poly”.

In the example with unknown n, we generated two different blocks for the same source block if_end. This can simplify some analysis since all variables have only one possible value. Some implements would be in favor of keeping the resulting blocks one-to-one to the source blocks as possible, and only record the variables as dynamic, or record all possible values of the variables. We don't like this because this would reduce the optimization potential by losing information by only record variables as dynamic, or be complex by requiring to order the blocks to evaluate all predecessors of a block before evaluating the block. If we want to reduce the generated code size, there are some more analysis we can add, e.g. collect the values that may be used by each block and their successors, and remove all unused variables from the poly, thus we can avoid emitting the same block twice just because an unrelated value in the control flow, or scan up from the bottom of each block and collect some or all reoccurring instruction sequences into one block, depending on how compact we want the code to be.

Loops

At the point, our strategy works well: constant values are propagated to eliminate unused branches and generate appropriate specialized instructions; what's better, many loops are automatically unrolled completely[^1], some loops have dynamic conditions and can't be unrolled completely, but after one iteration, the body finds itself jumping to an evaluated poly, and the evaluator continues happily. But a simple program would drive our evaluator into infinite loop:

[^1]: Though may be overly aggressive, that we may need to add one more pass to reroll the repeating instructions back into a loop, or add some analysis to process big loops differently

int *gv;
int f(int n) {
  for (int i = 0; i < n; ++i)
    gv[i] = 0;
}

As we can see, i changes every iteration, so our evaluator tries to evaluate all the iterations, but the condition is dynamic, which means on the static perspective, there are infinite possible iteration to be generated, Oops.

We may come up with some simple solution: store both the constant value and the instruction returning the value, and ignore the constant value when deduplicating blocks, or only record the instructions. Great, now we are generating gv[0] = 0 for every iteration, or give up unrolling constant loops.

Well... Other partial evaluators seem not to be bothered by the problem, because one of these applies:

We just don't know them;
They are partial evaluating some language without loops, so they have other problems to face :);
They are evaluating some higher-level, structural IR, where they can directly access the “header” of the loop, and determine ad-hoc whether the loop can be unrolled, and use the correct strategy.

After all, in our IR, we should process non-unrollable loops specially.

First we should know whether we are in an non-unrollable loop. Our solution is performing an analysis to obtain all loops and their containing blocks first, then when we encounter a conditional jump, check if we are at the head block of a loop, and if so, pollute the poly to make every variable that is read in the loop and written in both the loop body and before the head block dynamic, and reevaluate the head block with the polluted poly. This way the loop header is adjusted not specialized with values in the first iteration of variables varying in the loop. And since this only applies to conditional jumps, unrollable loops are not affected because their jumps in the header can be specialized to an unconditional jump.

This is enough for code-generation of the loop, since all variables both changed in and read by the loop is correctly set dynamic, and we would only need to repeat the story on the loop once, because when we encounter the header and the body the second time, we are with a different poly, in which there are variables in the loop body, and then after the second time the body is evaluated, the evaluator would recognize the fact that the incoming header block is evaluated in the second iteration, and end the evaluation to the loop.

Forcefully pollute all read and written variables may be rough, and we can do better. There is an instruction called phi, which yields a value depending on from which block the control flow gets to the phi instruction. With this instruction, we can transform the program to SSA form, i.e. all variables are statically written once, by replacing all reassignments with a new variable, and when there are different possible values of a variable, inserting a phi instruction, since they must be from the different block, one block a different value respectively. With SSA form, we can keep the phi instructions, which normally would be eliminated since in both actual evaluation and partial evaluation, the actual control flow can only go along one path, determining the value of the phi instruction, in the reevaluation of the header block, and mark the value as dynamic; since all variables in the SSA form must be assigned before, this can ensure all variables both changed in the loop and used by the loop are set to the correct source —– the phi instruction. This transformation can be more accurate, and may be more universal.

With this strategy, we must memorize the arms of every phi instruction in the header block of the loop, and fill in the correct value when the control flow reenters the loop. And then we found another problem: the poly would be different after evaluating the loop body, so without special process, we would be attempting to create a new header block, and repeat the iteration once; this is OK when we are free to assign, but the first phi instruction would be left with only one value, which is not allowed in some architectures like LLVM. To reuse the header block we just generated, we can record in the poly a path of dynamic loops, push a loop into it when we found ourselves encounter a dynamic loop at the conditional branch, and pop the loops we are leaving at other jumps; when we are going to jump, check if we are jumping to the header of current loop, and if so, fill in the phi arms and reuse the generated block.

That's all of our strategy for partial evaluating non-unrollable loops. This is Senioriae source code handling branches and phis:

void PeInstVisitor::visitPHINode(llvm::PHINode& inst)
{
  if (poly.inloop) {
    auto val = inst.clone();
    auto phiv = llvm::dyn_cast<llvm::PHINode>(val);
    val->setName(inst.getName());
    val->insertInto(poly.insbb, poly.insbb->end());
    PhiPatch patch{ phiv, {} };
    for (size_t i = 0; i < phiv->getNumIncomingValues(); ++i) {
      patch.arms.emplace(phiv->getIncomingBlock(i), phiv->getIncomingValue(i));
    }
    bbphi[poly.insbb].push_back(std::move(patch));
    while (phiv->getNumIncomingValues())
      phiv->removeIncomingValue(unsigned(0), false);
    auto curv = inst.getIncomingValueForBlock(poly.from);
    phiv->addIncoming(poly.get(curv), poly.genfrom);
    poly.set(&inst, val);
  } else
    poly.set(poly.ip, inst.getIncomingValueForBlock(poly.from));
  state.bfsq.push_back(poly.next());
}


void PeInstVisitor::visitBranchInst(llvm::BranchInst& inst)
{
  auto brto = [&](PePoly& poly, llvm::BasicBlock* dst, bool lastjmp = true) {
    auto path = lastjmp
                  ? poly.path
                  : std::make_shared<std::vector<LoopPathItem>>(*poly.path);
    auto src = inst.getParent();
    PePoly res{
      poly.env, poly.base, &*dst->begin(), src, nullptr, poly.insbb, dst, path,
    };
    // Pop the loops we are leaving (or in other words, not jumping into)
    while (!path->empty() && !loops.getLoopFor(path->back().src)->contains(dst))
      path->pop_back();
    // Br to the cond block of the loop
    auto reloop =
      !path->empty() && dst == path->back().src ? path->back().gen : nullptr;
    if (reloop) {
      for (auto& patch : bbphi[reloop]) {
        patch.node->addIncoming(poly.get(patch.arms.at(poly.srcbase)),
                                poly.insbb);
      }
    } else {
      res.env = std::make_shared<llvm::ValueToValueMap>(*poly.env);
      res.base = std::make_shared<llvm::ValueToValueMap>(*poly.env);
    }
    // If new block: create the block
    auto blkit = polybb.find(res);
    if (!reloop && blkit == polybb.end()) {
      res.insbb = llvm::BasicBlock::Create(ctx, dst->getName());
      res.insbb->insertInto(effect);
      polybb.emplace(res, res.insbb);
      bbpoly.emplace(dst, res);
    }
    return std::make_tuple(res,
                           reloop                  ? reloop
                           : blkit != polybb.end() ? blkit->second
                                                   : res.insbb);
  };
  // Get the target bbs
  llvm::BasicBlock* thenbb = inst.getSuccessor(0);
  llvm::BasicBlock* elsebb = nullptr;
  auto cond = inst.isConditional() ? poly.get(inst.getCondition()) : nullptr;
  if (inst.isConditional()) {
    auto cc = cond ? llvm::dyn_cast<llvm::ConstantInt>(cond) : nullptr;
    if (cc && cc->isZero())
      thenbb = nullptr;
    if (!cc || cc->isZero())
      elsebb = inst.getSuccessor(1);
  }
  // Unconditional
  if (thenbb == nullptr || elsebb == nullptr) {
    auto dst = (thenbb != nullptr ? thenbb : elsebb);
    auto [sub, dstbb] = brto(poly, dst);
    llvm::BranchInst::Create(dstbb)->insertInto(poly.insbb, poly.insbb->end());
    if (sub.insbb)
      state.bfsq.push_back(sub);
    return;
  }
  // In loop: switch to loop mode and regenerate the block
  if (loops.isLoopHeader(inst.getParent()) && !poly.inloop) {
    auto next = poly;
    next.env = std::make_shared<llvm::ValueToValueMap>(*poly.env);
    next.ip = &*inst.getParent()->begin();
    next.inloop = true;
    next.path->push_back({ inst.getParent(), next.insbb });
    // Require an rerun
    polybb.emplace(next, next.insbb);
    next.insbb->erase(next.insbb->begin(), next.insbb->end());
    state.bfsq.push_back(next);
    return;
  }
  // Conditional: copy the inst and insert
  auto val = insinst(&inst);
  auto brval = llvm::dyn_cast<llvm::BranchInst>(val);
  auto ping = [&](int idx) {
    auto [sub, dstbb] = brto(poly, inst.getSuccessor(idx), idx == 1);
    brval->setSuccessor(idx, dstbb);
    return sub;
  };
  auto lsub = ping(0);
  auto rsub = ping(1);
  if (lsub.insbb) {
    state.bfsq.push_back(lsub);
  }
  if (rsub.insbb) {
    state.bfsq.push_back(rsub);
  }
}

There are some problems left, like how to obtain the loops and how to transform the program to SSA. Senioria didn't really implement them, since they are already done by llvm, but the algorithms seem to work well.

For loop detection, we can first make the domination graph, a directed graph of blocks with blocks as nodes and jumps in the blocks as edges. Then we can DFS on the graph, maintaining a stack of visited nodes, and when we visit a visited node N, add all nodes in the stack above N into the loop with header N. We can safely assume that a block an only be the header of one node here, since we can't do any better without better specialization technique.

For SSA transformation, we can iterate the blocks in topological order, ignoring edges from a loop body to its header when entering the header, replacing every reassignment with a new variable, and replace succeeding accesses to the assigned variable with the new variable, and create a phi instruction at the beginning of a block for potentially used variables in the block.

Further steps

We haven't discussed how to process functions yet, which is also crucial, and is at least the same complex as processing loops. Well, Senioria am lazy :)