Reworking Lengauer-Tarjan

In compiler circles, Lengauer & Tarjan's 1979 paper "A Fast Algorithm for Finding Dominators in a Flowgraph" is a classic: it describes and proves a viable algorithm for computing the idom of every node in a rooted directed graph. I have just two criticisms of the paper:

All the sample code is written in Algol. This was a fine choice in 1979, but tastes have changed over the intervening 46 years, and most modern eyes will see the Algol code as somewhat antiquated.
It starts by presenting an optimised variant of the algorithm, rather than starting with the most basic exposition and then introducing optimisations.

I believe that both problems can be addressed via a little bit of reworking.

The algorithm takes as input a graph G with some root node r. We immediately perform a depth first search on G; let T be the spanning tree discovered by that search, and replace all node labels with the pre-order index from that search. The root node is now 0, one of its successors is 1, and so forth. For arbitrary nodes v and w, this allows us to write things like v < w and v ≤ w and min(v, w), all of which are operating on these indices. We also introduce some notation for talking about these graphs:

Notation	Meaning
`v ->^G w`	There is an edge from `v` to `w` in `G`, and `w != r`
`v -*->^T w`	There is a path from `v` to `w` in `T`, or `v == w`

We jump in at what the paper calls Theorem 4, which enables this definition:

sdom(w) = min({     v  |              v ->^G w and v ≤ w} union
              {sdom(u) | u -*->^T v and v ->^G w and v > w and u > w})

This definition is recursive, but only in the case of u > w, so we can compute it for all nodes by first computing it for w = N-1, then for w = N-2, and so forth until we eventually reach w = 1 (we cannot compute it for w = 0, as both sets are empty in that case).

Reworking Theorem 3 slightly enables this definition:

idomfrom(w) = argmin_u{sdom(u) | u -*->^T w and u > sdom(w)}

Where argmin_u{expr | condition} gives the u which minimises expr (over all the possible u which meet condition), or any such u if there are multiple u achieving that minimum. Note that the set is always non-empty, as u = w meets the condition. As u -*->^T w implies u ≤ w, we also have idomfrom(w) ≤ w.

Then reworking Theorem 2 slightly enables this definition:

idom(w) = sdom(w) if idomfrom(w) ≥ w else idom(idomfrom(w))

This definition is recursive, but only in the case of idomfrom(w) < w, so we can compute it for all nodes by first computing it for w = 1, then for w = 2, and so forth until eventually we reach w = N-1.

This gives us everything we need to compute idom, via a four step process:

Perform a depth first search on G, to give T (and pre-order integer nodes).
Compute sdom(w) for all w > 0, in decreasing order of w.
Compute idomfrom(w) for all w > 0, in any order.
Compute idom(w) for all w > 0, in increasing order of w.

In both steps 2 and 3, we're interested in either min or argmin_u over the set {sdom(u) | u -*->^T v and u > limit}, for various values of v and limit. There are (at least) three different strategies for computing each of these min / argmin_u:

From scratch every time: whenever such a computation is required, start with u = v, and then iterate u = parent(u) (where parent gives the parent in T) until u ≤ limit, taking the min / argmin_u of sdom(u) over all the observed u for which u > limit.
Like strategy 1, but introduce a layer of caching to avoid repeated work. The paper calls this "path compression", which is one way of viewing it, but you reach the exact same place if you instead apply aggressive caching to strategy 1. For this to work, all queries need to be made in decreasing order of limit, as otherwise previously cached results wouldn't be valid. This happens naturally for step 2 (because it has limit = w, and is performed in decreasing order of w), but slightly more work is required to extend it to step 3: the usual approach is to chop up step 3 into lots of small pieces, and perform them interleaved with step 2 at appropriate times.
Like strategy 2, but doing the union-find equivalent of "union by rank/size" rather than the union-find equivalent of "path compression". Note that the min and argmin_u operations in question aren't quite union-find, but they're sufficiently similar that the ideas can carry over.

These three strategies are in increasing order of implementation complexity, but also decreasing order of worst-case algorithmic complexity. Strategy 3 is best in theory, but the general consensus is that strategy 2 usually beats it in practice. Meanwhile, the Go compiler uses strategy 1 and considers it sufficient.

If using strategy 1, the pseudocode for steps 2 through 4 can be quite concise:

# Step 2
for w in range(N-1, 0, -1):
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))

# Step 3
for w in range(1, N):
  idomfrom[w] = strategy1(w, sdom[w])[1]

# Step 4
for w in range(1, N):
  idom[w] = sdom[w]           # Relevant  when idomfrom[w] = w
  idom[w] = idom[idomfrom[w]] # No change when idomfrom[w] = w

def strategy1(v, limit):
  us = [v]
  while parent(us[-1]) > limit: us.append(parent(us[-1]))
  return min((sdom[u], u) for u in us)

Note that pred in step 2 gives the predecessors in G, whereas parent in strategy1 gives the parent in T.

There are two little tricks to avoid some of the function calls in step 3:

If sdom(w) == 0, then u = w is an acceptable result from argmin_u.
If sdom(w) == parent(w), then u = w will be only possible argmin_u.

We can revise step 3 to detect these two cases, and handle them without a call:

# Step 3
for w in range(1, N):
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    idomfrom[w] = strategy1(w, sdom[w])[1]

If we then want to use something better than strategy1, step 3 needs to be chopped up and interleaved with step 2. One way of doing this is to introduce an array called deferred, which is conceptually storing various sets: calls to strategy1 in step 3 are changed to instead add a node to a set, and then the actual calls happen later when the set is drained. The pseudocode for steps 2 and 3 thus becomes:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy1(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w

Then we can upgrade steps 2 and 3 from strategy1 to strategy2:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy2(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy2(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w
  cache_ancestor[w] = parent(w)
  cache_result[w] = (sdom(w), w)

def strategy2(v, limit):
  vs = []
  ancestor = cache_ancestor[v]
  while ancestor > limit:
    vs.append(v)
    v = ancestor
    ancestor = cache_ancestor[v]
  result = cache_result[v]
  while vs:
    v = vs.pop()
    result = min(result, cache_result[v])
    cache_result[v] = result
    cache_ancestor[v] = ancestor
  return result

I consider strategy2 to be sufficient, so I won't present a strategy3. Instead, a few memory optimisations are possible to anyone squeezing out performance:

The idom and sdom arrays can share the same underlying storage, as for any given w, once idom[w] has been assigned, sdom[w] is not accessed. This also removes the need for the idom[w] = sdom[w] assignment in step 4.
The cache_ancestor and parent arrays can share the same underlying storage, as for any given w, once cache_ancestor[w] has been assigned for the first time, parent(w) is not used. This also removes the need for the cache_ancestor[w] = parent(w) assignment.
The deferred and idomfrom arrays can share the same underlying storage, albeit the draining needs a minor tweak to read deferred[v] before writing idomfrom[v].

Once idom(w) has been computed, it can be used to compute DF(w) in the style of Cytron et al's 1991 paper "Efficiently Computing Static Single Assignment Form and the Control Dependence Graph":

for w in bottom_up_traversal_of_dominator_tree:
  DF(w) = set()
  for v in succ(w): # successors in G
    if idom(v) != w:
      DF(w).add(v)
  for v in children(w): # children in the dominator tree
    for u in DF(v):
      if idom(u) != w:
        DF(w).add(u)

Alternatively, idom(w) can be used to compute DF(w) in the style of Cooper et al's 2001 paper "A Simple, Fast Dominance Algorithm":

for w in G:
  DF(w) = set()
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        DF(u).add(w)
        u = idom(u)

Note that if len(pred(w)) ≥ 2: is an optimisation, rather than a requirement for correctness: when len(pred(w)) == 1, the sole predecessor of w will be idom(w), so the innermost loop won't execute.

This formulation is easily amenable to representing DF values as arrays rather than sets, as deduplication needs only to look at the most recently inserted value:

for w in G:
  DF(w) = []
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        if len(DF(u)) == 0 or DF(u)[-1] != w: DF(u).append(w)
        u = idom(u)

Finally, once DF(w) is available for all w, SSA construction becomes simple: if G denotes a control flow graph of basic blocks, then for every variable assigned-to in w, DF(w) is exactly the set of basic blocks which require a phi function (or basic block argument) inserted at their start to reconverge the various assignments of that variable. A mere two caveats apply:

The phi function is another assignment to the variable in question, which possibly enlarges the set of variables assigned-to by the basic block into which the phi was inserted. Multiple iterations can be required to reach a fixpoint.
Some liveness analysis can be applied to avoid inserting phi functions whose result would never be used.