Seeing the raft for the trees

The Raft distributed consensus protocol is normally seen as having an array of log entries. There is however a generalisation of Raft which replaces the array of log entries with a tree of nodes, and then (standard) Raft can be seen as a very simple tree that contains no branches. The resultant protocol can be called Tree Raft, and is very similar in spirit to Chained Raft. This post describes Tree Raft and contrasts it to standard Raft.

In standard Raft, the log is an array, indexed by uint64_t index, with each array element being a log entry:

struct LogEntry {
  uint64_t term;
  string contents;
};

In Tree Raft, the log is a tree. The tree can be stored in a map/dictionary whose key type is NodeRef and value type is Node:

struct NodeRef {
  uint64_t index;
  uint64_t term;
};
struct Node {
  string contents;
  NodeRef parent;
};

There are two restrictions on the parent field: if Node n has key (i, t) then n.parent.index == i - 1 and n.parent.term <= t. This first restriction allows an optimisation: parent.index does not need to be stored; only parent.term needs to be stored.

There are two obvious orderings that can be defined on NodeRef: by ascending index then ascending term (lexicographic), and by ascending term then ascending index (transposed lexicographic). Both orderings turn out to be useful in different places. Note that a node comes after its parent in both orderings.

As an optimisation, a Node and (some number of) its transitive parents can be represented as an array of LogEntry plus the parent term of the oldest Node. This optimised representation is the only representation in standard Raft, and it appears twice: the per-server log takes this form (with the parent term of the oldest Node being lastTerm in the InstallSnapshot RPC), and the AppendEntries RPC takes this form (with the parent term of the oldest Node being prevLogTerm). A Tree Raft implementation could use this optimised form for committed nodes, and only use the more expensive map/dictionary representation for uncommitted nodes.

Note that a NodeRef identifies a single Node, but because every node identifies its parent, a NodeRef implicitly identifies an entire chain of nodes. This is similar to a commit hash in git: a hash identifies a single commit, but a commit references its parent commit(s), and thus a single hash identifies an entire graph of commits leading to that point. Standard Raft describes this as the Log Matching property: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.

A Tree Raft server maintains two cursors into the tree: NodeRef commit and NodeRef head. The NodeRef commit cursor replaces the uint64_t commitIndex from standard Raft. The NodeRef head cursor replaces "last log entry" everywhere that it comes up: the RequestVote RPC contains the candidate's NodeRef head, and other servers are willing to vote for candidates whose NodeRef head is greater than or equal to theirs (using transposed lexicographic order). The leader is able to create a new node by using the key (head.index + 1, currentTerm) and then setting its head to this new node. For every follower, there will be a path in the tree from the follower's NodeRef head to the leader's NodeRef head, and one objective of a follower is to move its head cursor along this path toward the leader's head cursor (note that the path might involve going up the tree before going back down). Similarly, for every follower, there will be a path in the tree from the follower's NodeRef commit to the leader's NodeRef commit, and another objective of a follower is to move its commit cursor along this path toward the leader's commit cursor (crucially, this path is always downwards and never backtracks). The leader is responsible for moving its commit cursor toward its head cursor when it is safe to do so, and the safety constraint here is very similar to standard Raft: a strict majority of servers must have their head.term equal to the leader's currentTerm, and within this majority, the minimum head.index gives the safe commit point. On every server, the commit cursor will either be equal to the head cursor, or be an ancestor of the head cursor. This permits an optimisation: only the index of the commit cursor needs to be stored, as its term can be reconstructed by finding the ancestor of head with the given index. Standard Raft contains this optimisation, as it only stores uint64_t commitIndex. It can be seen that standard Raft is just Tree Raft where a server stores the chain of nodes between the root and its head (a very simple non-branching tree) as its log, and does not store any other nodes.

Tree Raft replaces the AppendEntries RPC with an AddNodes RPC. The prevLogIndex, prevLogTerm, entries[], and leaderCommit arguments are replaced by the leader's NodeRef head, pair<NodeRef, Node> nodes[], and the leader's NodeRef commit. Note that these arguments can be optimised somewhat (only including commit.index and not commit.term, using the array of LogEntry representation for nodes, not including head if the last nodes entry is the head, not including the term of nodes when equal to the leader's currentTerm, and so forth). After processing the RPC, followers respond with their currentTerm and their (possibly moved) NodeRef head. The receiver implementation is:

If term < currentTerm, finish processing.
Add all the nodes to your map.
If there is a path from your NodeRef head to the leader's NodeRef head, set your head to the leader's.
If there is a path from your NodeRef commit to the leader's NodeRef commit, and the leader's commit either equals your head or is an ancestor of your head, set your commit to the leader's (this can be deferred until after the response has been sent, and the movement can be done one step at a time).

Note that if a follower misses an AddNodes RPC (due to packet loss, or leader failure, or late startup), then the nodes added in step 2 might not have a path back to the root; that is, there'll be a node in the map whose parent is not in the map. In this case, steps 3 and 4 will not be able to find a path to the new nodes. In a departure from standard Raft, in Tree Raft, followers are responsible for identifying these missing parent nodes and issuing Replay RPCs (against a randomly chosen server) in an attempt to obtain them. Once obtained, steps 3 and 4 can be retried. This has a number of nice properties: leaders no longer need to track nextIndex for each follower, the replay workload is shared between followers rather than being yet another task that the leader is responsible for, and the AddNodes RPC can be a UDP multicast to all followers. Furthermore, as nodes are not removed from the map as part of processing an AddNodes RPC, there are slightly nicer forward progress guarantees during leader changeover. As yet another nice property, if implementing the PreVote extension (which is necessary for liveness), then PreVote rejects can carry the (server's latest view of) the leader's currentTerm and NodeRef head and NodeRef commit, and then the receiver of the reject can treat this like an AddNodes and then replay the missing nodes (note that currentTerm is monotonic, and within a term, the leader's head and commit are also monotonic, hence stale views of this triplet can be arbitrated). Consequently, followers can keep up with the leader even when direct communication between the leader and the follower is not possible, avoiding the need for a long recovery process once direct communication is restored.

As nodes are not removed from the map as part of the AddNodes RPC, it is worth considering when nodes do get removed. A server cannot remove its head node, nor any ancestor of its head node (except when moving these nodes into a snapshot), but it is safe to remove any other node at any time. Standard Raft takes a very aggressive policy: when head moves backward (as part of moving it toward the common parent of the follower's head and the leader's head, i.e. moving it along the path in the tree from follower's head to leader's head), immediately remove everything that can be removed. A very lax policy is also possible: never remove nodes. A good middle ground is to remove during commitment: when committing a node, remove any other nodes which have the same index but a different term, and when removing a node, also remove its children (recursively). This fits very nicely with using the optimised array representation for committed nodes. If taking this middle ground, then it becomes attractive to store the uncommitted nodes in a sorted map, with nodes in lexicographic order, as then nodes with the same index can be found with a range scan, and the children of a node can also be found with a range scan (find all nodes with index + 1 using range scan, then filter based on their parent.term).

To efficiently answer "is there a path between node X and node Y" queries, it can be useful for each server to store an additional NodeRef furthest_ancestor field on every Node. Initially, furthest_ancestor is set to parent, but when it is discovered that n.furthest_ancestor exists in the map, n.furthest_ancestor is replaced by lookup(n.furthest_ancestor).furthest_ancestor (giving a union-find-like behaviour). If two nodes have the same furthest_ancestor then a path exists between them. As an optimisation, furthest_ancestor does not need to be stored for committed nodes, as it'll always refer to the root for them. Furthermore, furthest_ancestor precisely identifies the node which needs to be the target of a Replay RPC.