# x86 macro-op fusion notes

For very many years now, x86 and x64 processors have been able to take an arithmetic instruction immediately followed by a conditional jump instruction, and execute the two instructions as if they were a single instruction (which is called "macro-op fusion"). When writing very hot loops in assembly code by hand, this can effectively give you an extra instruction for free.

In Intel chips, fusion first appeared in the original Core microarchitecture, launched in 2006. An improved version of fusion made its debut in the Nehalem microarchitecture (Nhlm) a few years later in 2008. The Sandy Bridge microarchitecture (SnBr) released in 2011 contained yet more improvements to fusion. Things remained fairly stable through until Cannon Lake (CnLk), but then Ice Lake in 2019 *removed* support for a few cases of fusion (`INC`

and `DEC`

instructions, and instructions with memory operands).

In AMD chips, fusion first appeared in the Bulldozer microarchitecture (Bdzr), launched in 2011. It supported fusing `TEST`

and `CMP`

instructions with all flag-based conditional jumps. Things remained fairly stable through until Zen 3 in 2020, which added support for a number of other instructions.

The table below summarises which instructions can fuse with which jumps on which microarchitectures. Combinations which are supported, but which are pointless in practice, are parenthesised.

TEST | AND | OR XOR | CMP | ADD SUB | INC DEC | |
---|---|---|---|---|---|---|

J[N]O | (Core+) | (SnBr+) | - | - | - | - |

J[N]S | Core+ | SnBr+ | - | - | - | - |

J[N]P JP[EO] | Core+ | SnBr+ | - | - | - | - |

J[N]B J[N]AE J[N]C | (Core+) | (SnBr+) | - | Core+ | SnBr+ | - |

J[N]BE J[N]A | (Core+) | (SnBr+) | - | Core+ | SnBr+ | - |

J[N]Z J[N]E | Core+ | SnBr+ | - | Core+ | SnBr+ | SnBr-CnLk |

J[N]L J[N]GE | (Core+) | (SnBr+) | - | Nhlm+ | SnBr+ | SnBr-CnLk |

J[N]G J[N]LE | Core+ | SnBr+ | - | Nhlm+ | SnBr+ | SnBr-CnLk |

TEST | AND | OR XOR | CMP | ADD SUB | INC DEC | |

J[N]O | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | Zen3+ |

J[N]S | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |

J[N]P JP[EO] | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |

J[N]B J[N]AE J[N]C | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | (Zen3+) |

J[N]BE J[N]A | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | (Zen3+) |

J[N]Z J[N]E | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |

J[N]L J[N]GE | (Bdzr+) | (Zen3+) | (Zen3+) | Bdzr+ | Zen3+ | Zen3+ |

J[N]G J[N]LE | Bdzr+ | Zen3+ | Zen3+ | Bdzr+ | Zen3+ | Zen3+ |

`ADD`

.
Jumps which depend only upon the result bits (for

`TEST`

, the result is what `AND`

*would*give, and for

`CMP`

, the result is what `SUB`

*would*give):

Description | Flags | Fusion (SnBr) | |
---|---|---|---|

JS | Jump if sign | `SF = 1` | `TEST` `AND` |

JNS | Jump if not sign | `SF = 0` | `TEST` `AND` |

JP JPE | Jump if parity Jump if parity even | `PF = 1` | `TEST` `AND` |

JNP JPO | Jump if not parity Jump if parity odd | `PF = 0` | `TEST` `AND` |

JZ JE | Jump if zero Jump if equal | `ZF = 1` | `TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB` |

JNZ JNE | Jump if not zero Jump if not equal | `ZF = 0` | `TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB` |

`SF`

is a copy of the most significant bit of the result (which for conceptually *signed*results, indicates a result less than zero),

`PF`

is set based on the population count of the low 8 bits of the result (`PF = 1`

implies that said population count was even), and `ZF`

is set if all bits of the result are zero (which for `CMP`

and `SUB`

implies that the two operands were equal).
Jumps which conceptually operate on unsigned operands:

Description | Flags | Fusion (SnBr) | |
---|---|---|---|

JB JNAE JC | Jump if below Jump if not above or equal Jump if carry | `CF = 1` | `CMP` `ADD` `SUB` |

JNB JAE JNC | Jump if not below Jump if above or equal Jump if not carry | `CF = 0` | `CMP` `ADD` `SUB` |

JBE JNA | Jump if below or equal Jump if not above | `CF = 1 or ZF = 1` | `CMP` `ADD` `SUB` |

JNBE JA | Jump if not below or equal Jump if above | `CF = 0 and ZF = 0` | `CMP` `ADD` `SUB` |

`CF`

is the (N+1)^{th}bit. Fusion with

`TEST`

and `AND`

is technically possible, but `CF`

is always zero following these instructions, making it pointless. The "above" and "below" descriptions are from the perspective of `CMP`

or `SUB`

with unsigned operands. To extend the description to `ADD`

, consider the instruction to be taking two N-bit unsigned operands, zero-extending both to (N+1) bits, and producing an (N+1)-bit *signed*result. Then "above" and "below" are whether this (N+1)-bit

*signed*result is above or below zero.

Jumps which conceptually operate on signed operands:

Description | Flags | Fusion (SnBr) | |
---|---|---|---|

JO | Jump if overflow | `OF = 1` | - |

JNO | Jump if not overflow | `OF = 0` | - |

JL JNGE | Jump if less Jump if not greater or equal | `SF ≠ OF` | `INC` `DEC` `CMP` `ADD` `SUB` |

JNL JGE | Jump if not less Jump if greater or equal | `SF = OF` | `INC` `DEC` `CMP` `ADD` `SUB` |

JNG JLE | Jump if not greater Jump if less or equal | `SF ≠ OF or ZF = 1` | `TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB` |

JG JNLE | Jump if greater Jump if not less or equal | `SF = OF and ZF = 0` | `TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB` |

^{nd}operand being

`1`

for `INC`

and `DEC`

), sign extend both to (N+1)-bits, compute an (N+1)-bit signed result, then `OF`

is set to the exclusive-or of the two most signicant bits of those (N+1). This means that `SF ≠ OF`

corresponds to the most significant bit of this signed (N+1)-bit result (in the same way that `CF`

corresponds to the most significant bit of the (N+1)-bit result when the operands are *zero*-extended). In other words,

`OF`

is set if sign-extending the N-bit result to (N+1)-bits gives a different result to sign-extending the operands to (N+1)-bits and *then*performing (N+1)-bit arithmetic. Fusion with

`TEST`

and `AND`

is technically possible for all these jumps, but `OF`

is always zero following `TEST`

or `AND`

, making many combinations pointless. The "greater" and "less" descriptions are from the perspective of `CMP`

or `SUB`

with signed operands. To extend the description to `ADD`

, consider the instruction to be taking two N-bit signed operands, sign-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "greater" and "less" are whether this (N+1)-bit signed result is greater or less than zero. The `INC`

and `DEC`

instructions are equivalent to `ADD`

and `SUB`

with the 2^{nd}operand being

`1`

(except for `INC`

and `DEC`

not touching `CF`

).
In the context of optimising hot loops, it is common to be iterating over the half-open range

`[0, N)`

, by some step `S`

. Given all the references above to comparing to zero, it is often beneficial to transform this range to `[-N, 0)`

, as then the loop-control instructions can be `ADD reg, S`

followed by a conditional jump, rather than `ADD reg, S`

followed by `CMP reg, N`

followed by a conditional jump. In the case of fusing `ADD reg, S`

with a conditional jump, if `N`

is known to be non-zero, then the conditional jump can be `JA`

or `JAE`

(or synonyms thereof), relying on `CF = 1`

occuring when the counter transitions from negative to non-negative. If `N`

might zero, and a single iteration of the loop is desired (or harmless) even when `N`

*is*zero, then the conditional jump cannot be

`JA`

or `JAE`

, but instead can be `JL`

(or a synonym thereof), relying on sign extension of both operands to (N+1) bits and then jumping if the (N+1)-bit result is less than zero.