Comparison after bit reversal

ARM processors have an rbit instruction for reversing bits. On these processors, the following function can be executed as three instructions; rbit followed by rbit followed by cmp, with the result in flags:

bool rbit_lt(uint32_t a, uint32_t b) {
  return __builtin_arm_rbit(a) < __builtin_arm_rbit(b);
}

x86 processors lack an rbit instruction, but following a derivation from the expired patent US 6347318 B1, there is a different three-instruction sequence for x86 processors which achieves the same thing. Start by pulling out the less-than comparison:

bool rbit_lt(uint32_t a, uint32_t b) {
  return lt(__builtin_arm_rbit(a), __builtin_arm_rbit(b));
}

bool lt(uint32_t a, uint32_t b) {
  return a < b;
}

Then replace the less-than comparison with a much more complicated version. This is based on the observation that the less-than comparison only cares about the most significant bit position where the two inputs differ. In turn, most significant bit position is just reversed least significant bit position, and standard bit tricks can be used to get the least significant set bit:

bool lt(uint32_t a, uint32_t b) {
  return (msb(a ^ b) & b) != 0;
}

uint32_t msb(uint32_t x) {
  return __builtin_arm_rbit(lsb(__builtin_arm_rbit(x)));
}

uint32_t lsb(uint32_t x) {
  return x & -x;
}

Then merge (the more complicated) lt back into rbit_lt:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (msb(__builtin_arm_rbit(a ^ b)) & __builtin_arm_rbit(b)) != 0;
}

Then expand msb:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (__builtin_arm_rbit(lsb(a ^ b)) & __builtin_arm_rbit(b)) != 0;
}

Then the reversals can be dropped:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (lsb(a ^ b) & b) != 0;
}

This is three instructions; xor followed by blsi followed by test, with the result in flags. The blsi instruction is part of the BMI1 instruction set, and thus not present on older x86 processors. If we care about older processors, then xor-lsb can be replaced with with equivalent sub-lsb:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (lsb(a - b) & b) != 0;
}

Then lsb expanded:

bool rbit_lt(uint32_t a, uint32_t b) {
  return ((a - b) & (b - a) & b) != 0;
}

This comes out to four instructions on ARM; sub, sub, and, and tst. As x86 lacks relevant three-operand operations, it comes out to five on x86: mov, sub, sub, and, and test (or sub, mov, neg, and, and test), though the mov will be almost free on recent x86 processors.