__VA_OPT__ Minutiae

Let's say we're writing C code, and would like to define a macro for printf-with-newline. We might start with:

#define printfln(fstr, ...) \
  printf(fstr "\n", __VA_ARGS__)

So far so good; we can write printfln("1+1 is %d", 1+1) and it'll print 1 + 1 is 2 followed by a newline. However, simpler cases such as printfln("Hello") result in a syntax error, as this macro expands to printf("Hello" "\n",), which is invalid due to the trailing comma.

The non-standard solution

One conventional solution is to rely on a non-standard language extension whereby inserting ## between , and __VA_ARGS__ has a special effect:

#define printfln(fstr, ...) \
  printf(fstr "\n", ## __VA_ARGS__)

With this, both printfln("1+1 is %d", 1+1) and printfln("Hello") work fine. However, combining the two into something like printfln("Wrote %d chars", printfln("Hello")) results in a mysterious error. To figure out why, we need to look into exactly what special effect this non-standard language extension has. The GNU C Preprocessor documentation on Variadic Macros says:

The ## token paste operator has a special meaning when placed between a comma and a variable argument. If you write [...] and the variable argument is left out [...], then the comma before the ## will be deleted. This does not happen if you pass an empty argument, nor does it happen if the token preceding ## is anything other than a comma.

Meanwhile, the GCC documentation on Variadic Macros says:

If the variable arguments are omitted or empty, the ## operator causes the preprocessor to remove the comma before it. If you do provide some variable arguments in your macro invocation, GNU CPP does not complain about the paste operation and instead places the variable arguments after the comma. Just like any other pasted macro argument, these arguments are not macro expanded.

The last sentence of this 2nd piece of documentation tells us what is going on: macro arguments are usually expanded before being substituted, but this doesn't happen to macro arguments used as operands to # or ##. In either case, after substitution has been performed, the resultant tokens are rescanned for more expansions, albeit with the original macro deactivated during this rescan. The only major observable difference between expansion-before-substitution and expansion-during-rescan is that the original macro is active in the former but deactivated in the latter. Hence printfln("Wrote %d chars", printfln("Hello")) expands to printf("Wrote %d chars" "\n", printfln("Hello")) without expanding the inner printfln, which the compiler then tries to parse as a call to the non-existent function printfln.

The two pieces of documentation are contradictory as to what happens if the variable arguments are present but empty (as in printfln("Hello",)). In practice the 1st piece of documentation is correct; the comma is kept if the variable arguments are present but empty.

The nor does it happen if the token preceding ## is anything other than a comma phrase from the 1st piece of documentation is interesting, as it turns out this is a slightly dynamic property: comma deletion can happen for things like , ## x ## __VA_ARGS__ provided that x expands to nothing. Clang will also delete the comma in , x ## __VA_ARGS__ when x expands to nothing. Things also seem to get funky if more pasting immediately follows , ## __VA_ARGS__. As examples:

#define F(x, ...) asF()F(1)F(,)F(1,)F(1,2)
,##__VA_ARGS__emptyempty,,,2
,##__VA_ARGS__ xempty1,,1,2 1
,##__VA_ARGS__##x,error,errorerror† or ,21
,##x##__VA_ARGS__emptyerror,errorerror
,x##__VA_ARGS__,† or empty ‡,1,,1,12

† According to gcc 13.2.
‡ According to clang 17.0.1.

In the cases where gcc and clang differ, who is to say which is correct? After all, there is no standard documenting the desired behaviour for non-standard language extensions.

Enter __VA_OPT__

Rather than standardise this mess, the C and C++ languages adopted a different solution, namely __VA_OPT__. With __VA_OPT__, our motivating example looks like:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_ARGS__)

In this, __VA_OPT__(,) expands to nothing if the variable arguments are absent or empty or become empty after macro expansion, whereas it expands to , otherwise. There's no token pasting going on, so printfln("Wrote %d chars", printfln("Hello")) now works. The other behavioural difference is that the odd-looking printfln("Hello",) now expands to the valid printf("Hello" "\n"), as does printfln("Hello", EMPTY) in the context of #define EMPTY /*nothing*/.

The standard didn't stop there though; there can be more than just a , within __VA_OPT__. Any (-)-balanced token sequence is allowed, and it can even access the arguments of the macro invocation, so for example it is valid to have:

#define M(x, ...) (0 __VA_OPT__(-(x)) )

Then M(1) expands to (0) and M(1,2) expands to (0 - (1)).

What about whitespace?

Compilers seem to disagree on the behaviour of whitespace just after __VA_OPT( or just before the matching ). Consider:

#define TILDAS(...) ~__VA_OPT__( ~ )~
#define S2(x) #x
#define S1(x) S2(x)
const char* s = S1(TILDAS());
const char* sa = S1(TILDAS(a));

The observed results are:

gcc 13.2clang 17.0.1
s"~~""~ ~"
sa"~~~""~ ~ ~"

One interpretation is that gcc is correct, based on this paragraph from the standard: (N3096 6.10.4.1¶7)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

Combined with an earlier paragraph about replacement lists: (N3096 6.10.4¶7)

[...] Any white-space characters preceding or following the replacement list of preprocessing tokens are not considered part of the replacement list for either form of macro.

What does __VA_OPT__() expand to?

The short answer should be nothing, based on the rules for expanding it:

  1. If the variable arguments are absent or empty or become empty after macro expansion, the expansion of __VA_OPT__() is a single placemarker token.
  2. Otherwise, if used as an operand of ##, the expansion of __VA_OPT__() is a single placemarker token (because an empty expansion becomes a placemarker in this context).
  3. Otherwise, the expansion of __VA_OPT__() is empty (though a single placemarker token would be equally valid, as it'll evaporate away in due course).

#__VA_OPT__() therefore becomes an obfuscated way of writing "". Meanwhile, __VA_OPT__() as an operand of ## can be used to force the other operand to be in a ## context. For example, the undesirable behaviour of , ## __VA_ARGS__ can be reproduced via:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_OPT__() ## __VA_ARGS__)

If __VA_OPT__() expands to nothing, regardless of whether the variable arguments are absent or empty or become empty after macro expansion, then it might seem rational to optimise it away. There's a corner case though: determining whether the variable arguments become empty after macro expansion requires macro expanding them, and macro expansion of the (non-standard) __COUNTER__ macro has visible side-effects. Consider:

#define EMPTY_VA_OPT(...) __VA_OPT__()
int x = __COUNTER__;
EMPTY_VA_OPT(__COUNTER__)
int y = __COUNTER__;
return y - x;

For the above, gcc 13.2 returns 2, whereas clang 17.0.1 returns 1, indicating that clang optimised away the __VA_OPT__().

Token pasting and __VA_OPT__

The standard is careful to say that the expansion of __VA_OPT__ can contain placemarker tokens: (N3096 6.10.4.1¶7, emphasis mine)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

These placemarker tokens were originally a fiction dreamt up by the standard to provide a succinct description of the semantics of ##:

Before __VA_OPT__, it was possible to ignore this fiction, and implement the preprocessing algorithm with careful rules around ## evaluation rather than producing and later removing placemarker tokens. It becomes harder to maintain this fiction with __VA_OPT__. Consider:

#define G(x, y, z, ...) 1 ## __VA_OPT__(x y ## y z) ## 5

Then G(,,) expands to the single token 15, while G(2,3,4,-) expands to the three tokens 12 33 45. If y is empty, then the inner y ## y should produce a placemarker, and then things get interesting:

Expansion ofResultNotes
G(2,,4,-)12 45The __VA_OPT__ expands to 24.
G( ,,4,-)1 45The placemarker inhibits merging to 145.
G(2,, ,-)12 5The placemarker inhibits merging to 125.
G( ,, ,-)1 5† or 15The correct result is the merged 15.

† According to gcc 13.2.
‡ According to clang 17.0.1.

The other fun observation is that macro arguments within __VA_OPT__ get macro expanded, even if the __VA_OPT__ itself is an operand of ##. Consider:

#define H1(...) x ##            __VA_ARGS__
#define H2(...) x ## __VA_OPT__(__VA_ARGS__)

Then H1(__LINE__) expands to x__LINE__, whereas H2(__LINE__) expands to something like x3.

Further reading

Higher quality random floats

Much has been written about how to generate random 64-bit integers; people might argue over which particular (CS)PRNG is best suited for a given task, but there is agreement on the general shape of the solution: keep some state around, and then extract 64 bits of randomness at a time, where each bit is equally likely to be either zero or one. This gives nice uniformly-distributed integers over the range [0, 264).

There is less agreement on how to generate random floating-point values. It is tempting to aim for what looks like a simple transformation of an integer generator: uniformly-distributed floats over the range [0, 1). There are various ways of doing such a transformation; one exposition does it as:

double pcg64_random_double(pcg64_t *rng) {
    return (double)(pcg64_random(rng) >> 11) * 0x1.0p-53;
}

Whereas LuaJIT does it as:

uint64_t lj_prng_u64d(PRNGState *rs) {
    uint64_t z, r = 0;
    TW223_STEP(rs, z, r)
    /* Returns a double bit pattern in the range 1.0 <= d < 2.0. */
    return (r & 0x000fffffffffffffull) | 0x3ff0000000000000ull;
}
/* Then to give d in [0, 1) range: */
U64double u;
double d;
u.u64 = lj_prng_u64d(rs);
d = u.d - 1.0;

And a Go version by Lemire does it as:

// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

The [0, 1) output range is less useful than it might initially appear. It is often stated that [0, 1) can be transformed to [A, B) by multiplying by B-A then adding A, but this is only true for some values of A and B; for other values, a number just less than 1 can be transformed and end up equal to B due to floating-point rounding, i.e. [0, 1) is transformed to [A, B]. The possibility of a zero output is also unhelpful if the result is going to be log-transformed (for example as part of a Box-Muller transform), as log(0) explodes.

An output range of [0, 1] would be better behaved under most additive and multiplicative transforms. After such a transform, a range like [A, B] can be easily turned into [A, B) via rejection sampling, should the half-open range be desired. It will also turn out to be very easy (on either theoretical or practical grounds) to exclude 0 as a possible output, giving the log-friendly output range (0, 1].

Before trying to generate a random float in the range [0, 1], we should instead consider the easier problem of generating a random float in the range [0.5, 1]. There are 252+1 different IEEE-754 doubles in this range. The "rounding basin" of some double d in that range can be defined as the range of infinite-precision real numbers in the range [0.5, 1] that would round to d (assuming round-to-nearest). Taking ε=2-54, most of these doubles have a basin of d-ε through d+ε. The exceptions are the extremes of the range; 0.5 has a basin of 0.5 through 0.5+ε, and 1 has a basin of 1-ε through 1. In other words, there are 252-1 values in the range whose basin is wide, and 2 values in the range whose basin is only ε wide. For a uniform distribution, the probability of seeing d should equal the width of the rounding basin of d. Given all this, there is a fairly simple monotonic transform from the uniform integer range [0, 253) to the uniform float range [0.5, 1]:

Integer(s)Resultant floating-point double (ε=2-54)
00.5
1, 20.5 + 2ε
3, 40.5 + 4ε
5, 60.5 + 6ε
253-7, 253-61 - 6ε
253-5, 253-41 - 4ε
253-3, 253-21 - 2ε
253-11

This transform can be expressed in code like so:

double rand_between_half_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1022ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The range [0.25, 0.5] looks a lot like the range [0.5, 1], except that ε reduces to 2-55. That is, there is a fairly simple monotonic transform from the uniform integer range [0, 253) to the uniform float range [0.25, 0.5]:

Integer(s)Resultant floating-point double (ε=2-55)
00.25
1, 20.25 + 2ε
3, 40.25 + 4ε
5, 60.25 + 6ε
253-7, 253-60.5 - 6ε
253-5, 253-40.5 - 4ε
253-3, 253-20.5 - 2ε
253-10.5

Or in code:

double rand_between_quarter_and_half() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1021ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

In turn, the range [0.125, 0.25] looks a lot like [0.25, 0.5], and [0.0625, 0.125] looks a lot like [0.125, 0.25], and so on. To generate a float in [0, 1], we need to choose one of these ranges, and then choose a value within that range. For a uniform distribution, the probability of a range should be proportional to the width of the range, so we have:

Output rangeWidthProbability
[0.5, 1]2-150%
[0.25, 0.5]2-225%
[0.125, 0.25]2-312.5%
[0.0625, 0.125]2-46.25%

As the width of the range approaches zero, so does the probability of the range. Once the width is less than 2-75, the probabilities are so small as to be effectively impossible for most practical purposes. By making them actually impossible, we gain a nice guarantee of algorithm termination, and avoid having to think about denormals, and get the log-friendly output range (0, 1]. One way of implementing this in code is:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    uint64_t e = 1022;
    do {
      if (rand_u64() & 1) break; /* 1-bit uniform integer */
      e -= 1;
    } while (e > 1022-75);
    x = ((x + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The above throws away lots of random bits, whereas it would be more economical to remember and use them later:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64(); /* 53-bit and 11-bit uniform integers */
    uint64_t bits = x & 0x7ff, nbits = 11;
    uint64_t e = 1022;
    do {
      if (bits & 1) break; /* 1-bit uniform integer */
      bits >>= 1;
      if (--nbits == 0) bits = rand_u64(), nbits = 64;
      e -= 1;
    } while (e > 1022-75);
    x = (((x >> 11) + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Finally, the loop can be eliminated using bit-counting intrinsics:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64();
    uint64_t e = __builtin_ctzll(x) - 11ull;
    if ((int64_t)e >= 0) e = __builtin_ctzll(rand_u64());
    x = (((x >> 11) + 1) >> 1) - ((e - 1011ull) << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Or in Go rather than C:

// toFloat64 -> (0,1]
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    e := bits.TrailingZeros64(x) - 11
    if e >= 0 {
        e = bits.TrailingZeros64(splitmix64(seed))
    }
    x = (((x >> 11) + 1) >> 1) - ((uint64(int64(e)) - 1011) << 52)
    return math.Float64frombits(x)
}

With that exposition done, I can now present my criteria for assessing the quality of functions that claim to generate uniformly-distributed IEEE-754 double-precision floating-point values over (0, 1] or [0, 1] or [0, 1):

CriteriaMost functionsPresented hereIdeal
Probability of zero2-52 or 2-5302-1075
Smallest non-zero2-52 or 2-532-762-1074
Largest non-seeable1 - 2-53 or 0.5 - 2-542-76 - 2-129-0
Distinct values252 or 253258.2262±1

If you've reached this far, then you might be interested in similar commentary from Taylor R. Campbell or Allen B. Downey or @moon_chilled.

Windows Arm64EC ABI Notes

The basic premise of Microsoft's Arm64EC is that a single virtual address space can contain a mixture of ARM64 code and X64 code; the ARM64 code executes natively, whereas the X64 code is transparently converted to ARM64 code by a combination of JIT and AOT compilation, and ARM64 ⇄ X64 transitions can happen at any function call/return boundary.

There are some good MSDN pages on the topic:

The first function to highlight is RtlIsEcCode; to tell apart ARM64 code and X64 code, the system maintains a bitmap with one bit per 4K page, specifying whether that page contains ARM64 code (bit set) or X64 code (bit clear). This bitmap can be queried using RtlIsEcCode. The loader sets bits in this bitmap when loading DLLs, as does VirtualAlloc2 when allocating executable memory with MEM_EXTENDED_PARAMETER_EC_CODE in MemExtendedParameterAttributeFlags.

After the X64 emulator executes a call or ret or jmp that crosses a page boundary, it needs to determine whether the new rip points to ARM64 code or X64 code. If !RtlIsEcCode(rip), then the X64 emulator can happily continue emulating X64 code. However, if RtlIsEcCode(rip), then a transition out of the emulator needs to be performed. This can be a call-like transition (X64 calling ARM64), or a return-like transition (X64 returning to ARM64). The transition type is determined by looking at the four bytes before rip; if they contain the encoding of blr x16 (0xd63f0200), then a return-like transition is performed by setting pc to rip. Otherwise, a call-like transition is performed by setting x9 to rip and setting pc to rip's entry thunk. To find the entry thunk, the four bytes before rip are used; the low two bits need to be 0b01, and after masking off said bits, the 32 bits are sign extended and then added to rip (the resultant value must be different to rip, i.e. a function cannot be its own entry thunk).

Before transferring control to the entry thunk, the X64 emulator performs a little manipulation:

ldr lr, [sp], #8 // pop return address (skipping sp alignment check)
mov x4, sp

After this, it ensures sp is 16-byte aligned:

if (unlikely(sp & 8)) {
  str lr, [sp, -#8]! // push return address again (again skip check)
  adr lr, x64_ret_stub // an X64 funclet containing just "ret"
}

In other words, the stack will look like whichever of the following diagrams gives rise to an aligned sp:

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
                      ▲
                      |
                      sp                        lr = retaddr
◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
              ▲
              |
              sp                           lr = x64_ret_stub

The 32 bytes of X64 home space begin at x4, and any arguments passed on the stack begin at x4+#32. The entry thunk is free to use the X64 home space if it wants, and some of the MSDN documentation suggests using it as a place to save q6 and q7, but this is problematic, as to later restore from this space, we'd need to save x4 (not to mention that no unwind codes can load from x4). In practice, only 24 of the 32 bytes are easily usable; sp+#8 through sp+#32 will always coincide with some 24 bytes of the home space.

The entry thunk can either be a copy of the original function that speaks the X64 ABI, or it can copy the arguments from their X64 ABI locations to their Arm64EC ABI locations and then make a call to the original function (helpfully provided in x9) and then transfer the results to their X64 ABI locations. If the original function is vararg, then the Arm64EC ABI dictates that x4 should contain a pointer to the stack arguments (so add #32) and that x5 should contain the size of the stack arguments (which is not generally known; MSVC-generated thunks populate #0 for x5 in this case).

In either case, the entry thunk needs to ensure that X64 ABI non-volatile registers are preserved (which translates to ARM64 ABI non-volatile registers, plus q6 through q15), and then needs to return to X64 once it is done. That is achieved by means of a tailcall to __os_arm64x_dispatch_ret, which resumes X64 execution at lr.

If the original function modified arguments in-place and then tailcalled something else (an adjustor function), then the entry thunk for it can instead be an adjustor thunk: modify the arguments in-place (in their X64 ABI locations), put the tailcall target in x9, and then tailcall __os_arm64x_x64_jump. If x9 points to ARM64 code, then __os_arm64x_x64_jump will tailcall x9's entry thunk (which will then consume arguments from their X64 locations), whereas if x9 points to X64 code, then it'll act like __os_arm64x_dispatch_call_no_redirect (which will again consume arguments from their X64 locations).

The other side of things is when native ARM64 code wants to make an indirect function call. The whole premise of Arm64EC is that function pointers at ABI boundaries can point to either ARM64 functions or X64 functions, and a-priori the caller doesn't know which it has. The recommendation is to put the call target in x11 and then call __os_arm64x_dispatch_icall, which will leave x11 as-is if it is ARM64 code, or copy x10 to x11 if not.

Note that __os_arm64x_dispatch_icall is not a function per-se, but instead a (per-module) global variable containing a function pointer. The distinction is important, as it affects how to call it; it cannot be a bl target, instead it has to be loaded into a register (e.g. by adrp and ldr) and then be used as a blr target. The linker and loader conspire to put LdrpValidateEcCallTarget in this variable, the pseudocode for which is:

LdrpValidateEcCallTarget:  // aka. __os_arm64x_dispatch_icall
  // Specially written to only mutate x16/x17
  // (and x9/x11 where indicated)
  if (RtlIsEcCode(x11)) {
    // x11 unchanged
    // x9 unspecified, in practice unchanged from input
  } else {
    x9 = ResolveFastForwardSequences(x11);
    if (RtlIsEcCode(x9)) {
      x11 = x9
      // x9 unspecified, in practice same as exit x11
    } else {
      x11 = x10
      // x9 specified as X64 pointer
    }
  }
  // x0-x8,x10,x15 specified as unchanged from input
  // q0-q7 specified as unchanged from input
  // x12 unspecified, in practice unchanged from input
  return;

ResolveFastForwardSequences(uintptr_t p):
  // jmp qword [rip+imm32]
  if (bytes_match(p, "ff 25 ?? ?? ?? ??")) {
    p += 6;
    int32_t imm32 = ((int32_t*)p)[-1];
    p = *(uintptr_t*)(p + imm32);
    return ResolveFastForwardSequences(p);
  }
  if (p & 15) {
    return p;
  }
  // mov rax, rsp; mov [rax+32], rbx; push rbp; pop rbp; jmp rel32
  // mov rdi, rdi; push rbp; mov rbp, rsp; pop rbp; nop; jmp rel32
  if (bytes_match(p, "48 8b c4 48 89 58 20 55 5d e9 ?? ?? ?? ??")
  ||  bytes_match(p, "48 8b ff 55 48 8b ec 5d 90 e9 ?? ?? ?? ??")) {
    p += 14;
    int32_t rel32 = ((int32_t*)p)[-1];
    p += rel32;
    return ResolveFastForwardSequences(p);
  }
  // mov r10, rcx; mov eax, imm32; test byte [0x7ffe0308], 0x01
  // jne rip+3; syscall; ret; jne_target: int 0x2e; ret
  if (bytes_match(p   , "4c 8b d1 b8 ?? ?0 00 00 f6 04 25 08")
  &&  bytes_match(p+12, "03 fe 7f 01 75 03 0f 05 c3 cd 2e c3")) {
    uint32_t imm32 = *(uint32_t*)(p + 4);
    return SyscallFunctionTable[imm32];
  }
  return p;

The related __os_arm64x_dispatch_icall_cfg is also a (per-module) global variable containing a function pointer; courtesy of the linker and loader it'll end up containing either LdrpValidateEcCallTarget or LdrpValidateEcCallTargetCfg or LdrpValidateEcCallTargetCfgES, the latter two of which perform a control-flow-guard (CFG) test against x11 and then tailcall LdrpValidateEcCallTarget.

If LdrpValidateEcCallTarget found X64 code, then x9 will end up containing the X64 pointer, and x11 will end up containing whatever was in x10. It is expected that the caller put the address of an exit thunk in x10, so that blr x11 will either perform the ARM64 call or call the exit thunk for an X64 call.

The exit thunk should set up a call frame, copy arguments from their Arm64EC ABI locations to their X64 ABI locations, perform the X64 call, transfer the result back to its Arm64EC location, then tear down its call frame and return. To perform the X64 call, it should put the X64 pointer in x9 (if not already in x9), and then call __os_arm64x_dispatch_call_no_redirect. If making a vararg call, then this is where x5 comes in to play, as the exit thunk needs to copy the stack-based arguments to a new location at the bottom of the stack.

The call to __os_arm64x_dispatch_call_no_redirect must be done as blr x16; the ARM64 CPU will set lr to the next ARM64 instruction, the X64 emulator will then push lr on to the stack as the return address, and when the emulated X64 code eventually tries to resume execution at this address, the emulator will recognise the blr x16 preceding this address and perform a return-like transition.

The (per-module) global variables __os_arm64x_dispatch_call_no_redirect and __os_arm64x_dispatch_ret and __os_arm64x_x64_jump end up pointing to functions exported from xtajit.dll. Again the linker and loader conspire to populate these variables (this is not the usual GOT / IAT machinery, though the end effect is similar). All three functions are entry points to the X64 emulator, and follow a similar pattern:

__os_arm64x_dispatch_call_no_redirect__os_arm64x_dispatch_ret__os_arm64x_x64_jump
Intended usageCall X64 code (usually from exit thunk)Return to X64 code after entry thunkTailcall ARM64 or X64 code in adjustor thunk
Architecture checkNone (target assumed to be X64)None (target assumed to be X64)If RtlIsEcCode(x9) then tailcall x9's entry thunk
Stack manipulationPush lr (so that X64 can return)None (was done pre-entry-thunk)Push lr (revert pre-entry-thunk manipulation)
Passthrough registers
(extra non-volatiles)
x0 through x3, q0 through q15x8, q0 through q3, q6 through q15x0 through x3, q0 through q15
(plus x4 through x9 if tailcalling ARM64)
X64 execution targetx9lrx9

The observant reader will notice that __os_arm64x_x64_jump isn't quite sufficient for its intended usage. If it was sufficient, then its pseudocode would be something like:

__os_arm64x_x64_jump:     // hoped-for implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    x9 = ResolveFastForwardSequences(x9)
    if (RtlIsEcCode(x9)) {
      x11 = GetEntryThunk(x9)
      br x11
    } else {
      revert the pre-entry-thunk stack manipulation
      br_to_X64_mode x9
    }
  }

Unfortunately, at least in the version of Windows 11 that I'm using right now, its pseudocode is more like:

__os_arm64x_x64_jump:      // actual implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    str lr, [sp, #-8]! // push lr
    br_to_X64_mode x9
  }

In other words, it is deficient in two ways:

  1. Not checking for fast-forward sequences. This is unfortunate from a performance perspective, but not otherwise harmful.
  2. Assuming that the pre-entry-thunk stack manipulation was just popping into lr. This is true most of the time, but not all the time, and all sorts of mayhem can arise if this assumption is wrong.

To address point 2, the tail of __os_arm64x_x64_jump wants to be more like the following:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  br_to_X64_mode x9
}

A more complex, but ultimately more useful, variant of the above is:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  str x9, [sp, #-8]! // push x9
  adr lr, x64_ret_stub
  br_to_X64_mode lr
}

The adr in the above can be dropped, as if sp != x4, it means that the pre-entry-thunk manipulation will already have put x64_ret_stub in lr. At this point, the two branches differ only in swapping x9/lr, and so the swap can be pulled out:

if (unlikely(sp != x4)) {
  swap(x9, lr)
  mov x4, sp
}
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9

Following this line of thinking, we can sidestep this particular deficiency of __os_arm64x_x64_jump by inserting the following immediately before the br to __os_arm64x_x64_jump:

cmp sp, x4          // sp == x4?
mov x4, x9
csel x9, x9, lr, eq // If not, swap
csel lr, lr, x4, eq // x9 and lr.
mov x4, sp

Alternatively, we could provide our own version of __os_arm64x_x64_jump, built using __os_arm64x_check_icall and __os_arm64x_dispatch_call_no_redirect, which fixes both deficiencies:

my_arm64x_x64_jump:
  adrp x17, __os_arm64x_check_icall
  ldr x17, [x17, __os_arm64x_check_icall]
  mov x11, x9 // __os_arm64x_check_icall main input
  adr x10, after_blr // __os_arm64x_check_icall thunk input
  stp fp, lr, [sp, #-16]!
  mov fp, sp
  blr x17 // __os_arm64x_check_icall
after_blr:
  ldrsw x16, [x11, #-4] // load entry thunk offset
  adrp x17, __os_arm64x_dispatch_call_no_redirect
  cmp x10, x11 // is x64 target?
  ldr x17, [x17, __os_arm64x_dispatch_call_no_redirect]
  csel x10, x9, x11, eq // x10 will become x9 later
  sub x11, x11, #1 // bias for 0b01 in entry thunk offset
  sub x9, sp, x4
  add x11, x11, x16 // entry thunk address
  ldp fp, lr, [sp], #16
  csel x11, x17, x11, eq
  ccmn x9, #16, #4, eq
  csel x9, x10, lr, eq
  csel lr, lr, x10, eq
  br x11 // entry thunk or __os_arm64x_dispatch_call_no_redirect

Windows ARM64 Frame Unwind Code Details

This post is intended to supplement MSDN's ARM64 exception handling information. To start, here's a fleshed-out version of a table listing the available unwind codes:

NameEncodingProlog InstructionEpilog InstructionUnwind Effect
alloc_s000iiiiisub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
alloc_m11000iiiiiiiiiiisub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
alloc_l11100000i:u24sub sp, sp, i*16add sp, sp, i*16Emulate epilog instruction
add_fp11100010i:u8add fp, sp, i*8 (NB: not 16)sub sp, fp, i*8 (NB: not 16)Emulate epilog instruction
set_fp11100001mov fp, spmov sp, fpEmulate epilog instruction
pac_sign_lr11111100pacibsp (sign lr using sp)autibsp (authenticate lr using sp)Emulate autibsp or xpaclri
save_fplr01iiiiiistp fp, lr, [sp+i*8]ldp fp, lr, [sp+i*8]Emulate epilog instruction
save_lrpair1101011nnniiiiistp x(19+2*n), lr, [sp+i*8]ldp x(19+2*i), lr, [sp+i*8]Emulate epilog instruction
save_fplr_x10iiiiiistp fp, lr, [sp-(i+1)*8]!ldp fp, lr, [sp], (i+1)*8Emulate epilog instruction
save_regp110010nnnniiiiiistp x(19+n), x(20+n), [sp+i*8]ldp x(19+n), x(20+n), [sp+i*8]Emulate epilog instruction
save_regp_x110011nnnniiiiiistp x(19+n), x(20+n), [sp-(i+1)*8]!ldp x(19+n), x(20+n), [sp], (i+1)*8Emulate epilog instruction
save_r19r20_x001iiiiistp x19, x20, [sp-i*8]! (NB: not i+1)ldp x19, x20, [sp], i*8 (NB: not i+1)Emulate epilog instruction
save_next11100110stp similar to previous non-lr stp ldp similar to next non-lr ldp Extend next effect by two
save_reg110100nnnniiiiiistr x(19+n), [sp+i*8]ldr x(19+n), [sp+i*8]Emulate epilog instruction
save_reg_x1101010nnnniiiiistr x(19+n), [sp-(i+1)*8]!ldr x(19+n), [sp], (i+1)*8Emulate epilog instruction
save_fregp1101100nnniiiiiistp d(8+n), d(9+n), [sp+i*8]ldp d(8+n), d(9+n), [sp+i*8]Emulate epilog instruction
save_fregp_x1101101nnniiiiiistp d(8+n), d(9+n), [sp-(i+1)*8]!ldp d(8+n), d(9+n), [sp], (i+1)*8Emulate epilog instruction
save_freg1101110nnniiiiiistr d(8+n), [sp+i*8]ldr d(8+n), [sp+i*8]Emulate epilog instruction
save_freg_x11011110nnniiiiistr d(8+n), [sp-(i+1)*8]!ldr d(8+n), [sp], (i+1)*8Emulate epilog instruction
save_any_reg11100111x:u16 *One str or stp to stack *One ldr or ldp from stack *Varies *
custom11101xxxNo instructionNo instructionBespoke
reserved11110xxxNo instructionNo instructionUnwind fails
reserved11111000x:u8Any one instructionAny one instructionNo effect (yet)
reserved11111001x:u16Any one instructionAny one instructionNo effect (yet)
reserved11111010x:u24Any one instructionAny one instructionNo effect (yet)
reserved11111011x:u32Any one instructionAny one instructionNo effect (yet)
reserved11111101Any one instructionAny one instructionNo effect (yet)
reserved11111110Any one instructionAny one instructionNo effect (yet)
reserved11111111Any one instructionAny one instructionNo effect (yet)
nop11100011Any one instructionAny one instructionNo effect
end_c11100101No instruction, start of prolog Any one instruction, then end of epilog No effect
end11100100No instruction, start of prologret (or tailcall b), then end of epilogUnwind complete

(*) save_any_reg was added for Arm64EC. The 16-bit payload is made from a number of fields, packed as rpxnnnnnmmiiiiii. This gives rise to many different variants:

NameEncodingProlog InstructionEpilog InstructionUnwind Effect
save_any_reg11100111000nnnnn00iiiiiistr x(0+n), [sp+i*8]ldr x(0+n), [sp+i*8]Emulate epilog instruction
save_any_reg11100111001nnnnn00iiiiiistr x(0+n), [sp-(i+1)*16]!ldr x(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn00iiiiiistp x(0+n), x(1+n), [sp+i*16]ldp x(0+n), x(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn00iiiiiistp x(0+n), x(1+n), [sp-(i+1)*16]!ldp x(0+n), x(1+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111000nnnnn01iiiiiistr d(0+n), [sp+i*8]ldr d(0+n), [sp+i*8]Emulate epilog instruction
save_any_reg11100111001nnnnn01iiiiiistr d(0+n), [sp-(i+1)*16]!ldr d(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn01iiiiiistp d(0+n), d(1+n), [sp+i*16]ldp d(0+n), d(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn01iiiiiistp d(0+n), d(1+n), [sp-(i+1)*16]!ldp d(0+n), d(1+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111000nnnnn10iiiiiistr q(0+n), [sp+i*16] (NB: not 8)ldr q(0+n), [sp+i*16] (NB: not 8)Emulate epilog instruction
save_any_reg11100111001nnnnn10iiiiiistr q(0+n), [sp-(i+1)*16]!ldr q(0+n), [sp], (i+1)*16Emulate epilog instruction
save_any_reg11100111010nnnnn10iiiiiistp q(0+n), q(1+n), [sp+i*16]ldp q(0+n), q(1+n), [sp+i*16]Emulate epilog instruction
save_any_reg11100111011nnnnn10iiiiiistp q(0+n), q(1+n), [sp-(i+1)*16]!ldp q(0+n), q(1+n), [sp], (i+1)*16Emulate epilog instruction
reserved111001111xxxxxxxxxxxxxxxAny one instructionAny one instructionUnwind fails
reserved11100111xxxxxxxx11xxxxxxAny one instructionAny one instructionUnwind fails

() The save_next unwind code requires further explanation. It causes the next unwind code to load four registers rather than two. It can be stacked; a sequence of N save_next unwind codes causes the next unwind code thereafter to load (N+1)*2 registers. Said next unwind code must be one of save_regp / save_regp_x / save_r19r20_x / save_fregp / save_fregp_x (notably, this excludes pair instructions with lr in their name). As examples:

The MSDN documentation suggests that a sufficiently long sequence of save_next codes can overflow from x registers to d registers, but best not to try this; stop at x31 or d15.

() In a prolog, end_c corresponds to no instruction, and also causes subsequent codes to correspond to no instruction. In an epilog, end_c corresponds to ret (or b, or any other one instruction that has no unwind effect), and causes subsequent codes to correspond to no instruction. In either case, subsequent codes (up to the first end) are still executed during unwind.

Moving on, each .xdata record describes a contiguous region of machine instructions. There is typically a 1:1 correspondence between regions and functions, though this needn't be true. A region contains:

If any of the above limits would be exceeded, then the region needs to be artifically split into multiple smaller regions, such that no limit is exceeded. The MSDN documentation suggests that split boundaries should be chosen as to not be in the middle of an epilog (this suggestion would not be required if end_c was treated as "No instruction, end of epilog", but alas it is treated as "Any one instruction, then end of epilog").

The length of the prolog is not explicitly given; it is calculated by analysing the unwind codes strictly preceding the first end or end_c and counting how many of them correspond to one instruction. Said unwind codes, once reversed, should correspond 1:1 with instructions in the prolog (nop codes can be used to make things line up, if there are instructions in the prolog which do not need an unwind effect). If unwinding from within the prolog, we calculate how many instructions we are away from the end of the prolog, and skip that many unwind codes from the start of the list. To indicate a lack of prolog, the first unwind code should be end_c (or end if no unwind effects are required by the main body of the region). Note that unwind effects after end_c are still executed if unwinding from within the prolog (though there's a Wine bug here).

The length of each epilog is not explicitly given; it is calculated by analysing the unwind codes starting at the epilog-specific offset and continuing up to and including the first end or end_c thereafter. Said unwind codes should correspond 1:1 with instructions in the epilog (nop codes can be used to make things line up, if there are instructions in the epilog which do not need an unwind effect). If unwinding from within an epilog, we calculate how many instructions we are away from the start of the epilog, and skip that many unwind codes from the start of the list. Note that unwind effects after end_c are still executed if unwinding from within an epilog (though again Wine bug).

If the E bit is set in the .xdata header, then the end of the epilog is equal to the end of the region (the start of the epilog is found by calculating the length of the epilog and subtracting this from the end). If the E bit is not set, then the start of each epilog is explicitly specified.

If unwinding from a PC not within a prolog or epilog, unwind codes are executed until the first end is reached. Note that these codes are shared with those describing the prolog; this is not normally a problem, but if it is, the prolog can be split off to a separate region consisting solely of prolog, leaving no prolog in the original region.

If the X bit is set in the .xdata header, and PC is not within a prolog or epilog, then the specified function is called during exception handler search and during exception unwinding. See __C_specific_handler for the prototype of this function, and see LuaJIT for a concrete example.

A whirlwind tour of AArch64 vector instructions (NEON)

32 vector registers, each 128 bits wide. Also a control register (FPCR) and a status register (FPSR), though scalar comparison instructions use PSTATE. Each vector register can contain:

128 bits might seem small compared to AVX-512, but for most vector instructions, the M1's Firestorm cores can issue four instances of the instruction per cycle, which gets you to a similar place. There are also AMX units on the M1.

A lane is 1/8/16/32/64/128 bits wide. The common lane types are:

1-bit8-bit16-bit32-bit64-bit128-bit
uintbituint8 [q]uint16 [q]uint32 [q]uint64 [q]
sintsint8 [q]sint16 [q]sint32 [q]sint64 [q]
fpfp16 [a]fp32fp64
bfbf16 [b]
polybitpoly8poly16poly64 [c]poly128 [c]
[a] Requires fp16 extension, and improved further by fp16fml extension.
[b] Requires bf16 extension (present on Apple M2, but not on M1).
[c] Requires crypto extension.
[q] Often with the choice of truncate on overflow or saturate on overflow.

The syntactic names for the 32 registers vary based on the lane width and also how much of the register is being used:

Lane widthLow scalarLow 64 bitsAll 128 bits
1 bit N/AV0.8BV31.8BV0.16BV31.16B
8 bitsB0B31V0.8BV31.8BV0.16BV31.16B
16 bitsH0H31V0.4HV31.4HV0.8HV31.8H
32 bitsS0S31V0.2SV31.2SV0.4SV31.4S
64 bitsD0D31V0.1DV31.1DV0.2DV31.2D
128 bitsQ0Q31N/AV0.1QV31.1Q
In some contexts, individual lanes can be referenced. The syntax for this appends [0][15] to one of the above, for example V7.4S[3] denotes the most significant 32-bit lane of V7. Writes to a single lane with this syntax generally preserve other lanes, whereas in all other cases, writes to the low bits of a register generally zero the remaining bits (though see FPCR.NEP).

Vector instructions have up to six input registers, and up to one output register (with the exception of loads, which can have up to four output registers). In most cases, all the registers involved in a single instruction have the same lane width, though there are exceptions:

In most cases, operations are lane-wise: lane i of the output register is formed by combining lane i of each input register, though there are exceptions:

GPR to / from vector

InstructionDirectionBehaviour
FMOVGPR to vectorTruncate to lane width, then zero extend
DUPGPR to vectorTruncate then replicate to all lanes
INSGPR to vectorTruncate then insert to arbitrary lane
FMOVVector to GPRTake low lane, zero extend to GPR width
UMOVVector to GPRArbitrary lane, zero extend to GPR width
SMOVVector to GPRArbitrary lane, sign extend to GPR width

A number of data type conversion instrucions can also have GPR as source or destination (see "Data type conversion, integer to float" and "Data type conversion, float to integer").

Load / store

A scalar load moves 8/16/32/64 bits from memory to (part of) a vector register, whereas a vector load moves 64/128 bits. This can be repeated up to four times, reading from consecutive memory locations and writing to distinct vector registers (which must be consecutively numbered, except for LDP / LDNP).

×1×2×3×4
Scalar (low lane,
zero others)
LDR
LDUR
LDP
LDNP
Scalar (any lane,
preserve others)
LD1(SS 1R)LD2(SS 2R)LD3(SS 3R)LD4(SS 4R)
Scalar (replicate
to all lanes)
LD1RLD2RLD3RLD4R
VectorLD1(MS 1R)
LDR
LDUR
LD1(MS 2R)
LDP
LDNP
LD1(MS 3R)LD1(MS 4R)
Vector, transposedLD2(MS 2R)LD3(MS 3R)LD4(MS 4R)

With the exception of scalar replicating to all lanes, all load instructions have a corresponding store instruction performing the inverse - replace LD with ST.

The SS or MS suffix denotes what the ARM reference manual calls "single structure" or "multiple structures", and is followed by the number of destination registers (1, 2, 3, or 4). The operand syntax relates to this suffix.

The vector transposed loads perform one of the following tranposes, where M denotes the number of destination registers (2, 3, or 4):

8-bit lanes16-bit lanes32-bit lanes64-bit lanes
64-bit vectors8×M ↦ M×84×M ↦ M×42×M ↦ M×2
128-bit vectors16×M ↦ M×168×M ↦ M×84×M ↦ M×42×M ↦ M×2

The addressing modes are all over the place:

BaseOffsetWriteback Mode
LDRXn or SPsigned imm9Pre- or post-index
LDURXn or SPsigned imm9No writeback
LDRXn or SPunsigned imm12, scaledNo writeback
LDRXn or SPXm or Wm, optional extend/scaleNo writeback
LDRPCsigned imm19, times 4No writeback
LDPXn or SPsigned imm7, scaledPre- or post-index
LDPXn or SPsigned imm7, scaledNo writeback
LDNPXn or SPsigned imm7, scaledPre- or post-index
LDNPXn or SPsigned imm7, scaledNo writeback
OthersXn or SPunsigned imm1, scaled (or Xm)Post-index

Data movement

Moving lanes around:

Source laneDestination laneOther lanes
FMOVLow laneLow laneZeroed
MOV (DUP)ArbitraryLow laneZeroed
MOV (INS)ArbitraryArbitraryPreserved
DUPArbitraryAll lanes (replicated)N/A
MOV (ORR)All lanesAll lanes (1:1)N/A

Various reversals:

in bytesin u16sin u32sin u64sin u128s
Reverse bitsRBIT
Reverse bytesno-opREV16REV32REV64TBL
Reverse u16sno-opREV32REV64TBL
Reverse u32sno-opREV64TBL
Reverse u64sno-opEXT

Various ways of creating one vector from two:

First stepSecond step
TRN1, TRN2Discard odd or even lanesInterleave lanes
ZIP1, ZIP2Discard high half or low halfInterleave lanes
UZP1, UZP2Concatenate vectorsDiscard odd or even lanes
EXTConcatenate vectorsTake a contiguous 16-byte span
Note that EXT with the same source vector for both operands gives a rotation by a whole number of bytes. Note that some math instructions come with a free UZP1 and UZP2 (annotated as [p]).

The TBL, TBX family concatenate 1-4 vectors from consecutively numbered registers to form a table T of 16/32/48/64 bytes, then another byte vector serves as indices into said table:

Per-byte behaviour
TBLDi = (Yi < len(T)) ? T[Yi] : 0
TBXDi = (Yi < len(T)) ? T[Yi] : Di
Note that x86 pshufb does Di = (Yi < 128) ? X[Yi & 15] : 0, which is similar.

Immediates

One flavour of FMOV loads constants of the form ±(1.0 + m/16) × 2e (where 0 ≤ m ≤ 15 and −3 ≤ e ≤ 4) into the low fp16/fp32/fp64 lane, and either zeros the other lanes or replicates the constant to the other lanes.

One flavour of MOVI loads a constant into the low 64-bit lane, each byte of which is independently either 0 or 255, and either zeros the high u64 lane or replicates the constant there.

The remaining flavours of MOVI and MVNI load a constant into the low 8/16/32-bit lane, one byte of which is an arbitrary value, bytes to the left either all 0 or all 255, and bytes to right either all 0 or all 255, then replicates this constant to all lanes.

The bitwise BIC and ORR instructions support constants of the form: 16/32-bit lane, one byte of which is an arbitrary value, other bytes all 0, replicated to all lanes.

Various comparison instructions support comparison against constant zero:

SignedUnsignedFloating
X == 0CMEQ X, #0CMEQ X, #0FCMEQ X, #0.0
X <= 0CMLE X, #0CMEQ X, #0FCMLE X, #0.0
X < 0CMLT X, #0always falseFCMLT X, #0.0
X > 0CMGT X, #0CMTST X, XFCMGT X, #0.0
X >= 0CMGE X, #0always trueFCMGE X, #0.0
X <=> 0N/AN/AFCMP X, #0.0

Shifts

Note >>R is used to denote a rounding right shift: if the most significant bit shifted out was a 1, then 1 is added to the result. If shifting right by N, this is equivalent to adding 1 << (N - 1) to the input before shifting.

In-lane shifts by immediate:

Per-lane behaviour
SHLD = X << N
SQSHL, UQSHLD = sat(X << N)
SQSHLUD = sat(X << N) (X signed, D unsigned)
SLID = (X << N) | bzhi(D, N) (bzhi clears all but the low N bits)
SSHR, USHRD = X >> N
SRSHR, URSHRD = X >>R N
SSRA, USRAD += X >> N
SRSRA, URSRAD += X >>R N
SRID = (X >> N) | bzlo(D, N) (bzlo clears all but the high N bits)

In-lane variable shifts, where the shift amount is a signed value in the low 8 bits of each lane in the 2nd operand:

Per-lane behaviour (Y > 0)Per-lane behaviour (Y < 0)
SSHL, USHLD = X << YD = X >> -Y
SRSHL, URSHLD = X << YD = X >>R -Y
SQSHL, UQSHLD = sat(X << Y)D = X >> -Y
SQRSHL, UQRSHLD = sat(X << Y)D = X >>R -Y

Widening shifts by immediate (from 8-bit lanes to 16-bit, 16-bit lanes to 32-bit, or 32-bit lanes to 64-bit), where X is sign-extended or zero-extended before use, and D lanes are twice the width of X lanes:

Per-lane behaviour (twice-width D)
SSHLL, SSHLL2, USHLL, USHLL2D = X << N (where 0 ≤ N < bitwidth(X))
SHLL, SHLL2D = X << bitwidth(X)

Narrowing shifts by immediate (from 64-bit lanes to 32-bit, 32-bit lanes to 16-bit, or 16-bit lanes to 8-bit), where D lanes are half the width of X lanes. In all cases, 1 ≤ N ≤ bitwidth(D):

Per-lane behaviour (half-width D)
XTN, XTN2D = truncate(X)
SHRN, SHRN2D = truncate(X >> N)
RSHRN, RSHRN2D = truncate(X >>R N)
SQXTN, SQXTN2D = sat(X) (signed)
UQXTN, UQXTN2D = sat(X) (unsigned)
SQXTUN, SQXTUN2D = sat(X) (X signed, D unsigned)
SQSHRN, SQSHRN2D = sat(X >> N) (signed)
UQSHRN, UQSHRN2D = sat(X >> N) (unsigned)
SQSHRUN, SQSHRUN2D = sat(X >> N) (X signed, D unsigned)
SQRSHRN, SQRSHRN2D = sat(X >>R N) (signed)
UQRSHRN, UQRSHRN2D = sat(X >>R N) (unsigned)
SQRSHRUN, SQRSHRUN2D = sat(X >>R N) (X signed, D unsigned)

There is no narrowing shift from 8-bit lanes to something narrower. This is a notable difference to x86, where pmovmskb can pull 1 bit out of every 8-bit lane. That said, SHRN from 16-bit lanes to 8-bit lanes with N set to 4 does something interesting: it pulls 4 bits out of every 8-bit lane, taking the high 4 bits of even lanes and the low 4 bits of odd lanes. Alternating between high/low halves is weird, but innocuous if every 8-bit lane starts out containing either 0 or 255.

Subject to the sha3 extension, two instructions provide rotates in 64-bit lanes:

Per-lane behaviour (64-bit lanes only)
RAX1D = X xor rotate_left(Y, 1)
XARD = rotate_right(X xor Y, N)

Note that rotate by immediate (without xor, for any lane width, and without needing sha3) can be constructed from SHL followed by USRA.

Bitwise

Assorted instructions operating bitwise:

DescriptionPer-bit behaviour
ANDAndD = X and Y
BCAX [s]Clear and xorD = X xor (Y and not Z)
BICClearD = X and not Y
BICClear (immedate)D = D and not imm [i]
BIFInsert if falseD = Y ? D : X
BITInsert if trueD = Y ? X : D
BSLSelectD = D ? X : Y
EORXorD = X xor Y
EOR3 [s]Xor (three-way)D = X xor Y xor Z
NOTNotD = not X
ORNOr notD = X or not Y
ORROrD = X or Y
ORROr (immediate)D = D or imm [i]
[s] Requires sha3 extension.
[i] Immediate is a 16-bit or 32-bit constant where one byte is an arbitrary imm8 and other bytes are zero, broadcast to all 16-bit or 32-bit lanes.

Bit counting instructions:

CountsPossible lane widths
CLSLeading sign bits8-bit, 16-bit, 32-bit
CLZLeading zero bits (i.e. lzcnt)8-bit, 16-bit, 32-bit
CNTNon-zero bits (i.e. popcnt)8-bit [a]
[a] Other lane widths can be achieved by a follow-up UADDLP or ADDV.

There are no horizontal bitwise instructions in the traditional sense, though various horizontal reductions can be constructed from other instructions:

Integer math

Assorted instructions operating lanewise, with 8/16/32/64-bit lanes:

Per-lane behaviour
ABSD = abs(X)
SQABSD = sat(abs(X))
NEGD = -X
SQNEGD = sat(-X)
ADD, SUBD = X ± Y
ADDPD = A + B [p]
ADDVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SQADD, UQADD, SUQADD, USQADDD = sat(X + Y)
SQSUB, UQSUBD = sat(X - Y)
SABD, UABD [d]D = abs(X - Y)
SABA, UABA [d]D += abs(X - Y)
SHADD, SHSUB, UHADD, UHSUB [d]D = (X ± Y) >> 1
SRHADD, URHADD [d]D = (X + Y) >>R 1
MUL [b] [d]D = X * Y
MLA, MLS [b] [d]D ±= X * Y
SQDMULH [a] [b] [d]D = sat((2 * X * Y) >> bitwidth(D))
SQRDMULH [a] [b] [d]D = sat((2 * X * Y) >>R bitwidth(D))
SQRDMLAH, SQRDMLSH [a] [b] [d]D = sat(D ± ((2 * X * Y) >>R bitwidth(D)))
SMIN, UMIN [d]D = min(X, Y)
SMINP, UMINP [d]D = min(A, B) [p]
SMINV, UMINV [d]D0 = min(X0, X1, …, Xn-1) [v]
SMAX, UMAX [d]D = max(X, Y)
SMAXP, UMAXP [d]D = max(A, B) [p]
SMAXV, UMAXV [d]D0 = max(X0, X1, …, Xn-1) [v]
CMEQ [z]D = (X == Y) ? ones_mask : 0
CMGE, CMHS [z]D = (X >= Y) ? ones_mask : 0
CMGT, CMHI [z]D = (X > Y) ? ones_mask : 0
CMTSTD = (X & Y) ? ones_mask : 0
[a] Not available for 8-bit lanes. Not available as unsigned.
[b] When using 16/32-bit lanes, can broadcast a single lane of Y to all lanes of Y.
[d] Not available for 64-bit lanes.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum/min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Assorted instructions operating lanewise, with 8/16/32-bit lanes for X and Y, and D lanes twice as wide (X/Y sign-extended or zero-extended before use):

Per-lane behaviour (twice-width D)
SXTL, SXTL2, UXTL, UXTL2D = X (i.e. just sign/zero extend)
SABDL, SABDL2, UABDL, UABDL2D = abs(X - Y)
SABAL, SABAL2, UABAL, UABAL2D += abs(X - Y)
SADDL, SADDL2, UADDL, UADDL2D = X + Y
SSUBL, SSUBL2, USUBL, USUBL2D = X - Y
SADDLP, UADDLPD = A + B [p]
SADALP, UADALPD += A + B [p]
SADDLV, UADDLVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SMULL, SMULL2, UMULL, UMULL2 [b]D = X * Y
SMLAL, SMLAL2, UMLAL, UMLAL2 [b]D += X * Y
SMLSL, SMLSL2, UMLSL, UMLSL2 [b]D -= X * Y
SQDMULL, SQDMULL2 [a] [b]D = sat(2 * X * Y)
SQDMLAL, SQDMLAL2 [a] [b]D = sat(D + sat(2 * X * Y))
SQDMLSL, SQDMLSL2 [a] [b]D = sat(D - sat(2 * X * Y))
[a] Not available for 8-bit lanes of X/Y. Not available as unsigned.
[b] When using 16/32-bit lanes of Y, can broadcast a single lane of Y to all lanes.
[p] Ai is X2*i+0, Bi is X2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum of all lanes of X, rest of D cleared.

A few instructions operating lanewise, with 16/32/64-bit lanes for D and X, and Y lanes half as wide (Y sign-extended or zero-extended before use):

Per-lane behaviour (half-width Y)
SADDW, SADDW2, UADDW, UADDW2D = X + Y
SSUBW, SSUBW2, USUBW, USUBW2D = X - Y

A few instructions operating lanewise, with 16/32/64-bit lanes for X and Y, and D lanes half as wide:

Per-lane behaviour (half-width D)
ADDHN, ADDHN2, SUBHN, SUBHN2D = (X ± Y) >> bitwidth(D)
RADDHN, RADDHN2, RSUBHN, RSUBHN2D = (X ± Y) >>R bitwidth(D)

Dense linear algebra instructions:

BehaviourD typeX typeY type
SDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]s8[4][4]
UDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]u8[4][4]
USDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]s8[4][4]
SUDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]u8[4][4]
SMMLAD += X @ YTs32[2][2] or u32[2][2]s8[2][8]s8[2][8]
UMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]u8[2][8]
USMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]s8[2][8]
[b] Can broadcast a 32-bit lane of Y to all 32-bit lanes.

Two oddball instructions operate on 32-bit unsigned lanes containing fixed-precision numbers with 32 fractional bits (i.e. range is 0 through 1-ε):

Per-lane behaviour
URECPED = sat(0.5 * X-1) (approximate, using just top 9 bits)
URSQRTED = sat(0.5 * X-0.5) (approximate, using just top 9 bits)

Float math

A broad range of floating-point math instructions are available, operating on fp32 or fp64 lanes (or fp16 subject to the fp16 extension), in either vector form or scalar form:

Per-lane behaviour
FABSD = abs(X)
FNEGD = -X
FADD, FSUBD = X ± Y
FADDPD = A + B [p]
FABDD = abs(X - Y)
FMUL [b]D = X * Y
FMULX [b]D = X * Y (except that ±0 times ±infinity is ±2.0)
FNMUL [s]D = -(X * Y)
FMADD, FMSUB [s]D = Z ± X * Y
FNMADD, FNMSUB [s]D = -(Z ± X * Y)
FMLA, FMLS [b]D ±= X * Y
FDIVD = X / Y
FSQRTD = X0.5
FRECPX [s]D = X-1 (crude approximate [a], using no fractional bits)
FRECPED = X-1 (approximate, using just 8 fractional bits)
FRECPSD = 2.0 - X * Y [c]
FRSQRTED = X-0.5 (approximate, using just 8 fractional bits)
FRSQRTSD = 1.5 - 0.5 * X * Y [d]
FMIN, FMINNMD = min(X, Y) [m]
FMINP, FMINNMPD = min(A, B) [m] [p]
FMINV, FMINNMVD0 = min(X0, X1, …, Xn-1) [m] [v]
FMAX, FMAXNMD = max(X, Y) [m]
FMAXP, FMAXNMPD = max(A, B) [m] [p]
FMAXV, FMAXNMVD0 = max(X0, X1, …, Xn-1) [m] [v]
FCMEQ [z]D = (X == Y) ? ones_mask : 0
FCMGE [z]D = (X >= Y) ? ones_mask : 0
FCMGT [z]D = (X > Y) ? ones_mask : 0
FACGED = (abs(X) >= abs(Y)) ? ones_mask : 0
FACGTD = (abs(X) > abs(Y)) ? ones_mask : 0
[a] Clears fraction bits, then adds one to exponent if zero, then bitwise inverse of exponent bits. Can be used with FMULX as part of vector normalisation.
[b] Can broadcast a single lane of Y to all lanes of Y.
[c] Useful as part of Newton-Raphson step where successive approximations to a-1 are computed as xn+1 = xn * (2.0 - a * xn). See FRECPE.
[d] Useful as part of Newton-Raphson step where successive approximations to a-0.5 are computed as xn+1 = xn * (1.5 - 0.5 * a * xn * xn). See FRSQRTE.
[m] Note that min/max are not quite equivalent to comparison followed by selection, due to signed zeros and NaNs. The NM variants return the non-NaN operand if exactly one operand is NaN.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[s] Scalar form only, no vector form.
[v] Low lane of D gets min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Various per-lane rounding instructions with floating-point inputs and outputs (see "Data type conversion, float to integer" for integer outputs):

RoundingRange [r]Exceptions
FRINT32XMode from FPCR-231 ⋯ 231-1Inexact, InvalidOp
FRINT32ZToward zero (truncate)-231 ⋯ 231-1Inexact, InvalidOp
FRINT64XMode from FPCR-263 ⋯ 263-1Inexact, InvalidOp
FRINT64ZToward zero (truncate)-263 ⋯ 263-1Inexact, InvalidOp
FRINTATo nearest, ties away from zeroUnbounded
FRINTIMode from FPCRUnbounded
FRINTMToward minus infinity (floor)Unbounded
FRINTNTo nearest, ties toward evenUnbounded
FRINTPToward positive infinity (ceil)Unbounded
FRINTXMode from FPCRUnboundedInexact
FRINTZToward zero (truncate)Unbounded
[r] Out of range results (in either direction) replaced by -231 or -263.

Mixed-width operations and dense linear algebra:

BehaviourD typeX/Y type
FMLAL, FMLSL [b]Di ±= Xi+0 * Yi+0fp32[4]fp16[8]
FMLAL2, FMLSL2 [b]Di ±= Xi+4 * Yi+4fp32[4]fp16[8]
BFMLALB [b]Di += X2*i+0 * Y2*i+0fp32[4]bf16[8]
BFMLALT [b]Di += X2*i+1 * Y2*i+1fp32[4]bf16[8]
BFDOT [b]Di += dot(Xi, Yi)fp32[4]bf16[4][2]
BFMMLAD += X @ YTfp32[2][2]bf16[2][4]
[b] Can broadcast a single lane of Y to all lanes of Y (for BFDOT, a lane is 32 bits).

Float comparisons involving PSTATE

The FCMP, FCMPE family perform a three-way comparison of the low fp16/fp32/fp64 lane of two operands, writing the result to PSTATE:

Following FCMP X, Y, the meaning of condition codes is:

EQX0 == Y0NE!(X0 == Y0)
LSX0 <= Y0HI!(X0 <= Y0)
LOX0 < Y0HS!(X0 < Y0)
MIX0 < Y0PL!(X0 < Y0)
CCX0 < Y0CS!(X0 < Y0)
GTX0 > Y0LE!(X0 > Y0)
GEX0 >= Y0LT!(X0 >= Y0)
VSis_nan(X0) or is_nan(Y0)VC!is_nan(X0) and !is_nan(Y0)

The FCCMP, FCCMPE family perform a conditional three-way comparison of the low fp16/fp32/fp64 lane of two operands: some condition is evaluated against the contents of PSTATE; if true, the instruction behaves like FCMP/FCMPE; if false, a four-bit immediate is written to the relevant PSTATE bits.

The FCSEL instruction uses PSTATE to conditionally select between two scalar fp16/fp32/fp64 operands: D0 = cond ? X0 : Y0 (other lanes of D cleared).

Data type conversion, float to float

Vector form, FPCR rounding:

to bf16to fp16to fp32to fp64
From bf16no changevia fp32SSHLL, SSHLL2via fp32
From fp16via fp32no changeFCVTL, FCVTL2via fp32
From fp32BFCVTN, BFCVTNFCVTN, FCVTN2no changeFCVTL, FCVTL2
From fp64via fp32 [x]via fp32 [x]FCVTN, FCVTN2no change
[x] Using FCVTXN or FCVTXN2, which employ round-to-odd rounding mode.

For scalar conversions, FCVT can convert from any of fp16/fp32/fp64 to any other of fp16/fp32/fp64.

Data type conversion, integer to float

Vector form, FPCR rounding, free division by a power of two afterwards:

to fp16to fp32to fp64
From s16 or u16SCVTF or UCVTFvia s32 or u32via s64 or u64
From s32 or u32via fp32SCVTF or UCVTFvia s64 or u64
From s64 or u64via fp64via fp64SCVTF or UCVTF

SCVTF and UCVTF can also take a GPR as input (32-bit or 64-bit, signed or unsigned), and convert that to any of fp16/fp32/fp64, again with a free division by a power of two afterwards.

Data type conversion, float to integer

A family of conversion instructions exist, available in two forms:

RoundingOverflow
FCVTAS, FCVTAUTo nearest, ties away from zeroSaturate
FCVTMS, FCVTMUToward minus infinity (floor)Saturate
FCVTNS, FCVTNUTo nearest, ties toward evenSaturate
FCVTPS, FCVTPUToward positive infinity (ceil)Saturate
FCVTZS, FCVTZU [f]Toward zero (truncate)Saturate
FJCVTZS [j]Toward zero (truncate)Modulo 232
[f] Free multiplication by a power of two possible before the conversion.
[j] Only exists in fp64 to s32 GPR form. Also sets PSTATE.

Complex float math

A pair of floating point lanes can represent a complex floating point number, where the low scalar lane contains the real part of the complex number and the high scalar lane contains the imaginary part of the complex number. A 128-bit register can then contain 4 fp16 complex lanes, or 2 fp32 complex lanes, or a single fp64 complex lane. A few instructions exist for manipulating these:

Real part of resultImaginary part of result
FCADD #90Re(D) = Re(X) - Im(Y)Im(D) = Im(X) + Re(Y)
FCADD #270Re(D) = Re(X) + Im(Y)Im(D) = Im(X) - Re(Y)
FCMLA #0 [b]Re(D) += Re(X) * Re(Y)Im(D) += Re(X) * Im(Y)
FCMLA #90 [b]Re(D) -= Im(X) * Im(Y)Im(D) += Im(X) * Re(Y)
FCMLA #180 [b]Re(D) -= Re(X) * Re(Y)Im(D) -= Re(X) * Im(Y)
FCMLA #270 [b]Re(D) += Im(X) * Im(Y)Im(D) -= Im(X) * Re(Y)
[b] Can broadcast a complex lane (i.e. 2 scalars) of Y to all complex lanes of Y.

Polynomial math

The PMUL, PMULL, and PMULL2 instructions all perform D = X * Y, where all lanes of D/X/Y contain ℤ2 polynomials. This is alternatively known as carryless multiplication (pclmulqdq on x86).

D lanesX/Y lanes
PMUL8-bit poly (high 7 bits of result discarded)8-bit poly
PMULL, PMULL216-bit poly (top bit always clear)8-bit poly
PMULL, PMULL2 [c]128-bit poly (top bit always clear)64-bit poly
[c] Requires crypto extension.

For ℤ2 polynomial addition/subtraction, see EOR or EOR3. Polynomial division and remainder against a constant Y can be performed via multiplication.

Cryptography

Some instructions are provided to accelerate AES encryption. A single round of AES encryption consists of AddRoundKey (just xor), then SubBytes and ShiftRows (in either order), then optionally MixColumns (performed for every round except the last). The provided instructions are:

Steps
AESEAddRoundKey then ShiftRows and SubBytes
AESMCMixColumns
AESDInverse AddRoundKey then inverse ShiftRows and inverse SubBytes
AESIMCInverse MixColumns

Note that x86 AES instructions are slightly different, for example aesenc there does ShiftRows and SubBytes, then MixColumns, then AddRoundKey.

Some instructions are provided to accelerate SHA-1 hashes: SHA1C, SHA1H, SHA1M, SHA1P, SHA1SU0, SHA1SU1.

Some instructions are provided to accelerate SHA-2 hashes: SHA256H, SHA256H2, SHA256SU0, SHA256SU1 for SHA-256, and SHA512H, SHA512H2, SHA512SU0, SHA512SU1 for SHA-512.

Some instructions are provided to accelerate SHA-3 hashes: EOR3, RAX1, XAR, BCAX. See "Shifts" or "Bitwise" for descriptions.

Some instructions are provided to accelerate SM3 hashes: SM3SS1, SM3TT1A, SM3TT1B, SM3TT2A, SM3TT2B, SM3PARTW1, SM3PARTW2.

Some instructions are provided to accelerate SM4 encryption: SM4E, SM4EKEY.

page: 1 2 3 4 5