corsix.org

Macros	Implementation
`#define __STDC_ENDIAN_LITTLE__`	Some integer constant
`#define __STDC_ENDIAN_BIG__`	Some integer constant
`#define __STDC_ENDIAN_NATIVE__`	`__STDC_ENDIAN_LITTLE__` or `__STDC_ENDIAN_BIG__` (†)
Regular functions	Implementation
`unsigned stdc_leading_zeros(T x)`	`lzcnt(x)`
`unsigned stdc_first_leading_one(T x)`	`x ? lzcnt(x) + 1 : 0`
`unsigned stdc_trailing_zeros(T x)`	`tzcnt(x)`
`unsigned stdc_first_trailing_one(T x)`	`x ? tzcnt(x) + 1 : 0`
`unsigned stdc_count_ones(T x)`	`popcnt(x)`
`bool stdc_has_single_bit(T x)`	`popcnt(x) == 1`
`unsigned stdc_bit_width(T x)`	`x ? floor(log2(x)) + 1 : 0`
`T stdc_bit_floor(T x)`	`x ? (T)1 << floor(log2(x)) : 0`
`T stdc_bit_ceil(T x)`	`x ? (T)1 << ceil(log2(x)) : 1` (‡)
Inverted functions	Implementation
`unsigned stdc_leading_ones(T x)`	`lzcnt((T)~x)`
`unsigned stdc_first_leading_zero(T x)`	`(T)~x ? lzcnt((T)~x) + 1 : 0`
`unsigned stdc_trailing_ones(T x)`	`tzcnt((T)~x)`
`unsigned stdc_first_trailing_zero(T x)`	`(T)~x ? tzcnt((T)~x) + 1 : 0`
`unsigned stdc_count_zeros(T x)`	`popcnt((T)~x)`

(†) Or some third value if the execution environment is neither little endian nor big endian.

(‡) Undefined if the << overflows or if the result does not fit in T.

Where:

T denotes either unsigned char or unsigned short or unsigned int or unsigned long or unsigned long long. Each function taking T is type-generic, dispatching to one of these five variants. The individual variants are also available by appending a _uc or _us or _ui or _ul or _ull suffix to the function name. Implementations might also support uint128_t.
lzcnt returns the number of leading zero bits. For an input of zero, this is sizeof(T) * CHAR_BIT.
tzcnt returns the number of trailing zero bits. For an input of zero, this is sizeof(T) * CHAR_BIT.
popcnt returns the number of set bits. For an input with all bits set, this is sizeof(T) * CHAR_BIT.
log2(0) is undefined, log2 coincides with tzcnt for inputs that are powers of two, and otherwise log2 is strictly monotonic and returns a float.

The inverted functions can all be implemented by inverting the input and then passing it to one of the regular functions.

The stdc_first_leading_ functions are slightly slippery, and require a careful reading of the standard. For example, stdc_first_leading_one is defined as:

Returns the most significant index of the first 1 bit in value, plus 1. If it is not found, this function returns 0.

In turn, most significant index has the following unintuitive definition:

The most significant index is the 0-based index counting from the most significant bit, 0, to the least significant bit, w − 1, where w is the width of the type that is having its most significant index computed.

The initial patches for these functions in musl got this wrong, instead using the more intuitive definition of most significant index. The VLC compat header also got this wrong, twice.

Problem statement

Given the current register state for a thread, and read-only access to memory, what would the register state hypothetically become if the current function was to immediately return and execution was to resume in its caller?

Notably, the register state includes (amongst other things) an instruction pointer and a stack pointer, so this operation can be iterated to generate a backtrace. With some extensions to support calling destructors and catching exceptions, this operation can also be the basis of throwing exceptions.

This operation is typically called unwinding (as in "unwinding a call frame" or "unwinding the stack").

If certain registers are always undefined immediately after a function return (because they are volatile in the ABI), then we needn't care about their state.

Portability

The size and contents of the register state depends on the architecture being used. We'll paper over this slightly by refering to registers by ordinal rather than by name, and rely on DWARF register number mappings to go between ordinal and name, for example:

Ordinal	x86 name	x86_64 name	aarch64 name
0	eax	rax	X0
1	ecx	rdx	X1
2	edx	rcx	X2
3	ebx	rbx	X3
4	esp	rsi	X4
5	ebp	rdi	X5
6	esi	rbp	X6
7	edi	rsp	X7
8		r8	X8
9		r9	X9
10		r10	X10
11		r11	X11
12		r12	X12
13		r13	X13
14		r14	X14
15		r15	X15
16			X16
⋮			⋮
30			X30 (LR)
31			SP

Because everyone loves special cases, the instruction pointer register (e.g. eip / rip / PC) isn't given an ordinal. On aarch64, it doesn't need one: the return address is always in some register (usually X30, aka. LR), and so we can use the return address register to represent the instruction pointer. In contrast, x86 and x86_64 put the return address on the stack rather than in a register. For these, we'll invent a fake register (typically ordinal 8 on x86, ordinal 16 on x86_64), pretend that it contains the return address, and use it to represent the instruction pointer.

Floating-point and vector registers are generally volatile in Linux ABIs, so we needn't worry about them too much.

Start small: unwinding just the stack pointer

In most cases, the hypothetical stack pointer after function return can be easily calculated by adding some value to some register. This can be expressed as two DWARF-style ULEB128 values: one ULEB128 to specify which register to start with (often the current stack pointer), then one ULEB128 to specify the amount to add. As foreshadowing, we'll call this DW_CFA_def_cfa.

Unwinding other registers

In most cases, if a function modifies a non-volatile register, it'll save it to somewhere on the stack before modifying it. Upon function return, it'll re-load the value from said stack slot. The locations of these stack slots can be expressed as being relative to the hypothetical stack pointer after function return. This can again be expressed as two DWARF-style ULEB128 values: one ULEB to specify which register we're talking about, then one ULEB to specify the offset of its stack slot. There's only one problem: because the offset is against the hypothetical stack pointer after function return, the required offset is going to be negative, which a ULEB can't represent. To get around this, we'll scale the 2^nd ULEB by a so-called data_align value (typically -4 or -8), which also has the convenient side-effect of making the ULEB smaller. As foreshadowing, we'll call this DW_CFA_offset_extended. When the 1^st ULEB is less than 64, we might also call this DW_CFA_offset.

More ways to unwind other registers

So far we've described forming the address of a stack slot, and then loading from it. There's a natural variation that skips the 2^nd part: form the address of a stack slot, then set the value of a register to be that address.

There's another common variant: unwind register X by taking the value from register Y (when X equals Y, this means that said register was unchanged).

For when these common variants aren't sufficient, we'll later describe a little expression language and associated bytecode VM capable of describing all sorts of complicated mechanisms.

This leads to the following definitions to describe how to unwind registers:

enum StackPointerUnwindMechanism {
  Undefined,
  RegOffset(unsigned RegOrdinal, unsigned Offset),
  ExprResult(uint8_t DwOpBytecode[unsigned Len]),
};

enum OtherRegUnwindMechanism {
  Undefined,
  LoadFromStackSlot(int StackSlotOffset),
  AddressOfStackSlot(int StackSlotOffset),
  CopyFromRegister(unsigned RegOrdinal),
  LoadFromExprResult(uint8_t DwOpBytecode[unsigned Len]),
  ExprResult(uint8_t DwOpBytecode[unsigned Len]),
};

struct UnwindActions {
  StackPointerUnwindMechanism sp;
  OtherRegUnwindMechanism reg[NUM_REG];
};

With the associated unwind logic being:

struct RegState {
  uintptr_t value[NUM_REG];
};

RegState UnwindOneFrame(RegState* input, UnwindActions* actions) {
  RegState output;
  uintptr_t NewSP;
  switch (actions->sp) {
  case RegOffset(unsigned RegOrdinal, unsigned Offset):
    NewSP = input->value[RegOrdinal] + Offset;
    break;
  case ExprResult(uint8_t DwOpBytecode[unsigned Len]):
    NewSP = EvalExpr(DwOpBytecode, 0, input);
    break;
  }
  output->value[SP] = NewSP;
  for (unsigned i = 0; i < NUM_REG; ++i) {
    uintptr_t NewVal;
    switch (actions->reg[i]) {
    case LoadFromStackSlot(int StackSlotOffset):
      NewVal = *(uintptr_t*)(NewSP + StackSlotOffset);
      break;
    case AddressOfStackSlot(int StackSlotOffset):
      NewVal = NewSP + StackSlotOffset;
      break;
    case CopyFromRegister(unsigned RegOrdinal):
      NewVal = input->value[RegOrdinal];
      break;
    case LoadFromExprResult(uint8_t DwOpBytecode[unsigned Len]):
      NewVal = *(uintptr_t*)EvalExpr(DwOpBytecode, NewSP, input);
      break;
    case ExprResult(uint8_t DwOpBytecode[unsigned Len]):
      NewVal = EvalExpr(DwOpBytecode, NewSP, input);
      break;
    default:
      continue;
    }
    output->value[i] = NewVal;
  }
  return output;
}

A little bytecode VM is defined to populate an instance of UnwindActions:

Opcode	Name and operands	Semantics assuming `UnwindActions* ua`
`0x0c`	DW_CFA_def_cfa(uleb128 Reg, uleb128 Off)	`ua->sp = RegOffset(Reg, Off)`
`0x12`	DW_CFA_def_cfa_sf(uleb128 Reg, sleb128 Off)	`ua->sp = RegOffset(Reg, Off * data_align)`
`0x0f`	DW_CFA_def_cfa_expression(uleb128 Len, uint8_t DwOpBytecode[Len])	`ua->sp = ExprResult(DwOpBytecode)`
`0x80 + Reg`	DW_CFA_offset(uleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(Off * data_align)`
`0x05`	DW_CFA_offset_extended(uleb128 Reg, uleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(Off * data_align)`
`0x11`	DW_CFA_offset_extended_sf(uleb128 Reg, sleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(Off * data_align)`
`0x2f`	DW_CFA_GNU_negative_offset_extended(uleb128 Reg, uleb128 Off)	`ua->reg[Reg] = LoadFromStackSlot(-(Off * data_align))`
`0x14`	DW_CFA_val_offset(uleb128 Reg, uleb128 Off)	`ua->reg[Reg] = AddressOfStackSlot(Off * data_align)`
`0x15`	DW_CFA_val_offset_sf(uleb128 Reg, sleb128 Off)	`ua->reg[Reg] = AddressOfStackSlot(Off * data_align)`
`0xc0 + Reg`	DW_CFA_restore	`ua->reg[Reg] = CopyFromRegister(Reg)` †
`0x06`	DW_CFA_restore_extended(uleb128 Reg)	`ua->reg[Reg] = CopyFromRegister(Reg)` †
`0x08`	DW_CFA_same_value(uleb128 Reg)	`ua->reg[Reg] = CopyFromRegister(Reg)`
`0x09`	DW_CFA_register(uleb128 Reg, uleb128 Src)	`ua->reg[Reg] = CopyFromRegister(Src)`
`0x07`	DW_CFA_undefined(uleb128 Reg)	`ua->reg[Reg] = Undefined`
`0x10`	DW_CFA_expression(uleb128 Reg, uleb128 Len, uint8_t DwOpBytecode[Len])	`ua->reg[Reg] = LoadFromExprResult(DwOpBytecode)`
`0x16`	DW_CFA_val_expression(uleb128 Reg, uleb128 Len, uint8_t DwOpBytecode[Len])	`ua->reg[Reg] = ExprResult(DwOpBytecode)`
`0x00`	DW_CFA_nop	No-op

† This differs from the usual DWARF semantics. To avoid confusion, use DW_CFA_same_value instead of DW_CFA_restore.

If the bytecode does not initialise sp, its value is Undefined. If the bytecode does not initialise reg[i], its value is CopyFromRegister(i).

Putting it together in a CIE

The aforementioned bytecode gets wrapped up in something called a CIE, which also includes a bunch of other fields:

struct cie {
  uint32_t length;           // In bytes, of all subsequent fields
  int32_t  zero;             // Must be zero
  uint8_t  version;          // Typically 1 or 3, sometimes 4
  char     aug_string[];     // NUL terminated
  if (aug_string[0] == 'e' && aug_string[1] == 'h') {
    void* eh_ptr;            // Only used by very old g++
  }
  if (version >= 4) {
    uint8_t addr_size;       // Must be sizeof(void*)
    uint8_t segment_size;    // Must be zero
  }
  uleb128 code_align;        // Typically 1 (even on aarch64)
  sleb128 data_align;        // Typically -4 or -8
  if (version == 1) {
    uint8_t return_address_ordinal;
  } else {
    uleb128 return_address_ordinal;
  }
  uint8_t aug_operands[];    // Relates to aug_string
  uint8_t dw_cfa_bytecode[];
  uint8_t zero_padding[];    // To give 4 or 8 byte alignment
};

The aug_string field is a NUL-terminated ASCII string. Conceptually it is a bitmask of flags, except that each flag is represented by one or two characters rather than by a single bit. If the flag has associated operand(s), they are read out of the aug_operands field, in the same order as characters appear in aug_string. The recognised flags are:

Char(s)	Operand(s)	Notes
`"eh"`	`void* eh_ptr`	Operand not part of `aug_operands`
`"z"`	`uleb128 length`	In bytes, of subsequent `aug_operands`
`"R"`	`uint8_t fde_ptr_encoding`	Foreshadowing
`"P"`	`uint8_t ptr_encoding` `uint8_t personality_ptr[]`	Foreshadowing
`"L"`	`uint8_t lsda_ptr_encoding`	Foreshadowing
`"S"`	None	Signal handler frame
`"B"`	None	aarch64 ptrauth uses B key
`"\0"`	None	End of `aug_string`

"eh", if it appears, must be first. In practice, it is never present, except in code compiled by very old versions of g++. "z", if it appears, must be next, and in practice is always present. The remaining characters (except NUL) can appear in any order, though if "R" is present, and "S" and/or "B" are also present, then "R" must appear before "S" and before "B" (to work around bugs in get_cie_encoding, as compared to the correct extract_cie_info). Furthermore, if "R" is present and its operand is non-zero, then "z" must be present (again due to get_cie_encoding bugs).

The whole struct cie must have a length which is a multiple of sizeof(uintptr_t) bytes. The zero_padding field at the end ensures this. Helpfully, DW_CFA_nop is zero, so zero_padding can be seen as an extension of the dw_cfa_bytecode field. The length of the whole struct cie is 4 + length. To get the length of the combined dw_cfa_bytecode / zero_padding, the lengths of all the other fields need to be subtracted off. This is straightforward (albeit fiddly) for most fields, with the only hard case being aug_operands: its length depends on the contents of aug_string, and there may be characters in aug_string which we do not recognise. This is where "z" comes in: once we've decoded its operand, it tells us the remaining length of aug_operands.

From one location to many

Obtaining the UnwindActions applicable to one instruction pointer value is all well and good, but this needs to scale to every (*) possible instruction pointer value across an entire program. One relevant observation is that the UnwindActions applicable to an instruction pointer X are usually very similar (or even identical to) those applicable to an instruction pointer value of X+1.

(*) Assuming -fasynchronous-unwind-tables. Less coverage is required if only -fnon-call-exceptions is used, and less still is required if neither is used. Note that -fasynchronous-unwind-tables is enabled by default on some architectures.

With this observation in mind, we can borrow a trick from video compression: use a keyframe to describe the UnwindActions applicable at the start of every function, and then use delta frames to describe the differences as the instruction pointer progresses through the function.

This motivates a bunch of new VM opcodes, along with two pieces of VM state: set target to the instruction pointer that you want UnwindActions for, set fpc to the instruction pointer of the start of the function, and execute VM instructions either until you run out of VM instructions, or until fpc >= target:

Opcode	Name and operands	Semantics
`0x40 + Off`	DW_CFA_advance_loc	`fpc += Off * code_align; if (fpc >= target) break;`
`0x02`	DW_CFA_advance_loc1(uint8_t Off)	`fpc += Off * code_align; if (fpc >= target) break;`
`0x03`	DW_CFA_advance_loc2(uint16_t Off)	`fpc += Off * code_align; if (fpc >= target) break;`
`0x04`	DW_CFA_advance_loc4(uint32_t Off)	`fpc += Off * code_align; if (fpc >= target) break;`
`0x01`	DW_CFA_set_loc(uint8_t Ptr[])	`fpc = DecodePtr(fde_encoding from CIE, Ptr); if (fpc >= target) break;`

It also motivates two pieces of VM state called spReg and spOff, along with revisements to DW_CFA_def_cfa / DW_CFA_def_cfa_sf, and new opcodes which perform only one half of DW_CFA_def_cfa / DW_CFA_def_cfa_sf:

Opcode	Name and operands	Semantics assuming `UnwindActions* ua`
`0x0c`	DW_CFA_def_cfa(uleb128 Reg, uleb128 Off)	`ua->sp = RegOffset(spReg = Reg, spOff = Off)`
`0x12`	DW_CFA_def_cfa_sf(uleb128 Reg, sleb128 Off)	`ua->sp = RegOffset(spReg = Reg, spOff = Off * data_align)`
`0x0d`	DW_CFA_def_cfa_register(uleb128 Reg)	`spReg = Reg; ua->sp = RegOffset(spReg, spOff)`
`0x0e`	DW_CFA_def_cfa_offset(uleb128 Off)	`spOff = Off; if (ua->sp is RegOffset) ua->sp = RegOffset(spReg, spOff)`
`0x13`	DW_CFA_def_cfa_offset_sf(sleb128 Off)	`spOff = Off * data_align; if (ua->sp is RegOffset) ua->sp = RegOffset(spReg, spOff)`

Finally, it motivates adding a Stack<UnwindActions> to the VM, along with some opcodes to manipulate it:

Opcode	Name and operands	Semantics assuming `UnwindActions* ua`
`0x0a`	DW_CFA_remember_state	`Push(*ua)` (note that `ua` not modified)
`0x0b`	DW_CFA_restore_state	`*ua = Pop()`

From one function to many

We could happily have one CIE per function, but there would be quite a lot of repetition between CIEs. To resolve this, we introduce the concept of an FDE, which is essentially a stripped down CIE:

struct fde {
  uint32_t length;            // In bytes, of all subsequent fields
  int32_t  negated_cie_offset;// In bytes, from this field, to a CIE
  uint8_t  func_start[];      // Using fde_ptr_encoding from CIE
  uint8_t  func_length[];     // Using (fde_ptr_encoding & 0xF)
  uint8_t  aug_operands[];    // Relates to aug_string from CIE
  uint8_t  dw_cfa_bytecode[]; // Appended to that of CIE
  uint8_t  zero_padding[];    // To give 4 or 8 byte alignment
};

The negated_cie_offset field points to a CIE, and the FDE inherits everything from the pointed-to CIE. Note that the value is negated, i.e. the value is subtracted from &fde::negated_cie_offset to form the cie*. For inheritance of aug_string, each flag therein sometimes has operand(s) in the CIE's aug_operands, sometimes in the FDE's aug_operands, and sometimes in both. The FDE aug_operands has:

Char(s)	Operand(s)	Notes
`"eh"`	None
`"z"`	`uleb128 length`	In bytes, of subsequent `aug_operands`
`"R"`	None
`"P"`	None
`"L"`	`uint8_t lsda_ptr[]`	Encoded using `lsda_ptr_encoding` from CIE
`"S"`	None
`"B"`	None
`"\0"`	None	End of `aug_string`

The dw_cfa_bytecode field is inherited by concatenation: executing the bytecode for an FDE involves first executing the bytecode of the associated CIE. All VM state is carried over from the end of the CIE bytecode to the start of the FDE bytecode, thoough if using DW_CFA_remember_state / DW_CFA_restore_state then the stack is emptied as part of switching from the CIE to the FDE.

That leaves func_start, which is a variable length pointer, and func_length, which is a variable length integer. If aug_string does not contain "R", then these fields are void* and uintptr_t respectively. If aug_string does contain "R", then the operand of "R" (fde_ptr_encoding) describes the size and interpretation of func_start, meanwhile fde_ptr_encoding & 0xF describes the size and interpretation of func_length. That's the segue to talking about pointer encodings.

Pointer encodings

Depending on the context, it can be desirable to encode a pointer in different ways. For example, sometimes a uintptr_t is desirable, whereas other times an int32_t offset from the start of the current function is desirable. To allow flexibility, various places allow a uint8_t to describe the pointer encoding being used. These places are:

Encoding field	Associated pointer field
`cie::fde_ptr_encoding`	`fde::func_start`
`cie::fde_ptr_encoding & 0xF`	`fde::func_length`
`cie::fde_ptr_encoding`	`DW_CFA_set_loc` operand
`cie::aug_string` `"P"` `ptr_encoding`	`cie::aug_string` `"P"` `personality_ptr`
`cie::aug_string` `"L"`	`fde::aug_string` `"L"` (LSDA pointer)
`eh_frame_hdr::eh_frame_ptr_enc`	`eh_frame_hdr::eh_frame_ptr` †
`eh_frame_hdr::fde_count_enc`	`eh_frame_hdr::fde_count` †
`eh_frame_hdr::table_enc`	`eh_frame_hdr::sorted_table` †
DW_OP_GNU_encoded_addr	DW_OP_GNU_encoded_addr ‡

† Foreshadowing.
‡ More foreshadowing.

There are two special cases for this uint8_t:

Encoding	Meaning
`0xff`	Encoded value is absent/empty, decode as NULL.
`0x50`	Encode a `uintptr_t`, but precede with padding to align it.

Once the two special cases are excluded, the remaining cases split the byte into a four bit field, a three bit field, and a one bit field. The low four bits denote the data type:

Encoding & 0xF	Data type	Size in bytes
`0x0`	`uintptr_t`	Typically either 4 or 8
`0x1` ‡	`uleb128`	Varies (≥ 1)
`0x2`	`uint16_t`	2
`0x3`	`uint32_t`	4
`0x4`	`uint64_t`	8
`0x9` ‡	`sleb128`	Varies (≥ 1)
`0xA`	`int16_t`	2
`0xB`	`int32_t`	4
`0xC`	`int64_t`	8

‡ Cannot be used for fde_ptr_encoding.

The next three bits denote a base value to be added to non-NULL values:

Encoding & 0x70	Base value
`0x00`	NULL
`0x10` (pcrel)	Address of first byte of encoded pointer
`0x20` (textrel)	Start of .text (in theory), but usually NULL in practice
`0x30` (datarel)	Start of .got (x86) or NULL (most other architectures) †
`0x40` (funcrel) ‡	`fde::func_start` (after decoding)

† Except in eh_frame_hdr::table_enc, where it means start of .eh_frame_hdr.
‡ Cannot be used for fde_ptr_encoding, or in eh_frame_hdr.

The final top bit controls an optional dereference:

Encoding & 0x80	Semantics
`0x00`	Treat value as `void*`
`0x80`	Treat value as `void*`, dereference it to get `void`

If an integer is desired rather than a pointer (as is the case for func_length and fde_count_enc), then the void* is reinterpreted as uintptr_t.

Finding all the FDEs

The various CIE and FDE structures are concatenated together, and then placed in the ELF .eh_frame section.

Traditionally, the compiler would insert a call to __register_frame somewhere during startup, passing along the address of the .eh_frame section. If non-NULL values were desired for textrel / datarel pointer encodings, it would instead insert a call to __register_frame_info_bases. At shutdown, a matching call to __deregister_frame or __deregister_frame_info_bases would be made. If there are multiple .eh_frame sections, and the linker didn't merge them, then __register_frame_table / __register_frame_info_table_bases can be used instead, which take a pointer to a NULL-terminated list of pointers to .eh_frame sections.

A more modern approach is to add an ELF .eh_frame_hdr section, and use either _dl_find_object or dl_iterate_phdr to find the appropriate .eh_frame_hdr section. Once found, the section contains a sorted table of FDE pointers:

struct eh_frame_hdr {
  uint8_t version;          // Must be 1
  uint8_t eh_frame_ptr_enc; // Typically 0x1B (pcrel int32_t)
  uint8_t fde_count_enc;    // Typically 0x03 (uint32_t)
  uint8_t table_enc;        // Must be 0x3B (datarel int32_t)
  uint8_t eh_frame_ptr[];   // Encoded with eh_frame_ptr_enc
  uint8_t fde_count[];      // Encoded with fde_count_enc
  struct {
    int32_t func_start;     // In bytes, relative to eh_frame_hdr
    int32_t fde_offset;     // In bytes, relative to eh_frame_hdr
  } sorted_table[fde_count];
};

The combined size of the eh_frame_ptr and fde_count fields must be a multiple of four (which happens naturally if eh_frame_ptr_enc and fde_count_enc both use 4 or 8 byte types). The eh_frame_ptr field contains a pointer to the .eh_frame section, which is used as fallback if the .eh_frame_hdr section is unparseable for some reason. The table_enc field contains the encoding used by the contents of sorted_table, albeit datarel means relative to the start of .eh_frame_hdr, and the only supported encoding is datarel int32_t. The table is sorted in ascending func_start order, and the func_start value therein overrides the func_start from the referenced FDE (though in practice they should be identical).

Personality pointers and LSDA pointers

To support calling destructors during unwinding, and to support catching exceptions (and thereby stopping unwinding), a CIE can specify a pointer to a personality function. Said function contains the destructing and/or catching logic, and will get called as part of unwinding (unless only generating a backtrace). I won't get into the specifics of the interface, though the personality function can query various parts of the unwind state:

_Unwind_GetLanguageSpecificData returns lsda_ptr from the FDE.
_Unwind_GetRegionStart returns func_start from the FDE.

Very old versions of g++ use eh_ptr instead of personality functions and LSDA pointers. Unless you are implementing __frame_state_for, you shouldn't need to worry about eh_ptr.

Expression bytecode

The UnwindOneFrame function from earlier included calls to EvalExpr to handle complex cases. The outline of EvalExpr is:

uintptr_t EvalExpr(uint8_t DwOpBytecode[unsigned Len],
                   uintptr_t StackInitial,
                   RegState* original) {
  const uint8_t* bpc = DwOpBytecode;
  Stack<uintptr_t> stk;
  stk.Push(StackInitial);
  while (bpc < DwOpBytecode + Len) {
    uint8_t opcode = *bpc++;
    /* Decode opcode-specific operand(s) from bpc */
    ...
    /* Perform opcode */
    ...
  }
  return stk.Pop();
}

With the various supported opcodes being:

Opcode	Name and operands	Semantics (assuming stack for `Push`, `Pop`, `At`)
`0x03`	DW_OP_addr(uintptr_t Lit)	`Push(Lit)`
`0x08`	DW_OP_const1u(uint8_t Lit)	`Push(Lit)`
`0x09`	DW_OP_const1s(int8_t Lit)	`Push(Lit)`
`0x0A`	DW_OP_const2u(uint16_t Lit)	`Push(Lit)`
`0x0B`	DW_OP_const2s(int16_t Lit)	`Push(Lit)`
`0x0C`	DW_OP_const4u(uint32_t Lit)	`Push(Lit)`
`0x0D`	DW_OP_const4s(int32_t Lit)	`Push(Lit)`
`0x0E`	DW_OP_const8u(uint64_t Lit)	`Push(Lit)`
`0x0F`	DW_OP_const8s(int64_t Lit)	`Push(Lit)`
`0x10`	DW_OP_constu(uleb128 Lit)	`Push(Lit)`
`0x11`	DW_OP_consts(sleb128 Lit)	`Push(Lit)`
`0x12`	DW_OP_dup	`Push(At(-1))` (duplicate top element)
`0x14`	DW_OP_over	`Push(At(-2))` (duplicate penultimate element)
`0x15`	DW_OP_pick(uint8_t Idx)	`Push(At(-1-Idx))`
`0x13`	DW_OP_drop	`Pop()`
`0x16`	DW_OP_swap	`a = Pop(); b = Pop(); Push(a); Push(b)`
`0x17`	DW_OP_rot	`a = Pop(); b = Pop(); c = Pop(); Push(a); Push(c); Push(b)`
`0x06`	DW_OP_deref	`Push((uintptr_t)Pop())`
`0x94` `0x01`	DW_OP_deref_size	`Push((uint8_t)Pop())`
`0x94` `0x02`	DW_OP_deref_size	`Push((uint16_t)Pop())`
`0x94` `0x04`	DW_OP_deref_size	`Push((uint32_t)Pop())`
`0x94` `0x08`	DW_OP_deref_size	`Push((uint64_t)Pop())`
`0x19`	DW_OP_abs	`a = Pop(); Push((intptr_t)a < 0 ? -a : a)`
`0x1f`	DW_OP_neg	`a = Pop(); Push(-a)`
`0x20`	DW_OP_not	`a = Pop(); Push(~a)`
`0x23`	DW_OP_plus_uconst(uleb128 Lit)	`a = Pop(); Push(a + Lit)`
`0x1a`	DW_OP_and	`rhs = Pop(); lhs = Pop(); Push(lhs & rhs)`
`0x1b`	DW_OP_div	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs / (intptr_t)rhs)`
`0x1c`	DW_OP_minus	`rhs = Pop(); lhs = Pop(); Push(lhs - rhs)`
`0x1d`	DW_OP_mod	`rhs = Pop(); lhs = Pop(); Push(lhs % rhs)`
`0x1e`	DW_OP_mul	`rhs = Pop(); lhs = Pop(); Push(lhs * rhs)`
`0x21`	DW_OP_or	`rhs = Pop(); lhs = Pop(); Push(lhs \| rhs)`
`0x22`	DW_OP_plus	`rhs = Pop(); lhs = Pop(); Push(lhs + rhs)`
`0x24`	DW_OP_shl	`rhs = Pop(); lhs = Pop(); Push(lhs << rhs)`
`0x25`	DW_OP_shr	`rhs = Pop(); lhs = Pop(); Push(lhs >> rhs)`
`0x26`	DW_OP_shra	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs >> rhs)`
`0x27`	DW_OP_xor	`rhs = Pop(); lhs = Pop(); Push(lhs ^ rhs)`
`0x29`	DW_OP_eq	`rhs = Pop(); lhs = Pop(); Push(lhs == rhs)`
`0x2e`	DW_OP_ne	`rhs = Pop(); lhs = Pop(); Push(lhs != rhs)`
`0x2a`	DW_OP_ge	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs >= (intptr_t)rhs)`
`0x2b`	DW_OP_gt	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs > (intptr_t)rhs)`
`0x2c`	DW_OP_le	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs <= (intptr_t)rhs)`
`0x2d`	DW_OP_lt	`rhs = Pop(); lhs = Pop(); Push((intptr_t)lhs < (intptr_t)rhs)`
`0x2f`	DW_OP_skip(int16_t Delta)	`bpc += Delta`
`0x28`	DW_OP_bra(int16_t Delta)	`if (Pop() != 0) bpc += Delta`
`0x30 + Lit`	DW_OP_lit	`Push(Lit)`
`0x50 + Reg`	DW_OP_reg	`Push(original->value[Reg])`
`0x70 + Reg`	DW_OP_breg(sleb128 Lit)	`Push(original->value[Reg] + Lit)`
`0x90`	DW_OP_regx(uleb128 Reg)	`Push(original->value[Reg])`
`0x92`	DW_OP_bregx(uleb128 Reg, sleb128 Lit)	`Push(original->value[Reg] + Lit)`
`0x96`	DW_OP_nop	No-op
`0xf1`	DW_OP_GNU_encoded_addr(uint8_t PtrEncoding, uint8_t Ptr[])	`Push(DecodePtr(PtrEncoding, Ptr))`

The gcc implementation of EvalExpr assumes that the stack never holds more than 64 elements. Bad things will happen if the stack exceeds this size.

By far the most common usage of this expression language is to allow a single FDE to succinctly describe an arbitrary number of PLT entries. The serialised bytecode in this case will be something like:

0x92 0x07 0x08  DW_OP_bregx(7 /* rsp */, 8)
0x90 0x10       DW_OP_regx(16 /* rip */)
0x08 0x0f       DW_OP_const1u(15)
0x1a            DW_OP_and
0x08 0x0b       DW_OP_const1u(11)
0x2a            DW_OP_ge
0x08 0x03       DW_OP_const1u(3)
0x24            DW_OP_shl
0x22            DW_OP_plus

Which encodes the expression:

rsp + 8 + ((((rip & 15) >= 11) ? 1 : 0) << 3)

In other words, each 16-byte PLT entry is split into two pieces: the 1^st being 11 bytes long, and the 2^nd being 5 bytes long. If in the 1^st piece, the expression gives rsp + 8, whereas in the 2^nd piece it gives rsp + 16.

References

The DWARF Standard is documented, though .eh_frame diverges from it in a few areas. It also requires architecture-specific extensions (e.g. x86, x86_64, aarch64). The LSB has a pointer encoding specification, though it is incomplete. The only true reference is the gcc source code, some relevant entry points into which are:

unwind.inc:
unwind-dw2-fde-dip.c:
- _Unwind_Find_FDE - searching .eh_frame_hdr for an FDE
unwind-dw2-fde.c:
- _Unwind_Find_registered_FDE - searching .eh_frame for an FDE
- __register_frame
- __register_frame_info_bases
- __deregister_frame
- __deregister_frame_info_bases
- __register_frame_table
- __register_frame_info_table_bases
unwind-dw2.c:
- uw_frame_state_for - ties together _Unwind_Find_FDE and execute_cfa_program
- uw_update_context - UnwindOneFrame in my exposition
- execute_stack_op - EvalExpr in my exposition
unwind-dw2-execute_cfa.h:
- execute_cfa_program - interpreter for DW_CFA_ bytecode
unwind-pe.h:
- read_encoded_value - pointer encoding

See also

For a completely different take on the same problem, consider ARM64 unwinding on Windows. An .xdata record therein is roughly equivalent to an FDE, and it has a bytecode interpreter similar in purpose (but very different in style) to the DW_CFA_ bytecode.

Some examples of complete .eh_frame sections are available in LuaJIT:

Let's say we're writing C code, and would like to define a macro for printf-with-newline. We might start with:

#define printfln(fstr, ...) \
  printf(fstr "\n", __VA_ARGS__)

So far so good; we can write printfln("1+1 is %d", 1+1) and it'll print 1 + 1 is 2 followed by a newline. However, simpler cases such as printfln("Hello") result in a syntax error, as this macro expands to printf("Hello" "\n",), which is invalid due to the trailing comma.

The non-standard solution

One conventional solution is to rely on a non-standard language extension whereby inserting ## between , and __VA_ARGS__ has a special effect:

#define printfln(fstr, ...) \
  printf(fstr "\n", ## __VA_ARGS__)

With this, both printfln("1+1 is %d", 1+1) and printfln("Hello") work fine. However, combining the two into something like printfln("Wrote %d chars", printfln("Hello")) results in a mysterious error. To figure out why, we need to look into exactly what special effect this non-standard language extension has. The GNU C Preprocessor documentation on Variadic Macros says:

The ## token paste operator has a special meaning when placed between a comma and a variable argument. If you write [...] and the variable argument is left out [...], then the comma before the ## will be deleted. This does not happen if you pass an empty argument, nor does it happen if the token preceding ## is anything other than a comma.

Meanwhile, the GCC documentation on Variadic Macros says:

If the variable arguments are omitted or empty, the ## operator causes the preprocessor to remove the comma before it. If you do provide some variable arguments in your macro invocation, GNU CPP does not complain about the paste operation and instead places the variable arguments after the comma. Just like any other pasted macro argument, these arguments are not macro expanded.

The last sentence of this 2^nd piece of documentation tells us what is going on: macro arguments are usually expanded before being substituted, but this doesn't happen to macro arguments used as operands to # or ##. In either case, after substitution has been performed, the resultant tokens are rescanned for more expansions, albeit with the original macro deactivated during this rescan. The only major observable difference between expansion-before-substitution and expansion-during-rescan is that the original macro is active in the former but deactivated in the latter. Hence printfln("Wrote %d chars", printfln("Hello")) expands to printf("Wrote %d chars" "\n", printfln("Hello")) without expanding the inner printfln, which the compiler then tries to parse as a call to the non-existent function printfln.

The two pieces of documentation are contradictory as to what happens if the variable arguments are present but empty (as in printfln("Hello",)). In practice the 1^st piece of documentation is correct; the comma is kept if the variable arguments are present but empty.

The nor does it happen if the token preceding ## is anything other than a comma phrase from the 1^st piece of documentation is interesting, as it turns out this is a slightly dynamic property: comma deletion can happen for things like , ## x ## __VA_ARGS__ provided that x expands to nothing. Clang will also delete the comma in , x ## __VA_ARGS__ when x expands to nothing. Things also seem to get funky if more pasting immediately follows , ## __VA_ARGS__. As examples:

`#define F(x, ...)` as	`F()`	`F(1)`	`F(,)`	`F(1,)`	`F(1,2)`
`,##__VA_ARGS__`	empty	empty	`,`	`,`	`,2`
`,##__VA_ARGS__ x`	empty	`1`	`,`	`,1`	`,2 1`
`,##__VA_ARGS__##x`	`,`	error	`,`	error	error† or `,21`‡
`,##x##__VA_ARGS__`	empty	error	`,`	error	error
`,x##__VA_ARGS__`	`,`† or empty ‡	`,1`	`,`	`,1`	`,12`

† According to gcc 13.2.
‡ According to clang 17.0.1.

In the cases where gcc and clang differ, who is to say which is correct? After all, there is no standard documenting the desired behaviour for non-standard language extensions.

Enter __VA_OPT__

Rather than standardise this mess, the C and C++ languages adopted a different solution, namely __VA_OPT__. With __VA_OPT__, our motivating example looks like:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_ARGS__)

In this, __VA_OPT__(,) expands to nothing if the variable arguments are absent or empty or become empty after macro expansion, whereas it expands to , otherwise. There's no token pasting going on, so printfln("Wrote %d chars", printfln("Hello")) now works. The other behavioural difference is that the odd-looking printfln("Hello",) now expands to the valid printf("Hello" "\n"), as does printfln("Hello", EMPTY) in the context of #define EMPTY /*nothing*/.

The standard didn't stop there though; there can be more than just a , within __VA_OPT__. Any (-)-balanced token sequence is allowed, and it can even access the arguments of the macro invocation, so for example it is valid to have:

#define M(x, ...) (0 __VA_OPT__(-(x)) )

Then M(1) expands to (0) and M(1,2) expands to (0 - (1)).

What about whitespace?

Compilers seem to disagree on the behaviour of whitespace just after __VA_OPT( or just before the matching ). Consider:

#define TILDAS(...) ~__VA_OPT__( ~ )~
#define S2(x) #x
#define S1(x) S2(x)
const char* s = S1(TILDAS());
const char* sa = S1(TILDAS(a));

The observed results are:

	gcc 13.2	clang 17.0.1
s	`"~~"`	`"~ ~"`
sa	`"~~~"`	`"~ ~ ~"`

One interpretation is that gcc is correct, based on this paragraph from the standard: (N3096 6.10.4.1¶7)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

Combined with an earlier paragraph about replacement lists: (N3096 6.10.4¶7)

[...] Any white-space characters preceding or following the replacement list of preprocessing tokens are not considered part of the replacement list for either form of macro.

What does __VA_OPT__() expand to?

The short answer should be nothing, based on the rules for expanding it:

If the variable arguments are absent or empty or become empty after macro expansion, the expansion of __VA_OPT__() is a single placemarker token.
Otherwise, if used as an operand of ##, the expansion of __VA_OPT__() is a single placemarker token (because an empty expansion becomes a placemarker in this context).
Otherwise, the expansion of __VA_OPT__() is empty (though a single placemarker token would be equally valid, as it'll evaporate away in due course).

#__VA_OPT__() therefore becomes an obfuscated way of writing "". Meanwhile, __VA_OPT__() as an operand of ## can be used to force the other operand to be in a ## context. For example, the undesirable behaviour of , ## __VA_ARGS__ can be reproduced via:

#define printfln(fstr, ...) \
  printf(fstr "\n" __VA_OPT__(,) __VA_OPT__() ## __VA_ARGS__)

If __VA_OPT__() expands to nothing, regardless of whether the variable arguments are absent or empty or become empty after macro expansion, then it might seem rational to optimise it away. There's a corner case though: determining whether the variable arguments become empty after macro expansion requires macro expanding them, and macro expansion of the (non-standard) __COUNTER__ macro has visible side-effects. Consider:

#define EMPTY_VA_OPT(...) __VA_OPT__()
int x = __COUNTER__;
EMPTY_VA_OPT(__COUNTER__)
int y = __COUNTER__;
return y - x;

For the above, gcc 13.2 returns 2, whereas clang 17.0.1 returns 1, indicating that clang optimised away the __VA_OPT__().

Token pasting and __VA_OPT__

The standard is careful to say that the expansion of __VA_OPT__ can contain placemarker tokens: (N3096 6.10.4.1¶7, emphasis mine)

The preprocessing token sequence for [...] a va-opt-replacement [...] consists of the results of the expansion of the contained pp-tokens as the replacement list of the current function-like macro before removal of placemarker tokens, rescanning, and further replacement.

These placemarker tokens were originally a fiction dreamt up by the standard to provide a succinct description of the semantics of ##:

The operands of ## always produce at least one token. If an operand were to produce zero tokens, it instead produces a single placemarker token.
If one operand of ## is a placemarker token, the result of ## is the other operand.
Placemarker tokens are removed after ## processing and before rescan.

Before __VA_OPT__, it was possible to ignore this fiction, and implement the preprocessing algorithm with careful rules around ## evaluation rather than producing and later removing placemarker tokens. It becomes harder to maintain this fiction with __VA_OPT__. Consider:

#define G(x, y, z, ...) 1 ## __VA_OPT__(x y ## y z) ## 5

Then G(,,) expands to the single token 15, while G(2,3,4,-) expands to the three tokens 12 33 45. If y is empty, then the inner y ## y should produce a placemarker, and then things get interesting:

Expansion of	Result	Notes
`G(2,,4,-)`	`12` `45`	The `__VA_OPT__` expands to `24`.
`G( ,,4,-)`	`1` `45`	The placemarker inhibits merging to `145`.
`G(2,, ,-)`	`12` `5`	The placemarker inhibits merging to `125`.
`G( ,, ,-)`	`1` `5`† or `15`‡	The correct result is the merged `15`.

† According to gcc 13.2.
‡ According to clang 17.0.1.

The other fun observation is that macro arguments within __VA_OPT__ get macro expanded, even if the __VA_OPT__ itself is an operand of ##. Consider:

#define H1(...) x ##            __VA_ARGS__
#define H2(...) x ## __VA_OPT__(__VA_ARGS__)

Then H1(__LINE__) expands to x__LINE__, whereas H2(__LINE__) expands to something like x3.

Further reading

P1042R1 tweaked the __VA_OPT__ semantics slightly a few years ago.
__VA_OPT__ can be (ab)used for recursive macros.

Much has been written about how to generate random 64-bit integers; people might argue over which particular (CS)PRNG is best suited for a given task, but there is agreement on the general shape of the solution: keep some state around, and then extract 64 bits of randomness at a time, where each bit is equally likely to be either zero or one. This gives nice uniformly-distributed integers over the range [0, 2⁶⁴).

There is less agreement on how to generate random floating-point values. It is tempting to aim for what looks like a simple transformation of an integer generator: uniformly-distributed floats over the range [0, 1). There are various ways of doing such a transformation; one exposition does it as:

double pcg64_random_double(pcg64_t *rng) {
    return (double)(pcg64_random(rng) >> 11) * 0x1.0p-53;
}

Whereas LuaJIT does it as:

uint64_t lj_prng_u64d(PRNGState *rs) {
    uint64_t z, r = 0;
    TW223_STEP(rs, z, r)
    /* Returns a double bit pattern in the range 1.0 <= d < 2.0. */
    return (r & 0x000fffffffffffffull) | 0x3ff0000000000000ull;
}
/* Then to give d in [0, 1) range: */
U64double u;
double d;
u.u64 = lj_prng_u64d(rs);
d = u.d - 1.0;

And a Go version by Lemire does it as:

// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

The [0, 1) output range is less useful than it might initially appear. It is often stated that [0, 1) can be transformed to [A, B) by multiplying by B-A then adding A, but this is only true for some values of A and B; for other values, a number just less than 1 can be transformed and end up equal to B due to floating-point rounding, i.e. [0, 1) is transformed to [A, B]. The possibility of a zero output is also unhelpful if the result is going to be log-transformed (for example as part of a Box-Muller transform), as log(0) explodes.

An output range of [0, 1] would be better behaved under most additive and multiplicative transforms. After such a transform, a range like [A, B] can be easily turned into [A, B) via rejection sampling, should the half-open range be desired. It will also turn out to be very easy (on either theoretical or practical grounds) to exclude 0 as a possible output, giving the log-friendly output range (0, 1].

Before trying to generate a random float in the range [0, 1], we should instead consider the easier problem of generating a random float in the range [0.5, 1]. There are 2⁵²+1 different IEEE-754 doubles in this range. The "rounding basin" of some double d in that range can be defined as the range of infinite-precision real numbers in the range [0.5, 1] that would round to d (assuming round-to-nearest). Taking ε=2^-54, most of these doubles have a basin of d-ε through d+ε. The exceptions are the extremes of the range; 0.5 has a basin of 0.5 through 0.5+ε, and 1 has a basin of 1-ε through 1. In other words, there are 2⁵²-1 values in the range whose basin is 2ε wide, and 2 values in the range whose basin is only ε wide. For a uniform distribution, the probability of seeing d should equal the width of the rounding basin of d. Given all this, there is a fairly simple monotonic transform from the uniform integer range [0, 2⁵³) to the uniform float range [0.5, 1]:

Integer(s)	Resultant floating-point double (ε=2^-54)
0	0.5
1, 2	0.5 + 2ε
3, 4	0.5 + 4ε
5, 6	0.5 + 6ε
⋮	⋮
2⁵³-7, 2⁵³-6	1 - 6ε
2⁵³-5, 2⁵³-4	1 - 4ε
2⁵³-3, 2⁵³-2	1 - 2ε
2⁵³-1	1

This transform can be expressed in code like so:

double rand_between_half_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1022ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The range [0.25, 0.5] looks a lot like the range [0.5, 1], except that ε reduces to 2^-55. That is, there is a fairly simple monotonic transform from the uniform integer range [0, 2⁵³) to the uniform float range [0.25, 0.5]:

Integer(s)	Resultant floating-point double (ε=2^-55)
0	0.25
1, 2	0.25 + 2ε
3, 4	0.25 + 4ε
5, 6	0.25 + 6ε
⋮	⋮
2⁵³-7, 2⁵³-6	0.5 - 6ε
2⁵³-5, 2⁵³-4	0.5 - 4ε
2⁵³-3, 2⁵³-2	0.5 - 2ε
2⁵³-1	0.5

Or in code:

double rand_between_quarter_and_half() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    x = ((x + 1) >> 1) + (1021ull << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

In turn, the range [0.125, 0.25] looks a lot like [0.25, 0.5], and [0.0625, 0.125] looks a lot like [0.125, 0.25], and so on. To generate a float in [0, 1], we need to choose one of these ranges, and then choose a value within that range. For a uniform distribution, the probability of a range should be proportional to the width of the range, so we have:

Output range	Width	Probability
[0.5, 1]	2^-1	50%
[0.25, 0.5]	2^-2	25%
[0.125, 0.25]	2^-3	12.5%
[0.0625, 0.125]	2^-4	6.25%
⋮	⋮	⋮

As the width of the range approaches zero, so does the probability of the range. Once the width is less than 2^-75, the probabilities are so small as to be effectively impossible for most practical purposes. By making them actually impossible, we gain a nice guarantee of algorithm termination, and avoid having to think about denormals, and get the log-friendly output range (0, 1]. One way of implementing this in code is:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64() >> 11; /* 53-bit uniform integer */
    uint64_t e = 1022;
    do {
      if (rand_u64() & 1) break; /* 1-bit uniform integer */
      e -= 1;
    } while (e > 1022-75);
    x = ((x + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

The above throws away lots of random bits, whereas it would be more economical to remember and use them later:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64(); /* 53-bit and 11-bit uniform integers */
    uint64_t bits = x & 0x7ff, nbits = 11;
    uint64_t e = 1022;
    do {
      if (bits & 1) break; /* 1-bit uniform integer */
      bits >>= 1;
      if (--nbits == 0) bits = rand_u64(), nbits = 64;
      e -= 1;
    } while (e > 1022-75);
    x = (((x >> 11) + 1) >> 1) + (e << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Finally, the loop can be eliminated using bit-counting intrinsics:

double rand_between_zero_and_one() {
    double d;
    uint64_t x = rand_u64();
    uint64_t e = __builtin_ctzll(x) - 11ull;
    if ((int64_t)e >= 0) e = __builtin_ctzll(rand_u64());
    x = (((x >> 11) + 1) >> 1) - ((e - 1011ull) << 52);
    memcpy(&d, &x, sizeof(d));
    return d;
}

Or in Go rather than C:

// toFloat64 -> (0,1]
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    e := bits.TrailingZeros64(x) - 11
    if e >= 0 {
        e = bits.TrailingZeros64(splitmix64(seed))
    }
    x = (((x >> 11) + 1) >> 1) - ((uint64(int64(e)) - 1011) << 52)
    return math.Float64frombits(x)
}

With that exposition done, I can now present my criteria for assessing the quality of functions that claim to generate uniformly-distributed IEEE-754 double-precision floating-point values over (0, 1] or [0, 1] or [0, 1):

What is the probability of seeing exactly zero? The rounding basin of 0 is only 2^-1075 wide, so the probability should be as close as possible to this. Note that probability zero (i.e. impossible) is extremely close to this.
What is the smallest non-zero value that can be seen? Smaller is better.
What is the largest double-precision value less than 1 that cannot be seen? Smaller is better (and negative is best).
How many distinct double-precision values can be seen? Larger is better.

Criteria	Most functions	Presented here	Ideal
Probability of zero	2^-52 or 2^-53	0	2^-1075
Smallest non-zero	2^-52 or 2^-53	2^-76	2^-1074
Largest non-seeable	1 - 2^-53 or 0.5 - 2^-54	2^-76 - 2^-129	-0
Distinct values	2⁵² or 2⁵³	2^58.2	2⁶²±1

If you've reached this far, then you might be interested in similar commentary from Taylor R. Campbell or Allen B. Downey or @moon_chilled.

P.S. If you want to use any of the unattributed code from this post verbatim, but require all code to have a license attached, you are explicitly welcome to treat it as MIT or Apache 2.0 with LLVM exception.

The basic premise of Microsoft's Arm64EC is that a single virtual address space can contain a mixture of ARM64 code and X64 code; the ARM64 code executes natively, whereas the X64 code is transparently converted to ARM64 code by a combination of JIT and AOT compilation, and ARM64 ⇄ X64 transitions can happen at any function call/return boundary.

There are some good MSDN pages on the topic:

The first function to highlight is RtlIsEcCode; to tell apart ARM64 code and X64 code, the system maintains a bitmap with one bit per 4K page, specifying whether that page contains ARM64 code (bit set) or X64 code (bit clear). This bitmap can be queried using RtlIsEcCode. The loader sets bits in this bitmap when loading DLLs, as does VirtualAlloc2 when allocating executable memory with MEM_EXTENDED_PARAMETER_EC_CODE in MemExtendedParameterAttributeFlags.

After the X64 emulator executes a call or ret or jmp that crosses a page boundary, it needs to determine whether the new rip points to ARM64 code or X64 code. If !RtlIsEcCode(rip), then the X64 emulator can happily continue emulating X64 code. However, if RtlIsEcCode(rip), then a transition out of the emulator needs to be performed. This can be a call-like transition (X64 calling ARM64), or a return-like transition (X64 returning to ARM64). The transition type is determined by looking at the four bytes before rip; if they contain the encoding of blr x16 (0xd63f0200), then a return-like transition is performed by setting pc to rip. Otherwise, a call-like transition is performed by setting x9 to rip and setting pc to rip's entry thunk. To find the entry thunk, the four bytes before rip are used; the low two bits need to be 0b01, and after masking off said bits, the 32 bits are sign extended and then added to rip (the resultant value must be different to rip, i.e. a function cannot be its own entry thunk).

Before transferring control to the entry thunk, the X64 emulator performs a little manipulation:

ldr lr, [sp], #8 // pop return address (skipping sp alignment check)
mov x4, sp

After this, it ensures sp is 16-byte aligned:

if (unlikely(sp & 8)) {
  str lr, [sp, -#8]! // push return address again (again skip check)
  adr lr, x64_ret_stub // an X64 funclet containing just "ret"
}

In other words, the stack will look like whichever of the following diagrams gives rise to an aligned sp:

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
                      ▲
                      |
                      sp                        lr = retaddr

◀- lower addresses    x4                 higher addresses -▶
                      |
                      ▼
           ... retaddr home0 home1 home2 home3 arg4 arg5 ...
              ▲
              |
              sp                           lr = x64_ret_stub

The 32 bytes of X64 home space begin at x4, and any arguments passed on the stack begin at x4+#32. The entry thunk is free to use the X64 home space if it wants, and some of the MSDN documentation suggests using it as a place to save q6 and q7, but this is problematic, as to later restore from this space, we'd need to save x4 (not to mention that no unwind codes can load from x4). In practice, only 24 of the 32 bytes are easily usable; sp+#8 through sp+#32 will always coincide with some 24 bytes of the home space.

The entry thunk can either be a copy of the original function that speaks the X64 ABI, or it can copy the arguments from their X64 ABI locations to their Arm64EC ABI locations and then make a call to the original function (helpfully provided in x9) and then transfer the results to their X64 ABI locations. If the original function is vararg, then the Arm64EC ABI dictates that x4 should contain a pointer to the stack arguments (so add #32) and that x5 should contain the size of the stack arguments (which is not generally known; MSVC-generated thunks populate #0 for x5 in this case).

In either case, the entry thunk needs to ensure that X64 ABI non-volatile registers are preserved (which translates to ARM64 ABI non-volatile registers, plus q6 through q15), and then needs to return to X64 once it is done. That is achieved by means of a tailcall to __os_arm64x_dispatch_ret, which resumes X64 execution at lr.

If the original function modified arguments in-place and then tailcalled something else (an adjustor function), then the entry thunk for it can instead be an adjustor thunk: modify the arguments in-place (in their X64 ABI locations), put the tailcall target in x9, and then tailcall __os_arm64x_x64_jump. If x9 points to ARM64 code, then __os_arm64x_x64_jump will tailcall x9's entry thunk (which will then consume arguments from their X64 locations), whereas if x9 points to X64 code, then it'll act like __os_arm64x_dispatch_call_no_redirect (which will again consume arguments from their X64 locations).

The other side of things is when native ARM64 code wants to make an indirect function call. The whole premise of Arm64EC is that function pointers at ABI boundaries can point to either ARM64 functions or X64 functions, and a-priori the caller doesn't know which it has. The recommendation is to put the call target in x11 and then call __os_arm64x_dispatch_icall, which will leave x11 as-is if it is ARM64 code, or copy x10 to x11 if not.

Note that __os_arm64x_dispatch_icall is not a function per-se, but instead a (per-module) global variable containing a function pointer. The distinction is important, as it affects how to call it; it cannot be a bl target, instead it has to be loaded into a register (e.g. by adrp and ldr) and then be used as a blr target. The linker and loader conspire to put LdrpValidateEcCallTarget in this variable, the pseudocode for which is:

LdrpValidateEcCallTarget:  // aka. __os_arm64x_dispatch_icall
  // Specially written to only mutate x16/x17
  // (and x9/x11 where indicated)
  if (RtlIsEcCode(x11)) {
    // x11 unchanged
    // x9 unspecified, in practice unchanged from input
  } else {
    x9 = ResolveFastForwardSequences(x11);
    if (RtlIsEcCode(x9)) {
      x11 = x9
      // x9 unspecified, in practice same as exit x11
    } else {
      x11 = x10
      // x9 specified as X64 pointer
    }
  }
  // x0-x8,x10,x15 specified as unchanged from input
  // q0-q7 specified as unchanged from input
  // x12 unspecified, in practice unchanged from input
  return;

ResolveFastForwardSequences(uintptr_t p):
  // jmp qword [rip+imm32]
  if (bytes_match(p, "ff 25 ?? ?? ?? ??")) {
    p += 6;
    int32_t imm32 = ((int32_t*)p)[-1];
    p = *(uintptr_t*)(p + imm32);
    return ResolveFastForwardSequences(p);
  }
  if (p & 15) {
    return p;
  }
  // mov rax, rsp; mov [rax+32], rbx; push rbp; pop rbp; jmp rel32
  // mov rdi, rdi; push rbp; mov rbp, rsp; pop rbp; nop; jmp rel32
  if (bytes_match(p, "48 8b c4 48 89 58 20 55 5d e9 ?? ?? ?? ??")
  ||  bytes_match(p, "48 8b ff 55 48 8b ec 5d 90 e9 ?? ?? ?? ??")) {
    p += 14;
    int32_t rel32 = ((int32_t*)p)[-1];
    p += rel32;
    return ResolveFastForwardSequences(p);
  }
  // mov r10, rcx; mov eax, imm32; test byte [0x7ffe0308], 0x01
  // jne rip+3; syscall; ret; jne_target: int 0x2e; ret
  if (bytes_match(p   , "4c 8b d1 b8 ?? ?0 00 00 f6 04 25 08")
  &&  bytes_match(p+12, "03 fe 7f 01 75 03 0f 05 c3 cd 2e c3")) {
    uint32_t imm32 = *(uint32_t*)(p + 4);
    return SyscallFunctionTable[imm32];
  }
  return p;

The related __os_arm64x_dispatch_icall_cfg is also a (per-module) global variable containing a function pointer; courtesy of the linker and loader it'll end up containing either LdrpValidateEcCallTarget or LdrpValidateEcCallTargetCfg or LdrpValidateEcCallTargetCfgES, the latter two of which perform a control-flow-guard (CFG) test against x11 and then tailcall LdrpValidateEcCallTarget.

If LdrpValidateEcCallTarget found X64 code, then x9 will end up containing the X64 pointer, and x11 will end up containing whatever was in x10. It is expected that the caller put the address of an exit thunk in x10, so that blr x11 will either perform the ARM64 call or call the exit thunk for an X64 call.

The exit thunk should set up a call frame, copy arguments from their Arm64EC ABI locations to their X64 ABI locations, perform the X64 call, transfer the result back to its Arm64EC location, then tear down its call frame and return. To perform the X64 call, it should put the X64 pointer in x9 (if not already in x9), and then call __os_arm64x_dispatch_call_no_redirect. If making a vararg call, then this is where x5 comes in to play, as the exit thunk needs to copy the stack-based arguments to a new location at the bottom of the stack.

The call to __os_arm64x_dispatch_call_no_redirect must be done as blr x16; the ARM64 CPU will set lr to the next ARM64 instruction, the X64 emulator will then push lr on to the stack as the return address, and when the emulated X64 code eventually tries to resume execution at this address, the emulator will recognise the blr x16 preceding this address and perform a return-like transition.

The (per-module) global variables __os_arm64x_dispatch_call_no_redirect and __os_arm64x_dispatch_ret and __os_arm64x_x64_jump end up pointing to functions exported from xtajit.dll. Again the linker and loader conspire to populate these variables (this is not the usual GOT / IAT machinery, though the end effect is similar). All three functions are entry points to the X64 emulator, and follow a similar pattern:

	`__os_arm64x_dispatch_call_no_redirect`	`__os_arm64x_dispatch_ret`	`__os_arm64x_x64_jump`
Intended usage	Call X64 code (usually from exit thunk)	Return to X64 code after entry thunk	Tailcall ARM64 or X64 code in adjustor thunk
Architecture check	None (target assumed to be X64)	None (target assumed to be X64)	If `RtlIsEcCode(x9)` then tailcall `x9`'s entry thunk
Stack manipulation	Push `lr` (so that X64 can return)	None (was done pre-entry-thunk)	Push `lr` (revert pre-entry-thunk manipulation)
Passthrough registers (extra non-volatiles)	`x0` through `x3`, `q0` through `q15`	`x8`, `q0` through `q3`, `q6` through `q15`	`x0` through `x3`, `q0` through `q15` (plus `x4` through `x9` if tailcalling ARM64)
X64 execution target	`x9`	`lr`	`x9`

The observant reader will notice that __os_arm64x_x64_jump isn't quite sufficient for its intended usage. If it was sufficient, then its pseudocode would be something like:

__os_arm64x_x64_jump:     // hoped-for implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    x9 = ResolveFastForwardSequences(x9)
    if (RtlIsEcCode(x9)) {
      x11 = GetEntryThunk(x9)
      br x11
    } else {
      revert the pre-entry-thunk stack manipulation
      br_to_X64_mode x9
    }
  }

Unfortunately, at least in the version of Windows 11 that I'm using right now, its pseudocode is more like:

__os_arm64x_x64_jump:      // actual implementation
  if (RtlIsEcCode(x9)) {
    x11 = GetEntryThunk(x9)
    br x11
  } else {
    str lr, [sp, #-8]! // push lr
    br_to_X64_mode x9
  }

In other words, it is deficient in two ways:

Not checking for fast-forward sequences. This is unfortunate from a performance perspective, but not otherwise harmful.
Assuming that the pre-entry-thunk stack manipulation was just popping into lr. This is true most of the time, but not all the time, and all sorts of mayhem can arise if this assumption is wrong.

To address point 2, the tail of __os_arm64x_x64_jump wants to be more like the following:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  br_to_X64_mode x9
}

A more complex, but ultimately more useful, variant of the above is:

if (likely(sp == x4)) {
  str lr, [sp, #-8]! // push lr
  br_to_X64_mode x9
} else {
  str x9, [sp, #-8]! // push x9
  adr lr, x64_ret_stub
  br_to_X64_mode lr
}

The adr in the above can be dropped, as if sp != x4, it means that the pre-entry-thunk manipulation will already have put x64_ret_stub in lr. At this point, the two branches differ only in swapping x9/lr, and so the swap can be pulled out:

if (unlikely(sp != x4)) {
  swap(x9, lr)
  mov x4, sp
}
str lr, [sp, #-8]! // push lr
br_to_X64_mode x9

Following this line of thinking, we can sidestep this particular deficiency of __os_arm64x_x64_jump by inserting the following immediately before the br to __os_arm64x_x64_jump:

cmp sp, x4          // sp == x4?
mov x4, x9
csel x9, x9, lr, eq // If not, swap
csel lr, lr, x4, eq // x9 and lr.
mov x4, sp

Alternatively, we could provide our own version of __os_arm64x_x64_jump, built using __os_arm64x_check_icall and __os_arm64x_dispatch_call_no_redirect, which fixes both deficiencies:

my_arm64x_x64_jump:
  adrp x17, __os_arm64x_check_icall
  ldr x17, [x17, __os_arm64x_check_icall]
  mov x11, x9 // __os_arm64x_check_icall main input
  adr x10, after_blr // __os_arm64x_check_icall thunk input
  stp fp, lr, [sp, #-16]!
  mov fp, sp
  blr x17 // __os_arm64x_check_icall
after_blr:
  ldrsw x16, [x11, #-4] // load entry thunk offset
  adrp x17, __os_arm64x_dispatch_call_no_redirect
  cmp x10, x11 // is x64 target?
  ldr x17, [x17, __os_arm64x_dispatch_call_no_redirect]
  csel x10, x9, x11, eq // x10 will become x9 later
  sub x11, x11, #1 // bias for 0b01 in entry thunk offset
  sub x9, sp, x4
  add x11, x11, x16 // entry thunk address
  ldp fp, lr, [sp], #16
  csel x11, x17, x11, eq
  ccmn x9, #16, #4, eq
  csel x9, x10, lr, eq
  csel lr, lr, x10, eq
  br x11 // entry thunk or __os_arm64x_dispatch_call_no_redirect

C23 stdbit.h quick reference

Linux/ELF .eh_frame from the bottom up

__VA_OPT__ Minutiae

Higher quality random floats

Windows Arm64EC ABI Notes