Directly continuing from the first DynASM example, one obvious optimisation would be to write the remaining loop of run_job in assembly, thereby avoiding a function call on every iteration. This idea leads to the following version of transcode.dasm:

|.arch x64
|.actionlist transcode_actionlist
|.section code
|.globals GLOB_

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
| jmp ->loop_test
|->loop_body:
| dec r8
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
|     mov eax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap eax
      }
|     mov [rdx + field->output_offset], eax
      break;
    case 8:
|     mov rax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap rax
      }
|     mov [rdx + field->output_offset], rax
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
| add rcx, job->input_record_size
| add rdx, job->output_record_size
|->loop_test:
| test r8, r8
| jnz ->loop_body
| ret

In order, the changes to note are:

  1. The addition of the following:
    |.globals GLOB_
    
  2. The addition of the following loop head:
    | jmp ->loop_test
    |->loop_body:
    | dec r8
  3. The addition of the following loop tail:
    | add rcx, job->input_record_size
    | add rdx, job->output_record_size
    |->loop_test:
    | test r8, r8
    | jnz ->loop_body
The interesting components of these changes are the jumps and the labels. Once you know that the -> prefix is DynASM's notation for so-called global labels, then the syntax becomes the same as in any other assembler: labels are introduced by suffixing them with a colon, and are jumped to by being used as an operand to a jump instruction. As well as global labels, DynASM also supports so-called local labels. The defining difference between the two is that an assembly fragment containing a global label can only be emitted once, whereas local labels can be emitted an unlimited number of times. As a consequence, when jumping to a local label, you need to specify whether to jump backwards to the nearest previous emission of that label, or forwards to the next subsequent emission of that label. As global labels can only be emitted once, so no such specification is needed.
Label typeSyntaxUsageAvailable namesMaximum emissionsRetrievable in C
Global->name:jmp ->nameAny C identifier1Yes
Local  name:jmp  >name (forward) or
jmp  <name (backward)
Integers between 1 and 9No
PC=>expr:jmp =>exprAny C
expression
N/ANo
With labels explained, the remaining curiosity is the .globals directive: its effect is to emit a C enumeration with the names of all global labels. For this example, it causes the following to be written in transcode.h:
//|.globals GLOB_
enum {
  GLOB_loop_test,
  GLOB_loop_body,
  GLOB__MAX
};

Now that we're using labels, we need to do slightly more initialisation work. In particular, between calling dasm_init and dasm_setup, we need to do the following:

void* global_labels[GLOB__MAX];
dasm_setupglobal(&state, global_labels, GLOB__MAX);

After calling dasm_encode, the absolute address of ->loop_test: will be stored in global_labels[GLOB_loop_test], and likewise the absolute address of ->loop_body: will be stored in global_labels[GLOB_loop_body].

For completeness, the final C code is as follows:

void (*make_transcoder(transcode_job_t* job))(const void*, void*, int)
{
  dasm_State* state;
  int status;
  void* code;
  size_t code_size;
  void* global_labels[GLOB__MAX];

  dasm_init(&state, DASM_MAXSECTION);
  dasm_setupglobal(&state, global_labels, GLOB__MAX);
  dasm_setup(&state, transcode_actionlist);

  emit_transcoder(&state, job);
  
  status = dasm_link(&state, &code_size);
  assert(status == DASM_S_OK);

  code = VirtualAlloc(nullptr, code_size, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  status = dasm_encode(&state, code);
  assert(status == DASM_S_OK);

  dasm_free(&state);
  return (void(*)(const void*, void*, int))code;
}

void run_job(transcode_job_t* job)
{
  void (*transcode_n_records)(const void*, void*, int) = make_transcoder(job);
  transcode_n_records(job->input, job->output, job->num_input_records);
}