What is DynASM?

DynASM advertises itself as a dynamic assembler for code generation engines. I can think of several interpretations of what a dynamic assembler might be, not all of which are compatible with each other. As such, it is worth beginning my series about DynASM with a description of what it is and what it isn't.

The envisioned usage pattern is to have fragments of assembly code which are syntactically complete, except possibly for the values of some constants (i.e. the instructions, addressing modes, and registers are all fixed). A decision is made at runtime as to how many copies to make of each fragment, what the value of each constant should be in each fragment, and in what order to emit the fragments.

The DynASM site states that DynASM takes mixed C/Assembler source as input, and produces plain C code output. While this is true, it is also easy to misinterpret: the input to DynASM is a C file whose intended purpose is to emit machine code - the assembly portion of the input is the code to emit rather than the code to run in-line with the C portion. As an example, the intent of the following code is that when write_return_n is called, the machine code for return n; is emitted:

void write_return_n(dasm_state ds, int n)
{
| mov rax, n
| ret
}

If DynASM was implemented by building up a string of assembly code and then passing the result to a stand-alone assembler, then the result of passing the above code through DynASM might be:

void write_return_n(dasm_state ds, int n)
{
  ds += "mov rax, " + n + "\n";
  ds += "ret\n";
}

In reality, DynASM builds up a string of machine code rather than a string of assembly code, meaning that the actual output is somewhat closer to the following:

void write_return_n(dasm_state ds, int n)
{
  dasm_append_bytes(ds, 0x48, 0xC7, 0xC0); dasm_append_dword(ds, n);
  dasm_append_bytes(ds, 0xC3);
}

With this example in mind, DynASM can be described as a text-processing tool which takes lines starting with a vertical bar, interprets them as assembly code, and replaces them with C code which writes out the corresponding machine code. This description fails to mention a bunch of really nice features, but it gives the general idea.

A note on D3D10_RESOURCE_MISC_GDI_COMPATIBLE

Direct3D 10.1 is interoperable with the Windows GDI, but there is very little explicit documentation on the matter. MSDN has a diagram with Direct2D at the center of the interoperability web, but we have to rely on a DirectX blog post for the complete interoperability diagram. In particular, note that the latter diagram has a path from Direct3D10.1 to GDI via DXGI 1.1.

MSDN gives the impression that this interoperability is simple: when calling CreateTexture2D, there is a nice flag called D3D10_RESOURCE_MISC_GDI_COMPATIBLE, the documentation for which says that after enabling the flag, the resulting texture can be cast to an IDXGISurface1 and have GetDC called on it. A similar flag called DXGI_SWAP_CHAIN_FLAG_GDI_COMPATIBLE exists when creating a swap chains rather than textures. Unfortunately, if you try to naively use one of these flags, resource creation might well fail with E_INVALIDARG. The reason for this failure is stated on GetDC, but really needs to be more prominent:

The format for the surface or swap chain must be DXGI_FORMAT_B8G8R8A8_UNORM_SRGB or DXGI_FORMAT_B8G8R8A8_UNORM.

If this constraint isn't satisfied, then it isn't GetDC which will fail, but the resource creation itself.

Making COM nice to use

I write a lot of C++ code for Windows. There are a lot of Windows APIs which expose cool functionality and are implemented using COM; for me this currently means Direct2D, DirectWrite, Direct3D, DXGI, and Windows Imaging Component. In terms of presenting a compatible ABI, COM is a good thing (and is certainly nicer than straight flattening a C++ API to a C API). That said, COM looks extremely clunky in modern C++, for the following reasons:

Some of those reasons are fairly minor, others are a major nuisance, but in aggregate, their overall effect makes COM programming rather unpleasant. Some people take small measures to attack one of the reasons individually, such as CHECK_HRESULT macro for throwing an exception upon an unsuccessful HRESULT (but you still need to wrap each call in this macro), or a com_ptr<T> templated smart pointer which at least does AddRef and Release automatically (but you then need to unwrap the smart pointer when passing it as a parameter to a COM method). I think that a much better approach is to attack all of the problems simultaneously. It isn't as easy as writing a single macro or a single smart pointer template, but the outcome is nice-to-use COM rather than COM-with-one-less-problem.

As with switching on Lua strings, my answer is code generation. I have a tool which:

  1. Takes as input a set of COM headers (for example d3d10.h, d3d10_1.h, d2d1.h, dwrite.h, dxgi.h, and wincodec.h).
  2. Identifies every interface which is defined across this set of headers.
  3. For each interface, it writes out a new class (in an appropriate namespace) which is like a smart pointer on steroids:
    1. If the interface inherits from a base interface, then the smart class inherits from the smart class corresponding to the base interface.
    2. Just like a smart pointer, AddRef and Release are handled automatically by copy construction, move construction, copy assignment, and move assignment.
    3. For each method of the interface, a new wrapper method (or set of overloaded methods) is written for the class:
      1. If the return type was HRESULT, then the wrapper checks for failure and throws an exception accordingly.
      2. Out-parameters become return values (using a std::tuple if there was more than one out-parameter, or an out-parameter on a method which already had a non-void and non-HRESULT return type).
      3. Coupled parameter pairs (such as pointer and length, or pointer and IID) become a single templated parameter.
      4. Pointers to POD structures get replaced with references to POD structures, with optional pointers becoming an overload which omits the parameter entirely.
      5. Pointers to COM objects get replaced with (references to) their corresponding smart class (in the case of in-parameters and also out-parameters).

As an example, consider the following code which uses raw Direct2D to create a linear gradient brush:

// rt has type ID2D1RenderTarget*
// brush has type ID2D1LinearGradientBrush*
D2D1_GRADIENT_STOP stops[] = {
  {0.f, colour_top},
  {1.f, colour_bottom}};
ID2D1GradientStopCollection* stops_collection = nullptr;
HRESULT hr = rt->CreateGradientStopCollection(
  stops,
  sizeof(stops) / sizeof(D2D1_GRADIENT_STOP),
  D2D1_GAMMA_2_2,
  D2D1_EXTEND_MODE_CLAMP,
  &stops_collection);
if(FAILED(hr))
  throw Exception(hr, "ID2D1RenderTarget::CreateGradientStopCollection");
hr = rt->CreateLinearGradientBrush(
  LinearGradientBrushProperties(Point2F(), Point2F())
  BrushProperties(),
  stops_collection,
  &brush);
stops_collection->Release();
stops_collection = nullptr;
if(FAILED(hr))
  throw Exception(hr, "ID2D1RenderTarget::CreateLinearGradientBrush");

With the nice-COM headers, the exact same behaviour is expressed in a much more concise manner:

// rt has type C6::D2::RenderTarget
// brush has type C6::D2::LinearGradientBrush
D2D1_GRADIENT_STOP stops[] = {
  {0.f, colour_top},
  {1.f, colour_bottom}};
brush = rt.createLinearGradientBrush(
  LinearGradientBrushProperties(Point2F(), Point2F()),
  BrushProperties(),
  rt.createGradientStopCollection(
    stops,
    D2D1_GAMMA_2_2,
    D2D1_EXTEND_MODE_CLAMP));

A switch on Lua strings

In one of the projects I'm currently working on, Lua is used a lot as a data description language. As a result, there are many occasions where I'd like to run a switch statement work over a Lua string. One example of this is the following function:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
  switch(lua_tostring(L, -1))
  {
  case "Graphic":   return new GraphicArt(L);
  case "Rectangle": return new RectangleArt(L);
  case "Text":      return new TextArt(L);
  case "Line":      return new LineArt(L);
  case "Swf":       return new SwfArt(L);
  }
  return new UnknownArt(L);
}

To my mind, this code perfectly expresses the idea of performing different actions based on different strings from Lua. The only problem is that this code won't work. A common translation into code which does actually work is the following:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
  if(const char* s = lua_tostring(L, -1))
  {
    if(!strcmp(s, "Graphic"))   return new GraphicArt(L);
    if(!strcmp(s, "Rectangle")) return new RectangleArt(L);
    if(!strcmp(s, "Text"))      return new TextArt(L);
    if(!strcmp(s, "Line"))      return new LineArt(L);
    if(!strcmp(s, "Swf"))       return new SwfArt(L);
  }
  return new UnknownArt(L);
}

This translation isn't perfect, as a string like "Line\0Stuff" will result in a LineArt rather than an UnknownArt, but usually this is acceptable. A further refinement might also make use of the string length:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
  size_t len;
  if(const char* s = lua_tolstring(L, -1, &len))
  {
    switch(len)
    {
    case 7:
      if(!memcmp(s, "Graphic", 7)) return new GraphicArt(L);
      break;
    case 9:
      if(!memcmp(s, "Rectangle", 9)) return new RectangleArt(L);
      break;
    case 4:
      if(!memcmp(s, "Text", 4)) return new TextArt(L);
      if(!memcmp(s, "Line", 4)) return new LineArt(L);
      break;
    case 3:
      if(!memcmp(s, "Swf"), 3) return new SwfArt(L);
      break;
    }
  }
  return new UnknownArt(L);
}

This refinement is technically correct, and probably faster due to dispatching based on length and the use of memcmp rather than strcmp. However, compared to the very first (non-functional) code fragment, this is harder to read and harder to maintain. This point is important to bear in mind, but things will get worse still before they get better. It just so happens that in the project I'm working on, the Lua library is statically linked. This means I can take advantage of Lua's implementation details safe in the knowledge that the implementation isn't going to change. The detail which I'd like to take advantage of is that Lua calculates a hash value for each string, and stores this in a header prior to the string contents. It also stores the length in this header, which we can also take advantage of. This train of thought leads to the following code fragment:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
  if(auto ts = reinterpret_cast<const TString*>(lua_tostring(L, -1)))
  {
    switch(ts[-1].tsv.len)
    {
    case 7:
      if(ts[-1].tsv.hash == 1408413413ULL) return new GraphicArt(L);
      break;
    case 9:
      if(ts[-1].tsv.hash == 3769334599ULL) return new RectangleArt(L);
      break;
    case 4:
      switch(ts[-1].tsv.hash)
      {
      case 7903477ULL: return new TextArt(scene, L);
      case 7864041ULL: return new LineArt(scene, L);
      }
      break;
    case 3:
      if(ts[-1].tsv.hash == 205275ULL) return new SwfArt(L);
      break;
    }
  }
  return new UnknownArt(L);
}

As with the previous version based on strcmp, this version isn't perfect. It is highly likely (at least for strings longer than 4 characters) that there will be some other strings of the same length and hash value, but this is acceptable to me. On the positive side, this version should be extremely quick, as switching over a hash should be faster than sequential memcmps. Unfortunately, not only is this version unreadable and unmaintainable, it is also unwritable.

At this point, enter code generation. It is an entirely mechanical process to translate the original (non-functional) code into hash-based dispatch code, and furthermore the hash calculations are best done by a machine. Therefore I've written a preprocessing script for C++ source files which performs this translation. The input looks almost like the original (non-functional) code, with the addition of some markers:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
#ifdef LUA_STRING_SWITCH
  switch(lua_tostring(L, -1))
  {
  case "Graphic":   return new GraphicArt(L);
  case "Rectangle": return new RectangleArt(L);
  case "Text":      return new TextArt(L);
  case "Line":      return new LineArt(L);
  case "Swf":       return new SwfArt(L);
  }
#endif
  return new UnknownArt(L);
}

The script goes through the file, finds each #ifdef LUA_STRING_SWITCH, throws away any existing #else block, and then writes an #else block based on the contents of the switch, resulting in something like:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
#ifdef LUA_STRING_SWITCH
  switch(lua_tostring(L, -1))
  {
  case "Graphic":   return new GraphicArt(L);
  case "Rectangle": return new RectangleArt(L);
  case "Text":      return new TextArt(L);
  case "Line":      return new LineArt(L);
  case "Swf":       return new SwfArt(L);
  }
#else
  if(auto ts = reinterpret_cast<const TString*>(lua_tostring(L, -1)))
  {
    switch(ts[-1].tsv.len)
    {
    case 7:
      if(ts[-1].tsv.hash == 1408413413ULL) return new GraphicArt(L);
      break;
    case 9:
      if(ts[-1].tsv.hash == 3769334599ULL) return new RectangleArt(L);
      break;
    case 4:
      switch(ts[-1].tsv.hash)
      {
      case 7903477ULL: return new TextArt(scene, L);
      case 7864041ULL: return new LineArt(scene, L);
      }
      break;
    case 3:
      if(ts[-1].tsv.hash == 205275ULL) return new SwfArt(L);
      break;
    }
  }
#endif
  return new UnknownArt(L);
}

At compile time, LUA_STRING_SWITCH isn't defined, and so the hash-based dispatch gets used. At edit time, I use the folding features of an IDE to hide the nasty hash-based dispatch, so all I see is:

Art* MakeArt(lua_State* L)
{
  lua_getfield(L, -1, "type");
#ifdef LUA_STRING_SWITCH
  switch(lua_tostring(L, -1))
  {
  case "Graphic":   return new GraphicArt(L);
  case "Rectangle": return new RectangleArt(L);
  case "Text":      return new TextArt(L);
  case "Line":      return new LineArt(L);
  case "Swf":       return new SwfArt(L);
  }
#else /* Hidden Preprocessor Block */
#endif
  return new UnknownArt(L);
}

This gives the best of both worlds. I can write pseudo-C++ which expresses my ideas very clearly, and then it can be mechanically translated and compiled down to something very efficient.

Mapping and Lua Iterators

Introduction to simple iterators in Lua

Let us begin by considering what a simple iterator is. In Lua, a simple iterator is a function. Each time you call the function, you get back the next value in the sequence. In the following example, f is (assumed to be) a simple iterator:

local f = ("cat"):gmatch(".")
print(f()) --> c
print(f()) --> a
print(f()) --> t

[Technical note: In the reference implementation of Lua, string.gmatch returns a simple iterator, but it is questionable whether or not this is mandated by the language. In particular, LuaJIT returns a full iterator rather than a simple iterator; see later]

As Lua functions can return multiple values, it seems logical to allow iterators to produce multiple values per call, and this is indeed allowed. In the following example, f is (assumed to be) a simple iterator which produces two values per call:

local f = ("The Cute Cat"):gmatch("(%u)(%l*)")
print(f()) --> T, he
print(f()) --> C, ute
print(f()) --> C, at

To signal that it has finished producing values, an iterator returns nil. Well, nearly; an iterator can signal completion by returning no values, by returning a single nil, or by returning multiple values with the first value being nil. With this in mind, the previous example can be rewritten to use a loop rather than an explicit number of calls to f:

local f = ("The Cute Cat"):gmatch("(%u)(%l*)")
while true do
  local first, rest = f()
  if first == nil then
    break
  end
  print(first, rest)
end

This is rather verbose syntax, and thankfully the generic for loop can be used to achieve the same effect:

local f = ("The Cute Cat"):gmatch("(%u)(%l*)")
for first, rest in f do
  print(first, rest)
end

In the previous examples, string.gmatch is termed a simple iterator factory, as it returns a simple iterator (f). Equipped with this knowledge of how simple iterators work, we can write a simple iterator factory which emulates the numeric for loop:

function range(init, limit, step)
  step = step or 1
  return function()
    local value = init
    init = init + step
    if limit * step >= value * step then
      return value
    end
  end
end

This can then be used like so:

for n in range(3, 0, -1) do print(n) end --> 3; 2; 1; 0

Introduction to full iterators in Lua

Looking at range in slightly more detail, we see that every time it is called, it returns a function, and that the function has three upvalues (init, limit, and step). From an esoteric efficiency point of view, it would be nice to have less upvalues (or ideally no upvalues at all). Full iterators are one way of achieving this. Whereas a simple iterator is just a function (f), a full iterator is a three-tuple (f,s,v). As with a simple iterator, f is still a function (or something with a __call metamethod). As for the two new bits, s is an opaque value which is passed as the first parameter to each call of f, and v is an opaque value which is passed as the second parameter to the first call of f. Subsequent calls of f pass the first return value from the previous call of f as the the second parameter.

Written out formally like that, full iterators sound a little bit odd, but once you've wrapped your head around it, they are really quite reasonable. As an example, let us rewrite range to be a full iterator factory rather than a simple iterator factory. For this, we'll take s to be limit, and v to be init - step, like so:

function range(init, limit, step)
  step = step or 1
  return function(lim, value)
    value = value + step
    if lim * step >= value * step then
      return value
    end
  end, limit, init - step
end

As the generic for loop is designed to work with full iterators, this version of range can be used just like the previous version:

for n in range(3, 0, -1) do print(n) end --> 3; 2; 1; 0

Alternatively, to appreciate what is going on, we could use a verbose loop instead:

local f, s, v = range(3, 0, -1)
while true do
  v = f(s, v)
  if v == nil then
    break
  end
  print(v)
end

Looking at this version of range, the returned function only has one upvalue (step). In real terms, this means that 32 less bytes of space are allocated on the heap as compared to the three upvalue version. Admittedly this isn't much, and it hardly justifies the language support for full iterators. Rather more convincing are the pairs and ipairs full iterator factories from the Lua standard library. The full iterator machinery is precisely what these two functions need in order for their iterator functions to have no upvalues at all, meaning that you can iterate through a table without having to allocate anything on the heap, as in the following example:

local t = {
  Lemon = "sour",
  Cake = "nice",
}
for ingredient, taste in pairs(t) do
  print(ingredient .." is ".. taste)
end
--> Lemon is sour
--> Cake is nice

Mapping Lua iterators

Switching from Lua to Haskell for just a minute, recall the type signature of the map function:

map :: (a -> b) -> [a] -> [b]

This takes a function as its first parameter, and applies this function to every element of a list in order to generate a new list. Due to Haskell's lazy evaluation behaviour, it isn't unreasonable to equate Haskell lists with iterators in other languages. With that in mind, a first attempt at a map function for Lua iterators might be:

function map(mapf, f, ...)
  return function(...)
    return mapf(f(...))
  end, ...
end

For all intents and purposes, this is just dressed up function composition. For some cases, like the following, it even works:

local f = map(string.upper, ("cat"):gmatch("."))
print(f()) --> C
print(f()) --> A
print(f()) --> T

Unfortunately, the following example doesn't work quite as well:

local t = {
  Lemon = "sour",
  Cake = "nice",
}
for ingredient, taste in map(function(a, b)
  return a:lower(), b:upper()
end, pairs(t)) do
  print(ingredient .." is ".. taste)
end
--> lemon is SOUR
--> invalid key to 'next'

If you've understood how full iterators work, then you should be able to appreciate why this example fails after the first iteration: the map function is changing the results from the iterator function, and the first of these changed results is being passed back to the iterator function, which confuses it. To solve this issue, we need a slightly more complicated construction:

function map(mapf, f, s, v)
  local function domap(...)
    v = ...
    if v ~= nil then
      return mapf(...)
    end
  end
  return function()
    return domap(f(s, v))
  end
end

With this construction, we get the intended behaviour:

local t = {
  Lemon = "sour",
  Cake = "nice",
}
for ingredient, taste in map(function(a, b)
  return a:lower(), b:upper()
end, pairs(t)) do
  print(ingredient .." is ".. taste)
end
--> lemon is SOUR
--> cake is NICE

ConcatMapping Lua iterators

Jumping back to Haskell again, consider the slightly more complex concatMap function:

concatMap :: (a -> [b]) -> [a] -> [b]

This allows the iterator function to produce zero or more results for each value of input, for example:

module Main where
import Data.Char
main = putStr $
  concatMap (\c -> if isUpper c then [c] else [c,c]) "Cat" -- > "Caatt"

We could perform a direct conversion of this to Lua, and have the map function return an iterator, but unlike Haskell's syntactic support for constructing lists, Lua has none for constructing iterators, and so the result would feel rather clunky. The central issue is deciding how to return multiple values, or infact multiple tuples, from the map function. In Haskell it is natural to do this with lists, whereas in Lua we can achieve it with coroutines. Consider the following even more complex form of the map function in Lua:

function map(mapf, f, s, v)
  local done
  local function maybeyield(...)
    if ... ~= nil then
      coroutine.yield(...)
    end
  end
  local function domap(...)
    v = ...
    if v ~= nil then
      return maybeyield(mapf(...))
    else
      done = true
    end
  end
  return coroutine.wrap(function()
    repeat
      domap(f(s, v))
    until done
  end)
end

This version of map still permits the previous example:

local t = {
  Lemon = "sour",
  Cake = "nice",
}
for ingredient, taste in map(function(a, b)
  return a:lower(), b:upper()
end, pairs(t)) do
  print(ingredient .." is ".. taste)
end
--> lemon is SOUR
--> cake is NICE

Furthermore, it supports the generation of multiple tuples for each input by calling coroutine.yield for each additional tuple:

local t = {
  Lemon = "sour",
  Cake = "nice",
}
for ingredient, taste in map(function(a, b)
  coroutine.yield(a:lower(), b:upper())
  return a, "very ".. b
end, pairs(t)) do
  print(ingredient .." is ".. taste)
end
--> lemon is SOUR
--> Lemon is very sour
--> cake is NICE
--> Cake is very nice

Alternatively, if the map function doesn't return, then it doesn't generate anything at all:

for n in map(function(n)
  if n ~= 0 then
    return 1 / n
  end
end, range(-2, 2)) do
  print(n)
end
--> -0.5; -1; 1; 0.5
page: 3 4 5 6 7