Control flow guard

Microsoft recently introduced a new security feature called Control Flow Guard. At a basic level, this feature consists of a massive bit vector, and before any indirect function call is performed, the bit vector is consulted to determine whether the target of the call is valid or not. The end-goal is that the bit vector should specify all function entry addresses as valid, and all other addresses as invalid - thereby preventing malicious calls into the middle of functions. The structure of the bit vector is interesting, but most literature seems to get it wrong. For example, the Trend Micro report on Control Flow Guard states:

The status of every 8 bytes in the process space corresponds to a bit in CFGBitmap. If there is a function starting address in each group of 8 bytes, the corresponding bit in CFGBitmap is set to 1; otherwise it is set to 0.

Every bit in the CFGBitmap represents eight bytes in the process space. So if an invalid target call address has less than eight bytes from the valid function address, the CFG will think the target call address is "valid."

Meanwhile, a POC2014 conference talk states:

One bit indicates 8bytes address and actually in most cases 16bytes
Every guard function address needs to be aligned to 0x10
If function address is not aligned to 0x10, it will use the odd bit only

In the bit vector (which Trend Micro calls CFGBitmap), every two bits correspond to sixteen bytes. Therefore, on average, every bit corresponds to eight bytes. However, the average is very misleading here, as in reality one of those two bits corresponds to one byte, and the other bit corresponds to fifteen bytes. This arrangement has several benefits:

Exporting an infinite number of symbols

Dynamically loadable shared libraries typically come in one of a few formats:

The whole point of dynamically loadable shared libraries is to export symbols, and these formats typically store exported symbol information as a list of exported symbols or a hash table of exported symbols. One nice property of lists and hash tables is that they're finite by default: unless you deliberately try to make them infinite, they'll be finite.

One oddity of the Mach-O format is that exported symbol information can be represented as a trie. The term trie is meant to allude to tree, and trees are also finite by default. However, a trie can also be thought of as a directed rooted graph, and if that graph were to have a cycle, then the number of paths in the graph would be infinite.

Let us begin with a file called finite.c:

void corsix() {}
void corsix_() {}

#define C2(x) void corsix_##x() {}
#define C1(x) C2(x##a) C2(x##b) C2(x##c) C2(x##d) C2(x##e)
#define C0(x) C1(x##a) C1(x##b) C1(x##c) C1(x##d) C1(x##e)
C0(a) C0(b) C0(c) C0(d) C0(e)

We can compile this to a shared library like so:

$ clang finite.c -shared -o finite.dylib

This gives us a shared library called finite.dylib which exports 127 symbols: corsix, corsix_, and the 125 symbols matching the regex corsix_[a-e][a-e][a-e]. These symbols aren't overly interesting, and the sheer number of symbols is merely to ensure that the exported symbol trie in finite.dylib occupies sufficiently many bytes.

The exported symbol trie in finite.dylib looks something like the following diagram:

                                         +-"a"-> corsix_a ...
                                         |
                                         +-"b"-> corsix_b ...
                                         |
root -"_corsix"-> corsix -"_"-> corsix_ -+-"c"-> corsix_c ...
                                         |
                                         +-"d"-> corsix_d ...
                                         |
                                         +-"e"-> corsix_e ...

Our aim is to replace the exported symbol trie with something like the following diagram:

                +- <---"_"---+
                |            |
root -"_corsix"-+-> corsix --+
                |            |
                +- <---"a"---+
                |            |
                +- <---"b"---+
                |            |
                +- <---"c"---+
                |            |
                +- <---"d"---+
                |            |
                +- <---"e"---+

With such a trie, the symbol originally called corsix should now be exported under all the names matching the regex corsix[_a-e]*. We could also go slightly further, adding more looping edges to the trie, in order to reach corsix[_a-z0-9]*.

We'll use the following transform.lua program to do the dirty work of trie replacement:

dylib = io.read"*a"
nof, pos, tsz = dylib:match"_corsix%z(.)()(.)"
node = dylib:sub(pos, pos + tsz:byte()) .. "\37" ..
  ("_abcdefghijklmnopqrstuvwxyz0123456789"):gsub(".", "%0\0" .. nof)
io.write(dylib:sub(1, pos-1) .. node .. dylib:sub(pos + #node))

Running the program like so will generate a file called infinite.dylib:

$ lua transform.lua <finite.dylib >infinite.dylib

We'll then use the following client.cpp program to query the exported symbols of the two .dylib files:

#include <dlfcn.h>
#include <stdio.h>

void check_dylib(const char* path) {
  void* dylib = dlopen(path, RTLD_LOCAL);
  printf("\nName lookup results in %s:\n", path);
  const char* names[] = {
    "foobar23", "corsix", "corsix_aaa", "corsix_abc",
    "corsix_xyz", "corsix_foobar23", "corsix_dot_org"
  };
  for (const char* name : names) {
    printf("%-15s -> %p\n", name, dlsym(dylib, name));
  }
}

int main() {
  check_dylib("./finite.dylib");
  check_dylib("./infinite.dylib");
  return 0;
}

Compiling and running gives the following output:

$ clang -std=c++11 client.cpp && ./a.out

Name lookup results in ./finite.dylib:
foobar23        -> 0x0
corsix          -> 0x1076347b0
corsix_aaa      -> 0x1076347d0
corsix_abc      -> 0x107634840
corsix_xyz      -> 0x0
corsix_foobar23 -> 0x0
corsix_dot_org  -> 0x0

Name lookup results in ./infinite.dylib:
foobar23        -> 0x0
corsix          -> 0x1076377b0
corsix_aaa      -> 0x1076377b0
corsix_abc      -> 0x1076377b0
corsix_xyz      -> 0x1076377b0
corsix_foobar23 -> 0x1076377b0
corsix_dot_org  -> 0x1076377b0

I don't know of any particularly useful reason for exporting an infinite number of symbols, but it does trip up Apple's dyldinfo tool, and it might also trip up other tools of a similar nature:

$ dyldinfo -export infinite.dylib 
export information (from trie):
Segmentation fault: 11

$ dyldinfo -export_dot infinite.dylib
digraph {
  node000;
  node000 -> node011 [ label=_corsix ] ;
  node011 [ label=_corsix,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix_,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix__,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix___,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix____,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix_____,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix______,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix_______,addr0x000007B0 ];
  node011 -> node011 [ label=_ ] ;
  node011 [ label=_corsix________,addr0x000007B0 ];
... 15000 lines of output ommitted ...
Segmentation fault: 11

Why are slots so slow?

One of the points in Armin Ronacher's The Python I Would Like To See is that slots are slow. That is, A() + A() is slower than A().__add__(A()) in the context of the following:

class A(object):
    def __add__(self, other):
        return 42

I'd like to investigate this claim for myself. To begin, let us repeat the experiment and see whether we get the same result:

$ cat x.py
class A(object):
    def __add__(self, other):
        return 42
$ ./python.exe
Python 3.5.0a4+ (default, Apr 25 2015, 21:57:28) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.49)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from x import A
>>> a = A()
>>> b = A()
>>> a + b
42
>>> quit()
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.215 usec per loop
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.113 usec per loop

It would seem that Armin's claim stands up; a + b is indeed considerably slower than a.__add__(b).

First of all, an implicit assumption of Armin's claim is that a + b should be equivalent to a.__add__(b). Let us check this assumption by asking what does a + b mean in Python? The documentation for + is probably a good place to start:

The + (addition) operator yields the sum of its arguments. The arguments must either both be numbers or both be sequences of the same type. In the former case, the numbers are converted to a common type and then added together. In the latter case, the sequences are concatenated.

Well, uh, that doesn't explain the observed behaviour of a + b giving 42. Perhaps the documentation for __add__ will shed some light on the situation:

These methods are called to implement the binary arithmetic operations (+, [...]). For instance, to evaluate the expression x + y, where x is an instance of a class that has an __add__() method, x.__add__(y) is called. [...] If one of those methods does not support the operation with the supplied arguments, it should return NotImplemented.

Well, that explains the observed behaviour, and seems to pretty much straight up say that a + b means a.__add__(b). However, let's not get ahead of ourselves. On the off chance that it is relevant, let's consider the documentation for __radd__:

These methods are called to implement the binary arithmetic operations (+, [...]) with reflected (swapped) operands. These functions are only called if the left operand does not support the corresponding operation and the operands are of different types. For instance, to evaluate the expression x - y, where y is an instance of a class that has an __rsub__() method, y.__rsub__(x) is called if x.__sub__(y) returns NotImplemented.

Well, whad'ya know, it was relevant. With this extra bit of information, it seems like a + b is equivalent to something like:

if __add__ in a:
  tmp = a.__add__(b)
else:
  tmp = NotImplemented
if tmp is NotImplemented and type(a) != type(b):
  return b.__radd__(a)
else:
  return tmp

Of course, the story doesn't end there; immediately after the piece of documentation quoted above is the following gem:

Note: If the right operand’s type is a subclass of the left operand’s type and that subclass provides the reflected method for the operation, this method will be called before the left operand’s non-reflected method. This behavior allows subclasses to override their ancestors’ operations.

Bearing this in mind, maybe a + b is equivalent to something like:

if issubclass(type(b), type(a)) and __radd__ in b:
  tmp = b.__radd__(a)
  if tmp is not NotImplemented:
    return tmp
if __add__ in a:
  tmp = a.__add__(b)
else:
  tmp = NotImplemented
if tmp is NotImplemented and type(a) != type(b):
  return b.__radd__(a)
else:
  return tmp

I wish that the above were the full story, but alas it is not. Let us pluck another link out of thin air, this time to the documentation on special method lookup:

For custom classes, implicit invocations of special methods are only guaranteed to work correctly if defined on an object’s type, not in the object’s instance dictionary. [...] In addition to bypassing any instance attributes in the interest of correctness, implicit special method lookup generally also bypasses the __getattribute__() method even of the object’s metaclass.

I like to interpret this paragraph as saying bugger it, a + b means whatever the CPython interpreter does for a + b . Having studied the interpreter, the meaning of a + b is equivalent to something along the lines of the following:

def get(x, field):
  try:
    return getattr(type(x), field) # Doesn't call __getattribute__
  except AttributeError:
    return None

def has(x, field):
  return get(x, field) is not None

# From now on, `x.__yzw__` means `get(x, '__yzw__')`
# and `__abc__ in d` means `has(d, '__abc__')`

def tp_add_slot(x):
  if x is a builtin type or a type from a C extension:
    return ?
  elif __add__ in x or __radd__ in x:
    return slot_nb_add
  else:
    return None

def sq_concat_slot(x):
  return ?

def slot_nb_add(x, y):
  do_other = type(x) != type(y) and tp_add_slot(y) == slot_nb_add and __radd__ in y
  if tp_add_slot(x) == slot_nb_add:
    if do_other and issubclass(type(y), type(x)) and (__radd__ not in x or x.__radd__ != y.__radd__):
      tmp = y.__radd__(x)
      if tmp is NotImplemented:
        do_other = False
      else:
        return tmp
    if __add__ in x:
      tmp = x.__add__(y)
      if tmp is not NotImplemented:
        return tmp
  if do_other:
    return y.__radd__(x)
  return NotImplemented

slota = tp_add_slot(a)
slotb = tp_add_slot(b)
slotc = sq_concat_slot(a)
if slota == slotb:
  return slota(a, b)
if slota is not None and slotb is not None and issubclass(type(b), type(a)):
  tmp = slotb(a, b)
  if tmp is NotImplemented:
    slotb = None
  else:
    return tmp
if slota is not None:
  tmp = slota(a, b)
  if tmp is not NotImplementd:
    return tmp
if slotb is not None:
  tmp = slotb(a, b)
  if tmp is not NotImplementd:
    return tmp
if slotc is None:
  raise error
else:
  return slotc(a, b)

The conclusion of the above exploration is that a + b has a rather more nuanced meaning than just a.__add__(b). If we accept this conclusion, then perhaps it shouldn't be surprising that a + b is slower than a.__add__(b). However, in our case, a and b are the same type, so the above pseudo-code should pretty quickly conclude that the meaning of a + b, in our case, is just a.__add__(b).

Let us consider an alternative conclusion: the people behind the CPython interpreter have spent more time optimising a.__add__(b) than they have spent optimising a + b. To test this hypothesis, we need to dig into the bytecode of these two expressions. If we ignore the bytecode which is common to both expressions, then we can say that a.__add__(b) consists of two bytecode instructions (LOAD_ATTR and CALL_FUNCTION), while a + b consists of just a single bytecode instruction (BINARY_ADD).

Let's begin with a.__add__(b) and look at what happens when the bytecode is executed:

  1. Begin LOAD_ATTR instruction.
  2. Call PyObject_GetAttr.
  3. Call PyObject_GenericGetAttr (via the tp_getattro slot in type(a)).
  4. Call _PyObject_GenericGetAttrWithDict.
  5. Call _PyType_Lookup.
  6. Successfully find __add__ in the method cache.
  7. Return a function object to _PyObject_GenericGetAttrWithDict.
  8. Call func_descr_get (via the tp_descr_get slot in the type of the function object).
  9. Call PyMethod_New (to bind a to the first argument of the function).
  10. Return a method object from PyObject_GetAttr.
  11. Push the method object onto the stack.
  12. End LOAD_ATTR instruction.
  13. Begin CALL_FUNCTION instruction.
  14. Call call_function.
  15. Realise that we have a method object.
  16. Replace the stack entry underneath b with the bound argument from the method object.
  17. Call fast_function with the function from the method object.
  18. Call PyFrame_New to create a new stack frame.
  19. Call PyEval_EvalFrameEx to actually evaluate our __add__ code.
  20. Return 42 from call_function.
  21. Free the method object.
  22. End CALL_FUNCTION instruction.

On the other hand, for a + b, we have:

  1. Begin BINARY_ADD instruction.
  2. Call PyNumber_Add.
  3. Call binary_op1.
  4. Call slot_nb_add (via the tp_add slot in type(a)) [slot_nb_add is defined via a SLOT1BIN macro].
  5. Call call_maybe(a, &Id__add__, "(O)", b) [call_maybe is variadic after the string parameter].
  6. Call lookup_maybe [lookup_maybe is like PyObject_GetAttr, but only looks in the type, and doesn't invoke __getattribute__].
  7. Call _PyType_LookupId.
  8. Call _PyUnicode_FromId [this converts Id__add__ into a PyObject representing "__add__", but this conversion is cached, and therefore effectively free].
  9. Call _PyType_Lookup [as in step 5 above].
  10. Successfully find __add__ in the method cache [as in step 6 above].
  11. Return a function object from _PyType_LookupId.
  12. Call func_descr_get (via the tp_descr_get slot in the type of the function object) [as in step 8 above].
  13. Call PyMethod_New (to bind a to the first argument of the function) [as in step 9 above].
  14. Return a method object from lookup_maybe.
  15. Call Py_VaBuildValue, passing the string literal "(O)" and a reference to call_maybe's variadic arguments.
  16. Do a tonne of string literal parsing and variadic argument fetching and tuple construction, resulting in Py_VaBuildValue eventually returning a singleton tuple containing b.
  17. Call method_call (via the tp_call slot in the type of the method object).
  18. Construct a new two-element tuple, filling it with the bound argument from the method object, and the contents of the previously constructed singleton tuple. In other words, we now have the tuple (a, b).
  19. Call function_call (via the tp_call slot in the type of the function from the method object).
  20. Call PyEval_EvalCodeEx.
  21. Call _PyEval_EvalCodeWithName.
  22. Call PyFrame_New to create a new stack frame [as in step 18 above].
  23. Call PyEval_EvalFrameEx to actually evaluate our __add__ code [as in step 19 above].
  24. Return 42 from function_call.
  25. Free the two-element tuple.
  26. Return 42 from method_call.
  27. Free the singleton tuple.
  28. Free the method object.
  29. Return 42 from PyNumber_Add.
  30. End BINARY_ADD instruction.

One obvious diference is that a + b does far more manipulation of tuples and of variadic arguments. Given that call_maybe is always called with a format of "(O)", let's acknowledge this by changing its signature to be fixed-arg rather than vararg, and also construct an argument tuple via PyTuple_New / PyTuple_SET_ITEM rather than Py_VaBuildValue:

diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..d27cc07 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,29 +1465,22 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
 /* Clone of call_method() that returns NotImplemented when the lookup fails. */
 
 static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
 {
-    va_list va;
     PyObject *args, *func = 0, *retval;
-    va_start(va, format);
 
     func = lookup_maybe(o, nameid);
     if (func == NULL) {
-        va_end(va);
         if (!PyErr_Occurred())
             Py_RETURN_NOTIMPLEMENTED;
         return NULL;
     }
 
-    if (format && *format)
-        args = Py_VaBuildValue(format, va);
-    else
-        args = PyTuple_New(0);
-
-    va_end(va);
-
+    args = PyTuple_New(1);
     if (args == NULL)
         return NULL;
+    PyTuple_SET_ITEM(args, 0, p);
+    Py_XINCREF(p);
 
     assert(PyTuple_Check(args));
     retval = PyObject_Call(func, args, NULL);
@@ -5624,20 +5617,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
         if (do_other && \
             PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
             method_is_overloaded(self, other, &rop_id)) { \
-            r = call_maybe(other, &rop_id, "(O)", self); \
+            r = call_maybe(other, &rop_id, self); \
             if (r != Py_NotImplemented) \
                 return r; \
             Py_DECREF(r); \
             do_other = 0; \
         } \
-        r = call_maybe(self, &op_id, "(O)", other); \
+        r = call_maybe(self, &op_id, other); \
         if (r != Py_NotImplemented || \
             Py_TYPE(other) == Py_TYPE(self)) \
             return r; \
         Py_DECREF(r); \
     } \
     if (do_other) { \
-        return call_maybe(other, &rop_id, "(O)", self); \
+        return call_maybe(other, &rop_id, self); \
     } \
     Py_RETURN_NOTIMPLEMENTED; \
 }

This gives a nice little speedup; we're down from 0.215 usec to 0.176 usec:

$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.176 usec per loop

We're still falling somewhat short the of 0.113 usec time set by a.__add__(b), so let's copy step 15 of a.__add__(b) and special-case method objects:

diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..2cd8e23 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,36 +1465,43 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
 /* Clone of call_method() that returns NotImplemented when the lookup fails. */
 
 static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
 {
-    va_list va;
-    PyObject *args, *func = 0, *retval;
-    va_start(va, format);
+    PyObject *args[2], *func = 0, *retval, *tuple;
+    int na = 1;
 
     func = lookup_maybe(o, nameid);
     if (func == NULL) {
-        va_end(va);
         if (!PyErr_Occurred())
             Py_RETURN_NOTIMPLEMENTED;
         return NULL;
     }
 
-    if (format && *format)
-        args = Py_VaBuildValue(format, va);
-    else
-        args = PyTuple_New(0);
-
-    va_end(va);
-
-    if (args == NULL)
-        return NULL;
+    args[1] = p;
+    if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
+        PyObject *mself = PyMethod_GET_SELF(func);
+        PyObject *mfunc = PyMethod_GET_FUNCTION(func);
+        args[0] = mself;
+        na = 2;
+        Py_INCREF(mfunc);
+        Py_DECREF(func);
+        func = mfunc;
+    } else {
+        args[0] = NULL;
+    }
 
-    assert(PyTuple_Check(args));
-    retval = PyObject_Call(func, args, NULL);
+    tuple = PyTuple_New(na);
+    if (tuple == NULL) {
+        retval = NULL;
+    } else {
+        memcpy(((PyTupleObject *)tuple)->ob_item, args, sizeof(PyObject*) * na);
+        Py_XINCREF(args[0]);
+        Py_XINCREF(args[1]);
+        retval = PyObject_Call(func, tuple, NULL);
+        Py_DECREF(tuple);
+    }
 
-    Py_DECREF(args);
     Py_DECREF(func);
-
     return retval;
 }
 
@@ -5624,20 +5631,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
         if (do_other && \
             PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
             method_is_overloaded(self, other, &rop_id)) { \
-            r = call_maybe(other, &rop_id, "(O)", self); \
+            r = call_maybe(other, &rop_id, self); \
             if (r != Py_NotImplemented) \
                 return r; \
             Py_DECREF(r); \
             do_other = 0; \
         } \
-        r = call_maybe(self, &op_id, "(O)", other); \
+        r = call_maybe(self, &op_id, other); \
         if (r != Py_NotImplemented || \
             Py_TYPE(other) == Py_TYPE(self)) \
             return r; \
         Py_DECREF(r); \
     } \
     if (do_other) { \
-        return call_maybe(other, &rop_id, "(O)", self); \
+        return call_maybe(other, &rop_id, self); \
     } \
     Py_RETURN_NOTIMPLEMENTED; \
 }

This gives another nice little speedup; we're down from 0.176 usec to 0.155 usec:

$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.155 usec per loop

Even better would be to also pull the fast_function trick that the interpreter does at step 17 in order to call a function without creating any argument tuples at all:

diff --git a/Include/ceval.h b/Include/ceval.h
index 6811367..f0997ac 100644
--- a/Include/ceval.h
+++ b/Include/ceval.h
@@ -10,6 +10,9 @@ extern "C" {
 PyAPI_FUNC(PyObject *) PyEval_CallObjectWithKeywords(
     PyObject *, PyObject *, PyObject *);
 
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n);
+
 /* Inline this */
 #define PyEval_CallObject(func,arg) \
     PyEval_CallObjectWithKeywords(func, arg, (PyObject *)NULL)
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..6419ea2 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,36 +1465,47 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
 /* Clone of call_method() that returns NotImplemented when the lookup fails. */
 
 static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
 {
-    va_list va;
-    PyObject *args, *func = 0, *retval;
-    va_start(va, format);
+    PyObject *args[2], *func = 0, *retval;
+    int na = 1;
 
     func = lookup_maybe(o, nameid);
     if (func == NULL) {
-        va_end(va);
         if (!PyErr_Occurred())
             Py_RETURN_NOTIMPLEMENTED;
         return NULL;
     }
 
-    if (format && *format)
-        args = Py_VaBuildValue(format, va);
-    else
-        args = PyTuple_New(0);
-
-    va_end(va);
-
-    if (args == NULL)
-        return NULL;
+    args[1] = p;
+    if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
+        PyObject *mself = PyMethod_GET_SELF(func);
+        PyObject *mfunc = PyMethod_GET_FUNCTION(func);
+        args[0] = mself;
+        na = 2;
+        Py_INCREF(mfunc);
+        Py_DECREF(func);
+        func = mfunc;
+    } else {
+        args[0] = NULL;
+    }
 
-    assert(PyTuple_Check(args));
-    retval = PyObject_Call(func, args, NULL);
+    if (PyFunction_Check(func)) {
+        retval = PyEval_FastFunction(func, &args[2], na);
+    } else {
+        PyObject* tuple = PyTuple_New(na);
+        if (tuple == NULL) {
+            retval = NULL;
+        } else {
+            memcpy(((PyTupleObject *)tuple)->ob_item, args, sizeof(PyObject*) * na);
+            Py_XINCREF(args[0]);
+            Py_XINCREF(args[1]);
+            retval = PyObject_Call(func, tuple, NULL);
+            Py_DECREF(tuple);
+        }
+    }
 
-    Py_DECREF(args);
     Py_DECREF(func);
-
     return retval;
 }
 
@@ -5624,20 +5635,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
         if (do_other && \
             PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
             method_is_overloaded(self, other, &rop_id)) { \
-            r = call_maybe(other, &rop_id, "(O)", self); \
+            r = call_maybe(other, &rop_id, self); \
             if (r != Py_NotImplemented) \
                 return r; \
             Py_DECREF(r); \
             do_other = 0; \
         } \
-        r = call_maybe(self, &op_id, "(O)", other); \
+        r = call_maybe(self, &op_id, other); \
         if (r != Py_NotImplemented || \
             Py_TYPE(other) == Py_TYPE(self)) \
             return r; \
         Py_DECREF(r); \
     } \
     if (do_other) { \
-        return call_maybe(other, &rop_id, "(O)", self); \
+        return call_maybe(other, &rop_id, self); \
     } \
     Py_RETURN_NOTIMPLEMENTED; \
 }
diff --git a/Python/ceval.c b/Python/ceval.c
index 2f3d3ad..bf6aedc 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -4329,6 +4329,12 @@ call_function(PyObject ***pp_stack, int oparg
     return x;
 }
 
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n)
+{
+  return fast_function(func, &stack, n, n, 0);
+}
+
 /* The fast_function() function optimize calls for which no argument
    tuple is necessary; the objects are passed directly from the stack.
    For the simplest case -- a function that takes only positional

And with that, we're down from 0.155 usec to 0.113 usec:

$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.113 usec per loop

So, it seems that slots aren't intrinsically slow. Provided that the implementation of slots in typeobject.c is taught to use the exact same tricks that the interpreter does, then they are the exact same speed as non-slots. We could even go further and elide construction of the method object entirely:

diff --git a/Include/ceval.h b/Include/ceval.h
index 6811367..f0997ac 100644
--- a/Include/ceval.h
+++ b/Include/ceval.h
@@ -10,6 +10,9 @@ extern "C" {
 PyAPI_FUNC(PyObject *) PyEval_CallObjectWithKeywords(
     PyObject *, PyObject *, PyObject *);
 
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n);
+
 /* Inline this */
 #define PyEval_CallObject(func,arg) \
     PyEval_CallObjectWithKeywords(func, arg, (PyObject *)NULL)
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..c4ffa70 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,36 +1465,64 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
 /* Clone of call_method() that returns NotImplemented when the lookup fails. */
 
 static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
 {
-    va_list va;
-    PyObject *args, *func = 0, *retval;
-    va_start(va, format);
+    PyObject *args[2], *func = 0, *retval;
+    int na = 2;
 
-    func = lookup_maybe(o, nameid);
+    args[1] = p;
+    func = _PyType_LookupId(Py_TYPE(o), nameid);
     if (func == NULL) {
-        va_end(va);
         if (!PyErr_Occurred())
             Py_RETURN_NOTIMPLEMENTED;
         return NULL;
     }
+    if (PyFunction_Check(func)) {
+        Py_INCREF(func);
+        args[0] = o;
+        retval = PyEval_FastFunction(func, &args[2], na);
+    } else {
+        descrgetfunc f = Py_TYPE(func)->tp_descr_get;
+        if (f == NULL) {
+            Py_INCREF(func);
+        } else {
+            func = f(func, o, (PyObject *)(Py_TYPE(o)));
+            if (func == NULL) {
+                if (!PyErr_Occurred())
+                    Py_RETURN_NOTIMPLEMENTED;
+                return NULL;
+            }
+        }
 
-    if (format && *format)
-        args = Py_VaBuildValue(format, va);
-    else
-        args = PyTuple_New(0);
-
-    va_end(va);
-
-    if (args == NULL)
-        return NULL;
-
-    assert(PyTuple_Check(args));
-    retval = PyObject_Call(func, args, NULL);
+        if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
+            PyObject *mself = PyMethod_GET_SELF(func);
+            PyObject *mfunc = PyMethod_GET_FUNCTION(func);
+            args[0] = mself;
+            Py_INCREF(mfunc);
+            Py_DECREF(func);
+            func = mfunc;
+        } else {
+            args[0] = NULL;
+            na = 1;
+        }
+    
+        if (PyFunction_Check(func)) {
+            retval = PyEval_FastFunction(func, &args[2], na);
+        } else {
+            PyObject* tuple = PyTuple_New(na);
+            if (tuple == NULL) {
+                retval = NULL;
+            } else {
+                memcpy(((PyTupleObject *)tuple)->ob_item, args, sizeof(PyObject*) * na);
+                Py_XINCREF(args[0]);
+                Py_XINCREF(args[1]);
+                retval = PyObject_Call(func, tuple, NULL);
+                Py_DECREF(tuple);
+            }
+        }
+    }
 
-    Py_DECREF(args);
     Py_DECREF(func);
-
     return retval;
 }
 
@@ -5624,20 +5652,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
         if (do_other && \
             PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
             method_is_overloaded(self, other, &rop_id)) { \
-            r = call_maybe(other, &rop_id, "(O)", self); \
+            r = call_maybe(other, &rop_id, self); \
             if (r != Py_NotImplemented) \
                 return r; \
             Py_DECREF(r); \
             do_other = 0; \
         } \
-        r = call_maybe(self, &op_id, "(O)", other); \
+        r = call_maybe(self, &op_id, other); \
         if (r != Py_NotImplemented || \
             Py_TYPE(other) == Py_TYPE(self)) \
             return r; \
         Py_DECREF(r); \
     } \
     if (do_other) { \
-        return call_maybe(other, &rop_id, "(O)", self); \
+        return call_maybe(other, &rop_id, self); \
     } \
     Py_RETURN_NOTIMPLEMENTED; \
 }
diff --git a/Python/ceval.c b/Python/ceval.c
index 2f3d3ad..bf6aedc 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -4329,6 +4329,12 @@ call_function(PyObject ***pp_stack, int oparg
     return x;
 }
 
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n)
+{
+  return fast_function(func, &stack, n, n, 0);
+}
+
 /* The fast_function() function optimize calls for which no argument
    tuple is necessary; the objects are passed directly from the stack.
    For the simplest case -- a function that takes only positional

With this extra optimisation, we're down from 0.113 usec to 0.0972 usec:

$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.0972 usec per loop

In conclusion, slots don't need to be slow - the above diff makes them fast (at least for some binary operators; applying similar transformations to other slots is left as an exercise to the reader).

GCC target confusion

Not blogging for two years has left me with quite a lot of things that I'd like to get off my chest. I don't plan on making a habit of negative-outlook posts, but blogging is partially for me to let off steam, and partially for writing things things that other people might find interesting, and today probably leans toward the former.

I've tried submitting a two-character diff to clang, and instructions for reproducing a socket-timeout bug in OpenJDK, both to no avail. I've not personally submitted any bugs to gcc, but one of my friends reported a pretty nasty g++ code-generation bug three years ago, again to no avail. I'm not trying to argue that reporting bugs to open source projects is a waste of time, but I am trying to argue that public shaming is perhaps occasionally a viable alternative approach, and today's unlucky victim is gcc.

The first thing I'd like to point out is that asking gcc to create a precompiled header from stdin will result in a segfault:

peter@euphraios:~$ echo | gcc -x c-header - ; echo $?
<stdin>:1:0: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.8/README.Bugs> for instructions.
1

This is somewhat embarrasing, but maybe I'm the only person in the world who wants to create a precompiled header from stdin. Then again, on OSX, where gcc is an alias for clang, this works just fine:

[littlegate /Users/corsix]$ echo | gcc -x c-header - ; echo $?
0

The second thing I'd like to point out is the g++ code-generation bug linked above. Using gcc, the following small program exits with a return code of zero:

peter@euphraios:~$ cat test.cpp
struct Counter {
  Counter operator++ () { ++value; return *this; }
  int value;
};

int main (int argc, char *[]) {
  int a[] = {0}, b[] = {0};
  Counter c = {0};
  (void)((argc > 1) ? a : b)[++c, 0];  
  return c.value;
}

peter@euphraios:~$ g++ test.cpp ; ./a.out ; echo $?
0

This is clearly a code-generation bug; ++c should always happen, therefore ++value should always happen, therefore c.value should evaluate to one rather than zero. Again, when on a platform where g++ is an alias for clang, the same program correctly exits with a return code of one:

[littlegate /Users/corsix]$ g++ test.cpp ; ./a.out ; echo $?
1

I'm really rather surprised that this bug has been sitting around known to the g++ developers for the last three years; I'd have thought that if you're a compiler developer, then code generation bugs are both the most serious and most interesting kind of bugs to fix (at least the V8 developers seem to share this point of view).

The third and final thing I'd like to point out relates to the title of this blog post. Let us begin with a piece of code which tries to call __builtin_ia32_vzeroall. Obviously, as this function isn't defined, calling it results in a compile-time error:

peter@euphraios:~$ cat test.c
void f() { __builtin_ia32_vzeroall(); }
peter@euphraios:~$ g++ -c test.c ; echo $?
test.c: In function ‘void f()’:
test.c:1:36: error: ‘__builtin_ia32_vzeroall’ was not declared in this scope
 void f() { __builtin_ia32_vzeroall(); }
                                    ^
1

My previous statement was a slight lie. As you might have guessed, __builtin_ia32_vzeroall isn't a random name which I pulled out of thin air. In fact, if we pass -mavx, then it becomes a predeclared function, and we can call it without issue:

peter@euphraios:~$ cat test.c
void f() { __builtin_ia32_vzeroall(); }
peter@euphraios:~$ g++ -mavx -c test.c ; echo $?
0

Maybe we don't want to apply -mavx to our entire source file, and instead we just want to apply it to certain functions. One way of doing this is via __attribute__((target)). In the following example, we apply this attribute to f but not to h, and as such, the compiler is fine with the call within f, but raises an error about the call within h:

peter@euphraios:~$ cat test.c
void f() __attribute__((__target__("avx")));
void f() { __builtin_ia32_vzeroall(); }
void h() { __builtin_ia32_vzeroall(); }
peter@euphraios:~$ g++ -c test.c ; echo $?
test.c: In function ‘void h()’:
test.c:3:37: error: ‘__builtin_ia32_vzeroall’ needs isa option -m32
 void h() { __builtin_ia32_vzeroall(); }
                                     ^
1

I'm somewhat unsettled about function attributes affecting which symbols are valid, but hey, let's run with it for now.

If __attribute__((target)) isn't your cup of tea, then #pragma GCC target can be used to the same effect. Again, in the following example, the call within g is perfectly fine, but the call within h is not:

peter@euphraios:~$ cat test.c
#pragma GCC push_options
#pragma GCC target("avx")
void g() { __builtin_ia32_vzeroall(); }
#pragma GCC pop_options
void h() { __builtin_ia32_vzeroall(); }
peter@euphraios:~$ g++ -c test.c ; echo $?
test.c: In function ‘void h()’:
test.c:5:37: error: ‘__builtin_ia32_vzeroall’ needs isa option -m32
 void h() { __builtin_ia32_vzeroall(); }
                                     ^
1

What perplexes me is that if you combine both of these approaches to locally enabling -mavx, then it seems to become enabled globally. In the following example, __attribute__((target)) is applied to f, #pragma GCC target is applied to g, and then by the time we reach h, gcc seems to have suffered target confusion and accepts __builtin_ia32_vzeroall even though it shouldn't be valid:

peter@euphraios:~$ cat test.c
void f() __attribute__((__target__("avx")));
void f() { __builtin_ia32_vzeroall(); }
#pragma GCC push_options
#pragma GCC target("avx")
void g() { __builtin_ia32_vzeroall(); }
#pragma GCC pop_options
void h() { __builtin_ia32_vzeroall(); }
peter@euphraios:~$ g++ -c test.c ; echo $?
0

Everything is broken

I'd like to tell you a story. Like all good stories, it starts with a stack trace obtained from a heap dump:

"jenkins.util.Timer [#6]" daemon prio=5 tid=34 RUNNABLE
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(SocketInputStream.java:129)
  at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
...
  at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
...
  at java.io.FilterInputStream.read(FilterInputStream.java:116)
  at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2676)
  at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
...
  at java.io.InputStreamReader.read(InputStreamReader.java:167)
  at java.io.Reader.read(Reader.java:123)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2001)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1980)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1957)
  at org.apache.commons.io.IOUtils.toString(IOUtils.java:819)
  at org.kohsuke.github.Requester.parse(Requester.java:384)
...
  at org.kohsuke.github.Requester$1.hasNext(Requester.java:295)
  at org.kohsuke.github.PagedIterator.fetch(PagedIterator.java:44)
  at org.kohsuke.github.PagedIterator.hasNext(PagedIterator.java:32)
  at org.kohsuke.github.PagedIterable.asList(PagedIterable.java:19)
  at org.kohsuke.github.GHRepository.getPullRequests(GHRepository.java:504)
  at org.jenkinsci.plugins.ghprb.GhprbRepository.check(GhprbRepository.java:82)
  at org.jenkinsci.plugins.ghprb.Ghprb.run(Ghprb.java:101)
  at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:149)
  at hudson.triggers.Trigger.checkTriggers(Trigger.java:265)
  at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214)
  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
...
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)

Obviously it is a Java stack trace, with java.net.SocketInputStream.socketRead0 at the top, then some I/O layers, then some GitHub layers, then some ghprb (GitHub pull-request builder) plugin layers, then hudson.triggers.Trigger$Cron.doRun, and finally some Java threading layers. What isn't shown is the C part of the stack frame above socketRead0, but all that's there is a call to recv, and after that it is Linux kernel stuff.

Let's start with Jenkins, the continuous integration server formerly known as Hudson. I don't have much love for it, but it is something that we use and live with. One of its responsibilities is to run jobs periodically, which it does by waking up once per minute, checking if there are any jobs which need doing, and if so, scheduling them. In other words, it reimplements cron. I'm telling you this because it is the background for the story: we observed that our Jenkins instance stopped scheduling periodic jobs, which is a pretty serious problem. The stack trace was taken from our Jenkins instance, and from it, we can deduce why Jenkins broke.

The hudson.triggers.Trigger$Cron.doRun frame on the stack means that we're looking at the code which runs every minute to check whether any jobs need doing. The frames immediately above it are GitHub-related, so clearly our Jenkins instance is talking to our GitHub Enterpise instance in order to establish whether any jobs need doing (because someone somewhere decided that a particular job should be triggered by events on GitHub, and decided to configure it via polling rather than webhooks). The frames above those are what you'd expect to see when Java code reads from a secure socket. Finally, in the piece of stack trace you can't see, the call to recv is blocking.

The brokenness starts with hudson.triggers.Trigger$Cron.doRun. If for some reason doRun doesn't terminate, then Jenkins' cron breaks. Given that doRun can end up calling almost anything, and does so in the same thread without any explicit timeout, this seems like a pretty big problem.

OK, so, why isn't doRun terminating? We can see that the ghbrp plugin is trying to communiate with GitHub over a socket. Let's give the plugin the benefit of the doubt and assume that it configures its sockets with a read-timeout so that reading won't block forever. After working our way through various layers, we end up at socketRead0 - a function which accepts a socket and a timeout value and tries to read from the socket. At this point, we hit our next piece of brokenness. JDK-8049846, summarised as the timeout argument to socketRead0 is sometimes ignored, or alternatively summarised as Java's socket I/O is broken.

OK, so, why is socketRead0 blocking forever even though it was passed a timeout? As noted at JDK-8049846, the implementation of socketRead0 is basically:

if (poll(socket, POLLIN, timeout))
  recv(socket, buffer);

In other words, ask poll to wait (with a timeout) until the socket is readable, and then read from it. No need to set O_NONBLOCK on the socket, because poll is used to check that a read won't block. What could possibly go wrong? Well, let's check the man page for poll:

Bugs

See the discussion of spurious readiness notifications under the BUGS section of select(2).

Let's check the man page for select:

Bugs

...

Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.

In other words, Linux's poll is broken: it can report that a socket is ready for reading when, in fact, it isn't.

OK, so, why did poll give us a spurious readiness notification on a socket? The socket in question was talking to our GitHub Enterprise instance, and, uh, GHE seems kind of flakey on the network side of things, so I wouldn't be surprised if it sent bad packets. In other words, GitHub Enterprise is broken (if anyone from GitHub is reading, please advise on how best to diagnose GitHub Enterprise being flaky). Also, while I'm wishing for things which won't happen: if any JDK maintainers are reading, please note that the current classification of JDK-8049846 as P4 (Minor loss of function, or other problem where easy workaround is present) is rather inaccurate: there is no workaround, and when it hits, the entire calling thread ceases to function. Of course, I'd be happier if you fixed the bug rather than fiddling around with priorities, but I'll take what I can get.

To conclude, everything is broken:

When all of these things break at the same time, your continuous integration server stops continuously integrating.

page: 6 7 8 9 10