Everything is broken

I'd like to tell you a story. Like all good stories, it starts with a stack trace obtained from a heap dump:

"jenkins.util.Timer [#6]" daemon prio=5 tid=34 RUNNABLE
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(SocketInputStream.java:129)
  at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
...
  at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
...
  at java.io.FilterInputStream.read(FilterInputStream.java:116)
  at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2676)
  at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
...
  at java.io.InputStreamReader.read(InputStreamReader.java:167)
  at java.io.Reader.read(Reader.java:123)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2001)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1980)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1957)
  at org.apache.commons.io.IOUtils.toString(IOUtils.java:819)
  at org.kohsuke.github.Requester.parse(Requester.java:384)
...
  at org.kohsuke.github.Requester$1.hasNext(Requester.java:295)
  at org.kohsuke.github.PagedIterator.fetch(PagedIterator.java:44)
  at org.kohsuke.github.PagedIterator.hasNext(PagedIterator.java:32)
  at org.kohsuke.github.PagedIterable.asList(PagedIterable.java:19)
  at org.kohsuke.github.GHRepository.getPullRequests(GHRepository.java:504)
  at org.jenkinsci.plugins.ghprb.GhprbRepository.check(GhprbRepository.java:82)
  at org.jenkinsci.plugins.ghprb.Ghprb.run(Ghprb.java:101)
  at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:149)
  at hudson.triggers.Trigger.checkTriggers(Trigger.java:265)
  at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214)
  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
...
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)

Obviously it is a Java stack trace, with java.net.SocketInputStream.socketRead0 at the top, then some I/O layers, then some GitHub layers, then some ghprb (GitHub pull-request builder) plugin layers, then hudson.triggers.Trigger$Cron.doRun, and finally some Java threading layers. What isn't shown is the C part of the stack frame above socketRead0, but all that's there is a call to recv, and after that it is Linux kernel stuff.

Let's start with Jenkins, the continuous integration server formerly known as Hudson. I don't have much love for it, but it is something that we use and live with. One of its responsibilities is to run jobs periodically, which it does by waking up once per minute, checking if there are any jobs which need doing, and if so, scheduling them. In other words, it reimplements cron. I'm telling you this because it is the background for the story: we observed that our Jenkins instance stopped scheduling periodic jobs, which is a pretty serious problem. The stack trace was taken from our Jenkins instance, and from it, we can deduce why Jenkins broke.

The hudson.triggers.Trigger$Cron.doRun frame on the stack means that we're looking at the code which runs every minute to check whether any jobs need doing. The frames immediately above it are GitHub-related, so clearly our Jenkins instance is talking to our GitHub Enterpise instance in order to establish whether any jobs need doing (because someone somewhere decided that a particular job should be triggered by events on GitHub, and decided to configure it via polling rather than webhooks). The frames above those are what you'd expect to see when Java code reads from a secure socket. Finally, in the piece of stack trace you can't see, the call to recv is blocking.

The brokenness starts with hudson.triggers.Trigger$Cron.doRun. If for some reason doRun doesn't terminate, then Jenkins' cron breaks. Given that doRun can end up calling almost anything, and does so in the same thread without any explicit timeout, this seems like a pretty big problem.

OK, so, why isn't doRun terminating? We can see that the ghbrp plugin is trying to communiate with GitHub over a socket. Let's give the plugin the benefit of the doubt and assume that it configures its sockets with a read-timeout so that reading won't block forever. After working our way through various layers, we end up at socketRead0 - a function which accepts a socket and a timeout value and tries to read from the socket. At this point, we hit our next piece of brokenness. JDK-8049846, summarised as the timeout argument to socketRead0 is sometimes ignored, or alternatively summarised as Java's socket I/O is broken.

OK, so, why is socketRead0 blocking forever even though it was passed a timeout? As noted at JDK-8049846, the implementation of socketRead0 is basically:

if (poll(socket, POLLIN, timeout))
  recv(socket, buffer);

In other words, ask poll to wait (with a timeout) until the socket is readable, and then read from it. No need to set O_NONBLOCK on the socket, because poll is used to check that a read won't block. What could possibly go wrong? Well, let's check the man page for poll:

Bugs

See the discussion of spurious readiness notifications under the BUGS section of select(2).

Let's check the man page for select:

Bugs

...

Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.

In other words, Linux's poll is broken: it can report that a socket is ready for reading when, in fact, it isn't.

OK, so, why did poll give us a spurious readiness notification on a socket? The socket in question was talking to our GitHub Enterprise instance, and, uh, GHE seems kind of flakey on the network side of things, so I wouldn't be surprised if it sent bad packets. In other words, GitHub Enterprise is broken (if anyone from GitHub is reading, please advise on how best to diagnose GitHub Enterprise being flaky). Also, while I'm wishing for things which won't happen: if any JDK maintainers are reading, please note that the current classification of JDK-8049846 as P4 (Minor loss of function, or other problem where easy workaround is present) is rather inaccurate: there is no workaround, and when it hits, the entire calling thread ceases to function. Of course, I'd be happier if you fixed the bug rather than fiddling around with priorities, but I'll take what I can get.

To conclude, everything is broken:

When all of these things break at the same time, your continuous integration server stops continuously integrating.

It's been a while

It has been twenty five months, almost to the day, since I last wrote anything here. Regular readers, if there are any left, will notice a major visual overhaul and an absence of comments. The comments may well come back, but in the mean time, you can reach me via email at anything at corsix.org.

I've been quiet, but not been dead, for the last two years. Mostly I've been busy earning a wage, but I've also published a few things around the web:

Calling external functions with x64 DynASM

One non-obvious corner of using DynASM to generate x64 machine code is how to make calls to existing C functions, such as printf. x64 makes this difficult because the address space is 64-bits, but most instructions can only have 32-bit immediate values or 32-bit offsets.

One of the few instructions which accept a 64-bit immediate is for moving a constant into a register - in the Intel and AMD manuals, this is the mov r64, imm64 instruction. DynASM can emit this instruction, but using the mnemonic mov64 rather than mov. This leads to the following pattern for calling functions:

| mov64 rax, (uint64_t)printf
| call rax

The above pattern works, but it costs you a register, which is less than ideal.

An alternative pattern is to use the same approach as shared libraries: import tables. The basic idea is that the import table contains a list of function pointers, and calls are done using the call [rip + imm32] instructions, where imm32 is the distance between the instruction pointer and the function pointer. DynASM will emit this instruction in x64 mode when given assembler of the form call qword [label], so the only tricky bit left is to stick the appropriate function pointer at the appropriate label. One pattern for doing this is the following:

|.section code, imports

|.macro call_extern, target
|  .imports
|  ->__imp__..target:
|  .dword  (uint32_t)target
|  .dword ((uint64_t)target >> 32)
|  .code
|  call qword [->__imp__..target]
|.endmacro

| call_extern printf

The .section code, imports directive tells DynASM that we want to append instructions to two buffers rather than one, and that said buffers should be called .code and .imports. For each buffer, it registers a directive of the same name which sets the current output buffer to that one.

The .macro call_extern, target directive registers a one-argument macro called call_extern. Internally it uses token-pasting (the syntax for which is .., and has the same effect as the C preprocessor's ##) to create a label, uses .dword directives to write the function pointer, and uses .imports and .code to switch between output buffers.

Once the macro is registered, calls are done using call_extern.

Jumping with DynASM

Directly continuing from the first DynASM example, one obvious optimisation would be to write the remaining loop of run_job in assembly, thereby avoiding a function call on every iteration. This idea leads to the following version of transcode.dasm:

|.arch x64
|.actionlist transcode_actionlist
|.section code
|.globals GLOB_

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
| jmp ->loop_test
|->loop_body:
| dec r8
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
|     mov eax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap eax
      }
|     mov [rdx + field->output_offset], eax
      break;
    case 8:
|     mov rax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap rax
      }
|     mov [rdx + field->output_offset], rax
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
| add rcx, job->input_record_size
| add rdx, job->output_record_size
|->loop_test:
| test r8, r8
| jnz ->loop_body
| ret

In order, the changes to note are:

  1. The addition of the following:
    |.globals GLOB_
    
  2. The addition of the following loop head:
    | jmp ->loop_test
    |->loop_body:
    | dec r8
  3. The addition of the following loop tail:
    | add rcx, job->input_record_size
    | add rdx, job->output_record_size
    |->loop_test:
    | test r8, r8
    | jnz ->loop_body
The interesting components of these changes are the jumps and the labels. Once you know that the -> prefix is DynASM's notation for so-called global labels, then the syntax becomes the same as in any other assembler: labels are introduced by suffixing them with a colon, and are jumped to by being used as an operand to a jump instruction. As well as global labels, DynASM also supports so-called local labels. The defining difference between the two is that an assembly fragment containing a global label can only be emitted once, whereas local labels can be emitted an unlimited number of times. As a consequence, when jumping to a local label, you need to specify whether to jump backwards to the nearest previous emission of that label, or forwards to the next subsequent emission of that label. As global labels can only be emitted once, so no such specification is needed.
Label typeSyntaxUsageAvailable namesMaximum emissionsRetrievable in C
Global->name:jmp ->nameAny C identifier1Yes
Local  name:jmp  >name (forward) or
jmp  <name (backward)
Integers between 1 and 9No
PC=>expr:jmp =>exprAny C
expression
N/ANo
With labels explained, the remaining curiosity is the .globals directive: its effect is to emit a C enumeration with the names of all global labels. For this example, it causes the following to be written in transcode.h:
//|.globals GLOB_
enum {
  GLOB_loop_test,
  GLOB_loop_body,
  GLOB__MAX
};

Now that we're using labels, we need to do slightly more initialisation work. In particular, between calling dasm_init and dasm_setup, we need to do the following:

void* global_labels[GLOB__MAX];
dasm_setupglobal(&state, global_labels, GLOB__MAX);

After calling dasm_encode, the absolute address of ->loop_test: will be stored in global_labels[GLOB_loop_test], and likewise the absolute address of ->loop_body: will be stored in global_labels[GLOB_loop_body].

For completeness, the final C code is as follows:

void (*make_transcoder(transcode_job_t* job))(const void*, void*, int)
{
  dasm_State* state;
  int status;
  void* code;
  size_t code_size;
  void* global_labels[GLOB__MAX];

  dasm_init(&state, DASM_MAXSECTION);
  dasm_setupglobal(&state, global_labels, GLOB__MAX);
  dasm_setup(&state, transcode_actionlist);

  emit_transcoder(&state, job);
  
  status = dasm_link(&state, &code_size);
  assert(status == DASM_S_OK);

  code = VirtualAlloc(nullptr, code_size, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  status = dasm_encode(&state, code);
  assert(status == DASM_S_OK);

  dasm_free(&state);
  return (void(*)(const void*, void*, int))code;
}

void run_job(transcode_job_t* job)
{
  void (*transcode_n_records)(const void*, void*, int) = make_transcoder(job);
  transcode_n_records(job->input, job->output, job->num_input_records);
}

A first DynASM example

As an example of where you might want to use DynASM, let us consider the problem of transforming an array of binary structures into an array of slightly different binary structures. For the sake of concreteness, let us assume that such transformation jobs are described by the following C structures:

struct field_info_t
{
  int byte_width;

  int input_offset;
  int input_endianness;

  int output_offset;
  int output_endianness;
};

struct transcode_job_t
{
  const void* input;
  int input_record_size;
  int num_input_records;

  void* output;
  int output_record_size;

  int num_fields;
  field_info_t* fields;
};

Naïve code for performing these jobs might look something like the following:

void run_job(transcode_job_t* job)
{
  const char* input = (const char*)job->input;
  char* output = (char*)job->output;
  for(int r = 0; r < job->num_input_records; ++r)
  {
    for(int f = 0; f < job->num_fields; ++f)
    {
      field_info_t* field = job->fields + f;
      memcpy(output + field->output_offset, input + field->input_offset, field->byte_width);
      if(field->input_endianness != field->output_endianness)
        swap_endianness(output + field->output_offset, output + field->output_offset + field->byte_width - 1); 
    }

    input += job->input_record_size;
    output += job->output_record_size;
  }
}

void swap_endianness(char* first, char* last)
{
  for(; first < last; ++first, --last)
  {
    char tmp = *first;
    *first = *last;
    *last = tmp;
  }
}

If num_input_records is really large and the transcoding needs to be done as fast as mechanically possible, then one idea might be to unroll the inner loop of run_job at runtime using DynASM. The idea is that the resulting code will look something like the following:

void run_job(transcode_job_t* job)
{
  void (*transcode_one_record)(const char*, char*) = make_transcoder(job);

  const char* input = (const char*)job->input;
  char* output = (char*)job->output;
  for(int r = 0; r < job->num_input_records; ++r)
  {
    transcode_one_record(input, output);

    input += job->input_record_size;
    output += job->output_record_size;
  }
}

As the first step toward implementing make_transcoder, we need something to feed into DynASM. The following code is such an input, which we'll assume is in a file called transcode.dasc:

|.arch x64
|.actionlist transcode_actionlist
|.section code

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
|     mov eax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap eax
      }
|     mov [rdx + field->output_offset], eax
      break;
    case 8:
|     mov rax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap rax
      }
|     mov [rdx + field->output_offset], rax
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
| ret
}

With this written, we can can use DynASM to transform it into a file called transcode.h using the following command line:

luajit dynasm.lua --nolineno -o transcode.h transcode.dasc

The resulting file, transcode.h, should look something like the following:

//This file has been pre-processed with DynASM.

//|.arch x64
//|.actionlist transcode_actionlist
static const unsigned char transcode_actionlist[27] = {
  139,129,233,255,15,200,255,137,130,233,255,72,139,129,
  233,255,72,15,200,255,72,137,130,233,255,195,255
};

//|.section code
#define DASM_SECTION_CODE   0
#define DASM_MAXSECTION 	1

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
//|     mov eax, [rcx + field->input_offset]
dasm_put(Dst, 0, field->input_offset);
      if(field->input_endianness != field->output_endianness) {
//|       bswap eax
dasm_put(Dst, 4);
      }
//|     mov [rdx + field->output_offset], eax
dasm_put(Dst, 7, field->output_offset);
      break;
    case 8:
//|     mov rax, [rcx + field->input_offset]
dasm_put(Dst, 11, field->input_offset);
      if(field->input_endianness != field->output_endianness) {
//|       bswap rax
dasm_put(Dst, 16);
      }
//|     mov [rdx + field->output_offset], rax
dasm_put(Dst, 20, field->output_offset);
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
//| ret
dasm_put(Dst, 25);
}

With this, we're now able to implement make_transcoder:

#define DASM_FDEF static
#include "dynasm/dasm_proto.h" // For declarations of the dasm_ functions
#include "dynasm/dasm_x86.h"   // For x64 implementations of the dasm_ functions
#include "transcode.h"         // For emit_transcoder
#include <assert.h>            // For assert
#include <Windows.h>           // For VirtualAlloc

void (*make_transcoder(transcode_job_t* job))(const char*, char*)
{
  dasm_State* state;
  int status;
  void* code;
  size_t code_size;

  dasm_init(&state, DASM_MAXSECTION);
  dasm_setup(&state, transcode_actionlist);

  emit_transcoder(&state, job);
  
  status = dasm_link(&state, &code_size);
  assert(status == DASM_S_OK);

  code = VirtualAlloc(NULL, code_size, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  status = dasm_encode(&state, code);
  assert(status == DASM_S_OK);

  dasm_free(&state);
  return (void(*)(const char*, char*))code;
}
page: 7 8 9 10 11