One of the things which I've found myself doing many times over the past years has been x86 (32-bit) DLL injection (forcing a foreign process to load a DLL it otherwise wouldn't) and function hooking (rewriting machine code to intercept calls to functions and insert new code before or after or instead-of the call). For a new project I'm working on, I again find that I need to do DLL injection and function hooking, but for x64 (64-bit) as well as x86. This means that there are potentially four cases instead of one:

  1. x86 injector injecting an x86 DLL into a foreign x86 process
  2. x86 injector injecting an x64 DLL into a foreign x64 process
  3. x64 injector injecting an x86 DLL into a foreign x86 process
  4. x64 injector injecting an x64 DLL into a foreign x64 process

Obviously I have control over the injector and the DLL being injected, but I cannot know a-priori what architecture the foreign process will be. The simplest solution would be to implement (1) and (4), but this would mean distributing two different versions of the injector, and forcing the end-user to use the correct version depending on what they want to inject into. From the end-user's perspective, it would be best to do (1) and (2), or (3) and (4), or all of them. Before deciding on what to implement, we should step back and review how x64 relates to x86.

For those unaware, x86 is a family of instruction set architectures, originally introduced with the Intel 8086 processor in 1978. Though originally designed for 16-bit systems, it became 32-bit with the Intel 386 processor. As the 386 came out several years before I was born, x86 is and always has been a 32-bit instruction set as far as I'm concerned. Obviously 32 bits limits you to 4GB of address space (at least within any single user-mode process not using PAE), and as RAM gets larger and cheaper, that 4GB limit becomes a problem. Hence the predictable happened: in 2001, Intel released the 64-bit IA64 architecture. This was a completely new architecture, which you could argue is much cleaner and better designed than the old x86 architecture. As most software developers know, compatibility is king, and so the fact that an IA64 processor couldn't run x86 code was a problem. Hence in 2003, AMD released (the first implementation of) the x64 architecture - an incremental upgrade to x86 which added 64-bit support. Because x64 processors could run all the existing x86 code with no problems, computer manufacturers could sell computers with these new flashy x64 processors, and users could still use their old x86 programs and operating systems. Fast forward to today, and you'll find that x86 and x64 are the dominant architectures found in consumer desktop and laptop computers. As an example of the similarly of x86 and x64 instructions, here is an x86 instruction:

0x8B, 0x52, 0x0C, // mov edx, dword ptr [edx+0Ch]

The corresponding x64 instruction adds an extra prefix byte to indicate that 64-bit registers are being used, and has a different displacement value (as the size of the underlying structure has doubled due to pointers being twice the size):

0x48, 0x8B, 0x52, 0x18, // mov rdx, qword ptr [rdx+18h]

Some instructions, like short relative jumps, are encoded in exactly the same way in x86 and in x64. More common though is to see instructions with an extra so-called REX prefix byte to indicate usage of 64-bit registers. The REX prefix bytes overwrite a range of pre-existing x86 instructions, so for this reason (and many others), the x64 processor has to be told whether to interpret the bytes it is seeing as x86 or as x64.

Due to their nature, operating systems need to be tailored to the architecture which they run on. Due to x64's backward compatibility, you can run x86 Windows on an x64 chip, and things will behave exactly the same as if you were running on an x86 chip. The more interesting situation is running x64 Windows on an x64 chip, but then running x86 applications within the x64 operating system. This brings us to WoW64 - the component of x64 Windows which allows x86 applications to run, as although the x64 chip can run the x86 code, there is more that has to be done to allow applications to run properly. WoW64 handles the transitioning between running x64 code and running x86 code, and presents a 32-bit view of the world to the x86 process. Due to how x64 is an extension of x86, transitioning from 32-bit code to 64-bit code isn't that conceptually difficult - the attributes of a code segment tell the processor whether to treat the code as x86 code or x64 code, the 32-bit registers are in fact 64-bit registers with the top half ignored, and the 4GB of addressable RAM in 32-bit mode is the same as the bottom 4GB in 64-bit mode. Hence to jump (or technically call) from x86 code to x64 call, all that you need to do is a far (inter-segment) call to an x64 segment, and then do a far return when you're done. The tricky part is finding an x64 code segment, as WoW64 makes everything look 32-bit, and messing with segment descriptors requires the use of undocumented Windows API calls (though this doesn't stop Google's Native Client, NaCl, from calling said APIs). Clearly the WoW64 DLLs must have some way of finding an x64 code segment in order to transition to 64-bit mode, so someone disassembled these DLLs, found how it was code, and called the mechanism "Heaven's Gate". Heaven's Gate is very simple: segment 33h. Do a far call to segment 33h, and suddenly you're executing x64 code within an x86 process. For a far return, and you're back to x86 code which you left. With this portion of WoW64 dealt with, we can return to its other main purpose: making the 64-bit world look and behave like a 32-bit world. WoW64 does some clever things with the registry and the filesystem, but these are not relevant to this discussion (though developers might find it useful to read how to launch the x86 registry editor under Windows x64).

An area which is relevant is what WoW64 does regarding DLLs, processes, and threads. For an x86 process interacting with itself and/or other x86 processes, things work as they would on Windows x86, and WoW64 is almost invisible. For an x64 process interacting with itself and/or other x64 processes, WoW64 is irrelevant. Things get interesting when x86 processes want to interact with x64 processes and vice versa. The CreateProcess family of functions can be used to launch new processes which are of a different architecture to the calling process, and a handle to the created process is returned to the caller, which the caller can use to interact with the created process. For DLL injection, a process handle is the first requirement, and the next is the ability to execute code within the foreign process. In the context of injecting a DLL, there are two paths to go down when it comes to executing code within a foreign process: find some existing code within the foreign process which loads a DLL, or put some new bootstrap code into the foreign process which goes on to load a DLL. The first option is generally considered easiest, as Kernel32.dll's LoadLibrary[A|W] is present in (almost) every process. The tricky part is figuring out where LoadLibrary is in the foreign process - most of the time (at least on x86) you can assume that it is at the same place in the foreign process as it is in the calling process. This assumption completely falls apart when the foreign process is of a different architecture, though it can also fall apart if the caller is being executed with compatibility mode enabled, due to the hooking and shims done by the compatibility mode. Hence to do robust hooking, the second method is used: inject some bootstrap code into the foreign process which locates and then calls LoadLibrary. This raises the interesting question of how to achieve GetProcAddress(GetModuleHandle("Kernel32"), "LoadLibraryW") without calling any Windows API functions (as to call them you need their address, and if you can get the address of any API function in a foreign process, then you wouldn't need to be doing this in the first place). As long as you're happy to rely on undocumented and architecture-specific things like the segment which the thread environment block lives in, the offset of the process environment block pointer within the thread environment block, the offset of the module list within the process environment block, and the format of a portable executable and its export table, then this is a solved problem. Whichever method you choose, you need to allocate some data in the foreign process (to hold the DLL name, and if there is some, the bootstrap code), and then call some code in the foreign process. Allocating data in a foreign process just involves calling VirtualAllocEx and [Read|Write]ProcessMemory. For an x86 process interacting with an x64 process, these functions work fine, though they can only see the low 4GB of the 64-bit address space (as for an x86 process, their arguments and return values are 32 bits wide). I haven't tested it, but I suspect that they'll also work fine for an x64 process interacting with an x86 process.

Actually getting code to run in a foreign process of different architecture is difficult. For an x86 process interacting with an x64 process, CreateRemoteThread always fails, even if the given start address points to x64 code and the given parameter can be safely zero extended (by which I mean that the top bit isn't set - if it was then you'd have to choose between zero extension and sign extension). Getting around this is an interesting problem. One way would be to use the aforementioned Heaven's Gate to jump from x86 code to x64 code, then create the thread, then jump back to x86 code, but there is a major problem with this approach: an x86 process doesn't have a 64-bit version of kernel32.dll loaded, and hence doesn't have a 64-bit version of CreateRemoteThread available to be called - the only available 64-bit functions are the undocumented ones of ntdll.dll, and NtCreateThread requires that you set up a stack and processor context yourself, which is difficult, especially when the helper functions for setting these up (BaseCreateStack and BaseInitializeContext) live in kernel32, which we don't have. The solution I've settled on is to create a 3rd process, which I call the proxy process. The proxy process is an x64 process which is connected to the x86 process via stdin/stdout pipes, and whose job is to call CreateRemoteThread. The x86 process duplicates the handle of the foreign x64 process, passes it down the stdin pipe along with other parameters for CreateRemoteThread, and then waits for the proxy process to feed it back the result via the stdout pipe. It is an ugly solution, but it doesn't rely on any undocumented behaviours. For an x64 process interacting with an x86 process, things are often a little complex. For example, for an x64 process to get the context of an x86 thread, it needs to call Wow64GetThreadContext, which is only present on Windows Vista and later - for Windows XP, you have to implement Wow64GetThreadContext yourself, and it isn't trivial.

By using a proxy process, an x86 injector can inject into both x86 processes and x64 processes. As an x86 injector can also run on an x64 operating system, the solution which makes the most sense to me is distributing the single x86 injector which is capable of injecting into either architecture (cases (1) and (2)).

The next problem is hooking functions. Doing this on x86 is relatively simple - disassemble the target function to work out the minimum number of instructions that are required until you've got at least 5 bytes worth, copy these over to a trampoline, and replace the first five bytes with a jump to the trampoline (the relative near jump instruction with 32-bit offset - opcode 0xE9 - takes 5 bytes and can jump to anywhere in the 32-bit address space). The only other major problem is knowing the calling convention of the function you're hooking. At least on Windows, you see three calling conventions commonly (or four if you're hooking C++, at which point __thiscall can occur): __cdecl, __stdcall, and __(ms)fastcall. As an aside, provided that the __stdcall, __cdecl, or __fastcall function you want to call takes two or less arguments, you can write a thunker which calls the function correctly regardless of calling convention:

template <typename T0, typename T1, typename T2>
__declspec(naked) static T0 __cdecl
anycall(FARPROC f, T1 arg1, T2 arg2) { __asm {
    push ebp;
    mov eax, [esp+8];
    mov ecx, [esp+12];
    mov edx, [esp+16];
    mov ebp, esp;
    push edx;
    push ecx;
    call eax;
    mov esp, ebp;
    pop ebp;

In Windows x64, calling convention isn't an issue as everything is standardised to a single calling convention, which is well described on MSDN. For x64 hooks, the big problems are adjusting for the much more common instruction-pointer-relative instructions, and doing x64 jumps, as doing a jump to anywhere in the 64-bit address space without dirtying any registers can take 14 bytes (as opposed to the 5 it takes on x86). Another minor annoyance is that the Microsoft (Visual Studio) C++ compiler doesn't let you use inline assembly on x64, so you have to put any assembly code in an external file, add an ml64 /c step to your build, and add the resulting object file to linker command line.

That concludes this post on DLL injection and WoW64 - I'll likely write more on the subject as I work out the intricacies of x64 function hooking, and as the project which this is for nears fruition. Bonus points if you can guess the project (which is related to CorsixTH, but obviously not purely related to CorsixTH, as the code is open source, so I wouldn't need a DLL injector). If this subject interests you, then I recommend reading the archives of the blogs which I've linked to in this post: