Thoughts on No Graphics API

Sebastian Aaltonen recently wrote an excellent piece titled No Graphics API, which you should read if you're interested in the mechanics of GPUs and APIs for talking to them. You should especially read it if you're an ambitious young engineer at Microsoft who would like to make your mark on the world by designing and championing DirectX 13. If you're instead a hardware engineer designing a new GPU (or something GPU-adjacent), you should again read it and ensure that your hardware design is amenable to what is described.

Is there anything further which can be stripped away?

The outlined API is already quite thin, but perhaps it could be even thinner still.

Compute-only

If only caring about GPGPU and not at all about graphics, there's a subset of the outlined API which drops the graphics-specific bits. The surviving functions are:

Memory: gpuMalloc, gpuFree, gpuHostToDevicePointer.
Pipelines: gpuCreateComputePipeline, gpuFreePipeline.
Queue: gpuCreateQueue, gpuStartCommandRecording, gpuSubmit.
Semaphores: gpuCreateSemaphore, gpuWaitSemaphore, gpuDestroySemaphore.
Commands: gpuMemCpy, gpuBarrier, gpuSignalAfter, gpuWaitBefore, gpuSetPipeline, gpuDispatch, gpuDispatchIndirect.

If you do care about graphics then the rest should obviously be kept, but it is an interesting little thought experiment to consider a useful compute-only subset.

`gpuHostToDevicePointer`

If a GPU has a sufficiently good MMU, then in most cases, firmware on the GPU and drivers on the host can conspire to set up the GPU-side MMU mapping to make gpuHostToDevicePointer a no-op. If most could be extended to all, then gpuHostToDevicePointer could be removed, but perhaps this is a scenario in which covering 99% of cases is easy, but the final 1% is very hard.

`gpuSubmit`

If command buffers are always one-shot, then gpuSubmit looks potentially unnecessary: all commands enqueued to a command buffer will eventually execute, so they could be eligible for execution immediately upon being enqueued, with no gpuSubmit call necessary. CUDA works this way: kernels are enqueued with one call; there's no need for separate enqueue and submit. That said, there are a few possible arguments for the two-stage dance:

Throughput can be improved (at a slight latency cost) by coalescing multiple commands together. For example, perhaps enqueueing can be done entirely in a userspace driver, whereas submission requires doing a system call.
Though the API synopsis at the end of the blog post doesn't show semaphores interacting with gpuSubmit, an earlier example does. If removing gpuSubmit, there would need to be a different way of doing the semaphore part of submit.
If doing graphics (rather than just compute), there needs to be some way to tell the GPU that all commands for the current frame are complete, and that the associated render target should be displayed to the human.

`uint32x3` for `SV_ThreadID` and `SV_GroupID` and `SV_GroupThreadID`

Do we really need these to be 3D, or does it suffice for them to be 1D? Software can always unravel a 1D index to 3D if it needs to, and the driver might be inserting such an unravel already if the hardware is really only 1D under the hood.

Is there anything else which needs adding?

Though I'm a fan of minimalism, it is possible to be too minimalistic.

Multi-device support

In a system with multiple GPUs, gpuMalloc needs to know which GPU to allocate on, so either an extra argument or a sideband function call to set the active GPU. The same is true for all of gpuTextureSizeAlign, gpuCreateSemaphore, gpuCreate*Pipeline, and gpuCreate*State (some of these cases could instead be lazy and defer the actual GPU-specific resource creation until the first time the resultant object is used with a GpuCommandBuffer, but laziness causes other problems).

A related can of worms is peer-to-peer support between multiple GPUs: can one GPU write to another's memory in the same way it can write to CPU MEMORY_READBACK memory? Is there some form of barrier or signal or semaphore allowing one GPU to wait for work on another to complete? Many further questions are possible.

Multi-process support

Some GPGPU workloads benefit from having a singular GPU memory allocation visible to multiple distinct CPU processes. They might also benefit from being able to create a pipeline and then share the GPU-side state associated with that pipeline between multiple distinct CPU processes, though this is more of a minor optimisation to avoid the same state being created multiple times.

Memory pinning

It is often very convenient to be able to take an arbitrary memory allocation performed by the application (i.e. not through gpuMalloc), and make that memory visible to the GPU as-if gpuMalloc were used with the MEMORY_READBACK flag. In general, making this work requires at least one of:

UMA.
A sufficiently good MMU on the GPU.
An IOMMU on the host.
The OS to find a suitable range of contiguous physical memory and change which physical memory backs the allocation (only possible for pageable memory).

If relying on at least one of the above isn't viable, an alternative is adding a variant of gpuMemCpy which accepts CPU pointers. It is always possible to implement such an async memcpy API: the driver can do some combination of temporary pinning / splitting one non-contiguous copy into several contiguous ones / bounce buffers / DMA controller scatter-gather lists.

Instruction cache fences

Some hardware contains non-coherent instruction caches which need to be explicitly cleared after loading (or modifying) code and before executing said code. This is an obvious candidate for a gpuBarrier flag / mode. Alternatively, hardware which requires it could have the driver transparently perform the appropriate fence as part of every gpuCreate*Pipeline call, or transparently perform it just before the first gpuSubmit call after a gpuCreate*Pipeline call.

Does anything give me cause for concern?

Most of the outlined API has me thinking "yep, this all seems sensible", but a few areas cause me to think a little bit harder.

Write-combining memory

The approach to memory management is relying on either UMA or PCIe ReBAR. ReBAR doesn't strictly require write-combining memory, but you really want something like WC memory to give CPU → GPU writes acceptable performance. This is fine on x86 / x86-64, but potentially an area of concern for any other CPU architectures which lack the concept of WC. Even where it exists, write combining is not your friend: handing out pointers to write-combining memory to user code comes with lots of potential footguns. Some of these footguns can be mitigated with education and documentation, but not entirely.

Deadlock avoidance

If one command buffer can do a gpuWaitBefore for a gpuSignalAfter from a different command buffer, then commands from the two buffers need to be run in the right order, lest the GPU commit to blocking on the wait command before running the commands which would unblock it. This might look like an easy problem to solve: if a GPU would be blocking on a wait, it should instead actively go looking for other work (from other submitted command buffers) to perform. Actual reality is slightly more annoying: perhaps there are a finite number of hardware command queues, so the GPU driver multiplexes multiple software command buffers onto the same hardware command queue, and if it does that multiplexing in the wrong order, the resultant queue ends up with the wait before the signal. There are many ways to make this problem go away; one such way is to put the onus on the developer, and require that it is valid (even if not optimal) for the GPU to execute submitted command buffers one after another, and commands within each of those buffers in the order they were enqueued, with no reordering anywhere. CUDA happens to design the API to ensure this: streams don't need any explicit submission (so it is valid for the GPU to run commands in the exact order they were enqueued), and cudaEventRecord must be enqueued before cudaStreamWaitEvent is enqueued, as that's just how events work.

Leaving 32 bits behind

In practice, PCIe ReBAR means having a 64-bit operating system. I'm fine with excluding 32-bit operating systems, but perhaps not everybody is.

I have slightly more sympathy for 32-bit programs on 64-bit operating systems. To make them work, gpuHostToDevicePointer would need to return a 64-bit value rather than a pointer. Even then, structure definitions containing pointers could not be shared between CPU and GPU, and the amount of memory allocatable with gpuMalloc would be limited to a few gigabytes. It might be easier to just say that 32-bit programs are a legacy which we're prepared to leave behind.

Conclusion

You'll note that my collection of thoughts takes up far fewer words than the referenced piece. As I said in opening, it is an excellent piece: most of it doesn't require any further commentary.