DevRef / Theory / Lattice QCD

Lattice Gauge Theory and GPU Shader Pipelines: An Unexpected Analogy

Non-Abelian gauge symmetry and GPU rasterisation pipelines look nothing alike. But their computational structure is almost identical — both decompose a global problem into a lattice of local operations that can be executed in parallel with no communication between neighbours.

Lattice QCD in brief

Quantum chromodynamics describes the strong nuclear force through SU(3) gauge symmetry — a non-Abelian gauge group governing how quarks interact via gluons. Lattice QCD discretises spacetime onto a 4-dimensional Euclidean lattice, replacing continuous field variables with link variables: SU(3) matrices living on the edges between lattice sites.

The Wilson loop observable traces a closed path through the lattice by multiplying link matrices along its edges. Its expectation value under the path integral encodes confinement — the reason quarks are never observed in isolation. Computing it requires sampling gauge field configurations via Monte Carlo integration over the SU(3) manifold.

The embarrassingly parallel structure

Monte Carlo importance sampling of gauge field configurations is embarrassingly parallel in a precise sense: the action for each lattice site depends only on its immediate neighbours. Given a fixed background configuration, updating one site requires no global communication — only local reads of six adjacent link variables.

This is exactly the same locality structure as fragment shader execution in a GPU rasterisation pipeline. A fragment shader computing the colour of a pixel reads from a bounded neighbourhood of the input texture and produces a single output value. Adjacent fragments are independent. The hardware exploits this by executing thousands of shader invocations simultaneously across SIMD lanes, with no synchronisation between them.

SPIR-V and path integrals

SPIR-V, the intermediate representation used by Vulkan, Metal, and modern OpenGL, defines a computation graph over typed SSA values. The path integral in Euclidean lattice QCD has a directly analogous structure: it is a sum over all possible configurations of the gauge field, weighted by exp(-S[U]) where S[U] is the Wilson action.

Both formalisms express a global quantity (the integral, the rendered frame) as the aggregate of independent local computations (per-site Boltzmann weights, per-fragment shader outputs). The difference is domain: one lives on a 4D spacetime lattice, the other on a 2D screen-space grid.

Practical consequences

This structural analogy is not merely aesthetic. GPU hardware designed to execute fragment shaders efficiently has been repurposed for lattice QCD computations since the mid-2000s, long before general-purpose GPU programming was widespread. The SU(3) matrix multiplications required for gauge link updates map cleanly onto the SIMD floating-point units that fragment shaders use for matrix transforms.

Modern lattice QCD codes targeting CUDA and HIP — the GPU compute frameworks from NVIDIA and AMD respectively — achieve a significant fraction of peak hardware throughput by structuring their kernels to match the memory access patterns that fragment shader workloads optimise for: coalesced reads from spatially adjacent sites, no inter-thread dependencies within a thread block.

Where the analogy breaks

Fragment shaders are stateless within a draw call. Lattice QCD gauge updates are not — the Metropolis acceptance criterion requires a global decision about whether to accept or reject each proposed configuration change. This global reduction step has no equivalent in rasterisation, and it is the main reason lattice QCD cannot be implemented as a pure rasterisation workload. Compute shaders and general-purpose GPU programming are necessary for the acceptance step.