Wednesday, May 18, 2011

Packed Stream Output

There is some inconsistency in D3D 10/11 hardware - input layout stage can fetch data with different formats, but stream output can write to target buffers only 32-bit values (you can find a note about this in the Direct3D 10 programming guide, in the very end of Getting Started with the Stream-Output Stage section). Nevertheless, usually float precision is too excessive for local models, and it is desired to output data with half precision.

SM 5.0 has two functions for float-to-half conversion and vice versa: f32tof16() and f16tof32(). There is a dedicated silicon in DX11-hardware for these operations, so they are first-class API citizens. Also these functions are available under SM 4.0 profile - in that case they are emulated by a series of bit shifts, integer multiplications etc. I stumbled upon implementation of these functions in the OpenGL RedBook: Floating-Point Formats Used in OpenGL (probably, the algorithm implemented according to the IEEE 754-2008 specification for half precision floating-point format). Some time ago I wrote my own conversion functions, that work through lookup table, but with advent of SM 5.0 they are can be thrown away :)

I came up with idea that with SM 5.0 we can pack two floats into one, and stream out from geometry shader (and fetch later with input assembler) 2x less data than normally. Besides, important declaration [maxvertexcount] can be reduced: for example, if previously GS was outputing two vertices, now it will output only one. The main idea is: SO outputs two halves packed into single float, and IA interprets the vertex buffer as R16G16B16A16_FLOAT, so we can easily read each packed vertex.

Here is the code that packs two three-component vectors into one four-component: pck. The fourth component is required because R16G16B16A16_FLOAT format has four components (6-byte three-component formats was never supported by hardware), but we can ignore it when reading from vertex buffer or packing as well.

It is easy to pack until we stream out even number of vertives: 2, 4 and so on. But what if we need to output three vertices (say, triangle)? We can pack first two vertices into float4, third vertex - into .xy components of second float4 and left .zw uninitialized. But with subsequent fetch we will read three half4, and the fourth will belong to the next primitive - an error! And we can't define a stride between primitives in the buffer - no one wants to leave a gaps in the memory.

The solution is simple. Before we were packing four vertices into two, for instance, now we would have to pack three vertices into one:

struct gs_out
{
vec4 pos1_pos2 : Data0;
vec2 pos3 : Data1;
};

[maxvertexcount(1)]
gs_main(..., PointStream< gs_out > stream)
{
...
}

IA will interpret vertex buffer as series of half4 - thats all.

No comments:

Post a Comment