16-bit – The Instruction Limit

While working on some tangent project you’ll know about pretty soon, I’ve been trying to pack color data that had little visual importance from 24-bit “Truecolor” R8G8B8 to 16-bit “Highcolor” R5G6B5. Intuitively the solution is to take the most significant bits of each component and fit it inside two 8-bit containers by using bitwise operations.

But the problem is, bitshifting and just any bitwise operator are not supported in shaders before SM4.0, and I am still lagging behind with my videocard and OS so I can’t run those yet. And anyway, I assume 95% of the world can’t either.
So the only way to make this is to resort to integer arithmetic (division, multiplication and modulus). And since it took me most of the day to have it working, I thought I’d share my little HLSL snippet with the world.

Update : Now with 232.3% less arithmetic instructions!
Update #2 : Added in netics’s optimization in the encoding, 3 less instructions!

float2 EncodeR5G6B5(float3 rgb24)
{
	// scale up to 8-bit
	rgb24 *= 255.0f;

	// remove the 3 LSB of red and blue, and the 2 LSB of green
	int3 rgb16 = rgb24 / int3(8, 4, 8);

	// split the green at bit 3 (we'll keep the 6 bits around the split)
	float greenSplit = rgb16.g / 8.0f;

	// pack it up (capital G's are MSB, the rest are LSB)
	float2 packed;
	packed.x = rgb16.r * 8 + floor(greenSplit);		// rrrrrGGG
	packed.y = frac(greenSplit) * 256 + rgb16.b;		// gggbbbbb

	// scale down and return
	packed /= 255.0f;
	return packed;
}

float3 DecodeR5G6B5(float2 packed) {
	// scale up to 8-bit
	packed *= 255.0f;

	// round and split the packed bits
	float2 split = round(packed) / 8;	// first component at bit 3
	split.y /= 4;				// second component at bit 5

	// unpack (obfuscated yet optimized crap follows)
	float3 rgb16 = 0.0f.rrr;
	rgb16.gb = frac(split) * 256;
	rgb16.rg += floor(split) * 4;
	rgb16.r *= 2;

	// scale down and return
	rgb16 /= 255.0f;
	return rgb16;
}

Update Notes : Now, the first version I had posted here was much more high-level, and used functions like rightShift(x, a) that emulated bitwise operators. The idea was good, and it allowed me to experiment until I got it working, but it was way too complicated and the HLSL compiler just couldn’t optimize it well enough. So I rewrote it.

The new version consumes 28 vs_3_0 instructions to encode, and 11 ps_3_0 instructions to decode including the texture sampling. The old one was respectively 69 and 24 instructions for the exact same result. It’s crazy how optimizable some tasks are.
The big changes were the caching of divisions in a variable and the use of floor() or frac() instead of integer arithmetic, packing of similarly used data in vectors to group operations, removal of all pow() function calls, and overall code tidying. It gives a pretty hard to understand decoding function, but >200% speed-up totally justifies it.

An additional thing that I found out while optimizing, it’s just impossible to remove most-significant-bits by left-shifting and right-shifting back into place with integer arithmetic. The reason is that there is no native integer math on GPUs before SM4.0 and even if you can push a number by 30-something bits, you can’t bring it back down because the inverse has too many decimals and the floats run out of them. So the natural way to work around that is right-shifting (divide by 2^x), then use of the frac() intrinsic, and left-shifting if necessary to bring it back up.

EncodeR5G6B5() and DecodeR5G6B5() take respectively one float3 and compresses it to float2, or the inverse. Most of the color information is kept because only 3 bits at most are stripped, and they’re in the 1-4 range.

The encoding logic is the following :

Take the float3 (24-bit) color and expand it to 256-base range.
Remove the least significant bits (3-2-3) of each components by using 2^x integer division.
Shift the 5-bit red component leftmost and place it in the first 8-bit field.
Split the 6-bit green component in the two fields; the three least significant bits (LSB) of the first field will have the component’s three most significant bits (MSB), and the three MSB of the second field will have the rest (the component’s three LSB). This might have sounded confusing, but basically we’re filling the holes in sequence.
Append the 5-bit blue component to the remaining space, no need to shift, just bitwise-OR it up.
Take back the range to 1-base by floating-point-dividing on 255.

The decoding logic is, as one would expect, the inverse.
One important mention though is the presence of the round intrinsic function. Without it, for reasons unknown to my sleep-deprived brain, I keep losing random bits. I assume that integer casting (explicit or implicit) in HLSL just drops all decimals, like a floor operation would, and to be consequent we need to round it off to the nearest integer.
And of course since we’re dealing with encoded data, any bit could make a dramatic change!

And as a closing note, it doesn’t work very well with FSAA or probably any sort of blending, because those change the intensities by arbitrary factors and will screw up the encoding. I’ve had problems with FSAA, haven’t tried blending yet but it would be expected behaviour.

Enjoy!

Tag: 16-bit

16-Bit Color Encoding on the GPU