HLSL – Page 2 – The Instruction Limit

Effect Compiler & Disassembler

Updated! Now supports nVidia’s ShaderPerf tool.

Downloads

EffectCompiler.zip [51.3kb] – XNA Game Studio Express 2.0 (Visual C# 2005 Express), Source + Binaries

Description

Yesterday I took apart my Effect Compiling Tool which took a HLSL shader and converted it to Windows/Xbox360 bytecode, and made it into something more useful outside of XNA.

It’s always been somewhat of a hassle for me to compile and disassemble HLSL shaders. I can edit them pretty well in Visual Studio with code coloring and tabulations/undo’s/whatnot, but to compile them I always had to go with something else. I had read in the book Programming Vertex and Pixel Shaders by W.Engel how to compile them in VC++ 2005 using a Custom Build Step and fxc.exe, but when working in C# I had to have a parallel C++ project just for shaders, which is dumb. Also, fxc.exe has become less and less stable for some reason… So I finally made my own compiler and disassembler using XNA 2.0.

Continue reading Effect Compiler & Disassembler

16-Bit Color Encoding on the GPU

While working on some tangent project you’ll know about pretty soon, I’ve been trying to pack color data that had little visual importance from 24-bit “Truecolor” R8G8B8 to 16-bit “Highcolor” R5G6B5. Intuitively the solution is to take the most significant bits of each component and fit it inside two 8-bit containers by using bitwise operations.

But the problem is, bitshifting and just any bitwise operator are not supported in shaders before SM4.0, and I am still lagging behind with my videocard and OS so I can’t run those yet. And anyway, I assume 95% of the world can’t either.
So the only way to make this is to resort to integer arithmetic (division, multiplication and modulus). And since it took me most of the day to have it working, I thought I’d share my little HLSL snippet with the world.

Update : Now with 232.3% less arithmetic instructions!
Update #2 : Added in netics’s optimization in the encoding, 3 less instructions!

float2 EncodeR5G6B5(float3 rgb24)
{
	// scale up to 8-bit
	rgb24 *= 255.0f;

	// remove the 3 LSB of red and blue, and the 2 LSB of green
	int3 rgb16 = rgb24 / int3(8, 4, 8);

	// split the green at bit 3 (we'll keep the 6 bits around the split)
	float greenSplit = rgb16.g / 8.0f;

	// pack it up (capital G's are MSB, the rest are LSB)
	float2 packed;
	packed.x = rgb16.r * 8 + floor(greenSplit);		// rrrrrGGG
	packed.y = frac(greenSplit) * 256 + rgb16.b;		// gggbbbbb

	// scale down and return
	packed /= 255.0f;
	return packed;
}

float3 DecodeR5G6B5(float2 packed) {
	// scale up to 8-bit
	packed *= 255.0f;

	// round and split the packed bits
	float2 split = round(packed) / 8;	// first component at bit 3
	split.y /= 4;				// second component at bit 5

	// unpack (obfuscated yet optimized crap follows)
	float3 rgb16 = 0.0f.rrr;
	rgb16.gb = frac(split) * 256;
	rgb16.rg += floor(split) * 4;
	rgb16.r *= 2;

	// scale down and return
	rgb16 /= 255.0f;
	return rgb16;
}

Update Notes : Now, the first version I had posted here was much more high-level, and used functions like rightShift(x, a) that emulated bitwise operators. The idea was good, and it allowed me to experiment until I got it working, but it was way too complicated and the HLSL compiler just couldn’t optimize it well enough. So I rewrote it.

The new version consumes 28 vs_3_0 instructions to encode, and 11 ps_3_0 instructions to decode including the texture sampling. The old one was respectively 69 and 24 instructions for the exact same result. It’s crazy how optimizable some tasks are.
The big changes were the caching of divisions in a variable and the use of floor() or frac() instead of integer arithmetic, packing of similarly used data in vectors to group operations, removal of all pow() function calls, and overall code tidying. It gives a pretty hard to understand decoding function, but >200% speed-up totally justifies it.

An additional thing that I found out while optimizing, it’s just impossible to remove most-significant-bits by left-shifting and right-shifting back into place with integer arithmetic. The reason is that there is no native integer math on GPUs before SM4.0 and even if you can push a number by 30-something bits, you can’t bring it back down because the inverse has too many decimals and the floats run out of them. So the natural way to work around that is right-shifting (divide by 2^x), then use of the frac() intrinsic, and left-shifting if necessary to bring it back up.

EncodeR5G6B5() and DecodeR5G6B5() take respectively one float3 and compresses it to float2, or the inverse. Most of the color information is kept because only 3 bits at most are stripped, and they’re in the 1-4 range.

The encoding logic is the following :

Take the float3 (24-bit) color and expand it to 256-base range.
Remove the least significant bits (3-2-3) of each components by using 2^x integer division.
Shift the 5-bit red component leftmost and place it in the first 8-bit field.
Split the 6-bit green component in the two fields; the three least significant bits (LSB) of the first field will have the component’s three most significant bits (MSB), and the three MSB of the second field will have the rest (the component’s three LSB). This might have sounded confusing, but basically we’re filling the holes in sequence.
Append the 5-bit blue component to the remaining space, no need to shift, just bitwise-OR it up.
Take back the range to 1-base by floating-point-dividing on 255.

The decoding logic is, as one would expect, the inverse.
One important mention though is the presence of the round intrinsic function. Without it, for reasons unknown to my sleep-deprived brain, I keep losing random bits. I assume that integer casting (explicit or implicit) in HLSL just drops all decimals, like a floor operation would, and to be consequent we need to round it off to the nearest integer.
And of course since we’re dealing with encoded data, any bit could make a dramatic change!

And as a closing note, it doesn’t work very well with FSAA or probably any sort of blending, because those change the intensities by arbitrary factors and will screw up the encoding. I’ve had problems with FSAA, haven’t tried blending yet but it would be expected behaviour.

Enjoy!

XNA Effect Compiling Tool

Downloads

EffectCompilingTool.rar [41.5kb] – XNA GSE 1.0 Refresh Project w/ Binaries

Description

I love the content pipeline in XNA, but for effects, like the author of the XNA Effect Generator understood and mentioned already, it’s just not practical.

I liked the XNA Effect Generator too, but I wanted to implement my own effect framework so the generated classes were not what I was looking for either. Besides I had some problems with it (localization issues, dreaded french accents…). So I made a little tool that gets the Xbox 360 and Windows byte-code out of a shader in a convenient format and presentation.

Continue reading XNA Effect Compiling Tool

Static Ambient Occlusion

Traditional DirectX lighting models define ambient lighting as coming from all directions, and is added as a constant on all surfaces regardless of the geometry. Ambient occlusion acts as a factor to ambient lighting to take into account the cavities and concave areas in a model, or how much a surface is hidden from its environment.

Downloads

StaticAmbientOcclusion.rar [6.8 Mb] – VB 2005 (VS.NET 2005)

Continue reading Static Ambient Occlusion

Realtime Gradient Sky

Downloads

SkyGradient.rar [818kb] – VB 2005 (VS.NET 2005)

Description

I had a request from the same MMORPG developer which asked me for Non-Reflective Water to make a simpler version of my old “HLSL Sky Demo”, which I haven’t put on my blog yet because I’m not all that proud of the code.

Basically, I was asked to copy Worlds Of Warcraft’s skies; so make a tweakable gradient-based day sky solution, that renders fast and looks good, and most importantly that behaves well in huge worlds with big height variations.

Continue reading Realtime Gradient Sky

Bloom

Downloads

Bloom.rar [10mb] – C# 2.0 (VS.NET 2005)

Description

After seeing a couple of people using nVidia’s Bloom shader and be dissatisfied with its looks and perfomance, and more importantly because my employer asked me to do it, I made a Bloom shader from scratch.
It is highly customizable, supports FSAA and supports Pixel Shaders 2.0 up to 3.0.

Continue reading Bloom

Gaussian Blur Experiments

A follow-up to this article with clarifications and corrections to the “real-world considerations” can be found here.

I researched gaussian blur while trying to smooth my Variance Shadow Maps (for the Shadow Mapping sample) and made a pretty handy reference that some might like… I figure I’d post it for my first “Tips” blog post. :)

The full article contains a TV3D 6.5 sample with optimized Gaussian Blur and Downsampling shaders, and shows how to use them properly in TV3D. The article also contains an Excel reference sheet on how to calculate gaussian weights.

Update : I added a section about tap weight maximization (which gives an equal luminance to all blur modes) and optimal standard deviation calculation.

Continue reading Gaussian Blur Experiments

Shadow Mapping Experiments

I cleaned up this entry because it’s still getting traffic and didn’t really make sense as it was. I am not working on this anymore, haven’t been for months.

Description

My goal with this sample is to make a fast, simple, easy-to-use and good-looking (in other words, utopic :P) shadow mapping implementation in TV3D 6.5 with a directional light. It should work on landscape and meshes, perhaps on actors too but I don’t know how to rewrite an actor shader (animation blending = ouch).

I have already released a VB.Net implementation of Landscape Shadow Mapping on the TV3D forums. I might rewrite it in C# and my current programming standards and release it on my blog, but for now I’m not satisfied with the code enough to give it publicity…

Implementation Details

There are currently two modes :

The PCF (Percentage-Closer Filtering) implementation uses a R32F texture for the depth map, which is the fastest format for what’s needed. I hope this is renderable on ATi and ps_2_0 hardware, I’ll have to check on this.
I read on a couple of sites that on nVidia hardware, using D24X8/D24S8 textures gave free 4×4 PCF with bilinear filtering… but damnit, I can’t create a rendersurface with this mode! Sylvain said he’d look into this.
The VSM (Variance Shadow Maps) implementation uses a A16B16G16R16F (aka HDR_FLOAT_16) texture for the depth map, because it allows for filtering on my GeForce 7000 series hardware :)
I considered using G16R16F, which is faster and has a smaller memory footprint, but looks like TV3D won’t allow me to use it as a render target. G32R32F and A32B32G32R32F work and are more precise, but they don’t allow filtering; so it looks like unfiltered PCF, unless it’s post-bilinear-filtered (which I do support).

For the depth map rendering, I use two cameras : one for the actual viewport, and an orthogonal (aka isometric) one for the light. Its look-at vector is the same as light direction, and its zoom and position are calculated using bounding boxes vector projection. Then when rendering the depth map in the shader, I use the WORLDVIEWPROJECTION semantic, which gives me exactly what I want. I found that to be the cleanest way.

Comparison Screenshots

Now here is the promised ton of screenshots.

No filtering
No filtering (downsampled 4x)
3×3 PCF (Percentage-Closer Filtering)
3×3 PCF with Jitter Sampling
3×3 PCF with Bilinear Filtering
4×4 PCF
4×4 PCF with Jitter Sampling
4×4 PCF with Bilinear Filtering
4×4 PCF with Bilinear Filtering (Downsampled 4x)
7×7 Gaussian Blurred 16-bit VSM (Variance Shadow Maps)
5×5 Gaussian Blurred 16-bit VSM (Downsampled 4x)
5×5 Gaussian Blurred 32-bit VSM (Downsampled 4x)

Notice the framerate, it’s a fairly good indicator of the performance of each technique for a very small frame buffer resolution. See the table below for a “real-world” performance evaluation.

About Downsampling, it looks like it has a pretty hard effect on the framerate, but the quality is comparable to a 4x resolution shadowmap. And, something that can’t be seen in the above screenshots, the framerate is much more constant when the viewport is larger. I’ll make a proper benchmark to show how those techniques compare in the real world.

A word on VSM, the last shot. It’s very fast, it’s smooth, but… it looks odd.
Notice the gradient, as if light was bleeding from underneath the wall. And the separation between the armchair shadow and the wall shadow! Those are some of the artifacts I was talking about. The main problem is precision; I’m using a 16-bit floating-point format because it gives free hardware filtering (up to anisotropic), but it also causes precision issues.
32-bit versus 16-bit might look identical, but there are other artifacts in the scene like on straight walls or on the floor that are present in the 16-bit version but not in the 32-bit version. The shadow also looks somewhat sharper.

Now about jitter sampling. After seeing that, one would say “Jitter sampling is useless! Bilinear filtering looks tons better.”
But jitter sampling has an advantage that bilinear does not have : it can be scaled. Here are some more shots.

4×4 PCF with Scaled Jitter Sampling, 4x resolution shadowmap
4×4 PCF with Bilinear Filtering, 4x resolution shadowmap

Here, bilinear filtering just looks like antialiasing. It looks kinda good, but the shadows are very hard. Scaled jitter sampling on the other hand, has very soft shadows! And they’re relatively faster, for the same shadowmap size.

Mesh Self-Shadowmapping Performance

I’ve decided to make a 1024×768 fullscreen benchmark of all good-looking techniques. Here are the results :

Depth map size	Technique	Extensions	Framerate
512×512	PCF 4×4	Bilinear Filtering	111 fps
1024×1024	PCF 4×4	Bilinear Filtering	106 fps
1024×1024	PCF 4×4	Downsampled 4x, Bilinear Filtering	97 fps
2048×2048	PCF 4×4	Jittered	122 fps
512×512	VSM 5×5	16-bit, Hardware Filtered	210 fps
512×512	VSM 5×5	32-bit, Bilinear Filtered	117 fps
1024×1024	VSM 7×7	16-bit, Hardware Filtered	75 fps
1024×1024	VSM 5×5	Downsampled 4x, 16-bit, Hardware Filtered	169 fps
1024×1024	VSM 5×5	Downsampled 4x, 32-bit, Bilinear Filtered	95 fps

All in all, those are comparable techniques in terms of visual quality. What I take from these results :

PCF is viewport size-dependant, not depth map size-dependant; so getting better visual quality by pumping up the depth map size is possible. On the other hand, downsampling to gain performance is futile, and doesn’t help the visual quality much either.
Jittered sampling is very fast and works well with very high-resolution depth maps.
VSM is very much depth map size-dependant, which means that it benifits alot from downsampling.
16-bit VSM on nVidia hardware is VERY fast, but when you go earlier than the 6000 series or on ATi hardware, you have to bilinear-filter it yourself… and it’s slow. Also, 16-bit has a lot of precision issues, I have to work on those…
32-bit VSM is slow, but usable when the depth map is kept pretty small (512×512). It also looks very good.

Landscape Shadowmapping Notes

Some remarks though about the choices I made for filtering in a multi-pass context :

VSM is surprisingly slow in a multi-pass algorithm, so barely applicable to landscape shadowmapping. The precision artifacts just aren’t worth it.
Unfiltered modes look surprisingly good because the shadowmap itself is hardware-filtered so it doesn’t need to be constructed smooth. That renders useless all the work I’ve put in making bilinear filtering possible at every level! I find 512×512 unfiltered to look great for “semi-hard shadows”.
Jitter sampling looks great, has the advantage of being costless at high-resolution shadowmaps, and gets filtered — something that was impossible to do without multi-pass. It’s the clear winner here in the “soft shadows” category.

Bonus (jittered) teapot. :)

Soft Stencil Shadows

Downloads

SoftStencils.rar [776Kb] – C# 2.0 (VS.NET 2005)

Description

This demo demonstrates a technique that fellow forum user kTech proposed in the Beta Discussion forum to smooth otherwise hard stencil shadows. It uses screen-space blur of the shadow pass to smoothen things out.

Continue reading Soft Stencil Shadows

Double-Sided Bumped Refraction

Downloads

Both downloads are Visual Basic.NET 2005 projects.

Refraction.rar [408Kb] – Simpler first version
Refraction_v2.rar [665Kb] – Second version, now with backface rendering (double-sided refraction) and more test models

Description

A specular-capable, normal-mapped, double-sided refraction shader that can be applied to pretty much any TVMesh, originally asked by forum user WEst.

Continue reading Double-Sided Bumped Refraction