Samples – The Instruction Limit

Anaglyph Stereoscopic Rendering in First Person

I decided to finally finish up my analgyph stereoscopy sample, and cut the depth-of-field component that was adding too much complexity for my limited spare time right now.

Download

Binaries : stereoscopy_bin.zip (1.9 Mb – Binaries)
Code : stereoscopy_src.zip (524 Kb – Source Only, get DLLs for TV3D and SlimDX from the Binaries)

Description

This sample demonstrates a couple of things :

Optimized anaglyph filters for red/cyan stereoscopic rendering, to reduce eye-strain by minimizing retinal revialry but still keep some color information. As my source suggests, I’ve also implemented red channel gamma correction in the shader.
Auto-focus of both eyes on a focal plane. The camera is in the first person and the two “virtual cameras” act like human eyes, in the manner that they are connected to a single brain that wants to look at a single point. So the center of the screen is assumed to be that focal point, and both eyes will look at it. I used a depth rendering pass to achieve this, and a weighted sum of a certain portion of the screen near the center (this is all tweakable in realtime)

The distance between the eyes is also tweakable, if you want to give yourself a headache.

I wanted to do a stereoscopy sample to show how I did it in Super HYPERCUBE, but I can’t/don’t want to release its source, so I just re-did it properly. Auto-focus is a bonus feature that I wanted to play with; sHC didn’t need that since the focal plane was always the backing wall.

This sample, like all my recent ones, uses the latest version of my components framework. There are some differences between this release and the Stencil Rendering one, but it’s mostly the same. The biggest thing is that base components (Keyboard, Sound, etc.) are not auto-loaded anymore, and must be added in the Core.Initialize() implementation. This way, if I don’t need the sound engine, I just don’t load it… makes more sense.

Hope you like it!

Stencil Rendering

Here’s a little demo to show off a technique that Farbs posted about earlier this week.

Download

Binaries : stencilrendering_bin.zip (3.6 Mb – Binaries)
Code : stencilrendering_src.zip (1.8 Mb – Source Only, get DLLs for TV3D and SlimDX from the Binaries)

Description

Every frame, a random color from the target image is sampled. This color will be used as a stencil, such that every pixel whole target color is close enough to that stencil’s color will be painted. It’s a constructive painting process; every frame paints a single color, but if you wait long enough in a single spot you’ll end up with the target image.

Actually, the post processing draw code is so concise that I can post it here :

public override void PostDraw()
{
    targetBuffer.BltFromMainBuffer();
    targetBuffer.SetSystemMemCopy(true, true); // Lolworkaround

    // 1) Pick a colour from target image
    var pickedColor = Globals.DecodeRGBA(TextureFactory.GetPixel(targetBuffer.GetTexture(), RandomHelper.Random.Next(0, targetBuffer.GetWidth()), RandomHelper.Random.Next(0, targetBuffer.GetHeight())));
    stencilShader.SetEffectParamVector3("stencilColor", new TV_3DVECTOR(pickedColor.r, pickedColor.g, pickedColor.b));
    Screen2DImmediate.Draw_FullscreenQuadWithShader(stencilShader, 0, 0, 1, 1, mainBuffer.GetTexture(), targetBuffer.GetTexture());
    mainBuffer.BltFromMainBuffer();
}

“targetBuffer” and “mainBuffer” are just two TVRenderSurfaces as big as the viewport. Since I sample from targetBuffer, it needs to be flagged with “system memory copy”. I thought this would slow things down, but it runs at very interactive framerates (60 and more).

And uh… The “lolworkaround” is a bug in the current build of TV3D. Usually you only need to set this once at initialization time, but a BltFromMainBuffer does not flag the surface as dirty and it prevents updates to the pixels that I sample. Resetting the memory copy mode makes the changes effective. Sylvain tells me it’s fixed in the current development build. :)

A pixel shader does the rest :

float4 PS(PS_INPUT IN) : COLOR 
{
	float3 bufferSample = tex2D(MainBufferSampler, IN.TexCoord).rgb;
	float3 targetSample = tex2D(TargetSampler, IN.TexCoord).rgb;

	// 2) Calc difference between current screen and target (per channel subtraction, abs, then accumulate all three into one)
	float sampleDiff = distance(bufferSample, targetSample);

	// 3) As above, but between colour picked in step 1 and target image
	float stencilDiff = distance(stencilColor, targetSample);

	// 4) Where result of step 2 > result of step 3, draw colour picked in step 1
	float4 color;
	if (stencilDiff < sampleDiff)
		color = float4(stencilColor, 1);
	else
		color = float4(bufferSample, 1);

	return color;
}

I think it’s a lovely effect. It’s very dependent on how colorful and contrasted the scene is, and it works differently for sharply-defined shapes or gradients… And of course camera movement is a big factor. In the video it gets confusing, the effect is more “painterly” if you just rotate the camera in small circles and wait for the effect to accumulate.

Isotropic Specular Reflection Models Comparison

I’ve decided to repost all my remaining TV3D 6.5 samples to this blog (until I get bored). These are not new, but they were only downloadable from the TV3D forums until now!

This demo (originally released as VB.Net 2005 on Feburary 13th, 2007 here) is a visual and performance comparison (and reference implementation) of five different per-pixel lighting models for isotropic specular reflections :

Phong reflection model
Blinn-Phong (Blinn D1, Phong) specular distribution
Lyon halfway method 1 (for k=2 and D = H* – L)
Trowbridge-Reitz (Blinn D3) specular distribution
Torrance-Sparrow (Blinn D2, Gaussian) specular distribution

My main goals were to :

Make an optimized HLSL implemention of each model that fits in a single Shader Model 2.0 pass and supports 3 lights
Evaluate the performance of each model in a multiple light, per-pixel rendering context
Determine which model keeps the most numerical precision and does not produce artifacts when used with normal-mapping

Download

IsotropicModels.zip [2.7 Mb] – C#3 (VS.NET 2008, TV3D 6.5 Prerelease .NET DLL Required)

Screenshots

Details

The HLSL shader supplied with this sample was made to mimic the built-in TV3D offset-bumpmapping shader as closely as possible. As a result, almost all of its effect parameters are mapped to standard semantics. It supports :

One colored directional light
Two colored point lights in SM2.0, and four in more recent models (SM2a/b and SM3)
Support for all types of vertex fog
Parallax mapping of texture coordinates using a grayscale heightmap
Diffuse mapping with alpha support (a.k.a. texturing)
Normal mapping
Specular mapping (using the alpha channel of the normalmap)
Emissive mapping (colored!)
Usage of all material terms (diffuse, ambient, specular, emissive, power and opacity)

There is no support for point light attenuation as this would’ve gone over the 64 instructions limit of the ps_2_0 profile. (also TV3D doesn’t provide semantics for these)
There is no support for spot lights for the same reasons, but I believe spots will be processed as regular point lights, ignoring the specific parameters.

Techniques

With the realtime controls, you can choose from three different techniques : FiveLightsBranching, FiveLights and ThreeLights. On SM2.0 hardware, only the third option will be valid.

The FiveLightsBranching mode uses loops and “if” statements to produce dynamic branching on SM3.0 compatible hardware. This can (but may not) be benificial because only the calculations for enabled lights are performed.
The FiveLights and ThreeLights modes respectively do five and three lights (WHAT YOU SAY !!) but all in a static manner. It does not just unroll the loop! Most of the calculations are done with matrices, which makes it more efficient on most hardware.

To keep the shader “simple” (or to prevent from becoming even more complex…) I decided not to implement a multipass 5 lights technique for SM2.0… sorry!

Blinn vs. Phong

There’s two major categories in the models I tested : the ones that use the halfway vector, and the Phong model that works with the reflected vector. (see the Wiki entry on Blinn-Phong for details on these vectors).

A directional light reflecting on a surface with a power value of 64

According to a paper from Siggraph 2004 called Experimental Validation of Analytical BRDF Models, the halfway methods generate specular highlights with more realistic shapes than the Phong model. I realized that myself when working on an ocean rendering shader that had a Phong specular reflection, and it was impossible to get a long grazing highlight when the sun was setting.

Normal Mapping Artifacts

One of the things that made me do this whole analysis is that I was dissatisfied with the image quality of Phong and Blinn-Phong when used with a normal map, so per-pixel lighting. I had huge block artifacts on my water surface, and everything with a high enough specular power value and a bumpy surface. So I found about a reformulation of the Blinn-Phong model by Richard F. Lyon, written in 1993 (!) for Apple. (trivia : Mr. Lyon also invented the optical mouse… how awesome is that!)

This reformulation is interesting because it does not use the specular power literally as an exponentiation, it uses a distance metric and a much lower power value to produce very similar results to the Blinn-Phong model. Using a high specular power (32 or more) hurts floating-point accuracy, even in full-precision mode.

That said, I have seen hardware that do not have this problem. I am starting to think that it may be a driver issue, or something about mobile GPUs… In any case, the safe thing to do is choose the model that never produces artifacts, right?

The Other Blinn Distributions

The two other models I implemented (Trowbridge-Reitz and Torrance-Sparrow) were “ported” from the MATLAB code in Lyon’s reformulation paper. I wanted to test them out to see if they had the same artifact problems, and how different they looked from the classic models.

Trowbridge-Reitz is an interesting model because of how it looks. It’s slower than Blinn-Phong, but it has a distinct smoothness to it. The falloff of its specular highlights is softer than the other models… I’m not sure if it’s more accurate, but it looks pretty. Sadly, it has the same problems with normal mapping.

Torrance-Sparrow is a visual identity to the Blinn-Phong model. It’s the same thing, but slower and more instruction-heavy. It does not even fit in SM2.0 with 3 lights,… So I suggest you disregard it for realtime graphics.

Performance

I found that performance varies a lot depending on which technique you use, which shader model you support, how much your GPU is fillrate-limited instead of arithmetic-limited… So I’ll just say this : the Lyon model looks great, and it’s simple and fast enough to be worth considering. If you don’t experience the artifacts I describe, then the Blinn-Phong model is your best shot, but test Trowbridge-Reitz to see if it’s fast on your hardware.

It’s also worth mentioning that many things could be optimized by factorizing equations into small 1D or 2D textures (or perhaps a normalization cubemap), if your GPU loves pixels and hates instructions. But I don’t believe that the shader can be optimized that much by reorganizing code or removing useless statements. At least not without hurting visual quality.

Component System Updates

This sample contains a major breaking change to my component framework : The Service baseclass is gone. This makes Components able to “be” services (and publish many service interfaces), and allows this sample to have a much simpler class structure… no more state classes! The components just publish whatever data they want via their service interfaces. And with the new Eventful<T> class, it’s really easy to propagate changes from a controller to a view.

Post Processing/Fullscreen Texture Sampling

I’ve decided to repost all my remaining TV3D 6.5 samples to this blog (until I get bored). These are not new, but they were only downloadable from the TV3D forums until now!

This demo (originally released July 1st 2008 here) is a mixed bag of many techniques I wanted to demonstrate at once. It contains (all of which are described below) :

Pixel-perfect sampling of mainbuffer rendersurfaces for post-processing
A component/service system very similar to XNA’s and with support for IoC service injection
A minimally invasive, memory-conserving workflow for post-processing shaders
Many realtime surface downsampling methods (blit, blit 2x, tent, sinc+kaiser and bicubic)
An optimized Gaussian blur shader with up to 17 effective taps in a single pass

Download

FullscreenShaders_(final).zip [8.6 Mb] – C#3 (VS.NET 2008, TV3D 6.5 Prerelease .NET DLL Required)

Screenshots

Dissection

Pixel-Perfect Sampling

There is a well-known oddity in all DirectX versions (I think I’ve read somewhere that DX11 fixes it… amazing!) that when drawing a fullscreen textured quad, the texture coordinates need to be shifted by a half-texel. So that’s why if you use hardware filtering (which is typically enabled by default), all your fullscreen quads are slightly blurred.
But the half texel offset is related to the texel size of the texture you are sampling, as well as the viewport to which you are rendering! And it gets seriously wierd when you’re downsampling (sampling a texture at a lower frequency than its resolution), or when you’re sampling different-sized textures in a single render pass…

So this demo adresses the simple cases. I very recently found a more proper way to address this problem in a 2003 post from Simon Brown. I tested it and it works, but I don’t feel like updating the demo. ^_^

Component System

The component system was re-used in Trouble In Euclidea and Super HYPERCUBE, and I’m currently using it to prototype a culling system that uses hardware occlusion queries efficiently. The version bundled with this demo is slightly out of date, but very functional.
I blogged about the service injection idea a long time ago, if you want to read up on it.

Post-processing that just sits on top

This is how I consider post-processing should be done : using the main buffer directly without writing to a rendersurface to begin with. This way, the natural rendering flow is not disturbed, and post-processing effects are just plugged after the rendering is done.

If you can work directly with the main buffer (no downscaling beforehand), you can grab the mainbuffer onto a temporary rendersurface using BltFromMainBuffer() after all draw calls are performed, and call Draw_FullscreenQuadWithShader() using the temporary rendersurface as a texture. The post-processed result is output right on the main-buffer.
Any number of effects can be chained this way… blit, draw, blit, draw. Since the RS usage only lasts until the fullscreen draw call, you can re-use the same RS over and over again.

If you want to scale the main buffer down before applying your effect (e.g. for performance reasons, or to widen the effect of a blur), then you’ll need to work “one frame late”. I described how this works in this TV3D forum post.

Downsampling Techniques

The techniques I implemented for downsampling are the following :

Blit : Simply uses BltFromRenderSurface onto a half- or quarter-sized rendersurface. A single-shot 4x downscale causes sampling issues because it ignores half of the source texels… but it’s fast!
Double blit : 4x downscaling via two successive blits, each recieving surface being half as big as the source. It has less artifacts and is still reasonably fast.
Bicubic : A more detail-conserving two-pass filter for downsampling that uses 4 linear taps for 2x downsampling, or 8 linear taps for 4x. Works great in 2x, but I’m not sure if the result is accurate in 4x. (reference document, parts of it like the bits about interpolation/upsampling are BS but it worked to an extent)
Tent/Triangular/Bilinear 4x : I wasn’t sure of its exact name because it’s shaped in 2D like a tent, in 1D like a triangle, and it’s exactly the same as bilinear filtering… It’s accurate but detail-murdering 4x downsampling. Theoretically, it should produce the same result as a “double blit”, but the tent is a lot more stable, which shows that BlitFromRenderSurface has sampling problems.
Sinc with Kaiser window : A silly and time-consuming experiment that pretty much failed. I found out about this filter in a technical column by the Jonathan Blow (of Braid fame), which mentioned that it is the perfect low-pass filter and should be the most detail-conserving downsampling filter. There are very convincing experimental results in part 2 as well, so I gave it a shot. I get a lot of rippling artifacts, it’s way too slow for realtime, but it’s been fun to try out. (reference document)

Fast Gaussian Blur + Metrics

The gaussian blur shader (and its accompanying classes) in this demo an implementation of the stuff I blogged about some months ago : the link between “lost light” in the weights calculation and how similar to a box-filter it becomes. You can change the kernel size dynamically and it’ll tell you how box-similar it is. The calculation for this box-similarity factor is still very arbitrary and you should take it with a grain of salt… but it’s a metric, an indicator.

But there’s something else : hardware filtering! I read about this technique in a GLSL Bloom tutorial by Philip Rideout, and it allows up to 17 effective horizontal and vertical samples in a ps_2_0 (SM2 compatible) pixel shader… resulting in 289 effective samples and a very wide blur! It speeds up 7-tap and 9-tap filters nicely too, by reducing the number of actual samples and instructions.
Phillip’s tutorial contains all the details, but the idea is to sample in-between taps using interpolated weights and achieve the same visual effect even if the sample count is halved. Very ingenious!

If you have more questions, I’ll be glad to answer them in comments.
Otherwise I’ll direct you to the original TV3D forum post for info on its development… there’s a link to a simpler earlier version of the demo there, too.

Loop Parallelism Revisited

Here’s a follow-up on a previous post I had made on “Loop Parallelism”.

A faster PersistentThread

In my last post, I wrote a class that keeps a single thread alive in order to re-use it without having to create threads over and over, an operation that sounded like a lot of overhead when you have over 30 updates per second (i.e. in a game loop). However, the benchmarks showed that it was actually slower to use synchronization primitives than to just recreate the threads every time!

It turns out that using a Monitor for the wait/signal operations wasn’t so good. There are lighter primitives in .NET that can be used for that simple operation, where all you need is a boolean semaphore. After researching a little bit, the best object for the job is the ManualResetEvent. This allows you to put a thread in sleep mode with the WaitOne() method, and wake it from another thread using the Set() method.

I ended up using two ManualResetEvents : one for a thread waiting to be started with a new work unit, another to simulate a thread “Join” operation (without actually killing the thread). And the cost of using them is MUCH smaller than Monitors! Here are the new performance numbers :

Test ‘Multi-Threaded, PersistentThread (single kept-alive thread w/ generic context & delegate)’ Started… Completed.
Time Elapsed : 00:00:11.3376303 s
Test ‘Multi-Threaded, ParameterizedThreadStart’ Started… Completed.
Time Elapsed : 00:00:11.4678206 s
Test ‘Single-Threaded’ Started… Completed.
Time Elapsed : 00:00:22.3374592 s

Seeing as the iterative version takes ~22.33 seconds to execute, the theoretical minimum time it could take on my Core 2 Duo is ~11.17s. And the new PersistentThread takes only ~0.16s less than that!
Also notice that the former speed champion, the base-line ParameterizedThreadStart method, is now slightly slower than the new PersistentThread. Honestly, I don’t think one can do any better with just two threads.

Task Parallel Library

The fancy lads at Microsoft have been working hard lately on bringing new ways of working with concurrent code. It was one of the big subjects on Channel9 this autumn.

So they released a CTP back in June of the Parallel Extensions for .NET, which features the Task Parallel Library and a front-end for using it with the Parallel static class and Parallel LINQ. I love it, and I can’t wait to use the official, final release in .NET 4.0. But for now, the CTP is an excellent reference point to see what kind of performance and ease of integration the TPL brings to your project.

The simplicity of my test codepath for the TPL boggles the mind, compared to all the other methods I’ve tested.

for (int i = 0; i < OuterLoops; i++)
    Parallel.ForEach(testData, sample => { sample.Number = SlowFunctions.Bessel(sample.Number / rnd.Next()); });

The number of threads used, how the data is partitioned, and the actual thread/context handling is all abstracted to its simplest form. It really makes C# 3.0 lambda expressions shine.

So, it’s cute alright. But how does it perform? Actually, the TPL will allocate as many threads as it thinks would be beneficial, like an automatic thread pool, based on how many processors you have and the workload you give it. I’m not sure of the details, but that’s how I understood it. So I suspect that it uses more than two threads if doing so will give more processor time to your code; this is relevant when there are many other threads waiting for the processor, and more of these threads being your code = more chance of it being executed.

So here are the numbers :

Test ‘Task Parallel Library’ Started… Completed.
Time Elapsed : 00:00:11.4088743 s

It’s definitely fast, but not as fast as my new PersistentThread! But it’s so much more flexible, easy to use and scalable that it’s the obvious choice for concurrency in .NET,… when it’ll be officially released. :)

A final note on the Parallel Extensions, there’s a new class called ManualResetEventSlim in this CTP which suggests that it’s more optimized or leaner. I didn’t use it because it depends on libraries that cannot be shipped, but quick test showed that the performance was more or less the same as ManualResetEvent.

Vista > XP

Since the last post, I upgraded (though some might say downgraded…) to Windows Vista. The experience isn’t totally smooth since my laptop is being dumb, but one of the upsides of upgrading is having access to new kernel code that appears to be more efficient.

My last benchmarks had quite a margin between the different tests (generic context & delegate, class-local context, etc.). In Vista, it’s pretty much the same :

Test ‘Multi-Threaded, ParameterizedThreadStart’ Started… Completed.
Time Elapsed : 00:00:11.5178019 s
Test ‘Multi-Threaded, Class-Local Context’ Started… Completed.
Time Elapsed : 00:00:11.4912111 s
Test ‘Multi-Threaded, Generic Context & Delegate’ Started… Completed.
Time Elapsed : 00:00:11.4255807 s

Those are really insignificant differences. I suspect that either calling delegates, creating objects, or just casting objects has become faster in Vista. It’s really hard to tell, but the good news is that you can really choose whatever approach you like best : they’re all equally as fast.

Two threads per core?

A final test I wanted to share is to double the workload on an Idle processor and see what happens. So instead of creating a single PersistentThread, I created 3, which means 4 active threads including the current one. The data is split in four equal parts, and…

Test ‘Multi-Threaded, 3x PersistentThread’ Started… Completed.
Time Elapsed : 00:00:12.3893090
Time Elapsed : 00:00:11.3681673
Time Elapsed : 00:00:15.0850932
Time Elapsed : 00:00:10.2940759
Time Elapsed : 00:00:16.0883727

… and I don’t know what to say. The results are ridiculously unstable, and the code is just confusing and ugly. I wouldn’t recommend it.

Also, I did consider using the .NET 2.0 ThreadPool static class, but it seemed a little confusing to use, especially if I want to join the threads after everyone’s done his work. I think I can wait for the TPL for a proper solution. ;)

Updated test project

Here’s the updated project, now you’ll need the Parallel Extensions CTP for it to compile!
LoopParallelismUpdate.zip (150kb), Visual Studio 2008 Solution for C# 3.5.

Fast .NET Reflection and Serialization

(sorry if you got this twice in your RSS, I hit the “publish” button too early…)

A while ago I decided to make an automatic serializer that works just like the XmlSerializer but for the SDL file format, since I like the simplicity and elegance of this data language. The XmlSerializer also doesn’t work natively with Dictionary objects, and crashes when used with certain visibility combinations and C# 3.0 auto-implemented properties.

Making a serializer for any language implies heavy use of reflection to determine the structure of what you’re reading or writing to or from a data file, but also to invoke the getter/setter of the fields you’re serializing.

Performance considerations

Some reflection operations come at a heavy performance cost. Not all of them though! This 2005 article in MSDN Magazine explains that fetching custom attributes, FieldInfo/PropertyInfo objects, invoking functions/properties and members and creating new instances are the costliest operations. Well that’s a problem, because all of those will be handy when writing our serializer.

The same article continues by showing which are the slowest method invocation techniques. The speediest technique are direct delegate use, virtual method calls or direct calls, but those are impossible to use if all you’ve got is a Type and an Object. The next best thing is using a DynamicMethod object, IL emission and a delegate. Having never used IL before, I didn’t grasp all of that, but thankfully there are many other resources concerning the use of DynamicMethod out there.

A post on Haibo Luo’s blog from 2005 makes a performance comparison between Activator.CreateInstance() (by the way, doing “new T()” with a generic type parameter that’s constrained as “new()” is the exact same as calling this method) and various other techniques including DynamicMethod and using it as a delegate. This last technique blows the rest out of the water in terms of speed.

This GPL library on CodeProject written by Alessandro Febretti provides an excellent dynamic method factory. And this other article on CodeProject goes a bit further and shows how to set/get values on fields, and isolates the boxing in helper functions.

What I ended up doing is taking from all of these examples, correcting the problems outlined in the comments of both CodeProject samples, and I built a IReflectionProvider interface that publishes all these costly operations and which can be implemented three different ways :

DirectReflector : Simply via reflection
EmitReflector : With IL emission but no caching performed (the DynamicMethods and delegates are rebuilt on each call)
CachedReflector : With IL emission and caching (the resulting delegates are created only once, then accessed with a dictionary lookup)

I’m aware that the 2nd test case is ridiculous, you should never emit IL and generate methods at runtime and repeatedly, but I wanted to outline the importance of caching.

The serializer

When making this sample, I wanted to both provide a fast .NET reflection library as well as a proper generic implementation of a reflective serializer. But I didn’t want to spend time on string parsing/formatting, since serializers usually output a text file or a certain data format. So the tradeoff I chose is somewhat unusable in the real world…

It outputs objects which are a generalization tentative of all .NET objects. There are three main categories :

SerializedAtoms are indivisible, single-valued and immutable. All primitive types will serialize to atoms, in addition to strings, enums and nullable types.
SerializedCollections are multi-valued object bags that don’t give a specific meaning to keys or indices other than natural ordering. All classes that implement ICollection<T> will serialize into this.
SerializedAggregates are multi-valued object maps that use the key or index for indentification. All of which doesn’t fall in the two other categories will serialize to aggregates, so Dictionaries and just any other class.

Only atoms contain actual values, but it contains them as an object. There is no string conversion done in the end, it all remains in memory. Serialized objects also retain the name of their host field or dictionary entry if any, and the runtime type if different from the declared one.

To customize the serialization output to an extent, I made a custom attribute called [Serialization] which allows to force an alternate name to a serialized member, mark a member as ignored by the serializer, or mark it as required. I could’ve used “optional” instead, but I find it more logical to skip serialization of all null or default-valued fields.

Just like the XmlSerializer, it only serializes the public instance fields or properties. So unlike the BinaryFormatter (which is deep serialization), my serializer does shallow serialization.

I have tested the implementation with many (if not all) combinations of value-type/class, serialized object category and visibility, so I can say it’s pretty robust and tolerant on what you feed it.

Results

This is the whole point… how fast does “Fast .NET Reflection” go? Here are the timings for 10 outer loops (so 10 serializer creations) and 100 inner loops (100 serializations per outer loop), which means 1000 serializations or the same complex aggregate object.

Test ‘Standard Reflection’ Started… Completed.
Time Elapsed : 00:00:08.2473666
Test ‘Reflection.Emit + Delegate (No Caching)’ Started… Completed.
Time Elapsed : 00:01:52.4517968
Test ‘DynamicMethod + Delegate, Cached’ Started… Completed.
Time Elapsed : 00:00:00.9970487

Well, I did say that no caching was a very bad idea.

Still, the highlight here is that by running the same serialization code with two different reflection function providers, using dynamic IL methods and a healthy dose of caching is eight (8!) times faster than using standard reflection.

Sample code

The code for this sample (C# 3.5, VS.NET 2008) can be found here : FastReflection.zip (46 Kb)

Even if you’re not interested in serialization, I suggest you take a look at the EmitHelper class and how it’s used in CachedReflector. All tasks that need Reflection in a time-critical context should use dynamic methods!

Loop Parallelism in a Game Context

An update to this post is available here.

Recently I was coding some physics-enabled particle systems and ended up with some fairly CPU-intensive stuff that looped as times as there are active particles in a system, and as many as there are active particle systems, each update cycle.

And it got choppy.

I don’t like choppy.

So I had three choices :

Offload to the GPU using GPGPU or Vertex Textures, but doing that in XNA with Shader Model 2 support is… time-consuming. Especially when you have no background work on the subject.
Profile and optimize my physics code, or switch to a physics package like Havok or Farseer, but I expect that would also take too much time.
Multithread using loop parallelism to use 100% of my CPU power instead of 50%! (I have a Core 2 Duo processor, and there’s more and more mainstream PCs with dual-, quad- and even octo-cores)

I ended up doing the latter because it was the simplest, and multithreading is rarely ever simple. But in some specific cases, it’s not much of a headache, really.

If your looped operation does not change the context of further loops, that is if every loop can be done in any order and individually without any write locking, it takes like 30 minutes to get it to work in C#, and with no concurrency problems. It’s like offloading static Web content to a different server, just a matter of routing to the right hardware.

Annoyances and interrogations

ParameterizedThreadStart is not generic. If you have to pass a context to your threaded operation, you have to cast it back to what it really is, at each iteration. It’s unnecessary and annoying… is it slow?
Once a thread dies, you can’t resurrect it. Correct me if I’m wrong, but FSM knows I’ve tried, and a thread that finished its execution cannot be re-run, even if its thread-start method is the same. You have to instantiate another. Does that affect performance…?

The problem with thread creation is that I create a lot of temporary threads. They live a single update cycle, and I instantiate one per particle system per update cycle, at about 60 cycles per second. It sounds like it could slow things down.

So I made a test!

Benchmark

The (fictional) situation is the following : in the middle of a game loop, you have to evaluate the 0th Order Modified Bessel Function Of The First Kind for some reason. And you need to do that for a large dataset, say 500 times per update cycle.

Actually, this kinda makes sense if you’re calculating weights for a very wide Kaiser filter, which I will probably cover later. I had code laying around, which is why I used the elegantly-named Bessel function.

So, this is slow, because it’s an integral. And so multi-threading would be beneficial. The different strategies are :

Single-Threading; it’s always good to know how slow it went before all optimization… if only for programmer ego.
Multi-Threading, using the ParameterizedThreadStart delegate, which means we’ll have to cast the Object to our real context type.
Multi-Threading, using a class-local, strongly-typed context and the (parameterless) ThreadStart delegate. This way we avoid the cast, and sacrifice a weird variable in the host class’ header.
Multi-Threading, using a generic Thread wrapper that does the strongly-typed context caching instead of laying it around. It’s basically a proxy for Thread with the context variable.
Multi-Threading, using a single “kept-alive” Thread that is started once for all update cycles, and that is forced into an artificial idle state between update cycles.

I hoped, even half-expected that these solutions would go from slowest to fastest. The last one was particularly interesting because it skipped all the thread instantiations and instead, relies on a Monitor and a sync-lock object to “wait” between two update cycles. It also produces much less garbage.

Results

Here’s the result for 500 update cycles, in which I update a 500-sized data-set, in Release mode, out of the IDE so without any debugging, and as little stuff as possible running in the background. I used a Stopwatch object to calculate these timings.

Test ‘Single-Threaded’ Started… Completed.
Time Elapsed : 27.8970399 s
Test ‘Multi-Threaded, ParameterizedThreadStart’ Started… Completed.
Time Elapsed : 18.1804179 s
Test ‘Multi-Threaded, Class-Local Context’ Started… Completed.
Time Elapsed : 19.5161083 s
Test ‘Multi-Threaded, Generic Context’ Started… Completed.
Time Elapsed : 20.2257278 s
Test ‘Multi-Threaded, Kept-Alive Thread’ Started… Completed.
Time Elapsed : 23.7845815 s

Of course, everything goes differently from what I expected. :P

The good news is, thread creation in .NET is very fast. No need to worry about that anymore. In fact, trying to circumvent that using a single kept-alive thread and monitor use makes it go almost 30% slower!

Also, casting the object context every iteration is faster than any other strongly-typed alternative. Unless the word “Object” in your code makes you want to shoot someone, it’s the best way to go. Kind of sad, that.

Here’s the C#3.0 (VS.NET 2008) test project : Loop Threading Performance (24 Kb)

Sorry again for the absolute absence of comments, it’s late and I don’t feel like it. D:

A Note On Debugging

One thing that’s slightly alarming and certainly worth mentioning, is that the performance in the IDE (with Debugging, but not necessarily in “Debug” mode) is hugely different.

In debugging, the “kept-alive” thread is almost always faster than all other strategies, by as much as 15%. This never replicates in the real-world. So kids, always test out of the IDE, or add the empty green arrow (Start Without Debugging, Ctrl+F5) in your Visual Studio toolbar!

Making an installer for an XNA GS 2.0 game

There is an ongoing thread on the XNA Creators Club forums about how to create an installer using Inno Setup, a free and very capable installer program.

The script described by Pelle in the first posts of this thread works great for XNA GSE 1.0 and Refresh, but is problematic for XNA GS 2.0 games because of the dependency on .NET 2.0 Service Pack 1. I won’t bring much new information in this post, but will try to summarize everything I read and tried it into a tested & true solution that I’m currently using for the Fez installer.

Please note that this solution does not work if your game requires use of the GamerServicesComponent of live networking using XNA, since you need the whole Game Studio for it to work. (see this post)
Also, it wasn’t made with 64-bit operating systems in mind, so if you’re going to support them you’ll have to modify it. It has been known to work on Vista though.

Continue reading Making an installer for an XNA GS 2.0 game

XNA Service Injection Sample

Download

ServicesDemo.rar [13 kb] – XNA GSE 1.0 Refresh Project (source only, the executable is pointless)

Description

Finally, here it is!

In my last blog post I said I’d be doing a sample of what’s explained in it, so service dependencies and service injection in components and other services, in an XNA context. It actually worked in my own game for many weeks, but I just found the time and motivation to finish up my sample.

Details about the sample structure after the jump.

Continue reading XNA Service Injection Sample

Windows Forms w/ XNA Revisited

Download

WindowsFormsDemo.rar [47.5kb] – XNA GSE 1.0 Refresh Project w/ Binaries

Description

Only works with XNA GSE 1.0 Refresh, not in 2.0! See this post.

Here’s a sample demonstrating what I explained in an earlier blog post, but with quite a few updates and precisions.

Continue reading Windows Forms w/ XNA Revisited