Skip to content
February 25, 2011 / racoonacoon


Hello folks! In my previous post I mentioned that I was either going to be working on improving the performance of our Claudius graphics engine or working on a level exporter within Blender.  Of the two, I favored working on the internals of Claudius a bit.

Why would I spend the little time we have to work on our game improving performance? I mean, premature optimization is the root of all evil isn’t it? Yes, but the key word in Donald Knuth’s famous quote  is ‘premature.’ You see, our game is only running at about 47 frames per second on a small level. Obviously, we are going to be very limited to the size, scope, and graphical prowess of our levels if some optimization isn’t done.

So I set out to optimize. The first step to optimization is figuring out what is running slow and, well, make it not run so slow! Finding one of the immediate causes of the problem was pretty easy. Based on my experience working on Xbox Live Indie Games such as Didgery, I have learned that Draw Calls are rather expensive operations. With each draw call there is a certain amount of overhead generated as the CPU pushes all the necessary information over to the GPU. Our game had a TON of draw calls, 372 to be exact. 372! That is just insane. Devices like the iPhone only give you about 8 or 16 draw calls before a game runs sluggishly. This huge number of draw calls is caused by the Dynamic Environment Map effect applied to the Sphere. This effect requires the scene to be drawn out six additional times to create an accurate reflective map. It looks nice, but it is very expensive.

The beautiful but all-so-expensive Dynamic Environment Mapping Effect

With this many draw calls, our frame rate looked something like this:

Not so good 😦 The first obvious thing to do was to reduce the number of draw calls Claudius was making. At that time, Claudius was rendering everything that was handed to it, even if it was behind the camera and wasn’t remotely visible. To fix this, we created a camera frustum class and put in accept/reject logic in the core rendering pipeline. As each object goes through, it is compared against the camera frustum. If it is not visible to the camera it will never be drawn. This helped us out a bit:

I actually implemented an octree, but early tests suggest that the overhead of getting items out of the octree results in a slower frame rate than testing each object against the camera frustum. As our scenes get larger, this will most certainly change, but for now per-object culling tests are preferred. With per object culling implemented we are getting around 139 draw calls per frame, which is certainly better, but it is still a bit too much for me to be comfortable with. I wanted to reduce the draw call, and increase the frame rate, as much as possible. That way we wouldn’t have to worry so much about performance in the future. As I thought on how I could improve performance, I noticed that most of the objects in our scene are similar. What if all the similar objects could be batched together into a single large vertex buffer and submitted to the Graphics Card in one call? This is what XNA uses for its 2D SpriteBatch technique. Could the same concept be applied to 3D and still be advantageous? So off I went for a couple days to test out a basic implementation of this idea. The result: disappointing.

While the maximum frame rate is higher, both the average and minimum are lower . Furthermore, the large amount of variance between average, min, and max suggests that the frame rate is more unstable with dynamic batching applied. I have not attempted to optimize this optimization, but I have to wonder at this point if it would be worth it. The overhead of gathering together all the mesh vertices and appending them into a Dynamic Vertex Buffer is actually more than just drawing out the objects separately.

But not all is lost. During testing, I noticed that just looping across a large number of objects is slower than it needs to be. After a bit of measuring using timers, I was able to discover that the core of Claudius, the ObjectRegister, is much more inefficient than it should be. I plan to spend a few hours working on the efficiency of this core component. Hopefully this will provide us with the performance boost that we need.

See ya!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: