The Mysteriously Spinning Turret

One of the “benefits” of using new technology such as OpenCL, is that you sometimes find other people’s bugs. Finding these can be very time-consuming, because as a programmer, you’re used to thinking that if something does not work, it’s because you did something wrong. However, sometimes that assumption is not true.

One example of this we encountered was the following: a Panzer III Ausf. F model consisting of 3 separate pieces of geometry (the hull, the turret and the main gun) are supposed to look like this:

All parts accounted for, Sir!

But when running on AMD GPU’s the tank suddenly looks like this:

Is it a Bren Carrier? Is it a Panzer III? No, it’s a bug!

Not that big of a problem, the turret is just missing, should be easy to fix, right? Probably something went wrong while loading or something wasn’t initialized properly.

Until we let the simulation run, at which point this starts happening:

The Mysteriously Spinning Turret

Whereas on nVidia, things are working as they are supposed to:

The Not-So-Mysteriously Static Turret

After exhaustively checking every line of code that touches this data, I was at a loss of where this behavior was coming from. After spending considerable time on it, we decided to park the issue and to delay solving this issue. Last month I had to dig into it again, since it has to be fixed if we want to release anything. So after a time-consuming comparison of the state of the data on 2 systems, one with an nVidia GPU the other with an AMD GPU I suddenly noticed something strange. Tracking it down led me to one piece of the code that does something like this:

if (object is scene child)
{
    get position from scenegraph node
}
else
{
    get position from global transform
}
// the code then updates the position
if (object is scene child)
{
    write updated position to scenegraph node
}
else
{
    write position to global transform
}

Now, for a turret, this code should evaluate the first and last “if” to true, but the code on AMD GPU’s was executing the “else” part if the first “if” (but not the second). Every time I added something to the code to print out information in order to dig into the issue a little deeper, the problem went away(!?). I started looking through the AMD OpenCL developer forums where I found one post from years ago where the developer received the advice to “turn off optimizations” in case of certain bugs. So, “why not?”, I thought to myself and tried it on our code, and behold! The bug disappeared! So it started to look very much like AMD’s OpenCL compiler contains a bug that produced faulty code.

I posted the bug to AMD’s OpenCL forum and after creating a small program that contains just the bug, I handed things over to them and hopefully a future driver version will contain a fix for it. Until then we have to disabled optimizations for parts of the code, which is not ideal, but slow correct code is still better than code that does the wrong thing. To be continued…

Comments and reactions to this blog entry can be made on our forum.

Share: