Let me ask you a question. Suppose Intel came out with a CPU that was 10x faster. But there was a catch: every once in a while it would just give you the wrong number. So suppose that you calculate x=x+7, and once in a while it just gives you x-4 instead. Or sometimes when you write memory it might just put it somewhere else. Would you rather use a CPU that actually works, or the CPU that is 10x faster but unreliable? Obviously, you would want a CPU that gives the right results because even one calculation error could easily crash your program.
For that last 10 years, GPUs have gotten away with cutting corners here. We all have dealt with various driver problems on PCs. And in fact, we’ve all seen how when cards start to die, you can sometimes see the “white dots” issue where sporadic white dots tend to appear.
This is why OpenCL is such a pain. Here is a test case that I ran into today. I was trying to figure out what was wrong with my code on this image. I’m using an AMD 5750. My program is doing processing of average sized raw files (3900×2616).
At a glance it looks ok. But if you zoom in, here is a comparison between the GPU version of my code and the CPU version.
Also, here is a diff between the two.
Yikes! This looks like some kind of tiling problem, but I could be wrong. In a tile-ish pattern, it is sometimes just getting a copy of the pixel to the left. In the past few months, I’ve run into every problem that you can possibly run into with GPU coding. Drivers crash. Things just don’t work. There are cases where I can write data to the GPU, but trying to read it back gives me an internal error. If your kernel takes too long, the driver crashes. You name it, I’ve hit it.
But this kind of bug is much trickier. With all the other bugs, if the API gives an error, I at least know that there is a problem so that I can deal with it. But if there are certain driver/GPU combinations that just happen to give the wrong results, that’s a real issue. So far, the primary GPU computing applications are for enterprise-level software. For these programs, you have an IT guy who monitors things, updates drivers, etc. But if you want to reach mass market consumers, you have to write code that still works even if the user never updates their drivers.
Of course, if they really wanted to NVIDIA and AMD could make cards and drivers that were 100% reliable. But if they did, they would have to put out slower cards that cost more. In hardware design, there is an inherent tradeoff between accuracy and cost. Back to our hypothetical, a CPU that was 10x faster but gave the wrong answer once out of a billion calculations would be completely useless. However, a GPU that was 10x faster but gave the wrong answer once in a billion calculations would be a great tradeoff. The driver teams at AMD and NVIDIA are very talented and they work their tails off. They have a herculean problem and they do a terrific job. But it’s simply not their goal to fix every problem. Rather, they are tasked with fixing as many bugs as they can prioritized by the most common use cases.
But that still leaves the problem: How are GPUs going to start replacing typical CPU tasks if you can’t rely on them to actually work all the time? And as an application developer, I’m designing my codebase under the assumption that I can never rely on OpenCL to work. If it does, it’s a bonus, but I can’t trust it.