The Problem With Graphics Research: Error Metrics
March 6, 2011Graphics researchers and games programmers don’t talk to each other. That’s just how it is. I can’t count how many Siggraph papers that claim to be applicable to games. Naty has a post over at realtimerendering.com where he summarizes some of the discussion at I3D about this and adds some insights as well. I give everything in that post a solid +1. But actually, my main complaint has to do with error metrics.
Here is a hypothetical for you. There is some existing technique out there that everyone uses, and I have an approximation for it. It could be a comparison of my lighting model vs some established lighting model. My HDR tonemapping algorithm vs your HDR tonemapping program. Anything really. My technique has 2% average error. At any point, the error is no more than 4% of the maximum absolute value. So is my lighting model good enough that it “solves” the problem? Of course not. If you are saying that your technique is “visually indistinguishable” from another technique, absolute error metrics are basically worthless. And unfortunately, absolute error is the value that almost everyone uses in the graphics research community.
The classic example that I use for how error metrics can be deceiving is the Xbox 360 PWL Gamma Curve. Here is a graph of the gamma 2.2 curve, the sRGB curve, and the Xbox 360 curve. I’ve mentioned this elsewhere on this site. You can click on it for the full-res version.
So, is the Xenon (Xbox 360) curve a “good” approximation of the sRGB curve? Well, let’s look at them side-by-side. Here is a color checker with sRGB on the top and Xenon the bottom.


You know, those look kind of close. I’ll bet you that more that most people would tell you that those two images look the same, unless you give them a hint. What about the error?
First, we’ll start with the Normalized, Average Absolute Error. For each x value, we take the difference between the two curves, add them all up, and divide by the total number of x values. Then you divide by the maximum value of the curve, which in this case is 1.
What value do we get? 1.62%. HOORAY!!! Since the average error is actually only 1.62%, the 360 Gamma curve is clearly good enough. Ok, the average error is fine, but what’s the worst error? If we look at the curves, the absolute error between the curves peaks right around codes 135 and 136 at 3.71%.
Based on these observations, you would assume that the Xenon PWL Curve is pretty good when compared to the sRGB curve. If these results were in a paper, they would probably be good enough to get accepted.
And for the record, this process is how 99% of Siggraph papers present their error metrics.
1. Show a graph of the error. They generally look pretty close.
2. List the average absolute error. In this case, 1.62%.
3. List the worst-case absolute error. In this case, 3.71%.
4. Show the before and after results side-by-side.
The catch? Unfortunately, this approach is very misleading. These are all the exact wrong metrics to use for evaluating this color profile. It definitely irks me that this is the accepted process. Here’s why.
Let’s do one more comparison. Now I’ll put both images interleaved with each other so that I can see the real difference.

WHOA!!!!! Is this the same comparison? That looks terrible. There is no way that you can say that the Xenon curve is a valid approximation of the sRGB curve.
Just 30 seconds ago, I showed you how using all the conventional comparison methods (graphs, avg absolute error, max absolute error, and side-by-side comparisons) why these two profiles are “visually indistinguishable”. And with one more comparison I showed you that that conclusion was completely wrong. It’s almost as if the error metrics that the academic community uses need some work…
As you really should know by now, the human eye perceives intensity logarithmically. If I turn on a 60W light bulb in a dark basement, that makes a huge difference. If I turn on that same light outdoors on a bright sunny day, you’ll barely notice it. There is absolutely no excuse for a researcher in computer graphics not understanding this concept.
When you talk about the difference between how the human eye perceives two different values (such as a color profile on a monitor) you have look at the relative error. In other words, for each x value, take the difference between the two values, and then divide by the reference value. Do this for all x values, divide by the number of samples, and you have your relative error.
So I did this and made a text file of the absolute and relative error between the two curves. In that text file, the first column is the sRGB value, the second column is the 360 PWL Gamma value, the third column is the absolute error, and the fourth column is the relative error.
In the “interleaved” shot, the black value at the lower-right is by far the worst offender. It looks completely off. So let’s zoom in on the lower end of this curve.

There’s your problem. The lower-right square is code 33. The absolute error is quite small, only 1.7%. But the relative error is is 112.7%. Also keep in mind that this image doesn’t show the worst offenders, which are actually around code 15. Does 1.7% seem like the correct number to use for your perceived error at that square?
Let’s also look at the error across the entire domain of the curve. Total relative error? 37.37%. Huh? That’s a lot higher than 1.62%. But that value makes sense. At the bottom end, your values often have 223% relative error. By using absolute error, you are very, very, very heavily under-representing the error in the bottom end of the curve.
In my opinion, error representation is the main cause of trouble in computer graphics. Suppose you were writing a paper about how awesome your fast approximation of the sRGB gamma function is. Would you rather claim that your technique has 1.62% error or 37.37% error? Honestly?
I can not tell you how many times someone will tell me something like: “You should just use this approximation, it is proven to only have 2% error.” And my reaction is “Great, it’s only slightly worse than the 360 PWL Gamma curve”. Of course, the technique could still be very good research with many great use cases. To my eye (and most artists), an approximation with 2% absolute error is usually not good enough to be useful, depending on the case of course.
So what’s the solution:
- The only way to evaluate “perceptual difference” between two images is to interleave them. Putting two images side-by-side is not enough. If you are a reviewer for a paper with side-by-side images, the first thing you should do is crop and combine them in Photoshop.
- If you want to say that your technique is indistinguishable from the correct solution, absolute error metrics are useless. Absolute error metrics are still valid for many uses, just not the perceptual difference between two things.
- Try to use relative error, or some variation of it. Sure, there are issues. For example, what to do when the control value is zero and your value is non-zero. I think in that case, would say it’s fine to use a metric that divides by the max of the absolute and control, to avoid divide by zero. But, if the control value is 0 and your value is 0.001, I think it’s fair to say that your relative error at that value is 100%.
- Relative error will always be much, much larger than the absolute error. If you are an author, and you use relative error, then you have a much higher chance of being rejected. Getting funding is hard. So I actually don’t blame authors for using the error metrics that put their results in their best light. I put the onus on reviewers to start demanding relative error.
- No matter what you are studying, if your values differ by more than an order of magnitude, you really should be using some kind of relative error metric. Otherwise your lower values will be woefully under-represented. If you were creating an economic model for how tech companies like Facebook grow, would absolute error make sense? If you are off by $1 billion dollars at a $75 billion valuation, that would be quite different from being off by $1 billion dollars at a $100 million valuation.
- This should go without saying, but in order to say that two things look the same, you have to actually look at them. (-:






how you get the gamma curve? I see that you have a linear interpolated curve with few values.
if you read the gamma field in the ICC profile, normally you will find this kind of curves. My Dell monitor has only 5 values to represent gamma value on the official ICC profile.
But this is because the ICC profile contains the complete LUT to convert from RGB to XYZ as well. In that LUT, the gamma is included. And software like Photoshop will use the LUT, not the gamma field.
Read this small article I wrote about ICC : http://prestigetracer.wordpress.com/2011/02/24/using-icc-color-profiles/
The curves are actually cleanly defined. The 360 gamma curve is defined in the XDK, and the sRGB curve is listed online. I have a post with the formulas at http://filmicgames.com/archives/14.
You have to interleave the images to see the striking difference. The end user doesn’t even get to see them side by side though, so does the difference affect them? I’m not sure if relative error is really any more meaningful than absolute? They’re both just statistics trying to quantify how alike things are. Comparing them visually is probably the best way in some cases.
The problem with zero is that any non-zero value will have 100% error, which means that non-perfect encoding for 0 = ultra bad results, which is often false.
Mat: The catch is that the difference is actually pretty significant. I can clearly see a difference between the two images, but some people can’t. Interleaving is the “strictest” way to verify that two models look the same or different. If you care about showing the real difference between two techniques, interleaving is the way to go. If you want to hide the deficiencies, you should go side-by-side and place them as far away from each other as possible.
Also, you may hear complaints that the PS3 and 360 versions of a game look different. I.e. one looks more “washed out” is the usual one. That’s usually because of this curve.
Arseny: I’ve been thinking about this. There are times when a small value for non-zero should be considered 100% error, and there are times when it should be something lower. The most sensible thing to me is to divide by Max(maxValue*0.00030,Max(experimental,control)). In the case of sRGB color profiles, I’d consider it fair to say that your minimum divisor is error is code 1 of the sRGB curve, which is 0.00030. Keep in mind that if your value for zero is actually your max value times 0.001 (or .1% of your max value), calling that 100% error is actually generous IMO. In that case, you really deserve 230% error, but I’d be willing to compromise and call that 100% error.
John
Black in the framebuffer is not black on screen (unless you sit in a completely dark room with an OLED screen showing a black image…). So when calculating the relative error close to black you have to take practical limitations of screens into account.
In this case it’s the “true” contrast ratio of the monitor, including both backlight bleeding and reflection of ambient light. If that ratio is, say, 400:1, then 0 is really 0.0025 on the screen – which is brighter than 8/255 in sRGB space is supposed to be.