Graphics researchers and games programmers don’t talk to each other. That’s just how it is. I can’t count how many Siggraph papers that claim to be applicable to games. Naty has a post over at realtimerendering.com where he summarizes some of the discussion at I3D about this and adds some insights as well. I give everything in that post a solid +1. But actually, my main complaint has to do with error metrics.
Here is a hypothetical for you. There is some existing technique out there that everyone uses, and I have an approximation for it. It could be a comparison of my lighting model vs some established lighting model. My HDR tonemapping algorithm vs your HDR tonemapping program. Anything really. My technique has 2% average error. At any point, the error is no more than 4% of the maximum absolute value. So is my lighting model good enough that it “solves” the problem? Of course not. If you are saying that your technique is “visually indistinguishable” from another technique, absolute error metrics are basically worthless. And unfortunately, absolute error is the value that almost everyone uses in the graphics research community.
The classic example that I use for how error metrics can be deceiving is the Xbox 360 PWL Gamma Curve. Here is a graph of the gamma 2.2 curve, the sRGB curve, and the Xbox 360 curve. I’ve mentioned this elsewhere on this site. You can click on it for the full-res version.
So, is the Xenon (Xbox 360) curve a “good” approximation of the sRGB curve? Well, let’s look at them side-by-side. Here is a color checker with sRGB on the top and Xenon the bottom.
You know, those look kind of close. I’ll bet you that more that most people would tell you that those two images look the same, unless you give them a hint. What about the error?
First, we’ll start with the Normalized, Average Absolute Error. For each x value, we take the difference between the two curves, add them all up, and divide by the total number of x values. Then you divide by the maximum value of the curve, which in this case is 1.
What value do we get? 1.62%. HOORAY!!! Since the average error is actually only 1.62%, the 360 Gamma curve is clearly good enough. Ok, the average error is fine, but what’s the worst error? If we look at the curves, the absolute error between the curves peaks right around codes 135 and 136 at 3.71%.
Based on these observations, you would assume that the Xenon PWL Curve is pretty good when compared to the sRGB curve. If these results were in a paper, they would probably be good enough to get accepted.
And for the record, this process is how 99% of Siggraph papers present their error metrics.
1. Show a graph of the error. They generally look pretty close.
2. List the average absolute error. In this case, 1.62%.
3. List the worst-case absolute error. In this case, 3.71%.
4. Show the before and after results side-by-side.
The catch? Unfortunately, this approach is very misleading. These are all the exact wrong metrics to use for evaluating this color profile. It definitely irks me that this is the accepted process. Here’s why.
Let’s do one more comparison. Now I’ll put both images interleaved with each other so that I can see the real difference.
WHOA!!!!! Is this the same comparison? That looks terrible. There is no way that you can say that the Xenon curve is a valid approximation of the sRGB curve.
Just 30 seconds ago, I showed you how using all the conventional comparison methods (graphs, avg absolute error, max absolute error, and side-by-side comparisons) why these two profiles are “visually indistinguishable”. And with one more comparison I showed you that that conclusion was completely wrong. It’s almost as if the error metrics that the academic community uses need some work…
As you really should know by now, the human eye perceives intensity logarithmically. If I turn on a 60W light bulb in a dark basement, that makes a huge difference. If I turn on that same light outdoors on a bright sunny day, you’ll barely notice it. There is absolutely no excuse for a researcher in computer graphics not understanding this concept.
When you talk about the difference between how the human eye perceives two different values (such as a color profile on a monitor) you have look at the relative error. In other words, for each x value, take the difference between the two values, and then divide by the reference value. Do this for all x values, divide by the number of samples, and you have your relative error.
So I did this and made a text file of the absolute and relative error between the two curves. In that text file, the first column is the sRGB value, the second column is the 360 PWL Gamma value, the third column is the absolute error, and the fourth column is the relative error.
In the “interleaved” shot, the black value at the lower-right is by far the worst offender. It looks completely off. So let’s zoom in on the lower end of this curve.
There’s your problem. The lower-right square is code 33. The absolute error is quite small, only 1.7%. But the relative error is is 112.7%. Also keep in mind that this image doesn’t show the worst offenders, which are actually around code 15. Does 1.7% seem like the correct number to use for your perceived error at that square?
Let’s also look at the error across the entire domain of the curve. Total relative error? 37.37%. Huh? That’s a lot higher than 1.62%. But that value makes sense. At the bottom end, your values often have 223% relative error. By using absolute error, you are very, very, very heavily under-representing the error in the bottom end of the curve.
In my opinion, error representation is the main cause of trouble in computer graphics. Suppose you were writing a paper about how awesome your fast approximation of the sRGB gamma function is. Would you rather claim that your technique has 1.62% error or 37.37% error? Honestly?
I can not tell you how many times someone will tell me something like: “You should just use this approximation, it is proven to only have 2% error.” And my reaction is “Great, it’s only slightly worse than the 360 PWL Gamma curve”. Of course, the technique could still be very good research with many great use cases. To my eye (and most artists), an approximation with 2% absolute error is usually not good enough to be useful, depending on the case of course.
So what’s the solution:
- The only way to evaluate “perceptual difference” between two images is to interleave them. Putting two images side-by-side is not enough. If you are a reviewer for a paper with side-by-side images, the first thing you should do is crop and combine them in Photoshop.
- If you want to say that your technique is indistinguishable from the correct solution, absolute error metrics are useless. Absolute error metrics are still valid for many uses, just not the perceptual difference between two things.
- Try to use relative error, or some variation of it. Sure, there are issues. For example, what to do when the control value is zero and your value is non-zero. I think in that case, would say it’s fine to use a metric that divides by the max of the absolute and control, to avoid divide by zero. But, if the control value is 0 and your value is 0.001, I think it’s fair to say that your relative error at that value is 100%.
- Relative error will always be much, much larger than the absolute error. If you are an author, and you use relative error, then you have a much higher chance of being rejected. Getting funding is hard. So I actually don’t blame authors for using the error metrics that put their results in their best light. I put the onus on reviewers to start demanding relative error.
- No matter what you are studying, if your values differ by more than an order of magnitude, you really should be using some kind of relative error metric. Otherwise your lower values will be woefully under-represented. If you were creating an economic model for how tech companies like Facebook grow, would absolute error make sense? If you are off by $1 billion dollars at a $75 billion valuation, that would be quite different from being off by $1 billion dollars at a $100 million valuation.
- This should go without saying, but in order to say that two things look the same, you have to actually look at them. (-: