Where are my subpixel triangles???

Here’s a question that comes up a lot. According to the spec sheet, the RSX in the PS3 has something like 250 million triangles per second. Then, you can do the math. If we are hitting 30 fps, then we in theory get 8.3m triangles per frame. The resolution of a 720p display is 1280×720, or 921,600 pixels. That means we get about 9 triangles per pixel. At the top is a screenshot from Uncharted 2, and clearly we aren’t getting 9 triangles per pixel. So what gives.

1. Spec Sheets are Meaningless.
The first thing you should know is that when a GPU spec sheet says “x triangles per second” they are talking about the simplest possible triangle you can draw. When they calculate those numbers, they assume that the vertex shader is doing nothing. In real games though, the vertex shader has to do actual work. For a typical normal-mapping shader, the vertex shader needs to calculate postions, normals, binormals, and tangents. If you are doing skinning calculations, that’s more work that you have to do. In other words, if you want to draw any kind of “real” triangle, you won’t get anywhere close to the maximum number of triangles that the spec sheet says.

2. Pass Times.
In a real frame, we aren’t spending all of our time drawing triangles in the scene. We have many passes where we don’t draw any triangles.

That image is from my GDC talk and shows the timeline for the scene. If you want to know what they mean, check out my presentation at the GDC vault. You can also click it for higher res. Anyways, the first 6 lines show what the SPUs are doing. The next line labeled GPU is what the GPU is doing.

On the GPU line, that purple block that takes about 7ms at the start renders the depth and normal. Also, that yellow block that takes about 8ms is the standard pass where we do most of the drawing. Those two passes happen for every object that we draw in the scene. In the rest of the frame, we are doing other passes such as shadows, particles, fog, and fullscreen copies. So we don’t actually have 33ms to draw all these traingles. In the standard pass, we only have 8 ms to draw all the triangles.

3. Pixel Shading
Next up, there is a cost of drawing really small triangles. In particular, you might think that drawing lots of triangles is bad because of the cost of processing all those vertices. The real cost is pixel shading. You would think that drawing 1 triangle with 25 pixels would cost just as much pixel shading as 25 triangles with one pixel each. In reality, GPUs are designed to render many pixels from a single triangle. So if your triangle is really tiny, the GPU is spending time rendering pixels that eventually get discarded. That’s the real cost. If you have complicated pixel shaders, and you draw lots of triangles, then you will get slowed down by your pixel shading.

That’s one thing that everyone should know. We have a fundamental tradeoff between pixel shading and how many triangles we draw. If we wanted to decrease the complexity of our pixel shaders, we could draw more triangles.

That’s about it. Even though you can theoretically draw many triangles per pixel, if your vertex shader actually does work, you don’t get anywhere close to that number. Our main rendering pass in Uncharted 2 usually takes about 8ms, so we get 8ms worth of triangles, not 33ms worth of triangles. And we have a fundamental tradeoff between triangles and pixel shading. Yes, if we wanted to, we could increase the number of triangles in the game if we wanted to really decrease the quality of the shading.

Now the big question: Will all these get solved on PS4/Xbox 720???

8 Responses to “Where are my subpixel triangles???”

  1. Interesting post as usual John

    I think it is relatively safe to assume that the next consoles will have more unified architectures than the PS3. That is, the GPU will be able to better load balance between pixel shading and vertex processing. That is not to say that shading small triangles will more efficient.

    Fundamentally rasterization is more efficient the larger the triangles but this efficiency curve is non linear with larger and larger triangles producing diminishing efficiency gains. I don’t see that this will change without a significant shift in hardware architecture and that just doesn’t seem to be on the horizon right now. I guess what you’re asking is are we going to exceed some sweet spot next generation a descend into a horrible world where pixel shading just does not operate efficiently?

    Another question would be what proportion of your 33ms will you be able spend on triangles next generation? Will we spend more time on post processing? Hard to say, compute shaders are able to perform some post processing tasks more efficiently, on the other hand we will likely actually support 1080p and artists may want higher quality or new post processing effects.

    Will deferred shading (as opposed to deferred lighting) be more attractive next generation? Deferred shading only needs to draw all those inefficient small triangles once. (I can’t remember what Uncharted used? Deferred lighting?) I suspect deferred shading may take over as the more common of the deferred techniques.

    Another factor might be tessellation, with displacement mapping will we finally really start using tessellation? Will adaptive LOD help, it should do. I can see this being viable particularly to reduce art content costs.

    If we’re actually outputting at 1080p, will the triangles really be that small next generation? Yes probably.

    Best case we could be looking at:
    1. A single pass on all our triangles instead of two
    2. Proportionally less time on post processing
    3. Tessellation helping auto-LOD to keep triangles from getting too small

    I think it probable that will have more than 8ms for scene shading

  2. When it comes to deferred rendering on the next generation of consoles, my big issues is lighting models. In order to make objects look right, you truly need to have custom shading models. When I look at current games, my biggest complaints have to do with shading, not triangle count.

    Also, it’s worth keeping in mind that ALU processing is going to increase faster than bandwidth. In order to get true deferred shading, you would need to store all the parameters that the shader needs in a buffer. It’s not too terrible for blinn phong, since each shader just needs a custom exponent which you can pack into a byte (log-space). But for, say, Kelemen-Szirmay-Kalos (from the NVIDIA human head), you need 4 weight params and rolloff.

    For the anisotropic hair model, a deferred approach quickly gets out of control. We used Kajiya-Kay where the shader computed two tangents, and then each tangent had a separate specular power, scale, and color. So if we wanted to calculate that in a deferred pass, each tangent would be 3 bytes. Specular power, scale, color r/g/b are each a byte each. For two tangents, we need 16 bytes of data per-pixel for just material params on that model.

    There are certainly lots of ideas for using custom material models with true deferred shading. But I still haven’t seen a convincing way to allow Cook-Torrence, Kajiya-Kay, and other Anisotropic models with a true deferred renderer. Looking to the future, I’m more worried about getting better lighting models than maximizing triangle count.

    In U2, we had a prepass where we stored depth/normal/specular exponent, which would take about 6ms. And the standard pass was 8ms. We would use Blinn-Phong on all the deferred lights but could use custom models on the sun and character lights. Looking to the future, if using custom lighting models (for the main lights) means that I have to do 2 passes and thus halve my triangle count, so be it. That’s my choice, although I’m sure others would disagree.

  3. @admin:
    You linked an article, which claims that the RSX is actually clocked @ 500 MHz (instead of 550 MHz).
    I know it´s an old discussion, but I´m still interested.
    Do you know which number is right?

  4. Don’t know off-hand, but I think it’s 550.

  5. PS3 GPU RSX :

    Core: 500Mhz / Memory: 650Mhz

    according Unity3D & ProDG Target Manager for PS3)

  6. Conventional deferred shading is obviously not scalable… it burns memory bandwidth which is “not good” going forward.

    That said, tile-based deferred shading is completely viable (where you batch up most/all lighting work for a given tile of the screen and do it all at once), and doesn’t have any real issues with materials. Dozens of BRDF parameters are just fine as you read them a very small number of times (even once).

    Of course this benefits from somewhat more sophisticated hardware than current consoles provide, but I think it’s fair to assume at least a DX11 baseline for next gen consoles. Alternatively, you just move the majority of work to the CPU/SPUs.

  7. Not sure how that would scale, but it’s an interesting idea. You would end up with lots and lots and lots of if statements, but that could work. I guess it depends on how far your lighting model changes. If you stick with several more parameters, you’re fine. But if you start doing lots of crazy things with lookup textures, it gets tricky. If you want to precalculate skin specular parameters like Eugene d’ Eon and have lots of 4D tables for anisotropic BRDFs, it doesn’t scale. I guess it depends on what people actually want to do with the extra horsepower.

  8. Well there’s nothing magical about what happens with forward rendering. Doing lots of different materials and mapping them on to small triangles is no more efficient there either. Incoherent regions get masked out either way, either via the coverage mask or the control flow mask. Conventional hardware has handled the former slightly better but there’s not really a big different on modern GPUs. “Real” function calls (and switches via jump tables) are going to be reasonable too… they’re already supported on Fermi.

    You’re fundamentally just rescheduling your computation to be more coherent over lights. For any amount of texture lookups or inputs to your BRDF, they can be stored and read. Storing them and reading them back once or a small number of times is not very expensive, even for dozens of them. Thus I don’t think there’s any practical limitations on suitably flexible hardware (say DX11+).

    It’ll be interesting to see what happens, but I don’t think it’s realistic to assume that the generic hardware-scheduled path (pixel shaders + forward rendering) will be suitably efficient. You’re going to want to schedule based on knowledge of your particular renderer, not based on how the input and rasterizer chooses to split/output primitives.