Stencil MLAA

category: code [glöplog]
Smash talked about doing some sort of ultra-fast stencil MLAA on his blog (http://directtovideo.wordpress.com/). My raymarching renders are in desperate need of some Anti Aliasing, so I ported over the mlaa.com shader over to OpenGL. Or tried to, I couldn't get the weights working. It also didn't seem to make that much of a difference, as most of my scenes had diagonal lines that mlaa really didn't like. Looks like Smashes version solves that one to. I like how Smash manages to improve the speed of everything he uses, but still makes demos that kill my gpu :). So any hints on how to do this?
added on the 2011-05-27 11:31:55 by Mewler Mewler
its really simple. you just target the compomachine spec not that of the average user, and max it out for that - aim for best possible in realtime on best available spec.

oh sorry, you meant hints on mlaa. well - only edge pixels are affected by mlaa so write a stencil mask pass after edge detection and use it to mask the rest of it. stencil test runs pre pixelshader on modern gpus and tests very fast, so it makes the tested passes scale linear with masked pixel count - more effective than a branch.
added on the 2011-05-27 11:55:47 by smash smash
Hey thanks for the fast reply, love Blunderbuss BTW. :)

So yea, I'll look into stencil masks, seems like a good solution to other problems too, though I don't totally get why they would be much faster than a -

if(color == vec4(0.0))

- at the beginning of the shader.

One more question though, how do you calculate the weights? The other implementation I looked into used a texture lookup to calculate the weights, but I could never work out how that worked and always got broken renders.
added on the 2011-05-27 12:14:56 by Mewler Mewler
Mewler: The point is that it happens in fixed function logic, before a fragment shader is even started. So there's no shader-cycles spent on the compare, and you don't get any "pipeline bubbles".

But it's not a big deal, the point is that MLAA is slow to perform for every pixel, so earlying out is a good idea. Having a compare per pixel isn't that big of a deal, and it scales only with your resolution (i.e not scene complexity). Smash's PS3-background probably makes him more worried about these thing than the average demo-coder.
added on the 2011-05-27 12:21:13 by kusma kusma
OK, re-reading what I just said, I think underplayed the importance stencil-approach too little credit by just looking at the difference in ALU-usage; it's a good optimization especially because hierarchical stencil testing will work together with the rasterizer in modern GPUs, so your GPU will be able to early out of large areas of the screen in one go. But it's not going to make your engine super-fast, it's just not going to cripple the performance much either; expect the actual rendering to be much more significant for your over-all performance.
added on the 2011-05-27 12:28:18 by kusma kusma
I guess it depends on whether it goes before pixel/quad thread generation. I wouldn't think so?
added on the 2011-05-27 12:44:59 by Psycho Psycho
Psycho: Yeah, but that's the whole point of hierarchical stencil testing. If your hardware doesn't do hierarchical stencil, then you won't get that part of the benefit. But who has such hardware these days?
added on the 2011-05-27 13:44:22 by kusma kusma
Should that be done before or after post processing (DOF, motion blur, etc)?
added on the 2011-05-27 16:45:04 by xernobyl xernobyl
its really simple. you just target the compomachine spec not that of the average user, and max it out for that - aim for best possible in realtime on best available spec.

Ok. Almost choked on pizza :D
added on the 2011-05-27 16:47:38 by leGend leGend
kusma's reply pretty much covered it, but think of it like the differemce between branching around a function call in c vs earlying out of the function: you still pay the cost of the function call etc.
its more complicated tho and it depends on the hw, but in many hw implementations shader threads execute in groups; all threads in the group block until the longest one has finished. so if any pixels in that group pass the branch all of them have to wait for it to finish. (simplified explanation :) ). in general branches make it harder to schedule. also the stencil works for all subsequent passes; for branch it needs retesting each time.
also dont assume discard really earlies out. it could just kill the raster of the pixel: there's no guarantee execution of the shader ends there. or that the discard even gets placed by the compiler where you expect. always check the asm output.. (possible when using hlsl, of course. glsl is for lamers)
added on the 2011-05-28 18:08:07 by smash smash