pouët.net

Optimizing Closure

category: code [glöplog]
The wonderful things about there being no standard for binary intermediates for shaders (so far as I know) is that shaders for comercial games can be opened up in any text editor. I'm not sure if it's appropriate to dump code for a copyrighted game here, but if that is the case I imagine deleting this thread would not be to difficult for the admins here.

This is updated version of a flash game I really liked, but the dev seems to have solved his art/style objectives by brute force IMO.

{Copyright Eyebrow of course
Any /* */ are my additional commentary}

This is "hinnerglow.cg"
Code: const float px = 1.0/1920.0*2.0*1.3; const float py = 1.0/1080.0*2.0*1.3; /*Designed specifically for 1080p!? */ void main (float2 texCoord : TEXCOORD0, sampler2D tex : TEXTUNIT0, uniform float2 scale, out float4 oColor : COLOR) { float glow = 0.0; glow += tex2D(tex, float2(texCoord) + float2(0.0*px/scale.x, 0.0)).r * (1.0/(0.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(1.0*px/scale.x, 0.0)).r * (1.0/(1.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(2.0*px/scale.x, 0.0)).r * (1.0/(4.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(3.0*px/scale.x, 0.0)).r * (1.0/(9.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(4.0*px/scale.x, 0.0)).r * (1.0/(16.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(5.0*px/scale.x, 0.0)).r * (1.0/(25.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(6.0*px/scale.x, 0.0)).r * (1.0/(36.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(7.0*px/scale.x, 0.0)).r * (1.0/(49.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-1.0*px/scale.x, 0.0)).r * (1.0/(1.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-2.0*px/scale.x, 0.0)).r * (1.0/(4.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-3.0*px/scale.x, 0.0)).r * (1.0/(9.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-4.0*px/scale.x, 0.0)).r * (1.0/(16.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-5.0*px/scale.x, 0.0)).r * (1.0/(25.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-6.0*px/scale.x, 0.0)).r * (1.0/(36.0+1.0)); glow += tex2D(tex, float2(texCoord) + float2(-7.0*px/scale.x, 0.0)).r * (1.0/(49.0+1.0)); float b = (glow/2.888); oColor = float4(b, b, b, b); }

I take it this means 15 texture sample per pixel;
No wonder it will not run resaonably on the Radeon 4200 IGP in my laptop.

Just wow... I have zero shading experience, so is this a normal way for a dev to do what he is trying to do? Would a typical shader compiler make all those constant divisions into multiplies?
I can already imagine 1.0/scale.x could be done once instead of 15 times.

Any other ideas?
added on the 2012-09-17 06:54:48 by QUINTIX QUINTIX
if it's a simple gaussian blur (from what I can see), you can just do it with less samples but to a smaller rendertarget, and then your texture filtering will add an additional step of blur for free.
added on the 2012-09-17 10:35:49 by Gargaj Gargaj
At least they are using a separable version of it - which is complexity wise a pretty good idea they are not going total bruteforce there ;)

This might be another good idea to speed things up:
http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/
added on the 2012-09-17 11:48:44 by las las
I always end up down scaling to get blur. I take the average of 4 pixels. Do this a few times until I get like image_height<threshold. Then apply one blur to the final down scaled image. Is there any reason not to use this? Isn't it faster than anything else? (never tested it against anything else)
added on the 2012-09-17 12:06:32 by musk musk
do not modify the lookup coords in fragment shader. it should be faster if you calculate the sample points in vs and pass them to fs. this should allow the gpu to do the texture lookups 'before fs' as the lookup positions do not depend on your fs code.
added on the 2012-09-17 12:14:08 by rale rale
pretty good answers here. question : are some of you involved in game development ? or have just make too much demos ?
added on the 2012-09-17 12:26:17 by Tigrou Tigrou
Rale: You mean to compute the sample positions in the VS of a fullscreen quad(/triangle) and store them into varyings and than let the interpolation compute all your samples on the fly? Not a bad idea.

But I don't think that the sample position computation is the main bottleneck - it might be a good additional optimization - especially if you target low end hardware/mobile.

The main bottleneck are still the texture lookups and basically all major optimizations for that have been mentioned here already. (Separation of the kernel, downsampling, abusing linear sampling to get 2 samples instead of 1 - without separation along X/Y you could even go for 4 samples but for larger kernel sizes the computational complexity will show it's ugly face)

Besides that - there are nice approaches for box filters with almost arbitrary kernel size by using a summation table.

Quote:

question : are some of you involved in game development ? or have just make too much demos ?

No / Maybe. And other reasons. ;)
added on the 2012-09-17 12:53:01 by las las
rale: true for certain (mobile) GPUs only. :)
added on the 2012-09-17 12:58:24 by smash smash
:)
added on the 2012-09-17 13:01:19 by dv$ dv$
@rale, las, smash: Yeah, that sounds totally like black magic to me.
graga: no, its very sound advice on a certain kind of gpu.
added on the 2012-09-17 13:20:58 by smash smash
So to sum it up: You don't want to do dependent texture lookups in fragment shader on deferred rendering GPUs.
added on the 2012-09-17 13:32:38 by pommak pommak
It is in a way funny to see you all talking about "a certain mobile GPU" and "deferred rendering GPU" - strong NDAs anyone?

I'm not under NDA and I guess I could name the thing we are talking about - but let's leave it as an exercise for the interested reader to figure out what certain mobile GPU is bad in doing these thing.

Back to it - the closure guys are not doing it horribly wrong - so I recommend you to pick the hardware solution or just fix those shaders to do almost nothing - you wont have nice glow stuff than - but at least you could play the game.
added on the 2012-09-17 14:31:54 by las las
another opt you might want to try is to sort the sampled left to right. for cache purposes. it helped me with SSAO in the past.

it would be cool if GPUs had some sort of prefetch, like the _mm_prefetch in SSE, so that when doing your blur you'd start touching sample number 7 first before proceeding to accumulate samples 0, 1, 2, 3, 4, 5, 6, so that when you continued with samples 8, 9, 10, 11, 12, 13 these data would be already in cache ready for sampling thanks to the initial touch to sample 7? or perhaps GPU caches don't work this way anyway, i've no idea what i'm talking about really.

but yeah, sorting left to right helped me a bit in the past.
added on the 2012-09-17 19:27:20 by iq iq
IQ: I kinda of lost you somewhere... what are you trying to sort?
added on the 2012-09-17 22:39:53 by TLM TLM
that's a good one @iq. like sampling the scanline with a linear memory precache. i#m sure but i dunno if the compiler optimizes that anyway. ;)
added on the 2012-09-17 22:52:23 by yumeji yumeji
Can someone please explain?
added on the 2012-09-18 09:21:41 by TLM TLM
Read up on how caches work.

Basically it works like this: when reading stuff from memory, depending on the hardware, extra data in memory after the one you accessed are transferred into the processor cache for faster access (so it's stored locally and you don't need to access the bus to get it). So when you sample the texture in order from left to right the GPU might (I don't know enough of the hardware specifics to be more exact) have already the texels in its internal cache and thus save some cycles that would otherwise be spent on fetching.
added on the 2012-09-18 09:41:12 by Preacher Preacher
gpu used to do texture swizzling, right?
added on the 2012-09-18 09:52:06 by the_Ye-Ti the_Ye-Ti
Nice! I want to believe that the sharer compiler is smart enough to do it internally for above code...
added on the 2012-09-18 09:52:26 by TLM TLM
iq: that's difficult to assume these days because there's so much running on the gpu at any point in time. when that first texture read occurs in the shader what actually happens is there's like 63 other shader threads all doing the same instruction but on different parts of the texture.
you gain some benefit from the cache here because some of those different threads hit the same cache lines, but something like prefetch would be pointless on a modern desktop architecture.

modern desktop gpu architectures make very low assumptions about cache hit rates. the way gpus reduce the impact of reads from memory on performance is through latency hiding: having lots and lots of jobs in flight to choose from at any time, and being able to swap jobs as soon as one job stalls on a memory access.

added on the 2012-09-18 10:42:34 by smash smash
mu6k: the downsampling isnt necessary anymore nowadays!
i just use the full res and apply like 128 steps of HypnoGlow to it f.e.
all of this is almost for free, doesnt hit the framerate at all! all on GPU via HLSL ofcourse!
Quote:
all of this is almost for free, doesnt hit the framerate at all!

I tend to disagree. Only because it works for your 4k intros with a trivial glow post processing at real-time framerates - it is "not for free" nor even "almost for free".
Blurs are(can be - you can always screw things up) fast nowadays - agreed - but not free at all. You can get away with bruteforce (on fast enough hardware and depending what you want to do) - but maybe you should have read this thread - it was about speed optimizations.
added on the 2012-09-18 15:31:43 by las las
yes, i didnt read the thread to the end when i posted, but read my post again carefully: i answered mu6k on his Q! nothing more!

login