pouët.net

Fast aproxximations

category: code [glöplog]
@xernobyl Whats your platform?
If your exponent is an integer, it might be faster to do repeated multiplication, particularly if your exponent is fixed. For maximum speed, repeatedly square your parameter to get powers of 2^n, then multiply together looking at the bit pattern of the exponent parameter.

For any exponent, you need to dig into the floating point format.
pow(a,b)=exp2(log2(a)*b)
log2(a) = log2(mantissa_a) + exponent_a
exp2(c) = exponent (top bits of c) and mantissa (bottom bits of c)
You can speed up log2 and exp2 by using tables or less accurate polynomial approximations for the mantissa<->logarithm conversions.

But this depends on how easy it is to access the floating point representation directly. You might lose a lot of speed there.
sin(x) ~ x near 0 (o²)
added on the 2012-11-25 18:09:02 by Bartoshe Bartoshe
I implemented approximated sin when I made a demo with SSE.

My idea how to make a approximated sin/cos function:

Approximate it with polynomial:
f(x) = a0*x^0 + a1*x^1 + a2*x^2 + ... + an*x^n

Approximating whole sin function with a signle polynomial is impractical.
So approximate a part of sin (e.g. x=[-PI, PI])
and [mirror] copy them for other x.

There are many choices when you make a aprxd function.
First choice is a "n"(n is a natural number).
Next choices is how to get (a0, a1, ... an).
When you got n and (a0, a1, ... an), you got a function.

Larger n makes more precise function but require more calculations.
n=2~3 would be enough for some application which don't require high precision.

You get (a0, a1, ... an) by making a restrictions and solving the system of equations.
Example of restrictions:
f(0) = 0, f(PI/2) = 1
f'(0) = 1, f'(PI/2) = 0
f''(0) = 0, f''(PI/2) = -1
Make a choise by considering to how sin in your code works.
sin can be used for synthesis sound, drawing circle or line, make animation, setting rotation matrix, plasma, random number gen, etc.

After you make enough number of restrictions to make a solvable system of equations,
solve it with pen and paper.
You get (a0, a1, ... an) and your original approximated sin!
added on the 2012-11-26 08:29:52 by tomohiro tomohiro
Quote:
hfr: that's, like, soooooo outdated info, man :)

Kusma, probably true. So what do you think is different now? (except that you don't use lookups anymore).
added on the 2012-11-26 09:52:37 by hfr hfr
Dunno if all of these are faster, what the error is etc... Go try them.
Code: //from: http://www.devmaster.net/forums/showthread.php?t=5784 inline float cosf(float x) { //bring into range [-1,1] x = (x + (float)M_PI_2) * (float)M_1_PI; //wrap around volatile float z = (x + 25165824.0f); //must be volatile else it will be optimized out x -= (z - 25165824.0f); //low precision approximation: return 4.0f * (x - x * fabs(x)); //higher precision path float y = x - x * fabs(x); //next iteration for higher precision const float Q = 3.1f; const float P = 3.6f; return y * (Q + P * fabs(y)); }

Code: //from: http://www.devmaster.net/forums/showthread.php?t=5784 inline float sinf(float x) { //bring into range [-1,1] x *= (float)M_1_PI; //wrap around volatile float z = (x + 25165824.0f); //must be volatile else it will be optimized out x -= (z - 25165824.0f); //low precision approximation: return 4.0f * (x - x * fabs(x)); //higher precision path float y = x - x * fabs(x); //next iteration for higher precision const float Q = 3.1f; const float P = 3.6f; return y * (Q + P * fabs(y)); }

Code: inline float acosf(float x) { return sqrtf(1.0f-x) * (1.5707963267f + x*(-0.213300989f + x*(0.077980478f + x*-0.02164095f))); }

Code: inline float asinf(float x) { return sqrtf(1.0f-x) * (1.5707963267f + x*(-0.213300989f + x*(0.077980478f + x*-0.02164095f))) + (float)M_PI_2; }

Code: inline float invsqrtf(float x) { float xhalf = 0.5f * x; int i=*(int*)&x; i = 0x5f3759df - (i >> 1); x = *(float*)&i; x *= (1.5f - xhalf * x * x); //x *= (1.5f - xhalf * x * x); repeat Newton-Raphson step for higher precision return x; }

Code: inline float fabs(float x) { int f = ((*(int *)(&x)) & 0x7FFFFFFF); return *(float *)(&f); }

Code: //from: http://ldesoras.free.fr/doc/articles/rounding_en.pdf inline float ceilf(float x) { static float roundTowardsPI = -0.5f; //use 0.5f to round to nearest integer int i; __asm { fld x fadd st, st fsubr roundTowardsPI fistp i sar i, 1 neg i } return (float)i; }

Code: //from: http://ldesoras.free.fr/doc/articles/rounding_en.pdf inline float floorf(float x) { static float roundTowardsMI = -0.5f; //use 0.5f to round to nearest integer int i; __asm { fld x fadd st, st fadd roundTowardsMI fistp i sar i, 1 } return (float)i; }

Code: //from: http://ldesoras.free.fr/doc/articles/rounding_en.pdf inline float truncf(float x) { #if defined(USE_SSE3) || defined(USE_SSE4) float i; __asm { fld x fisttp i } return i; #else static float roundTowardsMI = -0.5f; //use 0.5f to round to nearest integer int i; __asm { fld x fadd st, st fabs fadd roundTowardsMI fistp i sar i, 1 } return (float)(x < 0 ? -i : i); #endif } [

Code: //From: http://www.iquilezles.org/www/articles/sfrand/sfrand.htm float randf(int * seed) { float res; seed[0] *= 16807; *((unsigned int *) &res) = ( ((unsigned int)seed[0])>>9 ) | 0x40000000; return(res-3.0f); }


plus some more stuff here
added on the 2012-11-26 10:40:20 by raer raer
@xernobyl/GoingDigital: On nowadays GPUs there is native exp2 and log2 support internally, so pow is basically for "free" if you have enough stuff happening around that code..
added on the 2012-11-26 13:50:22 by toxie toxie
@T21: in your cbrt, why do you use an approximation for dividing by 3? just divide by 3 and then look at the generated asm.. pooof.. it vanished.. :)
added on the 2012-11-26 13:56:57 by toxie toxie
@toxie Division is much more expensive than multiply on some platforms. Not sure what T21's on, but for Raspberry Pi GPU here are rough timings from experiments so far:
multiply/add/subtract : 0.5/1 cycle
exp/log: 10 cycles (definitely not free)
division/sqrt: 15 cycles ouch!
trig: 15-30 cycles

Big speedups to be had by multiplying reciprocals rather than dividing.
To me "nowadays GPUs" has an implicit "high-end" somewhere in there :)
added on the 2012-11-26 15:13:18 by msqrt msqrt
nowadays GPUs includes 4 year old cards (and i think even older than that), so.. ;)

@GoingDigital: (constant) divides can be transformed into multiplies (and should be by every smart compiler nowadays), see http://www.hackersdelight.org/magic.htm
added on the 2012-11-26 15:23:34 by toxie toxie
Quote:
divides can be transformed into multiplies (and should be by every smart compiler nowadays)

Sure, but it's numerically different and must enabled in the compiler settings.
The code in question approximates a 32bit integer division, though (which requires a 64bit multiply).
added on the 2012-11-26 16:36:27 by hfr hfr
hfr: the fragment shaders does more than just mujltiply add these days, so no series expansion any more. For instance, all modern NV shader-units do trig as two-instruction pairs (presin/sin etc).
added on the 2012-11-26 16:41:09 by kusma kusma
Nowadays platforms are a mix of desktop and mobile, so the raspberry pi gpu actually IS pretty much a current gen gpu. If you're working on such platforms (I am, so i'm very interested in this.. although I've already optimised out all the trig in all my current stuff as it happens) it's definitely relevant :)

I'll have to benchmark the current ios devices for this stuff at some point, it'd definitely help with optimising.
added on the 2012-11-26 18:57:19 by psonice psonice
toxiw: it doesn't just vanish, but it does look better :)

But I dont think the same mul shift trick can be done with SSE ?
We would need a _mm_mulhi_epi32() ? that doesn't exist
(Ultimately, all those would become SSE version to work on vec4f)

I'm looking to collecting a set of math function that gives me GPU level precision, as fast as possible on the CPU.
added on the 2012-11-26 19:00:48 by T21 T21
You are looking for this or this or this or this or this. Man. Try google...
added on the 2012-11-26 22:20:36 by raer raer
The first link doesn't seem to even tackle sin/cos, and does this for rsqrt() : div(sqrt(x))
This breaks my goal of "GPU level precision, as fast as possible on the CPU"

Their is no shortage of crazy stupid approximation, as well as written high precision approximation.
The trick is collecting the SOTA function scattered all over the place.
Look at the thread on sin() approximation.. conclusion "Dont use newton rapson serie) , what does 99% of the code on 'google' does?

I think I found two so far a sin() and cbrt()
My next one will be pow() (via exp log2)

Plenty of reference, no shortage of that. Well, thanks for even more inks :)

My point again. Its almost 2013, I wished this would have been settled by now.

added on the 2012-11-27 01:56:57 by T21 T21
psonice: all of Tegra, Adreno, Mali and SGX has native log2/exp2, so there's no need to benchmark; we already know that the VideoCore is amazingly bad compared to the others in the mobile space.
added on the 2012-11-27 09:53:26 by kusma kusma
Code:The first link doesn't seem to even tackle sin/cos, and does this for rsqrt() : div(sqrt(x)) This breaks my goal of "GPU level precision, as fast as possible on the CPU"

It is actually ~50% of your goal.

Code:I think I found two so far a sin() and cbrt() My next one will be pow() (via exp log2)

You've found cos(), pow(), and invsqrt() too...

Please do post a link to your stuff when you're done, for a definitive answer for everybody...
added on the 2012-11-27 10:10:04 by raer raer
or "quote"... :/
added on the 2012-11-27 10:10:59 by raer raer
Also take a look at this PDF.
added on the 2012-11-27 12:15:38 by raer raer
kusma: thanks, that's a useful doc (the SGX one, no need for the others yet) :)

Actually, I should have thought of this before: imgtec's shader editor gives cycled counts per line. With a little test program (compiler set to SGX543):

Low precision:
Code: Sin: 14 cycles Cos: 15 cycles Tan: 18 asin: ? (Probably 22 - the line count is missing, but 22 cycles get added at the end) acos: (probably 22) atan: (probably 32) pow: 3 exp: 3 MADD: 1 /: 2 min, max: 1


Medium or high precision:
Code: Sin: 10 cycles Cos: 11 cycles Tan: 13 asin: ? (Probably 22) acos: (probably 21) atan: (probably 33) pow: 3 exp: 3 MADD: 1 /: 2 min, max: 1


Medium / high precision looks faster, but I suspect it's because the GPU is combining processing units - so you get higher shader throughput, but less shader units doing the work, so it's slower overall. However, I'll test that to make sure at some point - faster sin/cos could be useful at some point.

On the SGX 535 the results were pretty similar, sin/cos a bit slower (~17), asin/acos slower at ~30.

So really it's only the trig stuff that's slow, and it's probably worth optimising out where possible on these GPUs.
added on the 2012-11-27 13:52:54 by psonice psonice
raer, so you see my point :) The info is out there, but all over the place. Hence the thread.

From your link (I do read them when I have the time) this was proposed to be more accurate then what you posted.

Code:float inv_sqrt(float x) { union { float f; uint32 u; } y = {x}; y.u = 0x5F1FFFF9ul - (y.u >> 1); return 0.703952253f * y.f * (2.38924456f - x * y.f * y.f); }


max rel_error
0.001752 original 0x5F3759DF
0.000650 Řrřola 0x5F1FFFF9
0.000366 rsqrtss

added on the 2012-11-27 16:04:40 by T21 T21
abs alternative (only one mul !)
Code: float abs(float x) { return sqrt(x*x); }
added on the 2012-11-27 22:27:29 by Tigrou Tigrou
FAST!11!!!!ONE1!
added on the 2012-11-28 09:30:30 by raer raer
Somehow related http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0363d/CJAJIDFD.html

the table at the bottom might be useful for some.
added on the 2012-12-01 02:13:18 by xernobyl xernobyl

login