pouët.net

Fast aproxximations

category: code [glöplog]
I'm looking for the ultimate collection of trig and math function to run on the CPU that are at least of GPU accuracy (medium quality profile, gaming card).

nvidia posted a bunch of reference function, but I'm not sure they are the best
http://http.developer.nvidia.com/Cg/cos.html

Looking around I haven't found a single shrine to make sacrifices to the math hacker gods.
I have to check again, but I think Sony might have the most complete set.

Here are 3 I have so far... I would be interested to see better or more useful approximation, and get references source.
I think the sin() one is from "Nick" , not sure about the other two.

Code: const float pi_f = 3.14159274f; __inline float _cbrt(float x) { float y; int i = *(int *)&x; // View x as an int. i = i/4 + i/16; // Approximate divide by 3. i = i + i/16; i = i + i/256; i = 0x2a5137a0 + i; // Initial guess. y = *(float *)&i; // View i as float. y = 0.33333333f*(2.0f*y + x/(y*y)); // Newton step. y = 0.33333333f*(2.0f*y + x/(y*y)); // Newton step again. return y; } __inline float _sin(float x) { const float B = 4/pi_f; const float C = -4/(pi_f*pi_f); const float P = 0.225; float y = B * x + C * x * abs(x); y = P * (y * abs(y) - y) + y; return y; } __inline float _acos(float n) { const float h = 1.0f / 6.0f; const float j = 3.0f / 40.0f; const float k = 5.0f / 112.0f; const float l = 35.0f / 1152.0f; float u2 = n * n; float u3 = u2 * n; float u5 = u3 * u2; float u7 = u5 * u2; float u9 = u7 * u2; return pi_f/2 - n - h * u3 - j * u5 - k * u7 - l * u9; }
added on the 2012-11-23 11:37:04 by T21 T21
Yes, the sin is from Nick, there is even an improved quality version. Here's the original thread : http://www.devmaster.net/forums/showthread.php?t=5784
Also nicely works in fixed point :)
added on the 2012-11-23 12:14:34 by MsK` MsK`
for my shaders I've used degree 5 equation of sinus and it gives power to my shaders to make sinus and cosinus.
added on the 2012-11-23 22:17:26 by Bartoshe Bartoshe
Why? I was under the impression the GPU can do sin() and cos() ...
added on the 2012-11-23 22:28:50 by trc_wm trc_wm
If you have proper GLSL code replacements for sin / cos / acos / atan etc. with appropriate accuracy + faster than using the intrinsics - please share those functions :)
added on the 2012-11-23 23:16:21 by las las
it was for vs-1-1 coding with my own HLL but it may not be faster and needs another cst for modulo -pi,pi
added on the 2012-11-24 00:07:22 by Bartoshe Bartoshe
here is a fast "approximation" of cos() :D

Code:cos(float x) { return abs(abs(x)/PI*2 % 4 - 2) - 1; }
added on the 2012-11-24 00:50:34 by Tigrou Tigrou
ouch.
added on the 2012-11-24 00:56:37 by Preacher Preacher
sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...
cos(x) = 1 - x²/2! + x^4/4! - x^6/6! + ... - ...
added on the 2012-11-24 08:48:39 by Bartoshe Bartoshe
Tigrou: you might wanna &3 instead of %4, yes?
added on the 2012-11-24 08:51:52 by maytz maytz
every compiler nows that trick, no need to hardcore it.
added on the 2012-11-24 11:39:06 by MsK` MsK`
MsK`: I guess my msvc compiler settings are all wrong then, since I have to hard code this in every 4k/1k intro :)
added on the 2012-11-24 11:58:24 by maytz maytz
Code: unsigned int D = GetTickCount(); if ( D % 8 == 0 ) printf("lulz"); if ( (D & 3) == 0 ) printf("lulz");

Code: unsigned int D = GetTickCount(); 00F1A993 call dword ptr [__imp__GetTickCount@0 (0F2105Ch)] if ( D % 8 == 0 ) 00F1A999 mov edi,dword ptr [__imp__printf (0F21270h)] 00F1A99F mov dword ptr [ebp-58h],eax 00F1A9A2 test al,7 00F1A9A4 jne HierarchicalMarchingCubes::Recurse+63h (0F1A9B3h) printf("lulz"); 00F1A9A6 push offset string "lulz" (0F65930h) 00F1A9AB call edi 00F1A9AD mov eax,dword ptr [ebp-58h] 00F1A9B0 add esp,4 if ( (D & 3) == 0 ) 00F1A9B3 test al,3 00F1A9B5 jne HierarchicalMarchingCubes::Recurse+71h (0F1A9C1h) printf("lulz"); 00F1A9B7 push offset string "lulz" (0F65930h) 00F1A9BC call edi


I guess so :/
added on the 2012-11-24 12:17:28 by MsK` MsK`
Around page 73 in this:
http://www.agner.org/optimize/optimizing_cpp.pdf
document there is a nice table of optimizations in different compilers.
added on the 2012-11-24 12:56:06 by kt kt
Out of interest, how is sin/cos/etc implemented on the GPU? Does the compiler break it down into something else? And what's the performance cost (in general that is).
added on the 2012-11-24 17:11:02 by psonice psonice
psonice: since the alu is only able to do multiply-add, all trigonometric functions are replaced by taylor approximations at compile time . In the times of shader model 2 this let the shader code hit the maximum instruction count significantly earlier and table lookups where an order of magnitude faster.
If you have a rough idea of how much precision your operations actually require, it still makes perfectly sense to care about this manually.
added on the 2012-11-24 19:20:20 by hfr hfr
I was always told that modern gpus do sin and cos in only one cycle, well, since the xbox360/ps3 era. Can they execute several madds in one cycle ?
added on the 2012-11-24 20:05:45 by MsK` MsK`
Nick latest posted version.
A version can be made without the Wrap if the input is known to be -PI to PI
A version can be made without the normalizing if the input is already in the -1 to 1

From what I see this might be the uncontested SIN COS function for the CPU that require 'GPU class' precision?
If we all agree, lets move to the next one... pow() ?

I'm specially interested in pow(x, CONSTANT); mainly for gamma processing.

Code: float sin(float x) { const float Q = 3.1f; const float P = 3.6f; // Convert the input value to a range of -1 to 1 x = x * (1.0f / PI); // Wrap around volatile float z = (x + 25165824.0f); x = x - (z - 25165824.0f); float y = x - x * abs(x); return y * (Q + P * abs(y)); }
added on the 2012-11-24 22:08:04 by T21 T21
hfr: that's, like, soooooo outdated info, man :)
added on the 2012-11-25 01:35:05 by kusma kusma
Here is the SSE2 version of cuberoot.
I think this match GPU shader accuracy ?

Code: const __m128 sign = _mm_castsi128_ps(_mm_set1_epi32(0x80000000)); const __m128 oneThird = _mm_set1_ps(0.33333333f); const __m128i magic = _mm_set1_epi32(0x2a5137a0); __inline __m128 _cbrt(__m128 x) { __m128 y, s; __m128i i; s = _mm_and_ps(x, sign); // sign(a) x = _mm_xor_ps(x, s); // abs(a) // Initial guess i = _mm_castps_si128(x); i = _mm_add_epi32(_mm_srai_epi32(i,2), _mm_srai_epi32(i,4)); i = _mm_add_epi32(i, _mm_srai_epi32(i,4)); i = _mm_add_epi32(i, _mm_srai_epi32(i,8)); i = _mm_add_epi32(i, magic); y = _mm_castsi128_ps(i); // Second & third iteration y = _mm_mul_ps(oneThird, _mm_add_ps(_mm_add_ps(y,y), _mm_div_ps(x, _mm_mul_ps(y,y)))); y = _mm_mul_ps(oneThird, _mm_add_ps(_mm_add_ps(y,y), _mm_div_ps(x, _mm_mul_ps(y,y)))); // Restore sign y = _mm_or_ps(y, s); return y; }
added on the 2012-11-25 05:44:18 by T21 T21
BTW, I almost fee lucky that Intel is not making us do some bit manipulation to do a subtract...
"You dont need a sub instruction, you can negate the sign and do an add"

"We gave have you sub... you want a HW version of sincos ??? who do you think we are GPU designers! "

me < no happy wasting time re-inventing the wheel.
added on the 2012-11-25 05:54:32 by T21 T21
This is much faster than the intrinsic on some lowly GPUs (especially Videocore IV)
Code: float sinf(float x){ x*=0.159155; x-=floor(x); float xx=x*x; y=-6.87897; y=y*xx+33.7755; y=y*xx-72.5257; y=y*xx+80.5874; y=y*xx-41.2408; y=y*xx+6.28077; return x*y; } float cosf(float x){ return sinf(x+1.5708); }

And heres the Python to calculate the coefficients, so you can try higher/lower orders:
Code: from numpy import * order=5 x=arange(0.00015,1,0.0001) y=sin(2*pi*sqrt(x))/sqrt(x) w=sqrt(x)/sin(sqrt(x)) coeffs=polyfit(x,y,deg=order,w=w) poly=poly1d(coeffs) print """ float sinf(float x) { x*="""+'%.6g'%(.5/pi)+"""; x-=floor(x); float xx=x*x;""" print " float y="+'%.6g'%poly[order]+";" for i in range(order): print " y=y*xx+"+'%.6g'%poly[order-i-1]+";" print """ return x*y; } float cosf(float x) { return sinf(x+"""+'%.6g'%(pi/2)+"""); } """

This is based on the Sony paper
@psionice To view some of the internal intrinsic implementations on Videocore IV, do a 'strings' on https://github.com/raspberrypi/firmware/blob/master/boot/start.elf and look near the end. Trig functions are all implemented in plain GLSL.
sin/cos/tan use a minimax polynomial followed squaring the sin/cos vector twice to multiply the angle by 4 and correct the vector length back 1.
asin/acos/atan use a CORDIC method - keep squaring the vector, and watch which quadrant it moves to.
CAN I HAS FAST POW?
added on the 2012-11-25 17:22:56 by xernobyl xernobyl

login