does sse suck!!??? :: pouët.net

does sse suck!!???

category: general [glöplog]

some time ago i wrote a nice realtime raytracer with
single precision fpu operations. but som arithmeticals
were real slow like fsqrt or fdiv.
so i decided 2 buy a new nice computer with some more power and simd 2 support.
i thought the accelation will be great, because of 4 operations at once and aproximations but my results where real crap.
is it not possible to write a raytracer with sse because of the ablute error of the approximations?
i'm very frustrated, because i wanted to present a nice demo on breakpoint 2003.

anyone know a solution?

added on the 2003-03-28 16:28:02 by lunatic

...problem solved...
there is a pssibility to rise pecision with some additional operations.

see:
http://www.agner.org

added on the 2003-03-28 21:05:22 by lunatic

i'm not aware of any precision limitations with SSE or SSE2. The operands are supposed to be the same precision as IEEE floats or doubles.

added on the 2003-03-29 07:35:48 by legalize

the problem was the reciprocal approximation
the precision is just 12 bit... to less for primary rayintersection calculations.
but with Newton-Raphson formula u can improve
precision to 23 bit
newton-raphson with reciprocal approximation is
faster at all then using divps or sqrtps
here are the formulas:

squareroot reciprocal
x0 = RSQRTSS(a)
x1 = 0.5 * x0 * (3 - (a * x0)) * x0) (rised precision)

reciprocal:
x0 = RCPSS(d)
x1 = x0 * (2 - d * x0) = 2*x0 - d * x0 * x0

simple squereroot:
sqrt(x)= x*rsqrt(s)

added on the 2003-03-29 20:59:04 by lunatic

:-)

s=x

added on the 2003-03-29 21:00:09 by lunatic

:-)

s=x

added on the 2003-03-29 21:00:16 by lunatic

Use geometrical raytracing and not conventional raytracing. It is much faster, and need less precission.

added on the 2003-03-30 11:43:14 by texel

Does SSE suck? Compared to what? AltiVec? yes ;)

added on the 2003-03-30 13:41:31 by Scali

i use geometrical raytraycing:
on my mobile p4 2,8 i have framerate of ~25
on resol 640*200

on my unoptimized version:
www.lunatic-site.de/rayasm.exe

there is no lighting and shadows visible because
of reflection (depth 10)

sse does not suck anymore :-) (in compare to fpu precision)

added on the 2003-03-31 23:12:34 by lunatic

Hi again LuNAtiC. Sorry, but your raytracer looks to be very slow (anyway it looks to be optimized in size). Do you know what I mean with geometrical raytracing? I got in a fullscreen 320x240 in a PII 233 15 fps, with 16 spheres, some of these reflecting. I used full integer calc. It is possible by the geometrical raytracing, because 32bits integers looks to be not enough to do conventional raytracing. With geometrical raytracing I mean to use rotatins to simplify the calc, to reduce the equations in one grade, so the sphere intersection is just one grade equation. Doing the rotations and the intersection with only one grade, is faster to do the second grade ecuations, and also you don't need one sqrt since you don't need to normalize the rays vector. And about the 10 depth in reflection, you scene looks as if it would be the same with 4-5 depth, and also, I'm sure that it would be the same as fast if you use only 2-3 depth, since when you do more reflections on it, it is only a very little part of the screen. Anyway it looks beatiful, why don't you try to do some better water? perlin noise works very good with the water distorsion, or, at least, use more harmonics in the water, now it looks too "sinusoidal".

added on the 2003-04-01 01:55:34 by texel

I wrote "is faster to do the second grade ecuations" and I wanted to wrote "is faster than to do..."

added on the 2003-04-01 01:56:50 by texel

i don't know, how u can reduce the equation by one grade. can u give an example?
my rays aren't normalized. they are just direction vectors. and i resolve the scalar, i need 2 mutiplicate with to get entrypoint.
the sqrt is needed to get the entry/leaving point of the sphere. how u can resolve this dualism other way then with sqrt?
dose it work with other objects as well? for example ellipsoid is very important for me to build some complex object with booleans.
the raytraycer ist not optimized at all. not in speed and not in size, but it'S fully written with nasm.
i try to create a 4k for the breakpoint 2003.
using apack it is just 2,8kb.
but there is not implemented refraction, bool, octree and sound yet.
the water is bumped by one animated texture (256*256). it must be tileable, thats why it looks a bit linear.

added on the 2003-04-01 10:21:52 by lunatic

That raytracer example was very beautifull. I am really hot to watch both lunatic's and texel's incoming demos when they will be finally released!

to texel: I got your email, thanks for that. I like big emails and this one is the most nice and interesting I got since ages. I will reply to you a bit later, whenever I will be free again, cause my PC fucked up again, I have to finish that CPC demo and I am preety busy these days..

added on the 2003-04-01 11:32:04 by Optimus

uhhh.. i think once upon a time people actually used fixed point and table lookups for vector rotations etc.

i'm not sure why my mind brought that up. ignore.

added on the 2003-04-02 01:00:27 by 216

i remind that times, too...
...but this times are past since pentium fpu performance

added on the 2003-04-02 01:22:49 by lunatic

Hi again LuNAtiC. Geometrical raytracing uses the advantages of parallel raytracing. For example, in parallel raytracing your vector is always (0,0,1), so it is so easy to check for intersections. So, what you do is to rotate the full world to make the rotated world ray vector be (0,0,1). It could be very good if you have something to rotate so fast, as SSE should be. With integers works so good. And, about the fpu performance, it is noway as fast enough as full integers programming. I mean, using fixed point math using variable precission and the best optimizing possible. In this way, and without mmx or that shit, you can get about 3 to 4 times more performance that using only the fpu, even in new pentiums or athlons. It is much harder to code in this way... but it is very good. About sse, I suppose that if it accelerate fpu calcs, then it will get about the same power of a good integer calc, but with more precission, obviously. But, if you use mmx, then I think the full integers with mmx will be the faster. About why people don't use to use now fixed point and that, it is by obviously questions, as accelerators that make you don't need to converse floats to integers (a very slow task), the high power of new computers... if you need to rotate and translate 20.000 vertices for example, it is not a problem at all... and that things. But, in any case, integer calc is the faster if you are doing software rendering.

added on the 2003-04-02 01:39:47 by texel

ok... i c it is a nice method 2 render spheres, because the shape is = from every side u look @. but as i told this seems 2 b just fast 4 spheres. calculations of a simple othogonal plane is much easier with not rotated room, because it is a simple division u have 2 calculate just unce a ray. anyway it is possible 2 calc planes this way as well, but ellipsoids.... forget it.
it was nice work 2 develop the formula for ellipsoid as it is, but rotated i think it is out of my imagination.
i think it is more complex then.
one vehicle i constructed of that simple objects (most of them just cut out something out of others)
consists of:
9 planes
7 cylinders
4 ellipsoids
and just 3 spheres

it will b not worth if this method just profites from spheres. maybe to combine this 2 methods is optimal.
i will check this up when i've got some time 2 take pecil &paper and think about it.
...hmmm junggler maybe works this way... yeah it's realy fast

added on the 2003-04-02 08:01:54 by lunatic

this is probrobly a little basic for what you want to do, but there is some info on useing SSE for raytracing here

added on the 2003-04-02 13:40:14 by keiichi

lunatic, for cilinders is very fast too, just calc the distance point to line (the ortogonal distance), if the cilinder vector line is not (0,0,1), in that case, just as a sphere. And it is better for metaballs too. Well, for ellipsoids, it is good too... just use your geometry knowlegde, it is just a proyection! Well, I'm not sure is for ellipsoids it is faster or not... but in any case, if we suppose that the rotations are near to be free (I mean, accelerated by sse or getting the high speed of integers or any way), then, the geometrical is always better. But, when you are rendering a plane, the rotation may cost too much, so it is not good to do the geometrical process.
I'm doing something for a demo, with about 500 textured cubes and some spheres and reflection... you will take a look of it soon... maybe before 20 of this month if I have time enough to finish the demo.

added on the 2003-04-02 19:52:20 by texel

i am anxious to it

added on the 2003-04-02 21:38:52 by lunatic

isn't the size of sse opcodes a disadvantage for writing a 4k intro?

just asking ;-)

added on the 2003-04-03 12:56:18 by nystep

Well nystep, it it takes one SSE instruction to perform something that would require 8 x86 instructions, I guess it's beneficial.

added on the 2003-04-03 13:04:04 by 33

in difference 2 mmx u can use fpu and sse at once.
so its not a problem 2 take the shorter opcodes if there ara just single scalar pultiplications.
i don't know, but maybe using both at once will be parallelized in the pipeline, so there ist also an advantage.. i have 2 read more about that and check it out.

added on the 2003-04-03 13:46:38 by lunatic

I worked out the Newton-Raphson iterations for 1/x and 1/sqrt(x) for someone:

http://board.win32asmcommunity.net/showthread.php?s=80adfb5722b8039539ad2b43f5132f4e&threadid=12094

They might be useful here, as SSE/3dnow! do the same... This is how you can refine the outcome of the approximations.

added on the 2003-04-03 19:33:13 by Scali

We want it to be clickable ofcourse, because we're lazy

added on the 2003-04-03 19:34:12 by Scali

does sse suck!!???

login