pouët.net

Help for vec3/4 Library Speedtest C/Linux

category: code [glöplog]
tl;dr
Test this.
https://github.com/Kabelmaulwurf/vec3

Hi guys,
after getting bored to implement my vec3/4 stuff again and again I just wanted to make a "library" out of it.
Thus I made one and added SSE Support with inline asm.
But I noticed major performance differences between my machines and wondered whiy.
So i wanted to have a broad field test on as many machines a possible.

And now you come in play.
I need you to test my library :)

Testing is as easy as
1. checkout git
2. make
3. run ./test.sh
4. pastebin results

Analysis of the results will follow like this
BB Image

Would be nice to have some results and suggestions.
Also every tester recieves beer the next time we see us :)

https://github.com/Kabelmaulwurf/vec3

P.S. Architecture/Testing info following soon but you got the ugly source anyway :P

Tried compiling under OSX, however it doesn't support -masm, so i just substituted that. Doesn't compile anyway

Code:gcc -c -Wall -Wextra -O3 -funroll-loops -nasm=intel -c -o SpeedTest.o SpeedTest.c /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `addps' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `subps' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `mulps' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `divps' /var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups' make: *** [SpeedTest.o] Error 1


added on the 2011-12-12 23:26:58 by Movi Movi
Also, i guess I'm too stupid to figure out how to fix your makefile for my Gentoo

Code: gcc -o SpeedTest SpeedTest.o SpeedTest.o: In function `v4_Length': SpeedTest.c:(.text+0x228): undefined reference to `sqrtf' SpeedTest.o: In function `v4_Normalize': SpeedTest.c:(.text+0x28f): undefined reference to `sqrtf' collect2: ld returned 1 exit status make: *** [SpeedTest] Error 1


I know I'm supposed to fix -lm into Speedtest.o somewhere, but it doesn't seem to catch on..
added on the 2011-12-12 23:45:53 by Movi Movi
Figured it out. Here are the results
added on the 2011-12-12 23:59:17 by Movi Movi
Oh sorry about the missing -lm going to fix that now.
And many thx for the results!
I also fixed it under OSX - you need to use %% for the xmm registers. It'It's a little bit different for the test script tho - no cpuinfo on darwin. I'm hacking something together, so should have the results soon.
added on the 2011-12-13 00:35:07 by Movi Movi
Just did the graphs, it seems that it's faster under linux in a VM than native on OSX. Apple quality..
added on the 2011-12-13 00:44:01 by Movi Movi
thx to bartman.
thx again to Movi

@movi: an extra thx for testing it in a VM :)
so plotting works at least ? :)
Yeah. Couldn't neatly compile matplotlib under OSX (without installing a yet 3rd copy of python), so did it under the VM. Oh, and i read the graphs in the wrong order. It is slower under the VM, so everything checks out :)
added on the 2011-12-13 00:58:34 by Movi Movi
To make good use of SSE its a bit more complicated than that. Check this out for example:
http://bullet.svn.sourceforge.net/viewvc/bullet/trunk/Extras/vectormathlibrary/include/vectormath/

You also have xnamath.h in the directxsdk.

What you want to look for is how they deal with return types, byte alignment (16 byte boundaries) with padding.

IMHO the usefulness of SSE when it comes to 3d vectormath is somewhat limited. If you really need to crunch you're better off using Eigen or GPGPU stuff. Using SSE compiler switches in general seems to be a good idea though.

added on the 2011-12-13 01:03:02 by Yomat Yomat
@ Yomat : Thx for the link.
I searched for some good source code but got none.

Only the id-lib math code from doom 3 which is quite nice,but c++ __asm which got some better compiler integration than gcc does.

Yeah alignment was a pitty when i experimented with vec3 and my x,y,z struct.

I dont really need to crunch,this should just be the first step in getting my stuff together ;)
So why not using intrinsics instead of raw assembler code? Such as those defined in e.g. xmmintrin.h and emmintrin.h ?
added on the 2011-12-13 06:48:54 by nystep nystep
My prod use SSE intrinsics and includes source code.
I hope it will help you.
(But it will work only in vc++)
http://www.pouet.net/prod.php?which=56553

I also recommend you to use intrinsics instead of asm.
If your inline function have code like this:
addps xmm0 xmm1
mulps xmm0 xmm1
They will always use xmm0 and xmm1 register.
So your compiler might insert instructions to copy/backup register.

When your code do "x = a+b+c*d;",
Compiler will generate code like this:
'xmm0 = a
'xmm1 = b
addps xmm0, xmm1
'backup xmm0
'xmm0 = c
'xmm1 = d
mulps xmm0, xmm1
'restore a+b to xmm1
addps xmm0, xmm1

SSE have xmm0~xmm7.
If you use intrinsics, compiler will assign better registers or memory to your code.
And most of modern compilers (vc++, gcc, icc) support it.
added on the 2011-12-13 09:25:01 by tomohiro tomohiro
Plus ICC and GCC generate better SSE code than the VC compiler.
added on the 2011-12-13 10:49:04 by raer raer
tomohiro: Thanks for the source.
I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.
Are you sure v4_Compare(..) works ?
Isn't comparing floats for equality evil ?
added on the 2011-12-13 12:14:02 by flure flure
flure : wooops saw that the == version is still in there.

Just needed a function to return whether the one is greater then the other.
Was only for some sorting stuff...
Quote:

I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.

You can not take into account instruction pairing, instruction and cache latency and whatnot. That's what compilers are there for. Sadly only the Intel compiler seems to create good SSE code...
added on the 2011-12-13 14:20:04 by raer raer
@raer: that's why my next idea is to write stuff directly in asm and call it from c.
Just passing the address of the operands.

wasnt there a release from an alternate compiler made by intel or is my brain making up something ?

Need to seek thru my bookmarks when back at home.
Sorry for repost, giving LLVM a try now.
If the I got time to hit up the VM.
Btw, Thanks for all the information.
Normally compilers should generate proper asm code from intrinsics and that should be the way to go, but... what I wrote. Hand-coded asm is almost always slower than code created by a GOOD compiler.
Anyway. Try LLVM. Might be worth a shot.
added on the 2011-12-13 15:05:55 by raer raer
Results:
http://pastebin.com/cknhBcQ4
cpuinfo:
http://pastebin.com/2yB1d6Un
added on the 2011-12-13 23:13:05 by joooo joooo

login