pouët.net

How are shaders divided between cards in a generic application with an SLI / crossfire setup?

category: code [glöplog]
 
I've done a bit of searching, but it doesn't look like a question that's asked that often. Is there a increase in general shader performance with SLI and Crossfire, (ie does the driver do all the grunt work), or do you have to explicitly enable that in your program? Is it possible to fine tune performance with such a setup, or would I need to use something like CUDA or DirectCompute for that? Just wondering as I'm doing some quite intense general computation with shaders and could really use the extra performance.

Also, is there any widely used library for shader based double / arbitrary precision emulation? The implementations I've come across lack really useful functions like sine, cosine, non-integer powers, and so on. I'm not really adept enough at understanding how numbers of non native precision are stored to be able to do that myself in a realistic amount of time. But, heck, I might try if there's no other way and someone points me in the right direction.
Had a 295 for quite some time now.
From what I understand the driver tries to balance the load on the two GPUs, and there's some chance there's some intelligence and heuristics involved in that.
There used to be two modes, AFR-alternate frame rendering, and two halves on two cards. I think you have no real control over that nowadays, but "somewhere" in the nvidia drivers there's a profile for each game ; which suggests that the mess is fined tuned for each app and not trivial.
For a homemade or unkknown app I dunno.
For gpgpu, I dunno.
In extreme cases, I suppose it depends on the weirdness of the engine, you cannot benefit from SLI at all.
Note that AFR induces a lag of one frame, that is, you have more fps but no less time between input and effect. This is not good.
One frame halved and done on two units on the other hand causes more synch issues.
I heard that for Crysis the gun and objects are rendered on one unit and the background on the other. If true this speaks lots about the difficulty of auto-paralleling.
In my modest experience I would say that bigpu sucks in comparison to single one.
But heh, it can spew more frames. Maybe not twice but it kindof works.
But comparing with equal power, single gpu is always preferable.
The thing is, I needed a 295.
I cant be sure I can do SLI and I needed tree outputs for my triple head.
And this thing delivers real 5040-wide framebuffer (surround), and that kicks some real serious ass.
However, it's still a bi-gpu. God only knows how the load is balanced in this case.
And I have to say, especially in surround:
More often than not, fraps displays a decent fps number, and your eyes tell you that it's nothing but a lie.
fps is not all if it's not regular.
A google search for "microstutter" reveals some painful truth.
Maybe it's past, but maybe not.
We would need a ms-granularity graph of every frame over dozens of seconds to be sure.
From now on, things changed.
The 680 has 4 outputs and can do surround.
And it's a single gpu, and quite powerful at that, and maybe even less expensive than an sli with the same power.
But I digress :)
AFAIR modern GPUs can do double in hardware.
Maybe not IEEE754-perfect, surely not 80bit internal(but neither does SSE), but still they do.
If you want to master the loss of precision completely you can do the computations in integer all the way but it's a tad more work ; you need something like add with carry for this.
afaik sli/crossfire for direct rendering is a mess. the afr is a hard to predict - there you go with microstuts and... so better off that. what could be of help is the tiled mode if you arrange your computation wisely. dunno if this counts for offscreen stuff to. that might help to work it out. nonetheless you better talk with a gpu driver coder how they manage that shit.
added on the 2012-03-26 23:13:30 by yumeji yumeji

login