[code] Particle Systems

category: general [glöplog]
OK, so I know that it is wise to use a SOA instead of an AOS. But has anybody an advice if its better to use normal arrays, std::list, or a selfmade allocator to store particle data (pos, vel, etc.)? Especially considering 1) the killing of random particles and 2) cache optimized (SIMD) routines for VectorArrays...
Any thoughts would be welcome!
added on the 2006-05-28 21:59:44 by arm1n arm1n
What is SOA or AOS?
added on the 2006-05-28 22:10:50 by xernobyl xernobyl
arrays should be good enough. dunno what you mean by killing random particles since every particle should be somehow organized :) "kill" it when its life goes to zero?

+ i recommend SOA, although no idea what it means.
added on the 2006-05-28 22:41:10 by shadez shadez
Using std::list for particle systems is the worst decision you could make. Well, not the most, you could be using std::map.
added on the 2006-05-28 23:43:53 by imbusy imbusy
array or std::vector.

and removing particles in the middle of an array is trivial, because particles don't have to be stored in any particular order (even when you zsort them or anything, you need to do that per frame anyway):

Code:for(i=0;i<count;) if(!particle[i].expired) { // process particle i++; } else particle[i] = particle[--count];
added on the 2006-05-28 23:49:33 by ryg ryg
Actually you can have two pointers (indices), one for reading and one for writing. e.g.:

for(i = 0, j = 0; i < activeparticles; i++) {
if (!particle[i].expired) {
// read data for particle i and write to particle j
activeparticles = j;

it's more or less the same as above.
added on the 2006-05-28 23:55:38 by pan pan
This may help.
added on the 2006-05-28 23:59:54 by bdk bdk
i think soa can be poison for your cache... at least if you avoid dynamic allocation. same applies if you use some kind of own memory management like illustrated in ryg's lightspeed seminar. personally i prefere a normal well aligned aos since its easier to operate on single items (delte, reorder,...).
added on the 2006-05-29 00:24:19 by ttl ttl
first, for the "uninitiated": aos=array of structures, soa=structure of arrays.

Code:struct particle { float x,y,z; } particles[1000];

Code:struct particles { float x[1000],y[1000],z[1000]; };

in practice, you usually do something like "aosoa": (or atleast that's what i tend to end up with)
Code:struct particle4 { float x[4],y[4],z[4]; } particles[250];

the idea behind soa is to make data more SIMD-friendly - vector adds etc are no problem either way, but things like dot or cross products are not particularly fast when you want to do them between individual components of a single SIMD register (lots of data movement/swizzling involved). the "aosoa" variant groups data into "SIMD-friendly" chunks but has nicer locality in case you also do a bit of random access now and then.

i think soa can be poison for your cache...

don't think. measure.

as long as you perform sequential operation (...like in a particle system...), working in multiple streams is not a problem (in my experience anyway) - it doesn't make much of a difference whether you have 3 cachelines' worth of xyz or 1 of x,y,z each. nonsequential access is different - there you definitely want tight packing so you don't have to fetch more cachelines than necessary.

pan: matter of taste, mostly. i like my variant a bit more (obviously :)) because it never has to re-read cachelines that have already been processed, even when most items get deleted and the distance between the write and read pointers gets big (even when you write a whole cacheline, it first gets read from memory, unless the memory region is marked as write combined). having said that, i seriously doubt there's any significant difference outside of specially constructed scenarios :)
added on the 2006-05-29 00:44:16 by ryg ryg
BB Image
added on the 2006-05-29 00:54:36 by bdk bdk
the problem im refering to is that u may need more than 4 cachelines for a particle. think about the acceleration for each particle... your final soa may have something like 6 or more arrays with the same size. if you now choose a bad size they will share the same line set resulting in cache fighting. this is probably a quite uncommen scenario but not impossible at all.
its not about having 3 cache lines, its about possible fighting ones
added on the 2006-05-29 01:24:47 by ttl ttl
what sort of thing do you want to do with particles, because there are many cases where you don't have to store or update anything at all.
added on the 2006-05-29 01:36:30 by Navis Navis
ttl - the solution is very simple (and already visible in my short example code :): use counts which don't have high powers of two as divisors. easiest way to make sure you get a usable spread in cache usage.
added on the 2006-05-29 01:47:34 by ryg ryg
sure... if sizes are well chosen theres no problem at all.
i just wanted to point out that soas have a really great potential to fuck up the loop ;)
but if you keep that in mind everything should be fine.
added on the 2006-05-29 01:57:21 by ttl ttl
whoa, thanks for all the helpful replies! That cleared up a lot of questions.
I had the same idea in mind like the "swapRemove" in ryg's first post, when the particle has expired.
Will try stuff out when getting home..
added on the 2006-05-29 11:45:20 by arm1n arm1n
ah, just another question: Is there any advantage in rendering particle billboards in screen-space (RHW) instead of using camera-aligned 3d-quads?
added on the 2006-05-29 11:54:15 by arm1n arm1n
regarding SOA/AOS... ryg's idea....

Code: struct particle4 { float x[4],y[4],z[4]; } particles[250];

is the way to go, but I've prefered to add a twist:

Code: struct particle4 { float x[32],y[32],z[32]; } particles[1024/32];

this way, if you are coding SIMD via automatic paralelisation (such as gcc-4 provides), then you are not tied to todays 4-way SIMD... if there is next processors that use 8-way, 16-way or 32-way SIMD, they will enable to use this automatically. I recall pondering this one when I first started to code a k6 machine with amd 2-way simd and just a bit later I was starting to make powerpc 4-way simd... I noticed that adjusting too much to the machine in this case was not good ;)
added on the 2006-05-31 21:28:00 by winden winden
i just make it one particle per structure and make use of the 4 simd float ops intelligently (e.g. put the rotation in the position.w, put the angular velocity in the velocity.w).
although that would be on ps2/psp, cant remember the last time i did a particle effect on pc :)
added on the 2006-06-01 11:34:35 by smash smash
for "normal" particle motions, that's just fine (and also what we use internally in our particle systems - which aren't even SSE optimized, because they're fast enough).

and in fact, what i said previously wasn't correct: you can do dot products pretty efficiently even in an AOS layout, by always computing N (where N is the vector size of whatever instruction set you're using) dot products at once. dot products with a constant vector are easy, dot products between variable vectors are a bit more involved because you have to perform a matrix transpose operation between SIMD registers, but it's still relatively fast.
added on the 2006-06-01 17:55:51 by ryg ryg
i'd also advise implementing the math primitives you use to generate simd (using for ex. the sse intrinsics on pc) instead of optimizing the separate case(s), costs a little more time, yields speed anywhere you do heavy vectorized floating point processing
added on the 2006-06-01 18:58:49 by superplek superplek
And nowadays I might not agree with myself that much ;)

sorry for the kick!
added on the 2009-07-18 09:30:45 by superplek superplek
BB Image
added on the 2009-07-18 11:13:55 by alienus alienus
superplek, it's no problem - no-one listens to you anyway! hohoho!
sitting with something you'd like to share, Conan? :)
added on the 2009-07-18 13:14:31 by superplek superplek
i agree with Ryg, measure... from my experience there is no "final" solution for particle systems. i tend to measure and try as i go.
added on the 2009-07-18 20:20:20 by pantaloon pantaloon