pouët.net

Threads (CPU not BBS)!

category: code [glöplog]
beeing afraid of threads is so 90th...

added on the 2010-03-12 03:54:28 by torus torus
you could always go for an event based engine, that'll make it possible to do parallel tasks
Chaos+Ryg: thanks for mentionning my work.

Design towards having only two kinds of threads:
- those that work a lot (tasks)
- those that wait a lot (IOs)

On a machine with n cores, you want n threads of the first kind, including the main (as Chaos said).

You can actually have quite a few threads of the second kind, those that make sure disk, network and speakers are kept busy. The key is to have them do very little when they're active (or spawn work for main+workers threads to execute).

All in all, diving big chunks of work in smaller chunks is the easy part.

The fun starts when you need to synchronize things together. Race conditions are not only hard to fix, they can be hard to spot/repeat. When faced with a race, I see people jumping to the easy solution : add mutex.

<hypnosis tone>
Stop. Pause. Think.
More mutexes = more race conditions
painting self in corner - dig self into hole - sawing off branch sat on...
A good mutex is a mutex not used
</hypnosis tone>

When you see a race condition, you want to make it not happen _by design_

This is the big software design challenge: how to create code that doesn't require tasks to be aware of what the others do. It's a slightly different way of thinking, a lot still needs to be invented...

For more explanations on how my sample achieves minimal use of mutexes, read the end of this article:
http://www.gamasutra.com/view/feature/4287/sponsored_feature_doityourself_.php
added on the 2010-03-12 10:40:34 by nulstein nulstein
on such a machine i think threading your app may have benefits...
BB Image
question : om such an system, what would you do?
added on the 2010-03-12 10:52:35 by Tigrou Tigrou
Quote:
The fun starts when you need to synchronize things together. Race conditions are not only hard to fix, they can be hard to spot/repeat. When faced with a race, I see people jumping to the easy solution : add mutex.


The very core of this problem lies in the fact that a lot of programmers still don't seem to get that certainly in these situations (and many others) less is more and that (added) complexity is not to be favored over simplicity; especially since the latter can be harder to attain.
added on the 2010-03-12 10:58:11 by superplek superplek
tigrou: that is exactly the reason why the thread per task approach is doomed to fail.

the real question is: with so many threads trying to steal work from each other, will it stall all the time? The critical section at the heart of the algorithm might become a bit crowded.
added on the 2010-03-12 11:44:58 by chaos chaos
There's also the case of bad aliasing, where several threads try to access the same cache line at the same time, resulting in so much handshaking that the result is way slower than just using one thread.. =)

Anyway, in my build example, I think the reason why 21 tasks was better than 4 is the required I/O waits.
added on the 2010-03-12 11:55:02 by sol_hsa sol_hsa
Any thoughts on apple's grand central dispatch thing? I've yet to use it in anger, but it seems to solve some of the issues - it manages the thread pool, so the number of threads available matches the number of available cores and threads aren't assigned to busy cores so stalls are avoided. The way it works seems pretty clean too. It's open sourced, no idea if a windows port is available though.
added on the 2010-03-12 12:09:37 by psonice psonice
psonice, doesn't the GCD come with a syntax extension? I've seen the block declarations here and there and they look nice and very usable. At last a lot better than firing off your own threads via pthreads or whatever.

However as new language features it will take at least a decade until we see support on all relevant platforms.
added on the 2010-03-12 12:28:58 by torus torus
What would iq do with that machine?
added on the 2010-03-12 12:41:57 by xernobyl xernobyl
Yeah, it uses 'blocks' which is a syntax extension. I seem to remember you can use functions instead of blocks which might get around that issue.

Taking a quick look though, it seems it's being ported to bsd and not much else as yet, guess it needs access to some lower level stuff that makes porting it outside of bsd hard work.
added on the 2010-03-12 12:43:35 by psonice psonice
psonice: GCD seems like a sort of sensible linguistig approach to deferring the problem of sensibly-grained parallel processing to the OS -- it's quite high on my list of 'things to try' - if you get to take it for a spin don't hesitate to post any findings (too busy myself right now)
added on the 2010-03-12 13:34:58 by superplek superplek
i have got a use for it actually - i want to offload a bunch of work from the gpu (which is slowly being crushed by the amount of work i'm dumping on it :) and it's highly suitable for threading. I was hoping to keep 10.5 support though. Laziness is kind of luring me towards 10.6/gcd (plus opencl is very appealing for this app) so we'll see. I'll post up what happens if i do it. (this is a video processing kind of app btw, not some uber demo unfortunately)
added on the 2010-03-12 14:21:17 by psonice psonice
I couldn't care less what kind of app. it is actually - interested if you get some results / get to see some bottlenecks of the whole thing et cetera.
added on the 2010-03-12 14:33:58 by superplek superplek
But... Why not OpenMP?
It works, it's easy, doesn't mess up the code too much...
added on the 2010-03-12 14:50:27 by pan pan
OpenMP is nice and good as long as you're just after pure data decomposition. When you start wanting to parallelize a whole game engine type of app, you encounter all sorts of situations that get more and more awkward to deal with in OpenMP only (oh, and _don't_ start mixing and matching threading libraries!)

My latest bit of fun has been about having the main thread submit draw calls while the rest of the processor would work on the next frame. There is something in latest TBB that makes it really easy: TaskGroups. nulstein also has this ability to spawn tasks while you do something else in the calling thread. I reckon this is a key feature when pipelining things and I really don't think you can even hope to see this in OpenMP.

But, generally, the main reason why I'm not using OpenMP is control. Having all this code hidden behind #pragma's makes me uncomfortable ;)
added on the 2010-03-13 18:51:10 by nulstein nulstein
Quote:

My latest bit of fun has been about having the main thread submit draw calls while the rest of the processor would work on the next frame.


Do we know, does typically each core have it's own cache? I would assume so but you never know. Also, I wonder how cores negotiate what is dirty between them.
added on the 2010-03-13 20:14:35 by sigflup sigflup
Quote:
Do we know, does typically each core have it's own cache?

L1 yes, L2 depends (sometimes yes, sometimes no), L3 no (always just one L3 per package in all the architectures I've seen so far)
Quote:
I would assume so but you never know.

This is mentioned in virtually every in-depth processor review, it's in the architecture block diagrams, and it's also on Wikipedia.
Quote:
Also, I wonder how cores negotiate what is dirty between them.

http://en.wikipedia.org/wiki/Cache_coherence.
added on the 2010-03-13 21:58:17 by ryg ryg
I'd add that with hyperthreading on, you have two "logical cores" running off the same cache ie L3 all the way up to L1... Add to this that the OS can reschedule your thread at any time and you have a nice rubiky toy to play with.

If you're trying to take advantage of how the cache is shared, you pretty much have to setup your thread affinities so the OS has got no room for anything "automagic" but then, you have to take care of anything automagic that might have helped you before...

What about Last Level Cache being shared always? it's not, in processors like the Q6600 (which has proved very popular), we really have two dual-core chips in one and you can pick cores that don't share any cache at all...

Thankfully, caches share information efficiently and are designed to work transparently. But if you really have to mess with this, you may find these articles useful:
http://www3.intel.com/cd/ids/developer/asmo-na/eng/276613.htm
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
added on the 2010-03-14 11:26:05 by nulstein nulstein
ahh hyperthreading, that's what I was after. I knew there was a particular distinction between threads and core that we weren't discussing...
added on the 2010-03-14 17:44:30 by sigflup sigflup
Wasnt PC GTA game that "fully utilized" 2 cores successfully?
added on the 2010-03-14 19:20:40 by moredhel moredhel
main.c
Code: void TestFunction(void* a, void* b) { printf("Function %i sleeping %i\n", (int)b, (int)a); Timer timer; while(timer.dGet()<(double)((int)a)/1000.0){}; //Sleep((int)a); } int main(char* argv[], int argc) { Timer timer; ThreadManager ThreadMan; ThreadMan.InitTM(); printf("Starting the pushing...\n"); timer.Reset(); ThreadMan.Push(TestFunction, (void*)1000, (void*)1); ThreadMan.Push(TestFunction, (void*)1000, (void*)2); ThreadMan.Push(TestFunction, (void*)1000, (void*)3); ThreadMan.Push(TestFunction, (void*)1000, (void*)4); ThreadMan.Push(TestFunction, (void*)1000, (void*)5); ThreadMan.Push(TestFunction, (void*)1000, (void*)6); ThreadMan.Push(TestFunction, (void*)1000, (void*)7); ThreadMan.Push(TestFunction, (void*)1000, (void*)8); ThreadMan.Push(TestFunction, (void*)1000, (void*)9); ThreadMan.WorkAndWait(); printf("\n%f seconds!\n", timer.dGet()); ThreadMan.KillTM(); system("PAUSE"); return 0; }

output
Code: Thread 1 initialized; Starting the pushing... Pushing 1... Pushing 2... Thread 2 initialized; Thread 2 working: 2 running, 0 queued Function 2 sleeping 1000 Pushing 3... Pushing 4... Pushing 5... Pushing 6... Pushing 7... Pushing 8... Pushing 9... Thread 0 working: 3 running, 6 queued Function 3 sleeping 1000 Thread 3 initialized; Thread 3 working: 4 running, 5 queued Function 4 sleeping 1000 Thread 1 working: 1 running, 0 queued Function 1 sleeping 1000 Thread 2 working: 4 running, 4 queued Function 5 sleeping 1000 Thread 0 working: 4 running, 3 queued Function 6 sleeping 1000 Thread 3 working: 4 running, 2 queued Function 7 sleeping 1000 Thread 1 working: 4 running, 1 queued Function 8 sleeping 1000 Thread 2 working: 4 running, 0 queued Function 9 sleeping 1000 3.002525 seconds! Thread 2 says good bye; Thread 1 says good bye; Thread 3 says good bye; Press any key to continue . . .

Does it look good enough? =)
added on the 2010-03-26 23:35:28 by xernobyl xernobyl
It takes ~2 seconds with 8 cores. So at least for stupid sleep functions it works fine.
added on the 2010-03-27 18:11:48 by xernobyl xernobyl
When doing multi-thread aware code, don't forget also to take into consideration all the APIs you call that may be thread-safe in a very crappy way. A typical thing in game development is to use a custom memory allocator, reuse it year after year, and finally realise that there is a global mutex added by somebody to make it thread safe, resulting in a game that runs slower on a dual core than it would on a single core...

added on the 2010-03-27 22:58:11 by Dbug Dbug
My worker threads... what should they do when the queue is empty? Currently I have a Sleep(0), not sure if that's a good idea.
added on the 2010-03-28 01:08:23 by xernobyl xernobyl

login