pouët.net

How to control multicore in C

category: general [glöplog]
smash: maybe it depends on the cpu? maybe it depends on the xp version?

I will try with the code I wrote here and post how many threads per second I could get with my computer and Vista...
added on the 2008-12-08 17:18:55 by texel texel
* you should not have more threads than cores - with the possible exception of a sound or network thread.

* you should not wait for events or locks more than once per frame. sync-points are very expensive and easily dominate your performance. without careful measurement you will never know if you are truly multithreaded.

* until the arrival of DX11 you will have to do all resource management in your main thread. beware of hidden locks in new/delete etc.
added on the 2008-12-08 17:44:27 by chaos chaos
Quote:
sync-points are very expensive and easily dominate your performance.


And this doesn't necessarily mean raw computing power - eg. the XP scheduler is known to wait up to 16 milliseconds to wake up a thread if others in the same priority class are running (my personal measurements revealed something between 4 and 16 milliseconds depending on phase of the moon, number of letters of the surname of the girlfriend of your second to last visitor, etc. YMMV) . Yes, this effectively means that waking up a thread can take a whole frame until the darn thing actually DOES something.
added on the 2008-12-08 18:06:32 by kb_ kb_
I admire your ability to measure these things. Any tips how to measure bottlenecks/delays/lags?


In interactive programs there are sometimes more threads than cores, but most of them are sleeping. They wake up when there is incoming midi or when a slow io operations is going on. Mostly io happens as events in the main thread.
added on the 2008-12-08 20:01:52 by neoneye neoneye
Quote:
I admire your ability to measure these things


It's actually not hard at all when you have wonderful tools like PIX on XBOX360 :D
added on the 2008-12-08 20:19:01 by keops keops
So, I didn't follow the intel evolutions for a long time, but I was used to hear a few years ago that xp always assigns all threads of an application on the same core, due to cache synchronization issue.
added on the 2008-12-08 20:22:46 by krabob krabob
i tried pthreads and now openmp (nice, but you still need to know all the possible types of locks).
added on the 2008-12-08 21:37:41 by earx earx
I pthreads too.
added on the 2008-12-08 21:42:47 by trc_wm trc_wm
Creating a pool of worker threads on application startup to reduce overhead isn't a bad idea.

Also, within a few years we will probably have more cores available than we need threads (Tilera already ships 64 core CPUs), so I wouldn't worry too much about limiting threads to one per core, unless the app needs to run as fast as possible NOW.
added on the 2008-12-08 22:19:18 by Radiant Radiant
and 64k should be enough for everyone.
added on the 2008-12-08 23:37:07 by trc_wm trc_wm
the idea of having a thread per task is wrong

the idea is to have a thread per core and many many many tiny tasks. then every core-thread will pick a minitask from the task-queue.

the obvious tricky part is to pick the mini-tasks in a way that does not globally lock all other tasks from picking one. the whole problem of task-picking gets a bit more complicated since there will be dependencies between the tasks. it has to be really quick and absolutely lock-free.

the other tricky part, the one that i have not yet found any interesting literature about, is how to pipeline different phases of processing with minimum latency - like input -> ai -> camera matrix calculation -> cpu-side graphics processing -> command buffer building -> gpu-side graphics processing. this is quite a problem at least for games. basically you want a multithreaded phase that lasts the whole frame, and a little bit of pre-work and a little bit of post-work, and you want to build a pipeline so that during the pre-work and the post-work all cores are busy.

and then it gets complicated...
added on the 2008-12-08 23:44:55 by chaos chaos
[utl=http://openmp.org/wp/]OpenMP[/url] is amazing too.
added on the 2008-12-09 00:01:52 by LiraNuna LiraNuna
OpenMP is amazing too.
added on the 2008-12-09 00:02:04 by LiraNuna LiraNuna
How to do it in assembly?
added on the 2008-12-09 03:21:10 by xernobyl xernobyl
Quote:
the other tricky part, the one that i have not yet found any interesting literature about, is how to pipeline different phases of processing with minimum latency


yes, its very difficult especially if youre moving from a single core to a multicore architecture.
if you can rearrange your tasks so they dont rely on each others results so heavily it helps. i.e. cpu side gfx processing, command buffer, gpu-side graphics processing run async from each other.
there are some things that arent really workable that way - input handling, camera calculation and so on. and you have large but typically interdependent processes like ai, physics and animation, which you have to rewrite to make them as asynchronous as possible. things like batching up all the ai raycasts in advance, getting the results much later. you can try building the job list a frame in advance.

it's little wonder that so many games studios don't achieve very good parallelism and simply throw a few large graphics/fx processes onto their extra cores. it's very tempting to go like that especially if your main cpu / one of your cores alone is "fast enough" to do a lot of the work. lucky for me im mainly concerned with the graphics side, which is much easier to thread. :)
added on the 2008-12-09 10:42:54 by smash smash
Quote:
yes, its very difficult especially if youre moving from a single core to a multicore architecture.

Oh that's funny on a few levels :)
added on the 2008-12-09 10:59:22 by gloom gloom
Quote:
Oh that's funny on a few levels :)

It is? :|

Anyway, this year's Breakpoint workshop about multithreading and such by hikey was interesting. Too bad the video ain't available. Or the presentation.
added on the 2008-12-09 11:18:27 by xernobyl xernobyl
I've just remember a way to look if your multicore version is working fine:

Try to run two (or 4 if you have a quadcore) of your single core executables at the same time. If all are running in different cores and the framerate doesn't decrease, it means that there is no memory bottleneck. In that case it is just matter of how good the paralellization could be done. The division of work use to take some time by itself even for the best parallellizable tasks.
added on the 2008-12-09 23:57:06 by texel texel
Texel: Write a mensah article about it.
added on the 2008-12-10 01:21:44 by Hatikvah Hatikvah
You can also use OpenMP...
added on the 2008-12-10 11:46:26 by zmurf zmurf
OpenMP is pretty much useless for more complex stuff other than simple loop constructs.

As chaos said a queue-based job/task scheduling system using worker threads is a really flexible and efficient system. It even gives you automatic load-balancing for free and other nice properties. Could insert shameless plug here, but won't :-)
added on the 2008-12-10 14:58:10 by arm1n arm1n
jar: no, openmp can even do functional parallelism, not only data paralellism. see the openmp 'section' command
added on the 2008-12-10 15:03:35 by earx earx
Actually Intel's Thread Building Blocks stuff (that the guy talked about on Evoke) sounds pretty neat if you want to distribute workload in a nice way. It's only meant for actual computing work tho, so don't expect it to solve all of your sync and latency problems :)
added on the 2008-12-10 15:16:21 by kb_ kb_
Quote:

it's little wonder that so many games studios don't achieve very good parallelism


it's partially because they're scared (and perhaps rightfully so, chaos' advice is probably the best for most developers) and for another very big part because many game studios are still using tech that was put together in the ps2/xbox or even psx era, which hinders certain radical changes needed to take the best out of a ps3 (i know you're talking about that ;-)).

when it comes to threading i've always been a conservative man myself but if you do it right and have the resources to justify a certain degree of lock complexity/overhead (e.g spus), it's the only way to 'max power'.
added on the 2008-12-10 18:01:58 by superplek superplek
BB Image
added on the 2008-12-10 18:03:10 by superplek superplek

login