the multithread thread

category: code [glöplog]

So here is a coder question.
How do I multithread?

I've already managed to start some several threads and wait until they end.

What I want is that I have a main thread. It calls an init procedure which starts x threads and will queue tasks which should unpause these threads and do their job. The main thread sometimes has to wait for these tasks are finished.

These small tasks are extremely time critical. So I want to do this the proper way.

This might be related to http://www.pouet.net/prod.php?which=58419

added on the 2012-02-05 16:11:35 by mu6k

Any details on the platform, operating system you code for?

added on the 2012-02-05 16:17:39 by kbi

Oh right, PC win32. So I'm guessing it should be in windows.h.
Forgot to set category :)

added on the 2012-02-05 16:19:28 by mu6k

Some time ago I did a multithreaded Mandelbrot benchmark. The main thread starts as many threads as logical CPU core's are there.

Each thread calculates a specific amount of lines of the Mandelbrot. The line counter is stored in a locked way, so the threads are not messing up where's the next line to be calculated. It works pretty well. Due to the iterating concept of Mandelbrot's it was a bit of try and error to see how many threads/lines are beneficial...and of course I guess there's also different multithread approaches for Mandelbrots.

You can find the app/source (Win32, assembler in FASM format) here:
http://www.mikusite.de/x86/KMB_V0.53I-32b-MT.zip

added on the 2012-02-05 16:49:58 by Kuemmel

You probably looking for a "thread pool" implementation.

The idea is to block threads (using waitforsingleobject) on a some event and signal the event after you assign the job.

Check: http://code.google.com/p/cppthreadpool/
this is a linux implementation, but its very straight forward.

added on the 2012-02-05 16:56:05 by TLM

IQ has code similar to what Kuemmel describes in one of his 4k frameworks, I think it's the g4k_Software one.
http://www.iquilezles.org/www/material/isystem1k4k/isystem1k4k.htm

Anyway, that's the Spawn-threads-and-wait-for-them-bit only. The code waits until the threads terminate. If you intend to do animations I guess you don't want the threads to terminate, that could hurt performance quite a bit.

added on the 2012-02-05 16:57:19 by Moerder

Yup, all you need is a good ol' WaitForMultipleObjects() and a couple of events to wait on, in the simplest implementation. Each thread should wait for job arrival notification, then do its job, and then signal its "job completed" event. The main thread would carry on as soon as all worker threads finish working on their pieces.

added on the 2012-02-05 17:11:33 by kbi

Or use globalalloc and use mutexes to communicate in shared memory

added on the 2012-02-05 17:16:04 by rasmus is danish

no you want to do "work stealing" iirc

added on the 2012-02-05 18:49:33 by the_Ye-Ti

rasmus, I think mutexes are two orders of magnitude slower than signal events.

added on the 2012-02-05 19:24:26 by TLM

Depends on the use case

added on the 2012-02-05 19:33:16 by rasmus is danish

And available thread facilities. AFAIK, you don't get them out of the box in posix environments.

added on the 2012-02-05 20:00:40 by kbi

I meant events, of course :)

added on the 2012-02-05 20:15:47 by kbi

Yes thread pools seem to be exactly what I want.

added on the 2012-02-05 21:27:12 by mu6k

As long as they are fast enough. I never need more than amount of logical CPU. The only thing I actually want it is to have my code running on every logical CPU. I don't think that starting a new thread for each task and killing it when over is good for performance. The tasks would be relatively small, their time would be counted at most in milliseconds.

added on the 2012-02-05 21:35:36 by mu6k

Never kill worker threads, let them idle, waiting for signals. This will cost you nothing in terms of performance, especially when compared to the cost of killing and spawning threads on demand.

If you need to make sure each threads uses separate CPU core, configure thread affinity accordingly.

added on the 2012-02-05 21:44:56 by kbi

as some others stated, threadpool is the solution :
it avoids creating destroying threads uselessly (since it has a cost) + it also avoid to create too much threads according the number of logical cpu.

basically adding new tasks to a threadpool works like this :

Code:

//add some jobs
threadpool::queuesomework(pointertosomefunction1, args);
threadpool::queuesomework(pointertosomefunction2, args);
...
threadpool::queuesomework(pointertosomefunction3, args);

threadpool::waitforalltocomplete();

and voila you dont need to take care about thread creation, or checking the number of cpu on machine to see if its usefull to create threads (ex : on a single cpu machine)

also : there is some cases where creating much more threads than the number of logical cpu's can be benefical : in case there is some I/O or anything inside a thread that would cause cpu to yield (system call or interrupt, a sleep, ...)

added on the 2012-02-05 23:07:27 by Tigrou

nullstein?

added on the 2012-02-05 23:33:54 by T$

what?

added on the 2012-02-05 23:37:05 by Tigrou

what everybody said.

as Moerder pointed out, you have a multithreaded tile rendering example at http://www.iquilezles.org/www/material/isystem1k4k/isystem1k4k.htm, which was used for oraganix, slisesix, ixaleno, leizex and eleixane (lol, gratuitous self propaganda).

Basically all threads are created, and they start consuming screen tiles (16x16 pixels, or something). When they are done with their first tile, they pick the next. The amount of tiles consumed at any point is being tracked by a counter incremented atomic by the threads when they pick a new tile to render. In Win32, it looks something like this (made for 4k intros, not for beauty)

Code:


typedef struct
{
    void *mutex;
    long *tileID;
}ThreadData;


void main( void )
{
    ThreadData idata[128];
    HANDLE     th[128];
    HANDLE     barrier;

    __declspec(align(32)) long tileID = 0;

    SYSTEM_INFO si;
    GetSystemInfo( &si );

    barrier = CreateSemaphore( 0, numCpus, si.dwNumberOfProcessors, 0 );

    for( int i=0; i<numCpus; i++ )
    {
        unsigned long thid;
        idata[i].mutex = barrier;
        idata[i].tileID = &tileID;
        th[i] = CreateThread( 0, 0, doWord, idata+i, 0, &thid );
    }

    // synch point
    WaitForMultipleObjects(numCpus, th, true, INFINITE);

    for( int i=0; i<numCpus; i++ ) CloseHandle(th[i]);
    CloseHandle(barrier);
}


static unsigned long __stdcall doWord( void *vdata )
{
    ThreadData *data = (ThreadData*)vdata;

    WaitForSingleObject( (HANDLE*)data->mutex, 0 );

    for(;;)
    {
        const int tile = InterlockedIncrement( data->tileID ) - 1;
        if( tile >= NUMTILES ) break;
        
        renderTile( tile );
    }

    ReleaseSemaphore( (HANDLE*)data->mutex, 1, NULL );

    return 1;
}

added on the 2012-02-06 00:38:16 by iq

http://www.lovethedot.net/2009/02/parallelfor-deeper-dive-parallel.html

added on the 2012-02-06 00:57:13 by Gargaj

http://scene.org/file.php?file=%2Fparties%2F2011%2Frevision11%2Fmisc%2Fseminars%2Frev11_chaos_multithreading_for_graphics_sun_h264.mp4&fileinfo

added on the 2012-02-06 03:52:42 by Yomat

Couple of things:

- as compact as IQ example is, it's still not thread pool, it create many threads and wait for them to die. There is huge overhead in doing so if you intend re-create a new job.

- Creating too many threads isn't a good idea on modern platform as it creates stress on task switching logic (even if threads are waiting on objects). I say golden number of active threads should be around x2 the number of cores.

- For high performance: Use events to block, set mm timer to 1 (gives about 10% boost), don't use sleep() (block instead).

An example of the mm timer thingy:

Code:

#include <windows.h>
#include <stdio.h>
#include <Mmsystem.h>

#define CYCLES 100000

HANDLE hStartJob;
HANDLE hJobComplete;

INT64 TS1;
INT64 TS2;
INT64 Freq;

DWORD WINAPI ThreadExecute(LPVOID) 
{
	QueryPerformanceFrequency((PLARGE_INTEGER)&Freq);
	QueryPerformanceCounter((PLARGE_INTEGER)&TS1);
	for (int i = CYCLES-1; i >= 0; i--)
	{
		WaitForSingleObject(hStartJob, INFINITE);

		// Do something...

		SetEvent(hJobComplete);
	}
	QueryPerformanceCounter((PLARGE_INTEGER)&TS2);

	return 0;
}

void PingPongTest()
{
	// Create thread
	TS2 = 0;
	CreateThread(0, 0, &ThreadExecute, 0, 0, 0);

	// Start and complete many jobs
	while (TS2 == 0)
	{
		SetEvent(hStartJob);
		WaitForSingleObject(hJobComplete, 10);
	}

	// Print timings
	printf("%fuSec\n", ((TS2-TS1) * (1000000. / Freq)) / CYCLES);
}

void main()
{
	hStartJob    = CreateEvent(0, 0, 0, 0);
	hJobComplete = CreateEvent(0, 0, 0, 0);

	printf("Default mm setup => ");
	PingPongTest();

	timeBeginPeriod(1);
	printf("Highest mm setup => ");
	PingPongTest();
}

added on the 2012-02-06 09:59:06 by TLM

+1 to Chaos' talk (if it's the same as I saw at GDCeu'10)

If you're going parallel, there is one rule: don't lock. The art is in making a machine that never (ie rarely) waits. Critical sections aren't your friends...

Once you have that, there is the matter of how big your tasks should be (there be ye old can of worms)... The trick is to realize that an individual task really is part of an iteration (a for loop, descending a tree, something like that). A task isn't the code inside the inner loop, it is a subset of the whole iteration. It is not a "render a pixel", it is "render a tile" (to take IQ's example).

Say the amount of work in renderTile is totally regular. Then you can split the work in numCpus, all threads will end up waiting for the last one to finish (because there has to be a last one, right?), but because each thread is doing a big chunk of work, the wasted cycles are relatively few.

Now, say the amount of work is totally unpredictable (classic example = Mandelbrot set), then big tiles will lead to a lot of waiting. So you're going to want to split into really small tiles, so the threads getting the easy ones can render more tiles than the thread getting the dense ones. But that poses two problems: first of all, you need to post all these tiny tasks in a job queue, which takes serial time (ie not parallel). The other problem is that because the tiles are tiny, your setup time grows in comparison to your work time. Start the engine, stop the engine, start the engine, stop the engine...

Sooner or later, you need two things:
- to have tasks that evolve dynamically in size
- to get rid of the central work queue (because of contention)

This is why it is key to think of task as a subset of an iteration, that you can split in smaller parts as you go along, tasks spawning tasks (which is the real reason why central work queue needs to go).

Ok, where do you go from there?
- start from IQ's example above
- take it as far as you can (because it's easy and can take you a long way)
- move on to a task stealing system

shameless plug:
- Tiny task stealing engine: http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling/
- For a more detailed discussion: http://software.intel.com/en-us/blogs/2010/08/20/nulstein-v2-plog-presenting-through-a-blog/

(btw, I have been told that my approach to deferred rendering doesn't give good results now that drivers do support it, I expect to revisit this soonish, unfortunately I haven't got much time at the mo')

added on the 2012-02-06 10:27:54 by nulstein

TLM: thanks for the timeBeginPeriod() hint - never heard of it! :)

added on the 2012-02-06 10:37:27 by kbi

pouët.net

the multithread thread

login