## Thursday, November 11, 2010

### Partitioning work over multiple threads

A few weeks back I was looking at some code that divided work across multiple threads. The code looked something like the following:

```void * dowork(void * param)
{
int chunksize = totalwork / nthreads;
int start     = chunksize * threadid;
int end       = start + chunksize;
for (int iteration = start; iteration < end; iteration++ )
{
...
```

So there was a small error in the code. If the total work was not a multiple of the number of threads, then some of the work didn't get done. For example, if you had 7 iterations (0..6) to do, and two threads, then the chunksize would be 7/2 = 3. The first thread would do 0, 1, 2. The second thread would do 3, 4, 5. And neither thread would do iteration 6 - which is probably not the desired behaviour.

However, the fix is pretty easy. The final thread does what ever is left over:

```void * dowork(void * param)
{
int chunksize = totalwork / nthreads;
int start     = chunksize * threadid;
int end       = start + chunksize;
if ( threadid + 1 == nthreads) { end = totalwork; }
for (int iteration = start; iteration < end; iteration++ )
{
...
```

Redoing our previous example, the second thread would get to do 3, 4, 5, and 6. This works pretty well for small numbers of threads, and large iteration counts. The final thread at most does nthreads - 1 additional iterations. So long as there's a bundle of iterations to go around, the additional work is close to noise.

But.... if you look at something like a SPARC T3 system, you have 128 threads. Suppose I have 11,000 iterations to complete, I divide these between all the threads. Each thread gets 11,000 / 128 = 85 iterations. Except for the final thread which gets 85 + 120 iterations. So the final thread gets more than twice as much work as all the other threads do.

So we need a better approach for distributing work across threads. We want each thread to so a portion of the remaining work rather than having the final thread do all of it. There's various ways of doing this, one approach is as follows:

```void * dowork(void * param)
{
int chunksize = totalwork / nthreads;
int remainder = totalwork - (chunksize * nthreads); // What's left over

int start     = chunksize * threadid;

if ( threadid < remainder ) // Check whether this thread needs to do extra work
{
chunksize++;              // Yes. Lengthen chunk
start += threadid;        // Start from corrected position
}
else
{
start += remainder;       // No. Just start from corrected position
}

int end       = start + chunksize; // End after completing chunk

for (int iteration = start; iteration < end; iteration++ )
{
...
```

If, like me, you feel that all this hacking around with the distribution of work is a bit of a pain, then you really should look at using OpenMP. The OpenMP library takes care of the work distribution. It even allows dynamic distribution to deal with the situation where the time it takes to complete each iteration is non-uniform. The equivalent OpenMP code would look like:

```void * dowork(void *param)
{
#pragma omp parallel for
for (int iteration = 0; iteration < totalwork; iteration++ )
{
...
```