Monday, November 22, 2010

Who am I?

Over the summer I was asked if I would do an interview to illustrate what University of Southampton graduates can end up doing. The discussion I had with Karen was great fun, and the resulting profile seems to have turned out ok. I always find these thing hard. It's one thing writing a technical document, but quite another to collaborate on something more personal. The family joke is that the acknowledgments was the hardest section of my books for me to write.

Whilst I'm talking about university life, I was surprised to find my PhD thesis listed (but unavailable) at Amazon.co.uk.

Saturday, November 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Friday, November 12, 2010

Stopping whichs

I was using a tool the other day, and I started it in the background. It didn't come up and, when I looked it had stopped. When this has happened in the past I foreground it and it continues working. I've only noticed this on rare occasions, and I'd previously put it down to some misconfiguration of the system. However, one of my colleagues had also noticed it, so this was the ideal opportunity to figure out what really was going on.

The first step was to identify which process was stopped using jobs -l:

$ jobs -l
[1]- 25195 Running                 process1 &
[2]+ 25223 Stopped (tty output)    process2

Having done that, the next step was to find out where the process had actually stopped. This information can be obtained using ptree which prints out the process call tree:

$ ptree 25223
511   /usr/lib/ssh/sshd
   25160 /usr/lib/ssh/sshd
     25161 /usr/lib/ssh/sshd
       25166 -bash
         25223 /bin/sh process2
           25232 sed -n $p
             25233 /usr/bin/csh -f /usr/bin/which java java
               25234 /usr/bin/stty erase ^H 

So the process has stalled in stty setting the erase character to be ^H. The callstack, printed by pstack, was not very enlightening.

$ pstack 25234
25234:  /usr/bin/stty erase ^H
  feef14d7 ioctl    (0, 540f, 8067988)
  080516f8 main     (3, 8047b24, 8047b34, 80511ff) + 40c
  0805125d _start   (3, 8047c08, 8047c16, 8047c1c, 0, 8047c1f) + 7d 

However the interesting step is from which to stty. What's interesting about which is that it is a C-shell script. The interesting bit is the following:

#! /usr/bin/csh -f
#...
if ( -r ~/.cshrc && -f ~/.cshrc ) source ~/.cshrc

So which sources the .cshrc file, and my .cshrc file happened to contain stty erase ^H. So why does this cause the process to stop?

Well stty controls the characteristics of the terminal, but when the script is executing in the background, there is no terminal. When there's no terminal, stty stops and waits for one to appear!

The easiest is to move the call to stty into my .login file. The .login file is only parsed at login, and not every time a shell is started. Alternatively, it's possible to check for the existence of a prompt:

if ($?prompt) then
  if ("$prompt" =~ ?*) then
  /usr/bin/stty  erase ^H
  endif
endif

Thursday, November 11, 2010

Partitioning work over multiple threads

A few weeks back I was looking at some code that divided work across multiple threads. The code looked something like the following:

void * dowork(void * param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize * threadid;
  int end       = start + chunksize;
  for (int iteration = start; iteration < end; iteration++ )
  {
...

So there was a small error in the code. If the total work was not a multiple of the number of threads, then some of the work didn't get done. For example, if you had 7 iterations (0..6) to do, and two threads, then the chunksize would be 7/2 = 3. The first thread would do 0, 1, 2. The second thread would do 3, 4, 5. And neither thread would do iteration 6 - which is probably not the desired behaviour.

However, the fix is pretty easy. The final thread does what ever is left over:

void * dowork(void * param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize * threadid;
  int end       = start + chunksize;
  if ( threadid + 1 == nthreads) { end = totalwork; }
  for (int iteration = start; iteration < end; iteration++ )
  {
...

Redoing our previous example, the second thread would get to do 3, 4, 5, and 6. This works pretty well for small numbers of threads, and large iteration counts. The final thread at most does nthreads - 1 additional iterations. So long as there's a bundle of iterations to go around, the additional work is close to noise.

But.... if you look at something like a SPARC T3 system, you have 128 threads. Suppose I have 11,000 iterations to complete, I divide these between all the threads. Each thread gets 11,000 / 128 = 85 iterations. Except for the final thread which gets 85 + 120 iterations. So the final thread gets more than twice as much work as all the other threads do.

So we need a better approach for distributing work across threads. We want each thread to so a portion of the remaining work rather than having the final thread do all of it. There's various ways of doing this, one approach is as follows:

void * dowork(void * param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int remainder = totalwork - (chunksize * nthreads); // What's left over

  int start     = chunksize * threadid;
  
  if ( threadid < remainder ) // Check whether this thread needs to do extra work
  { 
    chunksize++;              // Yes. Lengthen chunk
    start += threadid;        // Start from corrected position
  }
  else
  {
    start += remainder;       // No. Just start from corrected position
  }
    
  int end       = start + chunksize; // End after completing chunk

  for (int iteration = start; iteration < end; iteration++ )
  {
...

If, like me, you feel that all this hacking around with the distribution of work is a bit of a pain, then you really should look at using OpenMP. The OpenMP library takes care of the work distribution. It even allows dynamic distribution to deal with the situation where the time it takes to complete each iteration is non-uniform. The equivalent OpenMP code would look like:

void * dowork(void *param)
{
  #pragma omp parallel for
  for (int iteration = 0; iteration < totalwork; iteration++ )
  {
...

Wednesday, November 10, 2010

An introduction to parallel programming

My colleague, Ruud van der Pas, recorded a number of lectures on parallel programming.

Slides from Solaris Summit available

I found out about the Solaris Summit too late to be able to attend, but the slides are now on line.

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

Tuesday, November 9, 2010

mtmalloc performance

A while back I discussed how the performance of mtmalloc could be improved. Well Rick Weisner was working on this, so I provided him with a fix for my hot issue. So I'm very pleased to see, from the bug status, that this code was integrated last month!