Wednesday, July 28, 2010

Multicore application programming on Safari books

A roughcut of Multicore Application Programming has been uploaded to Safari books. If you have access you can read it, and provide feedback or comments. If you don't have access to Safari, you can still see the table of contents, read the preface, and view the start of each chapter.

Monday, July 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.

#include <stdlib.h>
#include <pthread.h>

void *work( void* param )
  while ( 1 ) { free( malloc(100) ); }

int main()
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
    free ( malloc (100) );

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
14.210     do_exit_critical
14.210    *take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       *do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
17.062     mutex_lock_impl
 2.872    *mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
11.938     free
 9.327     malloc
 4.203    *mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Friday, July 9, 2010

White paper on using Oracle Solaris Studio

I contributed a fair amount of material to a recent white paper about Oracle Solaris Studio. The paper is available for download from the developer portal.

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio delivers a fully integrated development platform for generating robust high-performance applications for the latest Oracle Sun systems (SPARC and x86). In order to take full advantage of the latest multicore systems, applications must be compiled for optimal performance and tuned to exploit the capabilities of the hardware. Learn how Oracle Solaris Studio helps you generate the highest performance applications for your target platform, from selecting the right compiler flags and optimization techniques to simplifying development with advanced multicore tools.

Presenting at Oracle Develop

As part of Oracle Develop, I'll be presenting at the Hotel Nikko, in San Francisco on 20th September at 4pm. The session is S317573 titled "Multicore Application Programming with Oracle Solaris Studio". The abstract reads as follows:

Writing correct and fast parallel applications is often considered a hard problem. However, it doesn't need to be that way. This session will describe how Oracle Solaris Studio can be used to produce applications that are both fast and correct. The talk will cover parallelization strategies, implementation details, and common pitfalls, as well as describing how the tools provided by Oracle Solaris Studio can identify coding errors and performance opportunities in the application.

Thursday, July 8, 2010

Multicore application programming: update

It's 2am and I've just handed over the final manuscript for Multicore Application Programming. Those who know publishing will realise that this is not the final step. The publishers will layout my text and send it back to me for a final review before it goes to press. It will probably take a few weeks to complete the process.

I've also uploaded the final version of the table of contents. I've written the book using It's almost certain not to be a one-to-one mapping of pages in my draft to pages in the finished book. But I expect the page count to be roughly the same - somewhere around 370 pages of text. It will be interesting to see what happens when it is properly typeset.