Monday, November 22, 2010

Who am I?

Over the summer I was asked if I would do an interview to illustrate what University of Southampton graduates can end up doing. The discussion I had with Karen was great fun, and the resulting profile seems to have turned out ok. I always find these thing hard. It's one thing writing a technical document, but quite another to collaborate on something more personal. The family joke is that the acknowledgments was the hardest section of my books for me to write.

Whilst I'm talking about university life, I was surprised to find my PhD thesis listed (but unavailable) at Amazon.co.uk.

Saturday, November 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Friday, November 12, 2010

Stopping whichs

I was using a tool the other day, and I started it in the background. It didn't come up and, when I looked it had stopped. When this has happened in the past I foreground it and it continues working. I've only noticed this on rare occasions, and I'd previously put it down to some misconfiguration of the system. However, one of my colleagues had also noticed it, so this was the ideal opportunity to figure out what really was going on.

The first step was to identify which process was stopped using jobs -l:

$ jobs -l
[1]- 25195 Running                 process1 &
[2]+ 25223 Stopped (tty output)    process2

Having done that, the next step was to find out where the process had actually stopped. This information can be obtained using ptree which prints out the process call tree:

$ ptree 25223
511   /usr/lib/ssh/sshd
   25160 /usr/lib/ssh/sshd
     25161 /usr/lib/ssh/sshd
       25166 -bash
         25223 /bin/sh process2
           25232 sed -n $p
             25233 /usr/bin/csh -f /usr/bin/which java java
               25234 /usr/bin/stty erase ^H 

So the process has stalled in stty setting the erase character to be ^H. The callstack, printed by pstack, was not very enlightening.

$ pstack 25234
25234:  /usr/bin/stty erase ^H
  feef14d7 ioctl    (0, 540f, 8067988)
  080516f8 main     (3, 8047b24, 8047b34, 80511ff) + 40c
  0805125d _start   (3, 8047c08, 8047c16, 8047c1c, 0, 8047c1f) + 7d 

However the interesting step is from which to stty. What's interesting about which is that it is a C-shell script. The interesting bit is the following:

#! /usr/bin/csh -f
#...
if ( -r ~/.cshrc && -f ~/.cshrc ) source ~/.cshrc

So which sources the .cshrc file, and my .cshrc file happened to contain stty erase ^H. So why does this cause the process to stop?

Well stty controls the characteristics of the terminal, but when the script is executing in the background, there is no terminal. When there's no terminal, stty stops and waits for one to appear!

The easiest is to move the call to stty into my .login file. The .login file is only parsed at login, and not every time a shell is started. Alternatively, it's possible to check for the existence of a prompt:

if ($?prompt) then
  if ("$prompt" =~ ?*) then
  /usr/bin/stty  erase ^H
  endif
endif

Thursday, November 11, 2010

Partitioning work over multiple threads

A few weeks back I was looking at some code that divided work across multiple threads. The code looked something like the following:

void * dowork(void * param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize * threadid;
  int end       = start + chunksize;
  for (int iteration = start; iteration < end; iteration++ )
  {
...

So there was a small error in the code. If the total work was not a multiple of the number of threads, then some of the work didn't get done. For example, if you had 7 iterations (0..6) to do, and two threads, then the chunksize would be 7/2 = 3. The first thread would do 0, 1, 2. The second thread would do 3, 4, 5. And neither thread would do iteration 6 - which is probably not the desired behaviour.

However, the fix is pretty easy. The final thread does what ever is left over:

void * dowork(void * param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize * threadid;
  int end       = start + chunksize;
  if ( threadid + 1 == nthreads) { end = totalwork; }
  for (int iteration = start; iteration < end; iteration++ )
  {
...

Redoing our previous example, the second thread would get to do 3, 4, 5, and 6. This works pretty well for small numbers of threads, and large iteration counts. The final thread at most does nthreads - 1 additional iterations. So long as there's a bundle of iterations to go around, the additional work is close to noise.

But.... if you look at something like a SPARC T3 system, you have 128 threads. Suppose I have 11,000 iterations to complete, I divide these between all the threads. Each thread gets 11,000 / 128 = 85 iterations. Except for the final thread which gets 85 + 120 iterations. So the final thread gets more than twice as much work as all the other threads do.

So we need a better approach for distributing work across threads. We want each thread to so a portion of the remaining work rather than having the final thread do all of it. There's various ways of doing this, one approach is as follows:

void * dowork(void * param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int remainder = totalwork - (chunksize * nthreads); // What's left over

  int start     = chunksize * threadid;
  
  if ( threadid < remainder ) // Check whether this thread needs to do extra work
  { 
    chunksize++;              // Yes. Lengthen chunk
    start += threadid;        // Start from corrected position
  }
  else
  {
    start += remainder;       // No. Just start from corrected position
  }
    
  int end       = start + chunksize; // End after completing chunk

  for (int iteration = start; iteration < end; iteration++ )
  {
...

If, like me, you feel that all this hacking around with the distribution of work is a bit of a pain, then you really should look at using OpenMP. The OpenMP library takes care of the work distribution. It even allows dynamic distribution to deal with the situation where the time it takes to complete each iteration is non-uniform. The equivalent OpenMP code would look like:

void * dowork(void *param)
{
  #pragma omp parallel for
  for (int iteration = 0; iteration < totalwork; iteration++ )
  {
...

Wednesday, November 10, 2010

An introduction to parallel programming

My colleague, Ruud van der Pas, recorded a number of lectures on parallel programming.

Slides from Solaris Summit available

I found out about the Solaris Summit too late to be able to attend, but the slides are now on line.

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

Tuesday, November 9, 2010

mtmalloc performance

A while back I discussed how the performance of mtmalloc could be improved. Well Rick Weisner was working on this, so I provided him with a fix for my hot issue. So I'm very pleased to see, from the bug status, that this code was integrated last month!

Monday, October 4, 2010

Memory ordering

Just had a couple of white papers published on memory ordering. This is a topic which is quite hard to find documentation on, and also quite complex. Fortunately, it's also rarely encountered.

In Oracle Solaris Studio 12.2 we introduced the file mbarrier.h. This defines some intrinsics which allow the developer to enforce memory ordering.

The first paper covers avoiding the reordering of memory operations that the compiler may perform when compiling an application. The second paper covers the more complex issue of avoiding the reordering of memory operations that the processor may do at runtime.

Thursday, September 16, 2010

Updated location

Just heard that my talk is moving to the Nikko Ballroom I. Still at 4pm Monday.

Details for Oracle Open World Presentation

I'm presenting at the Develop conference in San Francisco next week. I'll be in the Nikko Ballroom I at the Hotel Nikko, at 4pm on Monday. The title of the talk is "Multicore Application Programming with Oracle Solaris Studio 12.2". The abstract is:

Writing correct and fast parallel applications is often considered a hard problem. However, it doesn't need to be that way. This session will describe how Oracle Solaris Studio can be used to produce applications that are both fast and correct. The talk will cover parallelization strategies, implementation details, and common pitfalls, as well as describing how the tools provided by Oracle Solaris Studio can identify coding errors and performance opportunities in the application.

There's three talks from the Studio team, details are:

DateDetailsLocation
Monday 20th
4:00pm
S317573:
Multicore Application Programming with Oracle Solaris Studio
Darryl Gove
Hotel Nikko, Nikko Ballroom I
Tuesday 21st
11:30am
S317590:
Performance Measurement with Oracle Solaris Studio Performance Tools
Marty Itzkowitz
Hotel Nikko, Peninsula
Wednesday 22nd
1:00pm
S317585:
Building High-Quality C/C++ Applications
Don Kretsch
Hotel Nikko, Nikko Ballroom II

There appears to be no way to link directly to the talk details, but they are available if you search the entire programme.

Friday, September 10, 2010

Book update

I've just handed over the final set of edits to the manuscript. These are edits to the laid-out pages. Some are those last few grammatical errors you only catch after reading a sentence twenty times. Others are tweaks to the figures. There's still a fair amount of production work to do, but my final input will be a review of the indexing - probably next week.

So it's probably a good time to talk about the cover. This is a picture that my wife took last year. It's a picture of the globe at the cliff tops at Durlston Head near Swanage in England. It's 40 tonnes and over 100 years old. It's also surrounded by stone tablets some containing contemporary educational information, and a couple of blank ones that are there just so people can draw on them.

Wednesday, September 8, 2010

Oracle Solaris Studio 12.2 released

It's been just over a year since the release of Studio 12 Update 1, today we releasing the first Oracle branded Studio release - Oracle Solaris Studio 12.2. For the previous release I wrote a post for the AMD site looking at the growth in multicore processors. It seemed appropriate to take another look at this.

The graph in the chart below shows the cumulative number of SPECint2006 results broken down by the number of cores for each processor. This data does not represent the number of different types of processor that are available, since the same processor can be used in many different results. It is closer to a snapshot of how the market for multicore processors is growing. Each data point represents a system, so the curve approximates the number of different systems that are being released.

It's perhaps more dramatic to demonstrate the change using a stacked area chart. The chart perhaps overplays the number of single core results, but this is probably fair as "single core" represents pretty much all the results prior to the launch of CPU2006. So what is readily apparent is the rapid decline in the number of single core results, the spread of dual, and then quad core. It's also interesting to note the beginning of a spread of more than quad core chips.

If we look at what is happening with multicore processors in the context of what we are releasing with Solaris Studio, there's a very nice fit of features. We continue to refine our support for OpenMP and automatic parallelisation. We've been providing data race (and deadlock) detection through the Thread Analyzer for a couple of releases. The debugger and the performance analyzer have been fine with threads for a long time. The performance analyzer has the time line view which is wonderful for examining multithreaded (or multiprocess) applications.

In addition to these fundamentals Studio 12.2 introduces a bunch of new features. I discussed some of these when the express release came out:

  • For those who use the IDE, integration of support for the analysis of the runtime behaviour of applications has been very useful. It both provides more information directly back to the developer, and raises awareness of the available tools.
  • Understanding call trees is often an important part of interpreting the performance of the application. Being able to drill down the call tree has been a very useful extension to the Performance Analyzer.
  • Memory error checking is critical for all applications. The trouble with memory access errors is that, like data races, the "problem" is visible arbitrarily far from the point where the error occurred.

The release of a new version of a product is always an exciting time. It's a culmination of a huge amount of analysis, development, and testing, and it's wonderful to finally see it available for others to use. So download it and let us know what you think!

Footnote: SPEC, SPECint, reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6 September 2010 and this report.

Parallelisation white paper

An interesting white paper on various approaches to writing parallel programs has just been released. Covers OpenMP, Threading Building Blocks, MPI, and a bunch of others.

Sunday, August 8, 2010

I want my CMT

One problem with many parallel applications is that they don't scale to large numbers of threads. There are plenty of reasons why this might be the case. Perhaps the amount of work that needs to be done is insufficient for the number of threads being used.

On the other hand there are plenty of examples where codes scale to huge numbers of threads. Often they are called 'embarrassingly parallel codes', as if writing scaling codes is something to be ashamed of. The other term for this is 'delightfully parallel' which I don't really find any better!

So we have some codes that scale, and some that don't. Why don't all codes scale? There's a whole bunch of reasons:

  • Hitting some hardware constraint, like bandwidth. Adding more cores doesn't remove the constraint - although adding more processors, or systems might
  • Insufficient work. If the problem is too small, there just is not enough work to justify multiple threads
  • Algorithmic constraints or dependencies. If the code needs to calculate A and then B, there is no way that A and B can be calculated simultaneously.

These are all good reasons for poor scaling. But I think there's also another one that is, perhaps, less obvious. And that is access to machines with large numbers of cores.

Perhaps five years ago it was pretty hard to get time on a multicore system. The situation has completely reversed now. Obviously if a code is developed on a system with a single CPU, then it will run best on that kind of system. Over time applications are being "tuned" for multicore, but we're still looking at the 4-8 thread range in general. I would expect that to continue to change as access to large systems becomes more common place.

I'm convinced that as access to systems with large numbers of threads becomes easier, the ability of applications to utilise those threads will also increase. So all those applications that currently max out at eight threads, will be made to scale to sixteen, and beyond.

This is at the root of my optimism about multicore in general. Like many things, applications "evolve" to exploit the resources that are provided. You can see this in other domains like video games, where something new comes out of nowhere, and you are left wondering "How did they make the hardware do that?".

I also think that this will change attitudes to parallel programming. It has a reputation of being difficult. Whilst I agree that it's not a walk in the park, not all parallel programming is equally difficult. As developers become more familiar with it, coding style will evolve to avoid the common problems. Hence as it enters the mainstream, its complexity will be more realistically evaluated.

Wednesday, July 28, 2010

Multicore application programming on Safari books

A roughcut of Multicore Application Programming has been uploaded to Safari books. If you have access you can read it, and provide feedback or comments. If you don't have access to Safari, you can still see the table of contents, read the preface, and view the start of each chapter.

Monday, July 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.


#include <stdlib.h>
#include <pthread.h>

void *work( void* param )
{
  while ( 1 ) { free( malloc(100) ); }
}

int main()
{
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
  {
    free ( malloc (100) );
  }
}

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
  sec.
14.210     do_exit_critical
14.210    *take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
  sec.
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       *do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
  sec.
17.062     mutex_lock_impl
 2.872    *mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
  sec.
11.938     free
 9.327     malloc
 4.203    *mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Friday, July 9, 2010

White paper on using Oracle Solaris Studio

I contributed a fair amount of material to a recent white paper about Oracle Solaris Studio. The paper is available for download from the developer portal.

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio delivers a fully integrated development platform for generating robust high-performance applications for the latest Oracle Sun systems (SPARC and x86). In order to take full advantage of the latest multicore systems, applications must be compiled for optimal performance and tuned to exploit the capabilities of the hardware. Learn how Oracle Solaris Studio helps you generate the highest performance applications for your target platform, from selecting the right compiler flags and optimization techniques to simplifying development with advanced multicore tools.

Presenting at Oracle Develop

As part of Oracle Develop, I'll be presenting at the Hotel Nikko, in San Francisco on 20th September at 4pm. The session is S317573 titled "Multicore Application Programming with Oracle Solaris Studio". The abstract reads as follows:

Writing correct and fast parallel applications is often considered a hard problem. However, it doesn't need to be that way. This session will describe how Oracle Solaris Studio can be used to produce applications that are both fast and correct. The talk will cover parallelization strategies, implementation details, and common pitfalls, as well as describing how the tools provided by Oracle Solaris Studio can identify coding errors and performance opportunities in the application.

Thursday, July 8, 2010

Multicore application programming: update

It's 2am and I've just handed over the final manuscript for Multicore Application Programming. Those who know publishing will realise that this is not the final step. The publishers will layout my text and send it back to me for a final review before it goes to press. It will probably take a few weeks to complete the process.

I've also uploaded the final version of the table of contents. I've written the book using OpenOffice.org. It's almost certain not to be a one-to-one mapping of pages in my draft to pages in the finished book. But I expect the page count to be roughly the same - somewhere around 370 pages of text. It will be interesting to see what happens when it is properly typeset.

Wednesday, June 30, 2010

The solution is multicore

Professor David Patterson wrote an interesting article in IEEE Spectrum, 'The trouble with multicore'. The tag line is "Chipmakers are busy designing microprocessors that most programmers can't handle". The thrust of the article is that multicore processors are a hardware development that software is poorly equipped to utilise.

There are two main arguments made in the article. The first is that programming languages are very poor at describing parallelism. There has been a long list of languages that were either designed to tackle parallelism or have had parallelism imposed upon them. To be fair parallel programming is littered with the ill-conceived corpses of languages that were meant to solve the problem. So his view is correct, but perhaps this is not relevant.

The second point he makes is that not all tasks break down to independent work. His example is that of ten reporters writing the same story, and not being able to write the story ten times faster because each section of text has to build on the previous sections. Again, this is true. There are some tasks that have implicit (or explicit) dependencies, but perhaps this is not relevant.

The example in his paper that best illustrates how multicore is the solution, and not the problem, is that of cloud computing. As he says "Expert programmers can take advantage of the task-level parallelism inherent in cloud computing.". Are you an expert programmer when you type a search term into Google? A lot of computation goes into finding the results for you, but they appear nearly instantly. It could be argued that Google put a considerable amount of effort into designing a system that produced results so quickly. Of course they did. However, they did it once, and its used for millions of search queries every day.

Observation 1: Many problems just need parallelising once. Or conversely, not every developer needs to worry about the parallelism – in the same way as not every developer on a project needs to worry about the GUI.

But this only addresses part of the argument. It is all very well using an anecdotal example to demonstrate that it is possible to utilise multiple cores, but that does not disprove Professor Patterson's argument.

Lets return to the example of the reporters. The way the reporters are working is perhaps not the best use of their resources. Much of the work of reporting is fact checking, talking to people, and gathering data. The writing part of this is only the final step in a long pipeline. Perhaps a better way of utilising the ten reporters would be during the data gathering stages, multiple people could be interviewed simultaneously, multiple sources consulted at the same time. On the other hand, a newspaper would rarely allocate more than a single reporter to a single story. More progress would be made if each reporter was working on a different story. So perhaps the critical observation is that dependencies within a task are an indication that parallelism needs to be discovered outside that task.

Observation 2: It is rare that there are no other ways of productively utilising compute resources. Meaning that given a number of cores, it is almost always possible to find work to keep them busy. For example, rendering a movie could have cores working on separate frames, or separate segments of the same frame. Sequencing genes could have multiple genes being examined simultaneously. Simulation models of different scenarios could be completed in parallel.

But, it can be argued that there are times when you need to do a single task, and you care how long that task takes to complete. So, lets consider exactly what problems we encounter during our day where we would benefit from a faster processor.

  • "I waited for my PC to boot.". Well booting a PC is pretty much a serial process, however, the boot time is largely dominated by disk access time rather than processor speed.
  • "I waited for my e-mail to download". Any downloading activity, be it e-mail or webpages is going to be dominated by network latency or bandwidth issues. There is undoubtedly some processor activity in the mix, but it is unlikely that a fast processor would make a noticeable difference to performance.
  • "I was watching a video when my virus scanner kicked in and caused the movie to stutter." Assuming it wasn't a disc activity, this is a great example of where having multiple cores will help rather than hinder. Two cores would allow the video to continue playing while the virus scanner did its work. This was, of course, the frequently given example of why multicore processors were a good thing – as if virus scanner were a desirable use of processor time!
  • "I was compiling an application and it took all afternoon." Some stages of compilation, like linking or crossfile optimisation, are inherently serial. But, unless the entire source code was placed into a single file, most projects have multiple source files, so these could be compiled in parallel. Again, the performance can be dominated by disk or network performance, so it is not entirely a processor performance issue.

These are a few situations where you might possibly feel frustration at the length of time a task takes. You may have plenty more. The point is that it is rare that there is no parallelism available, and no opportunity to make parallel progress on some other task.

Observation 3: There are very few day to day tasks that are actually limited by processor performance. Most tasks have substantial bottlenecks in other parts of the system (disk, network, speed of devices). If anything having multiple cores enables a system to remain useful while other compute tasks are completed.

All this discussion has not truly refuted Professor Patterson's observation that there exist problems which are inherently serial, or fiendishly difficult to parallelise. But that's ok. Most commonly encountered computational activities are either easy to parallelise, or there are ways of extracting parallelism at other levels.

But what of software? There is great allure to using threads on a multicore processor to deliver many times the performance of a single core processor. And this is the crux of the matter. Advances in computer languages haven't 'solved' this problem for us. It can still be hard, for some problems, to write parallel programs that are both functionally correct and scale well.

However, we don't all need to solve the hard problems. There are plenty of opportunities for exploiting parallelism in a large number of common problems, and in other situations there are opportunities for task level parallelism. This combination should cover 90+% of the problem space.

Perhaps there are 10% of problems that don't map well to multicore processors, but why focus on those when the other 90% do?

Wednesday, June 9, 2010

Runtime analysis in the Solaris Studio IDE

I was pleasantly surprised to find support for runtime analysis embedded in the Solaris Studio IDE. This analysis uses the Performance Analyzer to gather data as the code is running and then presents this data both as timeline views over the runtime of the application, and also source code annotations. Here's the view as the data is gathered.


The tool gathers profile data which is shown as an aggregation of time spent in each routine, and also annotated against each line of source.


The other thing the tool is able to track is memory leaks, again reporting the amount leaked, and attributing the leaks to the lines of source where the data was allocated.


Tuesday, June 8, 2010

Setting thread/process affinity

In some instances you can get better performance, or reproducibility, by restricting the processors that a thread runs on. Linux has pthread_set_affinity_np (the 'np' tag means non-portable). On Solaris you have a number of nearly equivalent options:

  • Processor sets where you create a set of processors and allow only particular processes to run on this set.
  • Processor_bind, where you bind a particular process or thread to a particular virtual CPU. The thread cannot migrate off this CPU, but other threads can run on it. This means that you need to coordinate between different processes to ensure that they are not all allocated to the same CPU.
  • Locality groups. On a NUMA system, a locality group is a set of CPUs which share the same memory local memory. Processes that remain executing on processors within their locality group will continue to get low memory access times, if they get scheduled to processors outside the group their memory access times may increase.

Call trees in the Performance Analyzer

The Performance Analyzer has also had a number of new features and improvements. The most obvious one of these is the new call tree tab. This allows you to drill down into the call tree for an application and see exactly how the time is divided between the various call stacks.

Monday, June 7, 2010

Checking for memory access errors with discover

The latest Solaris Studio Express release contains the tool discover, which tests for memory access errors. These are errors like reading past the end of an array or freeing a pointer twice. The best part of the tool is that it does not require a special build of the application. The sequence is:

$ discover a.out
$ a.out

The discover command adds instrumentation to the executable, and you then run the resulting binary in the same way that you would normally run your program. The output from discover is an html file containing details of any memory access errors that the tool discovered.


Sunday, June 6, 2010

Solaris Studio Express

The latest Solaris Studio Express release is out, there's also a feedback programme for submitting bugs and posting questions.

One of the first things I did with it was to launch the solstudio IDE. It has the expected functionality. Code completion, and hints on the parameters that are expected by a function:




There's also integrated debugging:


I'll add a couple more posts over the next few days showing some other features.

Monday, May 31, 2010

Memory ordering resources

Quick links to memory ordering resources. AMD covers this in chapter 7 of their System Programming Guide. Intel covers this in chapter 8 of their System Programming Guide. For SPARC this is covered in section 9.4 of the UltraSPARC Architecture Manual.

Monday, May 17, 2010

I've uploaded the current table of contents for Multicore Application Programming. You can find all the detail in there, but I think it's appropriate to talk about how the book is structured.

Chapter 1. The design of any processor has a massive impact on its performance. This is particularly true for multicore processors since multiple software threads will be sharing hardware resources. Hence the first chapter provides a whistle-stop tour of the critical features of hardware. It is important to do this up front as the terminology will be used later in the book when discussing how hardware and software interact.

Chapter 2. Serial performance remains important, even for multicore processors. There's two main reasons for this. The first is that a parallel program is really a bunch of serial threads working together, so improving the performance of the serial code will improve the performance of the parallel program. The second reason is that even a parallel program will have serial sections of code. The performance of the serial code will limit the maximum performance that the parallel program can attain.

Chapter 3. One of important aspects of using multicore processors is identifying where the parallelism is going to come from. If you look at any system today, there are likely to be many active processes. So at one level no change is necessary, systems will automatically use multiple cores. However, we want to get beyond that, and so the chapter discusses approaches like virtualisation as well as discussing the more obvious approach of multi-thread or multi-process programming. One message that needs to be broadcast is that multicore processors do not need a rewrite of existing applications. However, getting the most from a multicore processor may well require that.

Chapter 4. The book discusses Windows native threading, OpenMP, automatic parallelisation, as well as the POSIX threads that are available on OS-X, Linux, and Solaris. Although the details do sometimes change across platforms, the concepts do not. This chapter discusses synchronisation primitives like mutex locks and so on, this enables the chapters which avoids having to repeat information in the implementation chapters.

Chapter 5. This chapter covers POSIX threads (pthreads), which are available on Linux, OS-X, and Solaris, as well as other platforms not covered in the book. The chapter covers multithreaded as well as multiprocess programming, together with methods of communicating between threads and processes.

Chapter 6. This chapter covers Windows native threading. The function names and the parameters that need to be passed to them are different to the POSIX API, but the functionality is the same. This chapter provides the same coverage for Windows native threads that chapter 5 provides for pthreads.

Chapter 7. The previous two chapters provide a low level API for threading. This gives very great control, but provides more opportunities for errors, and requires considerable lines of code to be written for even the most basic parallel code. Automatic parallelisation and OpenMP place more of the burden of parallelisation on the compiler, less on the developer. Automatic parallelisation is the ideal situation, where the compiler does all the work. However, there are limitations to this approach, and this chapter discusses the current limitations and how to make changes to the code that will enable the compiler to do a better job. OpenMP is a very flexible technology for writing parallel applications. It is widely supported and provides support for a number of different approaches to parallelism.

Chapter 8. Synchronisation primitives provided by the operating system or compiler can have high overheads. So it is tempting to write replacements. This chapter covers some of the potential problems that need to be avoided. Most applications will be adequately served by the synchronisation primitives already provided, the discussion in the chapter provides insight about how hardware, compilers, and software can cause bugs in parallel applications.

Chapter 9. The difference between a multicore system and a single core system is in its ability to simultaneously handle multiple active threads. The difference between a multicore system and a multiprocessor system is in the sharing of processor resources between threads. Fundamentally, the key attribute of a multicore system is how it scales to multiple threads, and how the characteristics of the application affect that scaling. This chapter discusses what factors impact scaling on multicore processors, and also what the benefits multicore processors bring to parallel applications.

Chapter 10. Writing parallel programs is a growing and challenging field. The challenges come from producing correct code and getting the code to scale to large numbers of cores. There are some approaches that provide high numbers of cores, there are other approaches which address issues of producing correct code. This chapter discusses a large number of other approaches to programming parallelism.

Chapter 11. The concluding chapter of the book reprises some of the key points of the previous chapters, and tackles the question of how to write correct, scalable, parallel applications.

Tuesday, May 11, 2010

New Book: Multicore application programming

I'm very pleased to be able to talk about my next book Multicore Application Programming. I've been working on this for some time, and it's a great relief to be able to finally point to a webpage indicating that it really exists!

The release date is sometime around September/October. Amazon has it as the 11th October, which is probably about right. It takes a chunk of time for the text to go through editing, typesetting, and printing, before it's finally out in the shops. The current status is that it's a set of documents with a fair number of virtual sticky tags attached indicating points which need to be refined.

One thing that should immediately jump out from the subtitle is that the book (currently) covers Windows, Linux, and Solaris. In writing the book I felt it was critical to try and bridge the gaps between operating systems, and avoid writing it about only one.

Obviously the difference between Solaris and Linux is pretty minimal. The differences with Windows are much greater, but, when writing to the Windows native threading API, the actual differences are more syntactic than functional.

By this I mean that the name of the function changes, the parameters change a bit, but the meaning of the function call does not change. For example, you might call pthread_create(), on Windows you might call _beginthreadex(); the name of the function changes, there are a few different parameters, but both calls create a new thread.

I'll write a follow up post containing more details about the contents of the book.

Friday, April 16, 2010

Kernel and user profiling with dtrace

Just put together a short dtrace script for profiling both userland and kernel activity.

#!/usr/sbin/dtrace -s
#pragma D option quiet


profile-97
/arg1/
{
  @[pid,execname,ufunc(arg1)]=count();
}

profile-98
/arg0/
{
  @k[pid,execname,func(arg0)]=count();
}

tick-1s
{
  trunc(@,25);
  trunc(@k,25);
  printf("%5s %20s %20s %10s\n","PID","EXECNAME","FUNC","COUNT");
  printa("%5d %20s %20A %10@d\n",@);
  printa("%5d %20s %#20a %10@d\n",@k);
  trunc(@);
  trunc(@k);
}

The script samples the current pc for both user land and kernel about 100x per second. There's some risk of over counting since there's one probe for user and one probe for kernel. Every second the code prints out the top 25 user and kernel routines, broken down by pid and executable name. The output looks like:

  
  PID             EXECNAME                 FUNC                 COUNT
  556                 Xorg libpixman-1.so.0`pixman_image_unref  1
  556                 Xorg libpixman-1.so.0`pixman_fill         1
  556                 Xorg libc.so.1`memcpy                     1
    0                sched unix`dispatch_softint                1
    0                sched unix`dispatch_hardint                2
    0                sched unix`mach_cpu_idle                   91

Saturday, April 10, 2010

New Redpoint albums

I was thrilled to see that there are two new full releases from Redpoint, "Nostalgia for Now" and "Sense of Summer". They have also set up a youtube site with a bunch of videos.

Tuesday, March 30, 2010

Tuesday, March 2, 2010

Compiler memory barriers

I've written in the past about memory barriers. Basically a membar instruction ensures that other processors see memory operations in the order that they appear in the source code. The obvious example being a mutex lock where you want the memory operations that occurred while the lock was held to be visible to other processors before the memory operation that releases the lock.

There's actually another kind of memory ordering and that is the ordering used by the compiler. If you write:

  *a=1;
  *b=2;

If the compiler can determine that a and b do no alias, then there's no reason for it not to swap the stores to a and b if it thinks that will be a more optimal code pattern.

The most cross-platform way of enforcing this ordering is to put a function call between the two stores:

  *a=1;
  reorder_barrier();
  *b=2;

Memory needs to be the program defined state at the call, so the compiler cannot defer the store to a, and cannot hoist the store to b.

This is a great solution, but causes the overhead of a function call, and function calls can have significant costs. There are some compiler intrinsics that cause the compiler to enforce the desired memory ordering. Sun Studio 12 Update 1 supports the GCC flavour:

  *a=1;
  asm volatile ("":::"memory");
  *b=2;

You can test the performance overhead using the following code:

void barrier(){}

void main()
{
  for (int i=0; i<1000000000;i++)
  {
    barrier();
  }
} 

On the test system this code took about 5 seconds to run. The alternative code is:

void main()
{
  for (int i=0; i<1000000000;i++)
  {
    asm volatile ("":::"memory");
  }
}

This code took under a second.

Tuesday, February 23, 2010

I'm presenting at the Silicon Valley OpenSolaris Users Group on Thursday evening. I was only asked today, so I'm putting together some slides this evening on "Multicore Application Programming". The talk is going to be a relatively high level presentation on writing parallel applications, and how the advent of multicore or CMT processors changes the dynamics.

Monday, February 15, 2010

x86 performance tuning documents

Interesting set of x86 performance tuning documents.

Bitten by inlining (again)

So this relatively straight-forward looking code fails to compile without optimisation:

#include <stdio.h>

inline void f1()
{
  printf("In f1\n");
}

inline void f2()
{
  printf("In f2\n");
  f1();
}

void main()
{
  printf("In main\n");
  f2();
}

Here's the linker error when compiled without optimisation:

% cc inline.c
Undefined                       first referenced
 symbol                             in file
f2                                  inline.o
ld: fatal: Symbol referencing errors. No output written to a.out

At low optimisation levels the compiler does not inline these functions, but because they are declared as inline functions the compiler does not generate function bodies for them - hence the linker error. To make the compiler generate the function bodies it is necessary to also declare them to be extern (this places them in every compilation unit, but the linker drops the duplicates). This can either be done by declaring them to be extern inline or by adding a second prototype. Both approaches are shown below:

#include <stdio.h>

extern inline void f1()
{
  printf("In f1\n");
}

inline void f2()
{
  printf("In f2\n");
  f1();
}
extern void f2();

void main()
{
  printf("In main\n");
  f2();
}

It might be tempting to copy the entire function body into a support file:

#include <stdio.h>

void f1()
{
  printf("In duplicate f1\n");
}

void f2()
{
  printf("In duplicate f2\n");
  f1();
}

This is a bad idea, as you might gather from the deliberate difference I've made to the source code. Now you get different code depending on whether the compiler chooses to inline the functions or not. You can demonstrate this by compiling with and without optimisation, but this only forces the issue to appear. The compiler is free to choose whether to honour the inline directive or not, so the functions selected for inlining could vary from build to build. Here's a demonstration of the issue:

% cc -O inline.c inline2.c
inline.c:
inline2.c:
% ./a.out
In main
In f2
In f1

% cc inline.c inline2.c
inline.c:
inline2.c:
% ./a.out
In main
In duplicate f2
In duplicate f1

Douglas Walls goes into plenty of detail on the situation with inlining on his blog.

Thursday, February 4, 2010

A very British computer

Couple more articles from the BBC, the Manchester Baby, and EDSAC at Cambridge.

I find it appropriate that one of the first commercial computers was used for ordering supplies for tea shops.

Wednesday, February 3, 2010

Tuesday, February 2, 2010

BBC articles on British computer pioneers

The BBC are running a week-long series of articles on the history of British computing. The first is on GCHQ, the second is on Colossus at Bletchley.

Although the articles are interesting, I'm slightly dubious about some of the claims "It was not until the 1990s that general purpose machines could match Oedipus for searching through a database and finding a particular term, said Prof Lavington."

The article contains an interesting link map of the people involved. The link map includes the NPL Ace Pilot. The NPL is the National Physical Laboratory, and I was driven past on my way to school every day. Back then it looked like a science laboratory covered in odd shaped objects and brick buildings. Unfortunately the photos on the history page fail to capture this ramshackle nature.

Friday, January 29, 2010

Welcome to http://www.darrylgove.com

It's been rather a long time since I've posted anything to my blog. That's mainly because of spending my time on some other projects (more on that soon).

As everyone knows, Sun has been taken over by Oracle, and with that there's a new set of rules that govern blogging.

As folks who've been reading this blog for a while, it's mainly technical stuff, with some trips off into things that interest me. So I've decided to set up a blog where I can continue to post a mix of personal and technical content. I fully anticipate continuing to use my old blog to talk about technical stuff, and I'll figure out the details of how having two blogs works as I go along - I imagine that I'll be duplicating content most of the time.

So you now have a choice of where to read my material, here, or http://www.darrylgove.com/