Wednesday, December 14, 2011

Oracle Solaris Studio 12.3

Oracle Solaris Studio 12.3 was released today. You can download it here.

There's a bundle of exciting stuff that goes into every new release. The headlines are probably the introduction of the Code Analyzer tool which does dynamic and static error reporting on an application, and the ablity of the IDE to be run on a remote system while the builds are done on the host.

I have a couple of other favourite areas of change. First of all we've got spot running on a bunch of recent processors - in particular the SPARC T4 (I'll write more about this later). Secondly, the filtering in the Performance Analyzer has been pushed to the foreground. Let's discuss filtering now.

Filtering is one of those technologies that is very powerful, but has been quite hard to use in previous releases. The change in this release has been that the filters have been placed on the right-click menu. Here's an example:

Adding and removing filters is now just a matter of right clicking. This allows you to rapidly drill down on the profile data. For example filtering out activity by processor, call stack, and so on.

Wednesday, November 2, 2011

Welcome to the (System) Developer's Edge

The Developer's Edge went out of print a while back. This was obviously frustrating, not just for me, but for the folks who contacted me asking what happened. Well, I'm thrilled to be able to announce that it's available as a pdf download.

This is essentially the same book as was previously available. I've not updated the links back to the original articles. It would have been problematic, in some instances the original articles no longer exist. There are only two significant changes, the first is the branding has been changed (there's no cover art, which keeps the download small). The second is the title of the book has been modified to include the word "system" to indicate that its focused towards the hardware end of the stack.

I hope you enjoy the System Developer's Edge.

Friday, October 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Differences between the various STL options on Solaris

Steve Clamage has provided a nice summary of the trade-offs between the various STL options. I'll summarise it here:

  • Default STL. Available as part of the OS so does not require a separate library to be shipped with the application. However, does not support the standard.
  • -library=stlport4 Much better conformance with the standard, but no internationalisation. Must be distributed with applications that use it.
  • -library=stdcxx4 (Apache). Complete implementation of standard. Available on S10U10 and onwards.

I'd also add that stlport4 and stdcxx4 typically have much better performance than the default library.

The other point that bears repetition is that you can only include one STL per application. So you cannot use different implementations for different libraries or for the application.

Sunday, October 2, 2011

Best practices for developing top-performing C/C++ Applications

I'll be presenting at Oracle Open World tomorrow. The title of the presentation is "Best practices for developing top-performing C/C++ applications". The presentation is at 11:00am in Golden Gate C1 at the Marriott Marquis.

Friday, August 19, 2011

Terminal corba errors

Every so often the terminal comes up with a corba error "Adding client to server's list failed, CORBA error:", as shown in the image.

The solution seems to be to delete

rm ~/.gconfd/saved_state ~/.gconfd/saved_state.tmp 

And then to kill gconfd

kill gconfd-2

Seems to work for me, but I can't guarantee that it won't do anything nasty to your system.

Tuesday, August 9, 2011

Standards and headers

Every so often I encounter, or hear about, a problem with function definitions when the standard header files are included. Most often its mmap, but sometimes it's something else. Every time I think that I should write something up. Well, it's finally happened, a short paper on how to write portable code using the standard headers.

Monday, August 1, 2011

Standard header files

Interesting (but old) blog post about the standard header files included with Solaris.

Oracle Solaris Studio 12.3 Beta Programme

Last week, we started the beta programme for Oracle Solaris Studio 12.3. You can participate by downloading the software and reporting any issues.

As with any release, there's a lot of incremental improvements wherever we find opportunities, and there's a couple of new features. The two most interesting new features are:

    The Code Analyzer which reports possible errors in your application, both dynamic (ie memory access errors), and static. The static error detection is the newest feature, this goes beyond the compile time warnings or lint messages, and does much more detailed compile-time analysis of your code.
  • Remote development on Windows. I'm yet to try out this feature, but the IDE has the ability to run remotely on a Windows box seamlessly compiling and running on a remote server. In fact the improvements in the IDE are well worth a look.

Some of the Studio team are giving a webcast on Thursday 4th August at 9am PDT.

Thursday, July 14, 2011

Best practices for libraries and linkers (part 8)

Part 8 is the conclusion of the series on the best practices for libraries and linking. The core set of best practices are:

  • Ensure at link time that all symbols are resolved.
  • Minimise the number of symbols of global scope.
  • Specify the library search paths at link time.

Putting this series of articles together turned out to be a fair amount of work. Hopefully you can see from the scale of the topics why we chose to break it down into bite-sized chunks. I'll be happy to hear feedback on whether you found it useful, or what other topics you would like discussed.

Using symbol scoping. Libraries and linker best practices part 7

In general the compiler is going to scope symbols declared in object files as being global. This means that they can be seen and bound to by any object. There are two other settings for symbol scope - "symbolic" and "hidden".

Hidden scope is easiest to describe as it just means that the symbol can only be seen within the module and is not exported for applications or libraries to use. This is basically a locally defined symbol. There are multiple advantages to using hidden scoping when possible, it reduces the number of symbols that the linker needs to handle at runtime, so reduces start up time. It also reduces the number of names, so reduces the chance of duplicate names. Finally hidden symbols cannot be bound to externally, so they cannot cause a link order problem. This makes hidden scope a good choice for all those symbols that don't need to be exported.

The other option is symbolic scope. A symbol with symbolic scope is still available for other modules to bind to - so it is like a global symbol in that respect. However, a symbolic symbol can only be satisfied from within the library or application. So if I have an unresolved symbolic symbol foo() then that symbol can only bind within the library or application. So symbolic-scoped symbols avoid the cross-library issue that causes link order problems.

Symbols can be declared with their scoping; __global,__symbolic, or __hidden. We can also use the compiler flag -xldscope=<scope> to set the default scoping for all the symbols not otherwise scoped.

The details of all this are discussed much more thoroughly in Part 7 of the series.

The best practices for symbol scoping come in two flavours:

The easiest way of handling scoping is to declare all the defined symbols to have symbolic scoping (-xldscope=symbolic). This ensures that these symbols end up with local binding rather than pulling in definitions that are present in other libraries. The downside of this is that it could cause multiple definitions for the same symbol to become present in the address space of an application.

The other approach is to carefully define interfaces by declaring exported symbols to be __symbolic, so that other libraries can bind to them, but this library will bind to the local versions in preference. Then to declare imported symbols as __global which will ensure that the library can bind to an external definition for the symbol. Then finally use -xldscope=hidden to avoid further pollution of the name space. This is time consuming but reduces runtime link costs, and also increases the robustness of the application.

Setting the initialisation order for libraries (Best practices for libraries and linking part 6)

Part 5 of the series talked about diagnosing initialisation problems. These are situations where the libraries are loaded in the wrong order and this causes the application not to function correctly (or at all). Part 6 discusses how to resolve this problem.

The easiest, but the least reliable approach is to reorder the libraries on the link line until they get initialised in the right order. This is an easy fix since it is just a matter of changing the link line, but it's not reliable. There are various reasons why this is a poor fix. It is limited to just fixing the one application, and does not fix the root of the problem. It is not robust as a change in one of the libraries may cause the whole problem to recur. etc. Better fixes involve avoiding the duplicate symbol problem that causes the library load order to be indeterminate.

If the symbols are introduced because of C++ templates, then the -instlib=<library> flag causes the compiler not to generate symbols that are defined in the listed libraries.

Direct binding is another approach which records the exact library dependencies at link time so that the linker knows exactly which libraries are required, and hence can determine the appropriate load order. This has the downside that it enables different libraries to bind to different definitions of the same symbol, this could be a useful feature, but could also introduce problems.

Tuesday, July 12, 2011

Feature Test Macros

Feature test macros are a set of macros that are either:

  • Defined by the development environment indicating that the environment conforms to a particular standard


  • Defined by the source code for the application before the header files are included to indicate that the application requires a particular environment to build

The macros define what APIs are available, and what parameters are passed through the APIs. Adherence to a particular standard (like POSIX) will define a particular set of APIs, and define their parameters. A good example of this is on Solaris where munmap changes definition depending on what standards have been requested:

$ grep munmap /usr/include/sys/*.h
/usr/include/sys/mman.h:extern int munmap(void *, size_t);
/usr/include/sys/mman.h:extern int munmap(caddr_t, size_t);

The Linux man page for feature_test_macros includes useful source code (ftm.c) for reporting which feature test macros are set by default. This changes depending on the the OS and compiler used. One of the big differences between Linux and Solaris are the feature test macros that are set by default. Here's the output from the program compiled on a Linux box and a Solaris box - both using gcc.


$ gcc ftm.c
$ ./a.out
_POSIX_C_SOURCE defined: 200809L
_BSD_SOURCE defined
_SVID_SOURCE defined


$ gcc ftm.c
$ ./a.out
_FILE_OFFSET_BITS defined: 32

The list of standards that Solaris 10 adheres to is documented under man standards, the list for Linux is documented under man feature_test_macros.

Monday, July 11, 2011

OpenMP 3.1 specification released

OpenMP is a great way to produce parallel applications with the minimal amount of work. The 3.1 specification came out a couple of days ago. As should be apparent from the version number, its more incremental than significant. The significant changes I see are:

  • Support for min and max reductions in C/C++. This was a frustrating omission from the previous versions, so I'm pleased to see that fixed here.
  • Support for thread binding. The specification introduces OMP_PROC_BIND which binds threads to cores. This is rather similar to the original SUNW_MP_PROCBIND in Studio, which only took true or false, more recent compilers allow a much finer granularity of control. Still "true" or "false" is a good start!

Wednesday, June 29, 2011

Library initialisation in C++ - libraries and linking part 5

Part 5 of the series of articles on linking and libraries is up. This one gets into the details of what can go wrong when writing libraries in C++. The key take aways from the article are to use:

  • LD_DEBUG=init to view runtime initialisation
  • LD_DEBUG=bindings to examine how symbols are bound to libraries at runtime

Wednesday, June 1, 2011

Avoiding problems at linktime (part 4 in series)

Part 4 in the series on best practices for linking is available. The key takeaways are:

  • Avoid defining duplicate symbols. The Solaris tool lari will produce a report on this issue (besides doing a bundle of other stuff). The problem with multiple definitions of symbols is that it is not predictable which definition will be picked at runtime. This is often deterministic on a particular platform, but could change on a different platform.
  • Always define libraries as a hierarchy, with no circular dependencies. If there are circular dependencies the libraries may get loaded in an unpredictable order.

Friday, May 27, 2011

Using LD_DEBUG to examine application startup (linking best practices part 3)

Part 3 of the series on best practices for linking C/C++ applications is up. This sections focuses on using LD_DEBUG to examine application startup.

The paper talks about the options LD_DEBUG=init which shows the initialisation and finalisation stages of an applications run, and LD_DEBUG=bindings which shows how the symbols are bound between the application and libraries.

Tuesday, May 24, 2011

Best practices for linking - part 2

Part 2 of the article on library linking best practices is up on OTN. This is a relatively short read about ensuring that the library records its dependencies.

The relevant options are:

  • -z defs which will cause the linker to report any unresolved symbols found in the library. This is the default for applications, but is not the default for libraries. Using this flag requires that all the libraries that are required for successful linking are listed on the link line. Doing this will ensure that the library will fail to link rather than fail at runtime.
  • The command ldd -U -r will report if the library (or executable) is linked to libraries that it does not use. This is helpful in ensuring that the minimal number of libraries are loaded in order for an application to run.

Wednesday, May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Wednesday, May 11, 2011

Best practices for linking libraries (part 1)

A while ago I was looking into some application start up problems. The problem turned out to be an issue relating to the order in which the libraries were loaded and initialised. It seemed to me that this was a rather tricky area, and it would be very helpful to document the best practices around it. I thought this would be a quick couple of pages, but it turned out to be a rather high page count, and I ended up working on the document with Steve Clamage (with Rod Evans helping out).

The first part of the document is available. This section covers basic linker good practices. Using -L and -R rather than LD_LIBRARY_PATH, generating relocatable code etc. The key take aways are:

  • Use -L to specify the path to where the libraries can be found at compile time.
  • Use -R to specify the location of the libraries at run time.
  • Use the token $ORIGIN to specify a relative path for the libraries' location. This avoids the need to have a hard-coded location where the libraries can be found.

Friday, May 6, 2011

Calling functions

I was looking at some code today and it reminded me of a very common performance issue - reloading data around calls. Suppose I have some code like:

int variable;

void function(int *array)
  for (i=0; i<1000; i++)
     if (variable==1) 

You might be surprised to find that "variable" is reloaded very iteration of the loop. The reason for this is that the loop calls another function - either func1() or func2() and the compiler knows that the function might change the value of "variable" - so to be correct it needs to be reloaded.

This problem can be fixed by caching a local copy of the variable. The compiler "knows" that local (or stack based) variables don't get modified by function calls.

However, the problem is more general than this, in C++ you might observe a reloading of variables that are members of objects - for similar reasons. The general rule for avoiding this is to examine every load or store in the hot region of code to check whether it is necessary, or whether it has been introduced because of a function call.

Thursday, April 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory ( and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find -name 'log.xml'`|grep process | sed 's/\:.*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.*tstamp=\"//'|sed 's/\".*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Catching the macro bug

I have to admit a dislike for macros. I've seen plenty of codes where it has been a Herculean task to figure out exactly what source code generated the particular assembly code. So perhaps I'm biased to begin with. However, I recently hit another annoyance with macros. The following code looks pretty benign:

#include <stdio.h>
#include <sys/time.h>

int timercmp(struct timeval *end, struct timeval *begin,struct timeval *result)

However, at compile time it produces the following error.

cc error.c
"error.c", line 4: syntax error before or at: struct
"error.c", line 4: syntax error before or at: )
"error.c", line 4: warning: old-style declaration or incorrect type for: tv_sec
"error.c", line 4: syntax error before or at: )
"error.c", line 4: warning: old-style declaration or incorrect type for: tv_sec
"error.c", line 4: syntax error before or at: )
"error.c", line 4: warning: old-style declaration or incorrect type for: tv_usec
"error.c", line 4: syntax error before or at: ->
"error.c", line 4: warning: old-style declaration or incorrect type for: tv_usec
"error.c", line 4: syntax error before or at: )
"error.c", line 4: warning: old-style declaration or incorrect type for: tv_sec
"error.c", line 4: identifier redefined: result
        current : function(pointer to struct timeval {long tv_sec, long tv_usec}) returning pointer to struct timeval {long tv_sec, long tv_usec}
        previous: function(pointer to struct timeval {long tv_sec, long tv_usec}) returning pointer to struct timeval {long tv_sec, long tv_usec} : "error.c", line 4
"error.c", line 4: syntax error before or at: ->
"error.c", line 4: warning: old-style declaration or incorrect type for: tv_sec
cc: acomp failed for error.c

The C++ compiler produces fewer errors:

 CC error.c
"error.c", line 4: Error: No direct declarator preceding "(".
1 Error(s) detected.

Of course, the problem is that timercmp is a macro defined in sys/time.h. This is revealed when the preprocessed source is examined:

$ cc -P error.c
$ tail error.i

int  ( ( ( struct timeval * end ) -> tv_sec == ( struct timeval * begin ) -> tv_sec ) ? ( ( struct timeval * end ) -> tv_usec struct timeval * result ( struct timeval * begin ) -> tv_usec ) : ( ( struct timeval * end ) -> tv_sec struct timeval * result ( struct timeval * begin ) -> tv_sec ) )

Now, we can narrow the problem down more rapidly by trying to compile the preprocessed code. This takes us to the exact line with the problem, and it's obvious from inspection exactly what is going on:

$ cc error.i
"error.i", line 1135: syntax error before or at: struct
"error.i", line 1135: syntax error before or at: )

Monday, April 25, 2011

Using pragma opt

The Studio compiler has the ability to control the optimisation level that is applied to particular functions in an application. This can be useful if the functions are designed to work at a specific optimisation level, or if the application fails at a particular optimisation level, and you need to figure out where the problem lies.

The optimisation levels are controlled through pragma opt. The following steps need to be followed to use the pragma:

  • The directive needs to be inserted into the source file. The format of the directive is #pragma opt /level/ (/function/). This needs to be inserted into the code before the start of the function definition, but after the function header.
  • The code needs to be compiled with the flag -xmaxopt=level. This sets the maximum optimisation level for all functions in the file - including those tagged with #pragma opt.

We can see this in action using the following code snippet. This contains two identical functions, both return the square of a global variable. However, we are using #pragma opt to control the optimisation level of the function f().

int f();
int g();

#pragma opt 2 (f)

int d;

int f()
  return d*d;

int g()
  return d*d;

The code is compiled with the flag -xmaxopt=5, this specifies the maximum optimisation level that can be applied to any functions in the file.

$ cc -O -xmaxopt=5 -S opt.c

If we compare the disassembly for the functions f() and g(), we can see that g() is more optimal as it does not reload the global data.

/* 000000          0 */         sethi   %hi(d),%o5

!   10                !  return d*d;

/* 0x0004         10 */         ldsw    [%o5+%lo(d)],%o4 ! volatile    // First load of d
/* 0x0008            */         ldsw    [%o5+%lo(d)],%o3 ! volatile    // Second load of d
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         mulx    %o4,%o3,%o0

/* 000000         14 */         sethi   %hi(d),%o5
/* 0x0004            */         ld      [%o5+%lo(d)],%o4               // Single load of d

!   15                !  return d*d;

/* 0x0008         15 */         sra     %o4,0,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         mulx    %o3,%o3,%o0

Friday, April 1, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ collect which
Creating experiment database ...

Monday, February 14, 2011

Interview with Jim Mauro

I was really pleased that Jim Mauro agreed to interview about developing for multicore processors. The interview has just gone live on the informit site.

Thursday, January 27, 2011

Don't initialise local strings

Consider the following code:

void s(int i)
  char string[2048]="";
  sprinf(string,"Value = %i",i);
  printf("String = %s\n",string);

The C standards require that if any elements of the character array string are initialised, then all of them should be. We can demonstrate this by compiling with gcc:

$ gcc -O -S f.c
$ more f.s
        .file   "f.c"
        .type   s, #function
        .proc   020
        save    %sp, -2160, %sp
        stx     %g0, [%fp-2064]
        add     %fp, -2056, %o0
        mov     0, %o1
        call    memset, 0
        mov    2040, %o2

You can see that explicitly initialising string caused all elements of string to be initialised with a call to memset(). Removing the explicit initialisation of string (the ="") avoids the call to memset().

Saturday, January 15, 2011

pginfo & pgstat

A couple of very welcome commands crept into Solaris 11. They are pginfo and pgstat (these are links to the Oracle documentation site).

The two commands deal with "processor groups", this seems a bit of a misnomer to me as they are really about CPU topology. They report information and utilisation stats demonstrating the resource sharing going on on the system, and how threads are using those resources. It's probably easiest to use a couple of examples from the man pages to show this. First off pginfo:

$ pginfo -p -T
0 (System) CPUs: 0-31
`-- 3 (Data_Pipe_to_memory [system,chip]) CPUs: 0-31
    `-- 2 (Floating_Point_Unit [system,chip]) CPUs: 0-31
        |-- 1 (Integer_Pipeline [core]) CPUs: 0-3
        |-- 4 (Integer_Pipeline [core]) CPUs: 4-7
        |-- 5 (Integer_Pipeline [core]) CPUs: 8-11
        |-- 6 (Integer_Pipeline [core]) CPUs: 12-15
        |-- 7 (Integer_Pipeline [core]) CPUs: 16-19
        |-- 8 (Integer_Pipeline [core]) CPUs: 20-23
        |-- 9 (Integer_Pipeline [core]) CPUs: 24-27
        `-- 10 (Integer_Pipeline [core]) CPUs: 28-31

This shows a processor with 32 virtual CPUs, sharing a single floating point pipeline, 8 cores with a single integer pipe each - looks like an UltraSPARC T1 to me.

The output from pginfo shows the utilisation of the processor:

$ pgstat 1 2
 0  System                   -  0.4%  0-31
 3   Data_Pipe_to_memory     -  0.4%  0-31
 2    Floating_Point_Unit   0%  0.4%  0-31
 1     Integer_Pipeline     0%    0%  0-3
 4     Integer_Pipeline     0%    0%  4-7
 5     Integer_Pipeline     0%    0%  8-11
 6     Integer_Pipeline     0%  0.2%  12-15
 7     Integer_Pipeline     0%    0%  16-19
 8     Integer_Pipeline   2.8%  2.7%  20-23
 9     Integer_Pipeline   0.1%  0.2%  24-27
10     Integer_Pipeline     0%    0%  28-31

It reports both software utilisation - meaning what work the operating system has assigned to the cores, plus it can report pipeline utilisation using the hardware counters. Pipeline utilisation indicates whether the core is saturated or not - each pipeline can be fully utilised before the core is running the maximal number of threads.

I'm pleased to see these tools appear. It is useful to have tools that report the topology of the system, and it is great to see tools that report actual hardware utilisation. On earlier releases of Solaris you can always use corestat to get similar data.

Wednesday, January 12, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\n", 1.0*(now-then)/100000000.0);
  then = now;

int func(char * value)
  int temp;
  ((char*)&temp)[0] = value[3];
  ((char*)&temp)[1] = value[2];
  ((char*)&temp)[2] = value[1];
  ((char*)&temp)[3] = value[0];
  return temp;

int main()
  int value = 0x01020304;
  for (int i=0; i<100000000; i++) func((char*)&value);

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char* value)
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.