Thursday, January 27, 2011

Don't initialise local strings

Consider the following code:

void s(int i)
{
  char string[2048]="";
  sprinf(string,"Value = %i",i);
  printf("String = %s\n",string);
}

The C standards require that if any elements of the character array string are initialised, then all of them should be. We can demonstrate this by compiling with gcc:

$ gcc -O -S f.c
$ more f.s
        .file   "f.c"
...
        .type   s, #function
        .proc   020
s:
        save    %sp, -2160, %sp
        stx     %g0, [%fp-2064]
        add     %fp, -2056, %o0
        mov     0, %o1
        call    memset, 0
        mov    2040, %o2

You can see that explicitly initialising string caused all elements of string to be initialised with a call to memset(). Removing the explicit initialisation of string (the ="") avoids the call to memset().

Saturday, January 15, 2011

pginfo & pgstat

A couple of very welcome commands crept into Solaris 11. They are pginfo and pgstat (these are links to the Oracle documentation site).

The two commands deal with "processor groups", this seems a bit of a misnomer to me as they are really about CPU topology. They report information and utilisation stats demonstrating the resource sharing going on on the system, and how threads are using those resources. It's probably easiest to use a couple of examples from the man pages to show this. First off pginfo:

$ pginfo -p -T
0 (System) CPUs: 0-31
`-- 3 (Data_Pipe_to_memory [system,chip]) CPUs: 0-31
    `-- 2 (Floating_Point_Unit [system,chip]) CPUs: 0-31
        |-- 1 (Integer_Pipeline [core]) CPUs: 0-3
        |-- 4 (Integer_Pipeline [core]) CPUs: 4-7
        |-- 5 (Integer_Pipeline [core]) CPUs: 8-11
        |-- 6 (Integer_Pipeline [core]) CPUs: 12-15
        |-- 7 (Integer_Pipeline [core]) CPUs: 16-19
        |-- 8 (Integer_Pipeline [core]) CPUs: 20-23
        |-- 9 (Integer_Pipeline [core]) CPUs: 24-27
        `-- 10 (Integer_Pipeline [core]) CPUs: 28-31

This shows a processor with 32 virtual CPUs, sharing a single floating point pipeline, 8 cores with a single integer pipe each - looks like an UltraSPARC T1 to me.

The output from pginfo shows the utilisation of the processor:

$ pgstat 1 2
PG  RELATIONSHIP            HW    SW  CPUS
 0  System                   -  0.4%  0-31
 3   Data_Pipe_to_memory     -  0.4%  0-31
 2    Floating_Point_Unit   0%  0.4%  0-31
 1     Integer_Pipeline     0%    0%  0-3
 4     Integer_Pipeline     0%    0%  4-7
 5     Integer_Pipeline     0%    0%  8-11
 6     Integer_Pipeline     0%  0.2%  12-15
 7     Integer_Pipeline     0%    0%  16-19
 8     Integer_Pipeline   2.8%  2.7%  20-23
 9     Integer_Pipeline   0.1%  0.2%  24-27
10     Integer_Pipeline     0%    0%  28-31

It reports both software utilisation - meaning what work the operating system has assigned to the cores, plus it can report pipeline utilisation using the hardware counters. Pipeline utilisation indicates whether the core is saturated or not - each pipeline can be fully utilised before the core is running the maximal number of threads.

I'm pleased to see these tools appear. It is useful to have tools that report the topology of the system, and it is great to see tools that report actual hardware utilisation. On earlier releases of Solaris you can always use corestat to get similar data.

Wednesday, January 12, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
{
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\n", 1.0*(now-then)/100000000.0);
  then = now;
}

int func(char * value)
{
  int temp;
  ((char*)&temp)[0] = value[3];
  ((char*)&temp)[1] = value[2];
  ((char*)&temp)[2] = value[1];
  ((char*)&temp)[3] = value[0];
  return temp;
}

int main()
{
  int value = 0x01020304;
  tick();
  for (int i=0; i<100000000; i++) func((char*)&value);
}

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char* value)
{
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];
}

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.