Tuesday, February 17, 2015

Profiling the kernel

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:

<username>::::defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel

The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
....
Creating experiment database ktest.1.er (Process ID: 7399) ...
....
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail
...

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
...
Creating experiment database ktest.2.er (Process ID: 7630) ...
...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep
...

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

Thursday, February 5, 2015

Digging into microstate accounting

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but also allows you to see the time spent waiting on page faults and other useful-to-know states.

This level of detail is available through the usage file in /proc/pid, there's a corresponding file for each lwp in /proc/pid/lwp/lwpid/lwpusage. You can find more details about the /proc file system in documentation, or reading my recent article about tracking memory use.

Here's an example of using it to report idle time, ie time when the process wasn't busy:

#include <stdio.h>
#include <sys/resource.h>
#include <unistd.h>
#include <fcntl.h>
#include <procfs.h>

void busy()
{
  for (int i=0; i<100000; i++)
  {
   double d = i;
   while (d>0) { d=d *0.5; }
  }
}

void lazy()
{
  sleep(10);
}

double convert(timestruc_t ts)
{
  return ts.tv_sec + ts.tv_nsec/1000000000.0;
}

void report_idle()
{
  prusage_t prusage;
  int fd;
  fd = open( "/proc/self/usage", O_RDONLY);
  if (fd == -1) { return; }
  read( fd, &prusage, sizeof(prusage) );
  close( fd );
  printf("Idle percent = %3.2f\n",
  100.0*(1.0 - (convert(prusage.pr_utime) + convert(prusage.pr_stime))
               /convert(prusage.pr_rtime) ) );
}

void main()
{
  report_idle();
  busy();
  report_idle();
  lazy();
  report_idle();
}

The code has two functions that take time. The first does some redundant FP computation (that cannot be optimised out unless you tell the compiler to do FP optimisations), this part of the code is CPU bound. When run the program reports low idle time for this section of the code. The second routine calls sleep(), so the program is idle at this point waiting for the sleep time to expire, hence this section is reported as being high idle time.

Wednesday, February 4, 2015

Namespaces in C++

A porting problem I hit with regularity is using functions in the standard namespace. Fortunately, it's a relatively easy problem to diagnose and fix. But it is very common, and it's worth discussing how it happens.

C++ namespaces are a very useful feature that allows an application to use identical names for symbols in different contexts. Here's an example where we define two namespaces and place identically named functions in the two of them.

#include <iostream>

namespace ns1
{
  void hello() 
  { std::cout << "Hello ns1\n"; }
}

namespace ns2
{
  void hello() 
  { std::cout << "Hello ns2\n"; }
}

int main()
{
  ns1::hello();
  ns2::hello();
}

The construct namespace optional_name is used to introduce a namespace. In this example we have introduced two namespaces ns1 and ns2. Both namespaces have a routine called hello, but both routines can happily co-exist because they exist in different namespaces.

Behind the scenes, the namespace becomes part of the symbol name:

$ nm a.out|grep hello
[63]    |     68640|        44|FUNC |GLOB |0    |11     |__1cDns1Fhello6F_v_
[56]    |     68704|        44|FUNC |GLOB |0    |11     |__1cDns2Fhello6F_v_

When it comes to using functions declared in namespaces we can prepend the namespace to the name of the symbol, this uniquely identifies the symbol. You can see this in the example where the calls to hello() from the different namespaces are prefixed with the namespace.

However, prefixing every function call with its namespace can rapidly become very tedious. So there is a way to make this easier. First of all, let's quickly discuss the global namespace. The global namespace is the namespace that is searched if you do not specify a namespace - kind of the default namespace. If you declare a function foo() in your code, then it naturally resides in the global namespace.

We can add symbols from other namespaces into the global namespace using the using keyword. There are two ways we can do this. One way is to add the entire namespace into the global namespace. The other way is to symbols individually into the name space. To do this write using namespace <namespace>; to import the entire namespace into the global namespace, or using <namespace>::<function>; to just import a single function into the global namespace. Here's the earlier example modified to show both approaches:

#include <iostream>

namespace ns1
{
  void hello() 
  { std::cout << "Hello ns1\n"; }
}

namespace ns2
{
  void hello() 
  { std::cout << "Hello ns2\n"; }
}

int main()
{
  { 
    using namespace ns1; 
    hello(); 
  }
  { 
    using ns2::hello;
    hello(); 
  }
}

The other thing you will notice in the example is the use of std::cout. Notice that this is prefixed with the std:: namespace. This is an example of a situation where you might encounter porting problems.

The C++03 standard (17.4.1.1) says this about the C++ Standard Library "All library entities except macros, operator new and operator delete are defined within the namespace std or namespaces nested within the namespace std.". This means that, according to the standard, if you include iostream then cout will be defined in the std namespace. That's the only place you can rely on it being available.

Now, sometimes you might find a function that is in the std namespace is already available in the general namespace. For example, gcc puts all the functions that are in the std namespace into the general namespace.

Other times, you might include a header file which has already imported an entire namespace, or particular symbols from a namespace. This can happen if you change the Standard Library that you are using and the new header files contain a different set of includes and using statements.

There's one other area where you can encounter this, and that is using C library routines. All the C header files have a C++ counterpart. For example stdio.h has the counterpart cstdio. One difference between the two headers is the namespace where the routines are placed. If the C headers are used, then the symbols get placed into the global namespace, if the C++ headers are used the symbols get placed into the C++ namespace. This behaviour is defined by section D.5 of the C++03 standard. Here's an example where we use both the C and C++ header files, and need to specify the namespace for the functions from the C++ header file:

#include <cstdio>
#include <strings.h>

int main()
{
  char string[100];
  strcpy( string, "Hello" );
  std::printf( "%s\n", string );
}

Tuesday, February 3, 2015

Improving application performance using bit manipulation optimisations

My recent blog posts on bit manipulation are now available as an article up on the OTN community pages. If you want to read the individual posts they are:

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
{
  for (int i=0;i < length; i+=8)
  {
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);
  }
}

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

  • The inverted byte must have a set upper bit.
  • If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
{
  for (int i=0; i<length; i+=8)
  {
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;
  }
}

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Monday, February 2, 2015

Bit manipulation: finding a range of values

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
{
  for (unsigned int i=0; i<length; i++)
  {
    if (array[i]>target) { return i; }
  }
  return -1;
}

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

  • The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.
  • We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is ( byte & 0x7f ) + (255-target).

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).

Putting all this together we get the following code:

int range2( unsigned char* array, unsigned int length, unsigned char target )
{
  unsigned int i = 0;
  // Handle misalignment
  while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
  { 
    if ( array[i] > target ) { return i; }
    i++;
    length--;
  }
  // Optimised code
  unsigned long long * p = (unsigned long long*) &array[i];
  if (target < 128)
  {
    unsigned long long v8 = 127 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);

    while (length > 8) 
    {
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = (v | u) & 0x8080808080808080;
      if (r) { break; }
      length-=8;
      p++;
      i+=8;
    }
  }
  else
  {
    unsigned long long v8 = 255 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);
    while (length > 8)
    {
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = v & u;
      if (r) { break; }
      length-=8;
      p++;
      i+=8;
    }
  }

  // Handle trailing values
  while (length > 0)
  { 
    if (array[i] > target) { return i; }
    i++;
    length--;
  }
  return -1;
}

The resulting code runs about 4x faster than the original version.