Darryl Gove's Blog

Fixing a Hohner JT60 (updating electronics)

2023-06-05T23:57:00.001-07:00

The other part of my JT60 that needed work was the volume control and one of the tone controls - both potentiometers needed replacing. The component list I came up with was:

Tone: 2x A250kΩ mini pots 15mm height The A stands for Audio.
Volume: B250kΩ mini pots 15mm height The B indicates linear.
Switch: DECO T867-0061.
Pickups: 3x GS1S - middle one with yellow wire (RWRP).

The electronics look like the following pictures:

I was unable to find a circuit diagram on-line. This is the wiring schematic for my JT60:

Fixing a Hohner Jt60 (new tremolo block)

2023-06-05T23:31:00.000-07:00

I've had a Hohner JT60 "Hollywood" guitar for a long while. It's unusual in that it has an ATN (Advanced Tonal Network) (the link is to a discussion in German on the potential circuit in the ATN). The idea is that you can obtain a wide range of tones from combinations of the volume and tone controls.

The guitar needed some work. Opening it up I found that the tremolo block had cracked and the springs had grown weak - as shown in the picture.

So the first fix was to replace the tremolo block with a Musiclily block that fitted the guitar. The dimensions of this tremolo are shown in the diagram - and it fits very well.

Photo of bridge in place, before studs and strings replaced:

go/fast tips being published

2023-03-02T15:00:00.005-08:00

https://abseil.io/fast/ is external exposure of one of the projects I've been involved in at Google. For a long time I've felt that performance optimisation is the best job in the world, but to really move the needle we need to ensure that folks have access to information about writing efficient code - hence this blog and the books. At work, internally we've been gathering and documenting best practices for performance. Finally, we're able to publish some of these externally!

TCMalloc hugepage paper

2021-06-15T23:09:00.003-07:00

I'm pleased to announce that our paper on TCMalloc and hugepages has been made available.

This paper describes the changes made to TCMalloc in order for it to become hugepage aware. It required a complete restructure of the "backend" - the bit that manages memory requested from the OS. The obvious change is to manage this memory in chunks of 2MiB (x86 hugepage size). Once you have that you need a cache that handles allocations of less than 2MiB, another that handles allocations of greater than 2MiB, etc.

One of the key observations in the paper is that increasing hugepage coverage improves application performance, not necessarily allocator performance. i.e. you can spend extra time in the allocator preserving hugepage coverage, this improves the performance of the code that uses the memory. The net result is that you can slightly increase time spent in the allocator, and reduce time spent in the application. This makes the time spent in the allocator increase - making it look worse!

To work around the fact that the allocator looks worse, you also need to look for metrics that represent the productivity of the application. If the productivity increases, then the changes to the allocator are an efficiency win - even if the allocator ends up taking a greater percentage of the entire runtime.

Paper on releasing memory back to the OS

2021-06-07T21:24:00.001-07:00

Co-author on a paper about releasing memory back to the OS in TCMalloc.

The quick summary is that memory returned to the OS is often requested back very rapidly. Hence waiting a bit of time before returning unused memory improves performance without increasing practical memory footprint.

The performance improvements comes from not needing to break up hugepages in order to return memory to the OS. Additionally we don't spend time in the OS returning memory, or faulting it back into the memory of the process.

We don't make a practical increase in the RAM that an application needs because the memory is typically requested after a short interval, and there's insufficient time to be able to schedule anything in the interval.

Featured in TheRegister

2021-02-09T22:54:00.000-08:00

I was surprised to find that my Oracle blog post were mentioned in TheRegister. Apparently Oracle has restored the content. The original content appears to still be there, but the linked images etc have disappeared - so the new content should probably be described as "Remastered" !

Bit masking on x86

2020-12-29T14:00:00.001-08:00

In theory it's pretty easy to generate a bitmask for a range of bits:

unsigned long long mask(int bits) {
  return mask = (1ull << bits) - 1;
}

You need to specify that the 1 being shifted is an unsigned long long otherwise it gets treated as a 32-bit int, and the code only works for the range 0..31.

However, this code fails to work on x86. For example:

#include <math.h>
#include <stdio.h>

unsigned long long shift(int x) {
    return (1ull << x) - 1;
}

int main() {
    printf("Value %0llx\n", shift(64));
}

This returns the value 0 for the result of shifting by 64 when run on x86.

The reason for this can be found in the Intel docs (Vol. 2B 4-583):

The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used).

The result of this is that the one is shifted by zero - ie remains unchanged - and subtracting 1 from that produces the value zero.

Unfortunately, this means we need a more complex bit of code that handles shifts of greater than 64 correctly:

unsigned long long mask(int bits) {
  return mask = (bits >= 64 ? -1 : (1ull << bits) - 1);
}

TCMalloc - fast, hugepage aware, memory allocator

2020-02-12T08:24:00.002-08:00

I'm thrilled that we've been able to release a revised version of TCMalloc.

In it's new default mode it holds per virtual CPU caches so that a thread running on a CPU can have contention free allocation and deallocation. This relies on Restartable Sequences - a Linux feature that ensures a region of code either completes without interruption by the OS, or gets restarted if there is an interruption.

The other compelling feature is that it is hugepage aware. It ultimately manages memory in hugepage chunks so that it can pack data onto as few hugepages as possible.

Matt Godbolt on compiler optimisations.

2019-11-14T21:32:00.003-08:00

This article on optimisation is worth a read.

Memory error detection (new series)

2015-10-12T23:37:00.001-07:00

Just before I left Oracle I sat down with my good friend Raj and we wrote up a bundle of stuff on Memory Error Detection. He's started posting it on his blog.

Memory errors are one of the common types of application error - accessing past the end of arrays, or past the end of a block of allocated memory, reading memory that has not yet been initialised, and so on. What makes them particularly nasty is that they are hard to detect, and the effect of the memory access error can be arbitrarily far from the place where the error occurs. So it's useful to have tools which help you find them.....

Profiling the kernel

2015-02-17T11:35:00.002-08:00

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:

<username>::::defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel

The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
....
Creating experiment database ktest.1.er (Process ID: 7399) ...
....
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail
...

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
...
Creating experiment database ktest.2.er (Process ID: 7630) ...
...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep
...

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

Digging into microstate accounting

2015-02-05T08:00:00.000-08:00

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but also allows you to see the time spent waiting on page faults and other useful-to-know states.

This level of detail is available through the usage file in /proc/pid, there's a corresponding file for each lwp in /proc/pid/lwp/lwpid/lwpusage. You can find more details about the /proc file system in documentation, or reading my recent article about tracking memory use.

Here's an example of using it to report idle time, ie time when the process wasn't busy:

#include <stdio.h>
#include <sys/resource.h>
#include <unistd.h>
#include <fcntl.h>
#include <procfs.h>

void busy()
{
  for (int i=0; i<100000; i++)
  {
   double d = i;
   while (d>0) { d=d *0.5; }
  }
}

void lazy()
{
  sleep(10);
}

double convert(timestruc_t ts)
{
  return ts.tv_sec + ts.tv_nsec/1000000000.0;
}

void report_idle()
{
  prusage_t prusage;
  int fd;
  fd = open( "/proc/self/usage", O_RDONLY);
  if (fd == -1) { return; }
  read( fd, &prusage, sizeof(prusage) );
  close( fd );
  printf("Idle percent = %3.2f\n",
  100.0*(1.0 - (convert(prusage.pr_utime) + convert(prusage.pr_stime))
               /convert(prusage.pr_rtime) ) );
}

void main()
{
  report_idle();
  busy();
  report_idle();
  lazy();
  report_idle();
}

The code has two functions that take time. The first does some redundant FP computation (that cannot be optimised out unless you tell the compiler to do FP optimisations), this part of the code is CPU bound. When run the program reports low idle time for this section of the code. The second routine calls sleep(), so the program is idle at this point waiting for the sleep time to expire, hence this section is reported as being high idle time.

Namespaces in C++

2015-02-04T12:23:00.000-08:00

A porting problem I hit with regularity is using functions in the standard namespace. Fortunately, it's a relatively easy problem to diagnose and fix. But it is very common, and it's worth discussing how it happens.

C++ namespaces are a very useful feature that allows an application to use identical names for symbols in different contexts. Here's an example where we define two namespaces and place identically named functions in the two of them.

#include <iostream>

namespace ns1
{
  void hello() 
  { std::cout << "Hello ns1\n"; }
}

namespace ns2
{
  void hello() 
  { std::cout << "Hello ns2\n"; }
}

int main()
{
  ns1::hello();
  ns2::hello();
}

The construct namespace optional_name is used to introduce a namespace. In this example we have introduced two namespaces ns1 and ns2. Both namespaces have a routine called hello, but both routines can happily co-exist because they exist in different namespaces.

Behind the scenes, the namespace becomes part of the symbol name:

$ nm a.out|grep hello
[63]    |     68640|        44|FUNC |GLOB |0    |11     |__1cDns1Fhello6F_v_
[56]    |     68704|        44|FUNC |GLOB |0    |11     |__1cDns2Fhello6F_v_

When it comes to using functions declared in namespaces we can prepend the namespace to the name of the symbol, this uniquely identifies the symbol. You can see this in the example where the calls to hello() from the different namespaces are prefixed with the namespace.

However, prefixing every function call with its namespace can rapidly become very tedious. So there is a way to make this easier. First of all, let's quickly discuss the global namespace. The global namespace is the namespace that is searched if you do not specify a namespace - kind of the default namespace. If you declare a function foo() in your code, then it naturally resides in the global namespace.

We can add symbols from other namespaces into the global namespace using the using keyword. There are two ways we can do this. One way is to add the entire namespace into the global namespace. The other way is to symbols individually into the name space. To do this write using namespace <namespace>; to import the entire namespace into the global namespace, or using <namespace>::<function>; to just import a single function into the global namespace. Here's the earlier example modified to show both approaches:

#include <iostream>

namespace ns1
{
  void hello() 
  { std::cout << "Hello ns1\n"; }
}

namespace ns2
{
  void hello() 
  { std::cout << "Hello ns2\n"; }
}

int main()
{
  { 
    using namespace ns1; 
    hello(); 
  }
  { 
    using ns2::hello;
    hello(); 
  }
}

The other thing you will notice in the example is the use of std::cout. Notice that this is prefixed with the std:: namespace. This is an example of a situation where you might encounter porting problems.

The C++03 standard (17.4.1.1) says this about the C++ Standard Library "All library entities except macros, operator new and operator delete are defined within the namespace std or namespaces nested within the namespace std.". This means that, according to the standard, if you include iostream then cout will be defined in the std namespace. That's the only place you can rely on it being available.

Now, sometimes you might find a function that is in the std namespace is already available in the general namespace. For example, gcc puts all the functions that are in the std namespace into the general namespace.

Other times, you might include a header file which has already imported an entire namespace, or particular symbols from a namespace. This can happen if you change the Standard Library that you are using and the new header files contain a different set of includes and using statements.

There's one other area where you can encounter this, and that is using C library routines. All the C header files have a C++ counterpart. For example stdio.h has the counterpart cstdio. One difference between the two headers is the namespace where the routines are placed. If the C headers are used, then the symbols get placed into the global namespace, if the C++ headers are used the symbols get placed into the C++ namespace. This behaviour is defined by section D.5 of the C++03 standard. Here's an example where we use both the C and C++ header files, and need to specify the namespace for the functions from the C++ header file:

#include <cstdio>
#include <strings.h>

int main()
{
  char string[100];
  strcpy( string, "Hello" );
  std::printf( "%s\n", string );
}

Improving application performance using bit manipulation optimisations

2015-02-03T12:19:00.001-08:00

My recent blog posts on bit manipulation are now available as an article up on the OTN community pages. If you want to read the individual posts they are:

Bit manipulation: Gathering bits

2015-02-03T08:00:00.000-08:00

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
{
  for (int i=0;i < length; i+=8)
  {
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);
  }
}

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

The inverted byte must have a set upper bit.
If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
{
  for (int i=0; i<length; i+=8)
  {
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;
  }
}

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Bit manipulation: finding a range of values

2015-02-02T08:00:00.000-08:00

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
{
  for (unsigned int i=0; i<length; i++)
  {
    if (array[i]>target) { return i; }
  }
  return -1;
}

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.
We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is ( byte & 0x7f ) + (255-target).

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).

Putting all this together we get the following code:

int range2( unsigned char* array, unsigned int length, unsigned char target )
{
  unsigned int i = 0;
  // Handle misalignment
  while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
  { 
    if ( array[i] > target ) { return i; }
    i++;
    length--;
  }
  // Optimised code
  unsigned long long * p = (unsigned long long*) &array[i];
  if (target < 128)
  {
    unsigned long long v8 = 127 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);

    while (length > 8) 
    {
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = (v | u) & 0x8080808080808080;
      if (r) { break; }
      length-=8;
      p++;
      i+=8;
    }
  }
  else
  {
    unsigned long long v8 = 255 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);
    while (length > 8)
    {
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = v & u;
      if (r) { break; }
      length-=8;
      p++;
      i+=8;
    }
  }

  // Handle trailing values
  while (length > 0)
  { 
    if (array[i] > target) { return i; }
    i++;
    length--;
  }
  return -1;
}

The resulting code runs about 4x faster than the original version.

Finding zeros in an array

2015-01-30T08:00:00.000-08:00

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:

#include "timing.h"

unsigned int len(char* array)
{
  unsigned int length = 0;
  while( array[length] ) { length++; }
  return length;
}

#define COUNT 100000
void main()
{
  char array[ COUNT ];
  for (int i=1; i<COUNT; i++)
  {
    array[i-1] = 'a';
    array[i] = 0;
    if ( i != len(array) ) { printf( "Error at %i\n", i ); }
  }
  starttime();
  for (int i=1; i<COUNT; i++)
  {
    array[i-1] = 'a';
    array[i] = 0;
    len(array);
  }
  endtime(COUNT);
}

A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.

The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.

unsigned int len2(char* array)
{
  unsigned int length = 0;
  // Handle misaligned data
  while ( ( (unsigned long long) & array[length] ) &7 )
  {
    if ( array[length] == 0 ) { return length; }
    length++;
  }

  unsigned long long * p = (unsigned long long *) & array[length];
  unsigned long long v8, v7;
  do
  {
    v8 = *p;
    v7 = v8 - 0x0101010101010101;
    v7 = (v7 & ~v8) & 0x8080808080808080;
    p++;
  }
  while ( !v7 );
  length = (char*)p - array-8;
  while ( array[length] ) { length++; }
  return length;
}

The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.

It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.

It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

Bit manipulation: Population Count

2015-01-29T08:00:00.000-08:00

Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware instruction to do it. However, for this example, we're going to look at coding it in software. First of all we'll write a baseline version of the code:

int popc(unsigned long long value)
{
  unsigned long long bit = 1;
  int popc = 0;
  while ( bit )
  {
    if ( value & bit ) { popc++; }
    bit = bit << 1;
  }
  return popc;
}

The above code examines every bit in the input and counts the number of set bits. The number of iterations is proportional to the number of bits in the register.

Most people will immediately recognise that we could make this a bit faster using the code we discussed previously that clears the last set bit, whist there are set bits keep clearing them, otherwise you're done. The advantage of this approach is that you only iterate once for every set bit in the value. So if there are no set bits, then you do not do any iterations.

int popc2( unsigned long long value )
{
  int popc = 0;
  while ( value )
  {
    popc++;
    value = value & (value-1);
  }
  return popc;
}

The next thing to do is to put together a test harness that confirms that the new code produces the same results as the old code, and also measures the performance of the two implementations.

#define COUNT 1000000
void main()
{
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    if (popc( i + (i<<32) ) != popc2( i + (i<<32) ) )
    { 
      printf(" Mismatch popc2 input %llx: %u!= %u\n", 
               i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32))); 
    }
  }
  
  // Performance test
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    popc(i+(i<<32));
  }
  endtime(COUNT);
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    popc2(i+(i<<32));
  }
}

The new code is about twice as fast as the old code. However, the new code still contains a loop, and this can be a bit of a problem.

Branch mispredictions

The trouble with loops, and with branches in general, is that processors don't know the next instruction that will be executed after the branch until the branch has been reached, but the processor needs to have already fetched instruction after the branch well before this. The problem is nicely summarised by Holly in Red Dwarf:

"Look, I'm trying to navigate at faster than the speed of light, which means that before you see something, you've already passed through it."

So processors use branch prediction to guess whether a branch is taken or not. If the prediction is correct there is no break in the instruction stream, but if the prediction is wrong, then the processor needs to throw away all the incorrectly predicted instructions, and fetch the instructions from correct address. This is a significant cost, so ideally you don't want mispredicted branches, and the best way of ensuring that is to not have branches at all!

The following code is a branchless sequence for computing population count

unsigned int popc3(unsigned long long value)
{
  unsigned long long v2;
  v2     = value &t;< 1;
  v2    &= 0x5555555555555555;
  value &= 0x5555555555555555;
  value += v2;
  v2     = value << 2;
  v2    &= 0x3333333333333333;
  value &= 0x3333333333333333;
  value += v2;
  v2     = value << 4;
  v2    &= 0x0f0f0f0f0f0f0f0f;
  value &= 0x0f0f0f0f0f0f0f0f;
  value += v2;
  v2     = value << 8;
  v2    &= 0x00ff00ff00ff00ff;
  value &= 0x00ff00ff00ff00ff;
  value += v2;
  v2     = value << 16;
  v2    &= 0x0000ffff0000ffff;
  value &= 0x0000ffff0000ffff;
  value += v2;
  v2     = value << 32;
  value += v2;
  return (unsigned int) value;
}

This instruction sequence computes the population count by initially adding adjacent bits to get a two bit result of 0, 1, or 2. It then adds the adjacent pairs of bits to get a 4 bit result of between 0 and 4. Next it adds adjacent nibbles to get a byte result, then adds pairs of bytes to get shorts, then adds shorts to get a pair of ints, which it adds to get the final value. The code contains a fair amount of AND operations to mask out the bits that are not part of the result.

This bit manipulation version is about two times faster than the clear-last-bit-set version, making it about four times faster than the original code. However, it is worth noting that this is a fixed cost. The routine takes the same amount of time regardless of the input value. In contrast the clear -last-bit-set version will exit early if there are no set bits. Consequently the performance gain for the code will depend on both the input value and the cost of mispredicted branches.

Tracking application resource use

2015-01-28T15:19:00.000-08:00

One question you might ask is "how much memory is my application consuming?". Obviously you can use prstat (prstat -cp <pid> or prstat -cmLp <pid>) to examine the behaviour of a process. But how about programmatically finding that information.

OTN has just published an article where I demonstrate how to find out about the resource use of a process, and incidentally how to put that functionality into a library that reports resource use over time.

Inline functions in C

2015-01-28T12:56:00.002-08:00

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation.

When a function is declared as inline it's a hint to the compiler that the function could be inlined. It is not a command to the compiler that the function must be inlined. If the compiler chooses not to inline the function, then the function will be left with a function call that needs to be resolved, and at link time it will be necessary for a function definition to be provided. Consider this example:

#include <stdio.h>

inline void foo() 
{
  printf(" In foo\n"); 
}

void main()
{
  foo();
}

The code provides an inline definition of foo(), so if the compiler chooses to inline the function it will use this definition. However, if it chooses not to inline the function, you will get a link time error when the linker is unable to find a suitable definition for the symbol foo:

$ cc -o in in.c
Undefined                       first referenced
 symbol                             in file
foo                                 in.o
ld: fatal: symbol referencing errors. No output written to in

This can be worked around by adding either "static" or "extern" to the definition of the inline function.

If the function is declared to be a static inline then, as before the compiler may choose to inline the function. In addition the compiler will emit a locally scoped version of the function in the object file. There can be one static version per object file, so you may end up with multiple definitions of the same function, so this can be very space inefficient. Since all the functions are locally scoped, there is are no multiple definitions.

Another approach is to declare the function as extern inline. In this case the compiler may generate inline code, and will also generate a global instance of the function. Although multiple global instances of the function might be generated in all the object files, only one will be remain in the executable after linking. So declaring functions as extern inline is more space efficient.

This behaviour is defined by the standard. However, gcc takes a different approach, which is to treat inline functions by generating a global function and potentially inlining the code. Unfortunately this can cause multiply-defined symbol errors at link time, where the same extern inline function is declared in multiple files. For example, in the following code both in.c and in2.c include in.h which contains the definition of extern inline foo() {...}.

$ gcc -o in in.c in2.c
ld: fatal: symbol 'foo' is multiply-defined:

The gcc behaviour for functions declared as extern inline is also different. It does not emit an external definition for these functions, leading to unresolved symbol errors at link time.

For gcc, it is best to either declare the functions as extern inline and, in additional module, provide a global definition of the function, or to declare the functions as static inline and live with the multiple local symbol definitions that this produces.

So for convenience it is tempting to use static inline for all compilers. This is a good work around (ignoring the issue of duplicate local copies of functions), except for an issue around unused code.

The keyword static tells the compiler to emit a locally-scoped version of the function. Solaris Studio emits that function even if the function does not get called within the module. If that function calls a non-existent function, then you may get a link time error. Suppose you have the following code:

void error_message();

static inline unused() 
{
  error_message(); 
}

void main()
{
}

Compiling this we get the following error message:

$ cc -O i.c
"i.c", line 3: warning: no explicit type given
Undefined                       first referenced
 symbol                             in file
error_message                       i.o

Even though the function call exists in code that is not used, there is a link error for the undefined function error_message(). The same error would occur if extern inline was used as this would cause a global version of the function to be emitted. The problem would not occur if the function were just declared as inline because in this case the compiler would not emit either a global or local version of the function. The same code compiles with gcc because the unused function is not generated.

So to summarise the options:

Declare everything static inline, and ensure that there are no undefined functions, and that there are no functions that call undefined functions.
Declare everything inline for Studio and extern inline for gcc. Then provide a global version of the function in a separate file.

Improving performance through bit manipulation: clear last set bit

2015-01-28T08:00:00.000-08:00

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code.

Of course, in doing this you need to avoid the pit fall of premature optimisation - where you needlessly make the code more obscure with no benefit, or a benefit that disappears as soon as you run your code on a different machine. So with that caveat in mind, let's take a look at a simple example.

Clear last set bit

This is a great starting point because it nicely demonstrates how we can sometimes replace a fair chunk of code with a much simpler set of instructions. Of course, the algorithm that uses fewer instructions is harder to understand, but in some situations the performance gain is worth it.

We'll start off with some classic code to solve the problem. The reason for this is two-fold. First of all we want to clearly understand the problem we're solving. Secondly, we want a reference version that we can test against to ensure that our fiendishly clever code is actually correct. So here's our starting point:

unsigned long long clearlastbit( unsigned long long value )
{
  int bit=1;
  if ( value== 0 ) { return 0; }
  while ( !(value & bit) )
  {
    bit = bit << 1;
  }
  value = value ^ bit;
  return value;
}

But before we start trying to improve it we need a timing harness to find out how fast it runs. The following harness uses the Solaris call gethrtime() to return a timestamp in nanoseconds.

#include <stdio.h>
#include <sys/time.h>

static double s_time;

void starttime()
{
  s_time = 1.0 * gethrtime();
}

void endtime(unsigned long long its)
{
  double e_time = 1.0 * gethrtime();
  printf( "Time per iteration %5.2f ns\n", (e_time-s_time) / (1.0*its) );
  s_time = 1.0 * gethrtime();
}

The next thing we need is a workload to test the current implementation. The workload iterates over a range of numbers and repeatedly calls clearlastbit() until all the bits in the current number have been cleared.

#define COUNT 1000000
void main()
{
  starttime();
  for (unsigned long long i = 0; i < COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }
  }
  endtime(COUNT);
}

Big O notation

So let's take a break at this point to discuss big O notation. If we look at the code for clearlastbit() we can see that it contains a loop. We'll iterate around the loop once for each bit in the input value, so for a N bit number we might iterate up to N times. We say that this computation is "order N", meaning that the cost the calculation is somehow proportional to the number of bits in the input number. This is written as O(N).

The order N description is useful because it gives us some idea of the cost of calling the routine. From it we know that the routine will typically take twice as long if coded for 8 byte inputs than if we used 4 byte inputs. Order N is not too bad as costs go, the ones to look out for are order N squared, or cubed etc. For these higher orders the run time to complete a computation can become huge for even comparatively small values of N.

If we look at the test harness, we are iterating over the function COUNT times, so effectively the entire program is O(COUNT*N), and we're exploiting the fact that this is effectively an O(N^2) cost to provide a workload that has a reasonable duration.

So let's return to the problem of clearing the last set bit. One obvious optimisation would be to record the last bit that was cleared, and then start the next iteration of the loop from that point. This is potentially a nice gain, but does not fundamentally change the algorithm. A better approach is to take advantage of bit manipulation so that we can avoid the loop altogether.

unsigned long long clearlastbit2( unsigned long long value )
{
  return (value & (value-1) );
}

Ok, if you look at this code it is not immediately apparent what it does - most people would at first sight say "How can that possibly do anything useful?". The easiest way to understand it is to take an example. Suppose we pass the value ten into this function. In binary ten is encoded as 1010b. The first operation is the subtract operation which produces the result of nine, which is encoded as 1001b. We then take the AND of these two to get the result of 1000b or eight. We've cleared the last set bit because the subtract either removed the one bit (if it was set) or broke down the next largest set bit. The AND operation just keeps the bits to the left of the last set bit.

What is interesting about this snippet of code is that it is just three instructions. There's no loop and no branches - so most processors can execute this code very quickly. To demonstrate how much faster this code is, we need a test harness. The test harness should have two parts to it. The first part needs to validate that the new code produces the same result as the existing code. The second part needs to time the old and new code.

#define COUNT 1000000
void main()
{
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) 
    { 
       unsigned long long v2 = value;
       value = clearlastbit(value);
       if (value != clearlastbit2(v2)) 
       { 
         printf(" Mismatch input %llx: %llx!= %llx\n", v2, value, clearlastbit2(v2)); 
       }
    }
  }
  
  // Performance test
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }
  }
  endtime(COUNT);

  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit2(value); }
  }
  endtime(COUNT);
}

The final result is that the bit manipulation version of this code is about 3x faster than the original code - on this workload. Of course one of the interesting things is that the performance does depend on the input values. For example, if there are no set bits, then both codes will run in about the same amount of time.

Missing semi-colon

2015-01-13T12:41:00.003-08:00

Thought I'd highlight this error message:

class foo
{
  foo();
}

foo::foo() 
{ }

$ CC -c c.cpp
"c.cpp", line 6: Error: A constructor may not have a return type specification.
1 Error(s) detected.

The problem is that the class definition is not terminated with a semi-colon. It should be:

class foo
{
  foo();
};  // Semi-colon

foo::foo() 
{ }

Behaviour of std::list::splice in the 2003 and 2011 C++ standards

2015-01-07T09:17:00.002-08:00

There's an interesting corner case in the behaviour of std::list::splice. In the C++98/C++03 standards it is defined such that iterators referring to the spliced element(s) are invalidated. This behaviour changes in the C++11 standard, where iterators remain valid.

The text of the 2003 standard (section 23.2.2.4, p2, p7, p12) describes the splice operation as "destructively" moving elements from one list to another. If one list is spliced into another, then all iterators and references to that list become invalid. If an element is spliced into a list, then any iterators and references to that element become invalid, similarly if a range of elements is spliced then iterators and references to those elements become invalid.

This is changed in the 2011 standard (section 23.3.5.5, p2, p4, p7, p12) where the operation is still described as being destructive, but all the iterators and references to the spliced element(s) remain valid.

The following code demonstrates the problem:

#include <list>
#include <iostream>

int main()
{
  std::list<int> list;
  std::list<int>::iterator i;

  i=list.begin();
  list.insert(i,5);
  list.insert(i,10);
  list.insert(i,3);
  list.insert(i,4); // i points to end
  // list contains 5 10 3 4
  i--; // i points to 4
  i--; // i points to 3
  i--; // i points to 10

  std::cout << " List contains: ";
  for (std::list<int>::iterator l=list.begin(); l!=list.end(); l++)
  {
    std::cout << " >" << *l << "< ";
  }
  std::cout << "\n element at i = " << *i << "\n";

  std::list<int>::iterator element;
  element = list.begin();
  element++; // points to 10
  element++; // points to 3
  std::cout << " element at element = " << *element << "\n";

  list.splice(i,list,element); // Swap 10 and 3

  std::cout << " List contains :";
  for (std::list<int>::iterator l=list.begin(); l!=list.end(); l++)
  {
    std::cout << " >" << *l << "< ";
  }

  std::cout << "\n element at element = " << *element << '\n';
  element++; // C++03, access to invalid iterator
  std::cout << " element at element = " << *element << '\n';
}

When compiled to the 2011 standard the code is expected to work and produce output like:

 List contains:  >5<  >10<  >3<  >4<
 element at i = 10
 element at element = 3
 List contains : >5<  >3<  >10<  >4<
 element at element = 3
 element at element = 10

However, the behaviour when compiled to the 2003 standard is indeterminate. It might work - if the iterator happens to remain valid, but it could also fail:

 List contains:  >5<  >10<  >3<  >4<
 element at i = 10
 element at element = 3
 List contains : >5<  >3<  >10<  >4<
 element at element = 3
 element at element = Segmentation Fault (core dumped)

New articles about Solaris Studio

2015-01-06T13:27:00.002-08:00

We've started posting new articles directly into the communities section of the Oracle website. If you're not familiar with this location, it's also where you can post questions on languages or tools.

With the change it should be easier to find articles relevant to developers, and it should be easy to comment on them. So hopefully this works out well. There's currently three articles listed on the content page. I've already posted about the article on the Performance Analyzer Overview screen, so I'll quickly highlight the other two:

In Studio 12.4 we've introducedfiner control over debug information. This allows you to reduce object file size by excluding debug info that you don't need. There's substantial control, but probably the easiest new option is -g1 which includes a minimal set debug info.
A change in Studio 12.4 is in the default way that C++ handles templates. The summary is that the compiler's default mode is inline with the way that other compilers work - you need to include template definitions in the source file being compiled. Previously the compiler would try to find the definitions in external files. This old behaviour could be confusing, so it's good that the default has changed. But it's possible that you may encounter code that was written with the expectation that the Studio compiler behaved in the old way, in this case you'll need to modify the source, or tell the compiler to revert to the older behaviour. Hopefully, most people won't even notice the change, but it's worth knowing the signs in case you encounter a problem.

The Performance Analyzer Overview screen

2015-01-06T13:13:00.002-08:00

A while back I promised a more complete article about the Performance Analyzer Overview screen introduced in Studio 12.4. Well, here it is!

Just to recap, the Overview screen displays a summary of the contents of the experiment. This enables you to pick the appropriate metrics to display, so quickly allows you to find where the time is being spent, and then to use the rest of the tool to drill down into what is taking the time.