Wednesday, January 28, 2015

Improving performance through bit manipulation: clear last set bit

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code.

Of course, in doing this you need to avoid the pit fall of premature optimisation - where you needlessly make the code more obscure with no benefit, or a benefit that disappears as soon as you run your code on a different machine. So with that caveat in mind, let's take a look at a simple example.

Clear last set bit

This is a great starting point because it nicely demonstrates how we can sometimes replace a fair chunk of code with a much simpler set of instructions. Of course, the algorithm that uses fewer instructions is harder to understand, but in some situations the performance gain is worth it.

We'll start off with some classic code to solve the problem. The reason for this is two-fold. First of all we want to clearly understand the problem we're solving. Secondly, we want a reference version that we can test against to ensure that our fiendishly clever code is actually correct. So here's our starting point:

unsigned long long clearlastbit( unsigned long long value )
{
  int bit=1;
  if ( value== 0 ) { return 0; }
  while ( !(value & bit) )
  {
    bit = bit << 1;
  }
  value = value ^ bit;
  return value;
}

But before we start trying to improve it we need a timing harness to find out how fast it runs. The following harness uses the Solaris call gethrtime() to return a timestamp in nanoseconds.

#include <stdio.h>
#include <sys/time.h>

static double s_time;

void starttime()
{
  s_time = 1.0 * gethrtime();
}

void endtime(unsigned long long its)
{
  double e_time = 1.0 * gethrtime();
  printf( "Time per iteration %5.2f ns\n", (e_time-s_time) / (1.0*its) );
  s_time = 1.0 * gethrtime();
}

The next thing we need is a workload to test the current implementation. The workload iterates over a range of numbers and repeatedly calls clearlastbit() until all the bits in the current number have been cleared.

#define COUNT 1000000
void main()
{
  starttime();
  for (unsigned long long i = 0; i < COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }
  }
  endtime(COUNT);
}

Big O notation

So let's take a break at this point to discuss big O notation. If we look at the code for clearlastbit() we can see that it contains a loop. We'll iterate around the loop once for each bit in the input value, so for a N bit number we might iterate up to N times. We say that this computation is "order N", meaning that the cost the calculation is somehow proportional to the number of bits in the input number. This is written as O(N).

The order N description is useful because it gives us some idea of the cost of calling the routine. From it we know that the routine will typically take twice as long if coded for 8 byte inputs than if we used 4 byte inputs. Order N is not too bad as costs go, the ones to look out for are order N squared, or cubed etc. For these higher orders the run time to complete a computation can become huge for even comparatively small values of N.

If we look at the test harness, we are iterating over the function COUNT times, so effectively the entire program is O(COUNT*N), and we're exploiting the fact that this is effectively an O(N^2) cost to provide a workload that has a reasonable duration.

So let's return to the problem of clearing the last set bit. One obvious optimisation would be to record the last bit that was cleared, and then start the next iteration of the loop from that point. This is potentially a nice gain, but does not fundamentally change the algorithm. A better approach is to take advantage of bit manipulation so that we can avoid the loop altogether.

unsigned long long clearlastbit2( unsigned long long value )
{
  return (value & (value-1) );
}

Ok, if you look at this code it is not immediately apparent what it does - most people would at first sight say "How can that possibly do anything useful?". The easiest way to understand it is to take an example. Suppose we pass the value ten into this function. In binary ten is encoded as 1010b. The first operation is the subtract operation which produces the result of nine, which is encoded as 1001b. We then take the AND of these two to get the result of 1000b or eight. We've cleared the last set bit because the subtract either removed the one bit (if it was set) or broke down the next largest set bit. The AND operation just keeps the bits to the left of the last set bit.

What is interesting about this snippet of code is that it is just three instructions. There's no loop and no branches - so most processors can execute this code very quickly. To demonstrate how much faster this code is, we need a test harness. The test harness should have two parts to it. The first part needs to validate that the new code produces the same result as the existing code. The second part needs to time the old and new code.

#define COUNT 1000000
void main()
{
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) 
    { 
       unsigned long long v2 = value;
       value = clearlastbit(value);
       if (value != clearlastbit2(v2)) 
       { 
         printf(" Mismatch input %llx: %llx!= %llx\n", v2, value, clearlastbit2(v2)); 
       }
    }
  }
  
  // Performance test
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }
  }
  endtime(COUNT);

  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit2(value); }
  }
  endtime(COUNT);
}

The final result is that the bit manipulation version of this code is about 3x faster than the original code - on this workload. Of course one of the interesting things is that the performance does depend on the input values. For example, if there are no set bits, then both codes will run in about the same amount of time.