Tuesday, February 3, 2015

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
{
  for (int i=0;i < length; i+=8)
  {
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);
  }
}

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

  • The inverted byte must have a set upper bit.
  • If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
{
  for (int i=0; i<length; i+=8)
  {
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;
  }
}

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.