Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware instruction to do it. However, for this example, we're going to look at coding it in software. First of all we'll write a baseline version of the code:

int popc(unsigned long long value) { unsigned long long bit = 1; int popc = 0; while ( bit ) { if ( value & bit ) { popc++; } bit = bit << 1; } return popc; }

The above code examines every bit in the input and counts the number of set bits. The number of iterations is proportional to the number of bits in the register.

Most people will immediately recognise that we could make this a bit faster using the code we discussed previously that clears the last set bit, whist there are set bits keep clearing them, otherwise you're done. The advantage of this approach is that you only iterate once for every set bit in the value. So if there are no set bits, then you do not do any iterations.

int popc2( unsigned long long value ) { int popc = 0; while ( value ) { popc++; value = value & (value-1); } return popc; }

The next thing to do is to put together a test harness that confirms that the new code produces the same results as the old code, and also measures the performance of the two implementations.

#define COUNT 1000000 void main() { // Correctness test for (unsigned long long i = 0; i<COUNT; i++ ) { if (popc( i + (i<<32) ) != popc2( i + (i<<32) ) ) { printf(" Mismatch popc2 input %llx: %u!= %u\n", i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32))); } } // Performance test starttime(); for (unsigned long long i = 0; i<COUNT; i++ ) { popc(i+(i<<32)); } endtime(COUNT); starttime(); for (unsigned long long i = 0; i<COUNT; i++ ) { popc2(i+(i<<32)); } }

The new code is about twice as fast as the old code. However, the new code still contains a loop, and this can be a bit of a problem.

**Branch mispredictions**

The trouble with loops, and with branches in general, is that processors don't know the next instruction that will be executed after the branch until the branch has been reached, but the processor needs to have already fetched instruction after the branch well before this. The problem is nicely summarised by Holly in Red Dwarf:

*"Look, I'm trying to navigate at faster than the speed of light, which means that before you see something, you've already passed through it."*

So processors use branch prediction to guess whether a branch is taken or not. If the prediction is correct there is no break in the instruction stream, but if the prediction is wrong, then the processor needs to throw away all the incorrectly predicted instructions, and fetch the instructions from correct address. This is a significant cost, so ideally you don't want mispredicted branches, and the best way of ensuring that is to not have branches at all!

The following code is a branchless sequence for computing population count

unsigned int popc3(unsigned long long value) { unsigned long long v2; v2 = value &t;< 1; v2 &= 0x5555555555555555; value &= 0x5555555555555555; value += v2; v2 = value << 2; v2 &= 0x3333333333333333; value &= 0x3333333333333333; value += v2; v2 = value << 4; v2 &= 0x0f0f0f0f0f0f0f0f; value &= 0x0f0f0f0f0f0f0f0f; value += v2; v2 = value << 8; v2 &= 0x00ff00ff00ff00ff; value &= 0x00ff00ff00ff00ff; value += v2; v2 = value << 16; v2 &= 0x0000ffff0000ffff; value &= 0x0000ffff0000ffff; value += v2; v2 = value << 32; value += v2; return (unsigned int) value; }

This instruction sequence computes the population count by initially adding adjacent bits to get a two bit result of 0, 1, or 2. It then adds the adjacent pairs of bits to get a 4 bit result of between 0 and 4. Next it adds adjacent nibbles to get a byte result, then adds pairs of bytes to get shorts, then adds shorts to get a pair of ints, which it adds to get the final value. The code contains a fair amount of AND operations to mask out the bits that are not part of the result.

This bit manipulation version is about two times faster than the clear-last-bit-set version, making it about four times faster than the original code. However, it is worth noting that this is a fixed cost. The routine takes the same amount of time regardless of the input value. In contrast the clear -last-bit-set version will exit early if there are no set bits. Consequently the performance gain for the code will depend on both the input value and the cost of mispredicted branches.