Tuesday, December 29, 2020

Bit masking on x86

In theory it's pretty easy to generate a bitmask for a range of bits:

unsigned long long mask(int bits) {
  return mask = (1ull << bits) - 1;
}

You need to specify that the 1 being shifted is an unsigned long long otherwise it gets treated as a 32-bit int, and the code only works for the range 0..31.

However, this code fails to work on x86. For example:

#include <math.h>
#include <stdio.h>

unsigned long long shift(int x) {
    return (1ull << x) - 1;
}

int main() {
    printf("Value %0llx\n", shift(64));
}

This returns the value 0 for the result of shifting by 64 when run on x86.

The reason for this can be found in the Intel docs (Vol. 2B 4-583):

The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used).

The result of this is that the one is shifted by zero - ie remains unchanged - and subtracting 1 from that produces the value zero.

Unfortunately, this means we need a more complex bit of code that handles shifts of greater than 64 correctly:

unsigned long long mask(int bits) {
  return mask = (bits >= 64 ? -1 : (1ull << bits) - 1);
}

Wednesday, February 12, 2020

TCMalloc - fast, hugepage aware, memory allocator

I'm thrilled that we've been able to release a revised version of TCMalloc.

In it's new default mode it holds per virtual CPU caches so that a thread running on a CPU can have contention free allocation and deallocation. This relies on Restartable Sequences - a Linux feature that ensures a region of code either completes without interruption by the OS, or gets restarted if there is an interruption.

The other compelling feature is that it is hugepage aware. It ultimately manages memory in hugepage chunks so that it can pack data onto as few hugepages as possible.