Thursday, March 2, 2023

go/fast tips being published is external exposure of one of the projects I've been involved in at Google. For a long time I've felt that performance optimisation is the best job in the world, but to really move the needle we need to ensure that folks have access to information about writing efficient code - hence this blog and the books. At work, internally we've been gathering and documenting best practices for performance. Finally, we're able to publish some of these externally!

Tuesday, June 15, 2021

TCMalloc hugepage paper

I'm pleased to announce that our paper on TCMalloc and hugepages has been made available.

This paper describes the changes made to TCMalloc in order for it to become hugepage aware. It required a complete restructure of the "backend" - the bit that manages memory requested from the OS. The obvious change is to manage this memory in chunks of 2MiB (x86 hugepage size). Once you have that you need a cache that handles allocations of less than 2MiB, another that handles allocations of greater than 2MiB, etc.

One of the key observations in the paper is that increasing hugepage coverage improves application performance, not necessarily allocator performance. i.e. you can spend extra time in the allocator preserving hugepage coverage, this improves the performance of the code that uses the memory. The net result is that you can slightly increase time spent in the allocator, and reduce time spent in the application. This makes the time spent in the allocator increase - making it look worse!

To work around the fact that the allocator looks worse, you also need to look for metrics that represent the productivity of the application. If the productivity increases, then the changes to the allocator are an efficiency win - even if the allocator ends up taking a greater percentage of the entire runtime.

Monday, June 7, 2021

Paper on releasing memory back to the OS

Co-author on a paper about releasing memory back to the OS in TCMalloc.

The quick summary is that memory returned to the OS is often requested back very rapidly. Hence waiting a bit of time before returning unused memory improves performance without increasing practical memory footprint.

The performance improvements comes from not needing to break up hugepages in order to return memory to the OS. Additionally we don't spend time in the OS returning memory, or faulting it back into the memory of the process.

We don't make a practical increase in the RAM that an application needs because the memory is typically requested after a short interval, and there's insufficient time to be able to schedule anything in the interval.

Tuesday, February 9, 2021

Featured in TheRegister

I was surprised to find that my Oracle blog post were mentioned in TheRegister. Apparently Oracle has restored the content. The original content appears to still be there, but the linked images etc have disappeared - so the new content should probably be described as "Remastered" !

Tuesday, December 29, 2020

Bit masking on x86

In theory it's pretty easy to generate a bitmask for a range of bits:

unsigned long long mask(int bits) {
  return mask = (1ull << bits) - 1;

You need to specify that the 1 being shifted is an unsigned long long otherwise it gets treated as a 32-bit int, and the code only works for the range 0..31.

However, this code fails to work on x86. For example:

#include <math.h>
#include <stdio.h>

unsigned long long shift(int x) {
    return (1ull << x) - 1;

int main() {
    printf("Value %0llx\n", shift(64));

This returns the value 0 for the result of shifting by 64 when run on x86.

The reason for this can be found in the Intel docs (Vol. 2B 4-583):

The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used).

The result of this is that the one is shifted by zero - ie remains unchanged - and subtracting 1 from that produces the value zero.

Unfortunately, this means we need a more complex bit of code that handles shifts of greater than 64 correctly:

unsigned long long mask(int bits) {
  return mask = (bits >= 64 ? -1 : (1ull << bits) - 1);

Wednesday, February 12, 2020

TCMalloc - fast, hugepage aware, memory allocator

I'm thrilled that we've been able to release a revised version of TCMalloc.

In it's new default mode it holds per virtual CPU caches so that a thread running on a CPU can have contention free allocation and deallocation. This relies on Restartable Sequences - a Linux feature that ensures a region of code either completes without interruption by the OS, or gets restarted if there is an interruption.

The other compelling feature is that it is hugepage aware. It ultimately manages memory in hugepage chunks so that it can pack data onto as few hugepages as possible.