Friday, March 30, 2012

Inline template efficiency

I like inline templates, and use them quite extensively. Whenever I write code with them I'm always careful to check the disassembly to see that the resulting output is efficient. Here's a potential cause of inefficiency.

Suppose we want to use the mis-named Leading Zero Detect (LZD) instruction on T4 (this instruction does a count of the number of leading zero bits in an integer register - so it should really be called leading zero count). So we put together an inline template called lzd.il looking like:

.inline lzd
  lzd %o0,%o0
.end

And we throw together some code that uses it:

int lzd(int);

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

We compile the code with some amount of optimisation, and look at the resulting code:

$ cc -O -xtarget=T4 -S lzd.c lzd.il
$ more lzd.s
                        .L77000018:
/* 0x001c         11 */         lzd     %o0,%o0
/* 0x0020          9 */         ld      [%i1],%i3
/* 0x0024         11 */         st      %o0,[%i2]
/* 0x0028          9 */         add     %i3,1,%i0
/* 0x002c            */         cmp     %i0,999
/* 0x0030            */         ble,pt  %icc,.L77000018
/* 0x0034            */         st      %i0,[%i1]

What is surprising is that we're seeing a number of loads and stores in the code. Everything could be held in registers, so why is this happening?

The problem is that the code is only inlined at the code generation stage - when the actual instructions are generated. Earlier compiler phases see a function call. The called functions can do all kinds of nastiness to global variables (like 'a' in this code) so we need to load them from memory after the function call, and store them to memory before the function call.

Fortunately we can use a #pragma directive to tell the compiler that the routine lzd() has no side effects - meaning that it does not read or write to memory. The directive to do that is #pragma no_side_effect(<routine name>), and it needs to be placed after the declaration of the function. The new code looks like:

int lzd(int);
#pragma no_side_effect(lzd)

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

Now the loop looks much neater:

/* 0x0014         10 */         add     %i1,1,%i1

!   11                !  {
!   12                !    c=lzd(c);

/* 0x0018         12 */         lzd     %o0,%o0
/* 0x001c         10 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000018
/* 0x0024            */         nop