Monday, April 23, 2012

sincos()

If you are computing both the sine and cosine of an angle, then you will be twice as quick if you call sincos() than if you call cos() and sin() independently:

#include 

int main()
{
  double a,b,c;
  a=1.0;
  for (int i=0;i<100000000;i++) { b=sin(a); c=cos(a); }
}

$ cc -O sc.c -lm
$ timex ./a.out
real          19.13

vs

#include 

int main()
{
  double a,b,c;
  a=1.0;
  for (int i=0;i<100000000;i++) { sincos(a,&b,&c); }
}
$ cc -O sc.c -lm
$ timex ./a.out
real           9.80

Friday, April 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
{
  return &data;
}

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Monday, April 2, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
  {
    int mytemplate(int);
  }

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
  {
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)
  }

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
  {
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)
  }

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
{
  int lzd(int);
  #pragma no_side_effect(lzd)
}

int a;
int c=0;

class myclass
{
  int routine();
};

int myclass::routine()
{
  try
  {
    for(a=0; a<1000; a++)
    {
      c=lzd(c);
    }
  }
  catch(...)
  {
    std::cout << "Something happened" << std::endl;
  }
 return 0;
}

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Pragmas and exceptions

The compiler pragmas:

  #pragma no_side_effect(routinename)
  #pragma does_not_write_global_data(routinename)
  #pragma does_not_read_global_data(routinename)

are used to tell the compiler more about the routine being called, and enable it to do a better job of optimising around the routine. If a routine does not read global data, then global data does not need to be stored to memory before the call to the routine. If the routine does not write global data, then global data does not need to be reloaded after the call. The no side effect directive indicates that the routine does no I/O, does not read or write global data, and the result only depends on the input.

However, these pragmas should not be used on routines that throw exceptions. The following example indicates the problem:

#include <iostream>

extern "C"
{
  int exceptional(int);
  #pragma no_side_effect(exceptional)
}

int exceptional(int a)
{
  if (a==7)
  {
    throw 7;
  }
  else
  {
   return a+1;
  } 
}


int a;
int c=0;

class myclass
{
  public:
  int routine();
};

int myclass::routine()
{
  for(a=0; a<1000; a++)
  {
    c=exceptional(c);
  }
 return 0;
}

int main()
{
  myclass f;
  try
  {
    f.routine();
  }
  catch(...)
  {
    std::cout << "Something happened" << a << c << std::endl;
  }
  
}

The routine "exceptional" is declared as having no side effects, however it can throw an exception. The no side effects directive enables the compiler to avoid storing global data back to memory, and retrieving it after the function call, so the loop containing the call to exceptional is quite tight:

$ CC -O -S test.cpp
...
                        .L77000061:
/* 0x0014         38 */         call    exceptional     ! params =  %o0 ! Result =  %o0
/* 0x0018         36 */         add     %i1,1,%i1
/* 0x001c            */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000061
/* 0x0024            */         nop

However, when the program is run the result is incorrect:

$ CC -O t.cpp
$ ./a.out
Something happend00

If the code had worked correctly, the output would have been "Something happened77" - the exception occurs on the seventh iteration. Yet, the current code produces a message that uses the original values for the variables 'a' and 'c'.

The problem is that the exception handler reads global data, and due to the no side effects directive the compiler has not updated the global data before the function call. So these pragmas should not be used on routines that have the potential to throw exceptions.