Wednesday, December 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Wednesday, December 5, 2012

Library order is important

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths library function:

#include <math.h>

int main()
{
  for (int i=0; i<10000000; i++)
  {
    sin(i);
  }
}

We compile and run it to get the following performance:

bash-3.2$ cc -g -O fp.c -lm
bash-3.2$ timex ./a.out

real           6.06
user           6.04
sys            0.01

Now most people will have heard of the optimised maths library which is added by the flag -xlibmopt. This contains optimised versions of key mathematical functions, in this instance, using the library doubles performance:

bash-3.2$ cc -g -O -xlibmopt fp.c -lm
bash-3.2$ timex ./a.out

real           2.70
user           2.69
sys            0.00

The optimised maths library is provided as an archive library (libmopt.a), and the driver adds it to the link line just before the maths library - this causes the linker to pick the definitions provided by the static library in preference to those provided by libm. We can see the processing by asking the compiler to print out the link line:

bash-3.2$ cc -### -g -O -xlibmopt fp.c -lm
/usr/ccs/bin/ld ... fp.o -lmopt -lm -o a.out...

The flag to the linker is -lmopt, and this is placed before the -lm flag. So what happens when the -lm flag is in the wrong place on the command line:

bash-3.2$ cc -g -O -xlibmopt -lm fp.c
bash-3.2$ timex ./a.out

real           6.02
user           6.01
sys            0.01

If the -lm flag is before the source file (or object file for that matter), we get the slower performance from the system maths library. Why's that? If we look at the link line we can see the following ordering:

/usr/ccs/bin/ld ... -lmopt -lm fp.o -o a.out 

So the optimised maths library is still placed before the system maths library, but the object file is placed afterwards. This would be ok if the optimised maths library were a shared library, but it is not - instead it's an archive library, and archive library processing is different - as described in the linker and library guide:

"The link-editor searches an archive only to resolve undefined or tentative external references that have previously been encountered."

An archive library can only be used resolve symbols that are outstanding at that point in the link processing. When fp.o is placed before the libmopt.a archive library, then the linker has an unresolved symbol defined in fp.o, and it will search the archive library to resolve that symbol. If the archive library is placed before fp.o then there are no unresolved symbols at that point, and so the linker doesn't need to use the archive library. This is why libmopt needs to be placed after the object files on the link line.

On the other hand if the linker has observed any shared libraries, then at any point these are checked for any unresolved symbols. The consequence of this is that once the linker "sees" libm it will resolve any symbols it can to that library, and it will not check the archive library to resolve them. This is why libmopt needs to be placed before libm on the link line.

This leads to the following order for placing files on the link line:

  • Object files
  • Archive libraries
  • Shared libraries

If you use this order, then things will consistently get resolved to the archive libraries rather than to the shared libaries.

Tuesday, December 4, 2012

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:

...
#define SIZE 1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);
...

Running this gave the following results:

Time per iteration   0.000065606310 MB/s
Time per iteration   2.709711563906 MB/s
Time per iteration   0.178590114758 MB/s

Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.

It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000);
  s_time=1.0*gethrtime();
}

#define SIZE 10*1024*1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR);
  for (int i=0; i<SIZE; i++)
  {
    write(file,"a",1);
  }
  close(file);
  endtime(SIZE);
}

void test_fprintf()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
  }
  fclose(file);
  endtime(SIZE);
}

void test_flush()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
    fflush(file);
  }
  fclose(file);
  endtime(SIZE);
}


int main()
{
  test_write();
  test_fprintf();
  test_flush();
}

Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.