Monday, April 1, 2013

OpenMP and language level parallelisation

The C11 and C++11 standards introduced some very useful features into the language. In particular they provided language-level access to threading and synchronisation primitives. So using the new standards we can write multithreaded code that compiles and runs on standard compliant platforms. I've tackled translating Windows and POSIX threads before, but not having to use a shim is fantastic news.

There's some ideas afoot to do something similar for higher level parallelism. I have a proposal for consideration at the April meetings - leveraging the existing OpenMP infrastructure.

Pretty much all compilers use OpenMP, a large chunk of shared memory parallel programs are written using OpenMP. So, to me, it seems a good idea to leverage the existing OpenMP library code, and existing developer knowledge. The paper is not arguing that we need take the OpenMP syntax - that is something that can be altered to fit the requirements of the language.

What do you think?

Thursday, March 14, 2013

The pains of preprocessing

Ok, so I've encountered this twice in 24 hours. So it's probably worth talking about it.

The preprocessor does a simple text substitution as it works its way through your source files. Sometimes this has "unanticipated" side-effects. When this happens, you'll normally get a "hey, this makes no sense at all" error from the compiler. Here's an example:

$ more c.c
#include <ucontext.h>
#include <stdio.h>

int main()
{
  int FS;
  FS=0;
  printf("FS=%i",FS);
}

$ CC c.c
$ CC c.c
"c.c", line 6: Error: Badly formed expression.
"c.c", line 7: Error: The left operand must be an lvalue.
2 Error(s) detected.

A similar thing happens with g++:

$  /pkg/gnu/bin/g++ c.c
c.c: In function 'int main()':
c.c:6:7: error: expected unqualified-id before numeric constant
c.c:7:6: error: lvalue required as left operand of assignment

The Studio C compiler gives a bit more of a clue what is going on. But it's not something you can rely on:

$ cc c.c
"c.c", line 6: syntax error before or at: 1
"c.c", line 7: left operand must be modifiable lvalue: op "="

As you can guess the issue is that FS gets substituted. We can find out what happens by examining the preprocessed source:

$ CC -P c.c
$ tail c.i
int main ( )
{
int 1 ;
1 = 0 ;
printf ( "FS=%i" , 1 ) ;
}

You can confirm this using -xdumpmacros to dump out the macros as they are defined. You can combine this with -H to see which header files are included:

$ CC -xdumpmacros c.c 2>&1 |grep FS
#define _SYS_ISA_DEFS_H
#define _FILE_OFFSET_BITS 32
#define REG_FSBASE 26
#define REG_FS 22
#define FS 1
....

If you're using gcc you should use the -E option to get preprocessed source, and the -dD option to get definitions of macros and the include files.

Wednesday, December 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Wednesday, December 5, 2012

Library order is important

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths library function:

#include <math.h>

int main()
{
  for (int i=0; i<10000000; i++)
  {
    sin(i);
  }
}

We compile and run it to get the following performance:

bash-3.2$ cc -g -O fp.c -lm
bash-3.2$ timex ./a.out

real           6.06
user           6.04
sys            0.01

Now most people will have heard of the optimised maths library which is added by the flag -xlibmopt. This contains optimised versions of key mathematical functions, in this instance, using the library doubles performance:

bash-3.2$ cc -g -O -xlibmopt fp.c -lm
bash-3.2$ timex ./a.out

real           2.70
user           2.69
sys            0.00

The optimised maths library is provided as an archive library (libmopt.a), and the driver adds it to the link line just before the maths library - this causes the linker to pick the definitions provided by the static library in preference to those provided by libm. We can see the processing by asking the compiler to print out the link line:

bash-3.2$ cc -### -g -O -xlibmopt fp.c -lm
/usr/ccs/bin/ld ... fp.o -lmopt -lm -o a.out...

The flag to the linker is -lmopt, and this is placed before the -lm flag. So what happens when the -lm flag is in the wrong place on the command line:

bash-3.2$ cc -g -O -xlibmopt -lm fp.c
bash-3.2$ timex ./a.out

real           6.02
user           6.01
sys            0.01

If the -lm flag is before the source file (or object file for that matter), we get the slower performance from the system maths library. Why's that? If we look at the link line we can see the following ordering:

/usr/ccs/bin/ld ... -lmopt -lm fp.o -o a.out 

So the optimised maths library is still placed before the system maths library, but the object file is placed afterwards. This would be ok if the optimised maths library were a shared library, but it is not - instead it's an archive library, and archive library processing is different - as described in the linker and library guide:

"The link-editor searches an archive only to resolve undefined or tentative external references that have previously been encountered."

An archive library can only be used resolve symbols that are outstanding at that point in the link processing. When fp.o is placed before the libmopt.a archive library, then the linker has an unresolved symbol defined in fp.o, and it will search the archive library to resolve that symbol. If the archive library is placed before fp.o then there are no unresolved symbols at that point, and so the linker doesn't need to use the archive library. This is why libmopt needs to be placed after the object files on the link line.

On the other hand if the linker has observed any shared libraries, then at any point these are checked for any unresolved symbols. The consequence of this is that once the linker "sees" libm it will resolve any symbols it can to that library, and it will not check the archive library to resolve them. This is why libmopt needs to be placed before libm on the link line.

This leads to the following order for placing files on the link line:

  • Object files
  • Archive libraries
  • Shared libraries

If you use this order, then things will consistently get resolved to the archive libraries rather than to the shared libaries.

Tuesday, December 4, 2012

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:

...
#define SIZE 1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);
...

Running this gave the following results:

Time per iteration   0.000065606310 MB/s
Time per iteration   2.709711563906 MB/s
Time per iteration   0.178590114758 MB/s

Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.

It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000);
  s_time=1.0*gethrtime();
}

#define SIZE 10*1024*1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR);
  for (int i=0; i<SIZE; i++)
  {
    write(file,"a",1);
  }
  close(file);
  endtime(SIZE);
}

void test_fprintf()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
  }
  fclose(file);
  endtime(SIZE);
}

void test_flush()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
    fflush(file);
  }
  fclose(file);
  endtime(SIZE);
}


int main()
{
  test_write();
  test_fprintf();
  test_flush();
}

Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.

Thursday, October 18, 2012

Mixing Java and native code

This was a bit of surprise to me. The slides are available from my presentation at JavaOne on mixed language development. What I wasn't expecting was that there would also be a video of the presentation.