Friday, October 21, 2011

Endianness

SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
{
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
  {
    pcout[i] = pcin[sizeof(T)-i];
  }
  return out;
}

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
{
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  {
    out<<=8;
    out|=(in&255);
    in>>=8;
  }
  return out;
} 

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Differences between the various STL options on Solaris

Steve Clamage has provided a nice summary of the trade-offs between the various STL options. I'll summarise it here:

  • Default STL. Available as part of the OS so does not require a separate library to be shipped with the application. However, does not support the standard.
  • -library=stlport4 Much better conformance with the standard, but no internationalisation. Must be distributed with applications that use it.
  • -library=stdcxx4 (Apache). Complete implementation of standard. Available on S10U10 and onwards.

I'd also add that stlport4 and stdcxx4 typically have much better performance than the default library.

The other point that bears repetition is that you can only include one STL per application. So you cannot use different implementations for different libraries or for the application.

Sunday, October 2, 2011

Best practices for developing top-performing C/C++ Applications

I'll be presenting at Oracle Open World tomorrow. The title of the presentation is "Best practices for developing top-performing C/C++ applications". The presentation is at 11:00am in Golden Gate C1 at the Marriott Marquis.