Tuesday, February 9, 2021

Featured in TheRegister

I was surprised to find that my Oracle blog post were mentioned in TheRegister. Apparently Oracle has restored the content. The original content appears to still be there, but the linked images etc have disappeared - so the new content should probably be described as "Remastered" !

Tuesday, December 29, 2020

Bit masking on x86

In theory it's pretty easy to generate a bitmask for a range of bits:

unsigned long long mask(int bits) {
  return mask = (1ull << bits) - 1;
}

You need to specify that the 1 being shifted is an unsigned long long otherwise it gets treated as a 32-bit int, and the code only works for the range 0..31.

However, this code fails to work on x86. For example:

#include <math.h>
#include <stdio.h>

unsigned long long shift(int x) {
    return (1ull << x) - 1;
}

int main() {
    printf("Value %0llx\n", shift(64));
}

This returns the value 0 for the result of shifting by 64 when run on x86.

The reason for this can be found in the Intel docs (Vol. 2B 4-583):

The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used).

The result of this is that the one is shifted by zero - ie remains unchanged - and subtracting 1 from that produces the value zero.

Unfortunately, this means we need a more complex bit of code that handles shifts of greater than 64 correctly:

unsigned long long mask(int bits) {
  return mask = (bits >= 64 ? -1 : (1ull << bits) - 1);
}

Wednesday, February 12, 2020

TCMalloc - fast, hugepage aware, memory allocator

I'm thrilled that we've been able to release a revised version of TCMalloc.

In it's new default mode it holds per virtual CPU caches so that a thread running on a CPU can have contention free allocation and deallocation. This relies on Restartable Sequences - a Linux feature that ensures a region of code either completes without interruption by the OS, or gets restarted if there is an interruption.

The other compelling feature is that it is hugepage aware. It ultimately manages memory in hugepage chunks so that it can pack data onto as few hugepages as possible.

Monday, October 12, 2015

Memory error detection (new series)

Just before I left Oracle I sat down with my good friend Raj and we wrote up a bundle of stuff on Memory Error Detection. He's started posting it on his blog.

Memory errors are one of the common types of application error - accessing past the end of arrays, or past the end of a block of allocated memory, reading memory that has not yet been initialised, and so on. What makes them particularly nasty is that they are hard to detect, and the effect of the memory access error can be arbitrarily far from the place where the error occurs. So it's useful to have tools which help you find them.....

Tuesday, February 17, 2015

Profiling the kernel

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:

<username>::::defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel

The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
....
Creating experiment database ktest.1.er (Process ID: 7399) ...
....
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail
...

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
...
Creating experiment database ktest.2.er (Process ID: 7630) ...
...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep
...

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

Thursday, February 5, 2015

Digging into microstate accounting

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but also allows you to see the time spent waiting on page faults and other useful-to-know states.

This level of detail is available through the usage file in /proc/pid, there's a corresponding file for each lwp in /proc/pid/lwp/lwpid/lwpusage. You can find more details about the /proc file system in documentation, or reading my recent article about tracking memory use.

Here's an example of using it to report idle time, ie time when the process wasn't busy:

#include <stdio.h>
#include <sys/resource.h>
#include <unistd.h>
#include <fcntl.h>
#include <procfs.h>

void busy()
{
  for (int i=0; i<100000; i++)
  {
   double d = i;
   while (d>0) { d=d *0.5; }
  }
}

void lazy()
{
  sleep(10);
}

double convert(timestruc_t ts)
{
  return ts.tv_sec + ts.tv_nsec/1000000000.0;
}

void report_idle()
{
  prusage_t prusage;
  int fd;
  fd = open( "/proc/self/usage", O_RDONLY);
  if (fd == -1) { return; }
  read( fd, &prusage, sizeof(prusage) );
  close( fd );
  printf("Idle percent = %3.2f\n",
  100.0*(1.0 - (convert(prusage.pr_utime) + convert(prusage.pr_stime))
               /convert(prusage.pr_rtime) ) );
}

void main()
{
  report_idle();
  busy();
  report_idle();
  lazy();
  report_idle();
}

The code has two functions that take time. The first does some redundant FP computation (that cannot be optimised out unless you tell the compiler to do FP optimisations), this part of the code is CPU bound. When run the program reports low idle time for this section of the code. The second routine calls sleep(), so the program is idle at this point waiting for the sleep time to expire, hence this section is reported as being high idle time.