Performance

Contents ๐Ÿ”—︎

Performance ๐Ÿ”—︎

How do we increase a program’s performance?

We would like to get the same amount of work done, in less time.

There are a few techniques that we can use to do this. We will discuss some here.

Loop Unrolling ๐Ÿ”—︎

Loop unrolling refers to increasing the number of instructions completed per loop.

For example, take a regular for-loop expression.

for (int i = 0; i < 4; i++) {
    A[i] = B[i] + C[i];
}

Now, consider using the loop unrolling technique.

for (int i = 0; i < 4; i+=2) {
    A[i] = B[i] + C[i];
    A[i+1] = B[i+1] + C[i+1];
}

Consider the unrolled for loop. For each loop we need to add some things, increment i, do a comparison and keep going.

When the loop is ‘rolled’ the increment of i and the comparison in the for loop occur less often since we are doing more work in each loop. This leads to less overhead, and we are able to complete the instructions more quickly.

This is a very simple example though, and unrolling may not always have a positive effect.

There is a short article by CS Professor Daniel Lemire on this topic here .

An interesting read is also this GitHub issue “cmd/compile: add unrolling stage for automatic loop unrolling #51302 ” for golang, where folks are discussing adding loop unrolling into Golang.

Multiple Accumulators ๐Ÿ”—︎

Consider another simple example.

sum = 0;
for (int i = 0; i < N; i++) {
    sum += A[i];
}

To speed this up, we can use multiple accumulators.

sum1 = sum2 = 0;
for (int i = 0; i < N; i+=2) {
    sum1 += A[i];
    sum2 += A[i+1];
}
sum = sum1 + sum2;

Each time we do a loop, there is some overhead. We need to increment i, check the if case etc. If we can do more operations per loop, then we can speed up our program. Similar to loop unrolling where we do less loops by unrolling the loop, now we try to do more operations per loop.

An Example ๐Ÿ”—︎

When could we actually use multiple accumulators?

Consider the following example, where we calculate the average age of users by unpacking a uint64 with 8 uint8 integers.

// packing the uint64
packedAges := make([]uint64, (len(userLines)+7)/8)
for i, line := range userLines {
  age, _ := strconv.Atoi(line[2])
  // do the packing
  packedAges[i/8] |= uint64(age) << ((i % 8) * 8)
}

And then unpacking using multiple accumulators.

type UserData struct {
 numAges    int
 packedAges []uint64
 payments   []uint32
}

func AverageAge(users UserData) float64 {
 var sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7 uint32
 for _, packed := range users.packedAges {
  sum0 += uint32((packed >> 0) & 0xFF)
  sum1 += uint32((packed >> 8) & 0xFF)
  sum2 += uint32((packed >> 16) & 0xFF)
  sum3 += uint32((packed >> 24) & 0xFF)
  sum4 += uint32((packed >> 32) & 0xFF)
  sum5 += uint32((packed >> 40) & 0xFF)
  sum6 += uint32((packed >> 48) & 0xFF)
  sum7 += uint32((packed >> 56) & 0xFF)
 }
 return float64(sum0+sum1+sum2+sum3+sum4+sum5+sum6+sum7) / float64(users.numAges)
}

Clearly the code is not very nice to read, but it does have some performance benefits.

Understanding your data is also key here. Using a uint8 is possible because we know that the maximum size of our data will fit in this range.

SIMD ๐Ÿ”—︎

One of the fastest ways to get speedup for your program is to use what is called “SIMD” (Single Instruction, Multiple Data). These instructions allow us to conduct operations on multiple data with just a single instruction. If you’ve heard of things like ‘vectorisation ’ this concept is quite similar.

An interesting article here on the stack overflow blog, about multiple accumulators and SIMD (Single Input, Multiple Data) instructions.

OpenBLAS is C library that utilises these SIMD instructions. Pandas/Numpy and other similar libraries use this to enable significant performance increases.