Micro benchmarking: finding the max value

I have seen a Pluralsight course recently on performance by Sasha Goldstein. At one point, he mentioned how he had used processor pipelining to find a maximum number of an array of integers even faster the than the regular for/while loop. This got me thinking, can we get the presented solution even faster? (the catch: he also presented some other features of the clr/framework which will help with this)

So how can get even faster (in certain situations) finding a maximum integer in an unordered array?

In this investigation, I am going to use micro-benchmarking, which is certainly a way of measuring one or a couple of instructions, but I find the results difficult to generalize: things being measured are correct, but small changes to the code or the environment can have significant impact on the results.

In my tests I will use an array of 2MB, which easily fits into the processor cache of my machine. The array is filled with integers [0..10000). It is important to have it the whole array fit in the processor cache, so we can avoid measuring the latency for bringing a chunk of data from memory to the processor.

I will use (optimized) Release mode builds, and try to avoid GC to kick in (by using keepalive). I repeat tests 7 times, and remove the best and worst results. From the rest of the measurements an average is calculated.

Math.Max

Let's start with the regular solution, we use Math.Max to find the largest element with a for loop:

for(int i = 0; i < array.Length; ++i)
{
  max = Math.Max(max, array[i]);
}

mathmax

Measurements show that in average this loop takes 3876 ticks. Note, this the regular loop, with a simple Max inside. We shall easily find better solutions than this.

Linq

In the next approach, we will use Linq to solve the problem in the "most readable" way.

max = array.Max();

Clearly this solution needs the least amount of code. On this other side, this approach seems to be even worse with a couple of folds, the avarege takes 37120,6 ticks. Linq is not this slow generally, it can do very nice optimizations and achieve nearly the speed of conventional use-cases, but it isn't the case this time.

linq

The Regular

Let's see one more regular approach. But this time, instead of using linq, or Math.Max method, we will simply find the max with an if statement. The benefits of this: no inlinening needed from the JIT compiler's side or there is no method call happening.

for(int i = 0; i < array.Length; i++)
{
  if(array[i] > max)
    max = array[i];
}

The average of this solutions show 3958,2 ticks, which is very similar result to the Math.Max approach.

regular

The Pipelined

In the following solution, we will try to utilize processor's instruction pipelining. In this case we will use the regular if branches to find the max value, but not only one, but two of them inside one loop iteration. The main criteria is to avoid dependency between the instructions. For this reason, I introduce a maxB variable next to the the regular max variable.

for(int i = 0; i < array.Length; i+=2)
{
  if(array[i] > max)
    max = array[i];

  if(array[i + 1] > maxB)
    maxB = array[i + 1];
}
if(maxB > max)
{
  max = maxB;
}

This approach has a significant boost, it takes an average of 2953,6 ticks. But note, that as soon as we cannot fit the array in the processor's cache, this approach will slow down, due to bringing data in from the memory (currently) takes significantly longer than the benefits we win by pipelining instructions. In this measurement this solution is 25% faster to regular approach, and 92% faster to Linq.

pipeline

Vectors (Single Instruction Multiple Data)

The final solution will use a completely different approach. Instead of depending on regular approaches or the pipelining, we will use SIMD, which is another feature we have on our processors, and they are enabled to us by the .net with RyuJIT. To use it, we need to add a package System.Numerics.Vectors to our application. SIMD is used mostly for mathematical/graphical calculations. I would find it interesting to use it with real-time market data processing as well, where quantity/price attributes are processed the same way on each tick (ie. each value is multiplied with the same constant value). Vectors use a special SIMD register to process data. The size of this register can vary by processor type, for this reason, I present here a solution, which fills as many integers in the SIMD register as possible on the given processor type.

Vector maxV = new Vector(-1);
int count = Vector.Count;
for(int i = 0; i < array.Length; i += count)
{
  Vector data = new Vector(array, i);
  maxV = Vector.Max(data, maxV);
}
for(int i = 0; i < count; i++)
{
  if(maxV[i] > max)
    max = maxV[i];
}

In this case, we cannot use our regular max variable, but we use a Vector of max values, pre-initialized with a negative number (note we only have natural numbers in our test array). In each iteration, we take a vector of the array, compare it with our max vector. Each element of the vectors are compared individually. Finally, we have a separate for loop to find the max value of our vector.

vectors

This solution has a significant boost on the performance, the average takes 808,4 ticks.