Sum with SIMD

One of my projects requires to calculate an average over about a hundred integer values. These values are available in an array, I used the Sum extension method from Linq to calculate the sum of them.

I have learned two things about this method:

  • With the current version of .NET, it does seem to use vectorization to calculate the sum of items.

  • It uses checked arithmetic, meaning that it throws an exception in case of an overflow.

This is the (optimized) assembly code generated by the JIT for the Sum method:

G_M000_IG03:                ;; offset=0020H
       mov      r9d, r8d
       add      eax, dword ptr [rdx+4*r9]
       jo       SHORT G_M000_IG05
       inc      r8d
       cmp      r8d, ecx
       jl       SHORT G_M000_IG03

In this post I am searching for optimization opportunities using Single instruction, multiple data (SIMD) with .NET 8 Preview 4.

SIMD

I plan to focus on summing items using SIMD. I will create two implementations, one using checked and one using unchecked semantics.

Sum with unchecked operations

This option is easier, as when using SIMD, operations overflow irrespective of the usage of the checked keyword in C#.

Note that the below code samples are for demonstration purposes only, do not use them in production environment.

public int SimdOverflow()
{
    var vResult = Vector<int>.Zero;
    var vLength = Vector<int>.Count;
    int index = 0;
    while (index + vLength <= _source.Length)
    {
        vResult += Vector.LoadUnsafe<int>(ref _source[index]);
        index += vLength;
    }

    Span<int> buffer = stackalloc int[vLength];
    _source.AsSpan(index).CopyTo(buffer);
    vResult += Vector.LoadUnsafe<int>(ref buffer[0]);
    vResult.CopyTo(buffer);
    return Vector.Sum(vResult);
}

As one could expect, the code iterates over the input. Loads the next batch of integers into a Vector<int> type and sums up the values in a local variable. After the very last iteration, there might be a few items left in the collection that would not make up a full vector. In this case the vector might contain uninitialized elements. To overcome this the last iteration writes the elements to a temporary buffer initialized with zeros, which is later used to create the last vector to be summed.Finally, the vector elements are summed up with the Sum method. In this case this Sum method is a static method implemented by Vector, not the one provided by Linq.

Sum with Checked semantics

In this second case, overflow should be prevented. The overall idea behind this method is remarkably similar to the previous with two differences:

  • An added CheckOverflow() method call before the add operations.

  • At the end, this method uses a checked block to sum up the remaining items as well as the elements of the final vector.

 public int SimdNoOverflow()
{
    var vResult = Vector<int>.Zero;
    var vLength = Vector<int>.Count;
    int index = 0;
    while (index + vLength <= _source.Length)
    {
        var s = Vector.LoadUnsafe<int>(ref _source[index]);
        CheckOverflow(vResult, s);
        vResult += s;
        index += vLength;
    }

    Span<int> buffer = stackalloc int[vLength];
    vResult.CopyTo(buffer);
    int result = 0;
    checked
    {
        for (int i = 0; i < vLength; i++)
            result += buffer[i];
        for (int i = index; i < _source.Length; i++)
            result += _source[index];
        return result;
    }
}

private readonly Vector<int> MaxValues = new Vector<int>(int.MaxValue);
private readonly Vector<int> MinValues = new Vector<int>(int.MinValue);

private void CheckOverflow(Vector<int> m1, Vector<int> m2)
{
    // m2 > 0 && m1 > Max - m2
    // m2 < 0 && m1 < Min - m2
    var mask = Vector.GreaterThan(m2, Vector<int>.Zero);
    var overflow = Vector.GreaterThan(m1, MaxValues - m2) & mask;
    mask = Vector.Negate(mask) ^ Vector<int>.One;
    overflow += Vector.GreaterThan(MinValues - m2, m1) & mask;
    if (overflow != Vector<int>.Zero)
        ThrowOverflow();
}

The CheckOverflow() method throws an exception when an overflow is detected. For each element it checks, if the two operands are summed, the operation can be done safely without overshooting the Max or Min values of int:

  • m2 > 0 && m1 > Max - m2

  • m2 < 0 && m1 < Min - m2

Internally it is using a mask to filter the positive and negative elements of the source vector.

Benchmarks

I have used BenchmarkDotNet to evaluate the performance of the methods on an input of an array with 100 items. The test machine uses AVX 256 registers, which means it can store 8 integers in a Vector<int> type.

|         Method |      Mean |     Error |    StdDev |
|--------------- |----------:|----------:|----------:|
|           Linq | 41.671 ns | 0.5031 ns | 0.4201 ns |
|   SimdOverflow |  7.107 ns | 0.1111 ns | 0.1039 ns |
| SimdNoOverflow | 20.887 ns | 0.2893 ns | 0.2706 ns |

While the overflow version of the method gains x6 times performance boost which is good, given the extra work required to vectorize the input.However, with the overflow checking, a significant performance is lost both by summing the non-vectorizable part of the input, as well as the additional operations to run to check for overflows.

This code certainly can be further optimized, it shows that one may get started with SIMD optimizations easily for the sum operation.