## Sum with SIMD

### 10/07/2023

### 5 minutes

One of my projects requires to calculate an average over about a hundred integer values. These values are available in an array, I used the Sum extension method from Linq to calculate the sum of them.

I have learned two things about this method:

With the current version of .NET, it does seem to use vectorization to calculate the sum of items.

It uses

`checked`

arithmetic, meaning that it throws an exception in case of an overflow.

This is the (optimized) assembly code generated by the JIT for the `Sum`

method:

G_M000_IG03: ;; offset=0020H mov r9d, r8d add eax, dword ptr [rdx+4*r9] jo SHORT G_M000_IG05 inc r8d cmp r8d, ecx jl SHORT G_M000_IG03

In this post I am searching for optimization opportunities using Single instruction, multiple data (SIMD) with .NET 8 Preview 4.

## SIMD

I plan to focus on summing items using SIMD. I will create two implementations, one using `checked`

and one using `unchecked`

semantics.

### Sum with unchecked operations

This option is easier, as when using SIMD, operations overflow irrespective of the usage of the `checked`

keyword in C#.

Note that the below code samples are for demonstration purposes only, do not use them in production environment.

public int SimdOverflow() { var vResult = Vector<int>.Zero; var vLength = Vector<int>.Count; int index = 0; while (index + vLength <= _source.Length) { vResult += Vector.LoadUnsafe<int>(ref _source[index]); index += vLength; } Span<int> buffer = stackalloc int[vLength]; _source.AsSpan(index).CopyTo(buffer); vResult += Vector.LoadUnsafe<int>(ref buffer[0]); vResult.CopyTo(buffer); return Vector.Sum(vResult); }

As one could expect, the code iterates over the input. Loads the next batch of integers into a `Vector<int>`

type and sums up the values in a local variable. After the very last iteration, there might be a few items left in the collection that would not make up a full vector. In this case the vector might contain uninitialized elements. To overcome this the last iteration writes the elements to a temporary *buffer* initialized with zeros, which is later used to create the last vector to be summed.Finally, the vector elements are summed up with the `Sum`

method. In this case this `Sum`

method is a static method implemented by `Vector`

, not the one provided by Linq.

### Sum with Checked semantics

In this second case, overflow should be prevented. The overall idea behind this method is remarkably similar to the previous with two differences:

An added

`CheckOverflow()`

method call before the add operations.At the end, this method uses a

`checked`

block to sum up the remaining items as well as the elements of the final vector.

public int SimdNoOverflow() { var vResult = Vector<int>.Zero; var vLength = Vector<int>.Count; int index = 0; while (index + vLength <= _source.Length) { var s = Vector.LoadUnsafe<int>(ref _source[index]); CheckOverflow(vResult, s); vResult += s; index += vLength; } Span<int> buffer = stackalloc int[vLength]; vResult.CopyTo(buffer); int result = 0; checked { for (int i = 0; i < vLength; i++) result += buffer[i]; for (int i = index; i < _source.Length; i++) result += _source[index]; return result; } } private readonly Vector<int> MaxValues = new Vector<int>(int.MaxValue); private readonly Vector<int> MinValues = new Vector<int>(int.MinValue); private void CheckOverflow(Vector<int> m1, Vector<int> m2) { // m2 > 0 && m1 > Max - m2 // m2 < 0 && m1 < Min - m2 var mask = Vector.GreaterThan(m2, Vector<int>.Zero); var overflow = Vector.GreaterThan(m1, MaxValues - m2) & mask; mask = Vector.Negate(mask) ^ Vector<int>.One; overflow += Vector.GreaterThan(MinValues - m2, m1) & mask; if (overflow != Vector<int>.Zero) ThrowOverflow(); }

The `CheckOverflow()`

method throws an exception when an overflow is detected. For each element it checks, if the two operands are summed, the operation can be done safely without overshooting the Max or Min values of `int`

:

m2 > 0 && m1 > Max - m2

m2 < 0 && m1 < Min - m2

Internally it is using a mask to filter the positive and negative elements of the source vector.

### Benchmarks

I have used BenchmarkDotNet to evaluate the performance of the methods on an input of an array with 100 items. The test machine uses AVX 256 registers, which means it can store 8 integers in a `Vector<int>`

type.

| Method | Mean | Error | StdDev | |--------------- |----------:|----------:|----------:| | Linq | 41.671 ns | 0.5031 ns | 0.4201 ns | | SimdOverflow | 7.107 ns | 0.1111 ns | 0.1039 ns | | SimdNoOverflow | 20.887 ns | 0.2893 ns | 0.2706 ns |

While the overflow version of the method gains x6 times performance boost which is good, given the extra work required to vectorize the input.However, with the overflow checking, a significant performance is lost both by summing the non-vectorizable part of the input, as well as the additional operations to run to check for overflows.

This code certainly can be further optimized, it shows that one may get started with SIMD optimizations easily for the sum operation.