First look at SIMD in .Net

Recently I have been reading about Performance of .Net application, I ran into the following post explaining processor and cache affects on our code.

Processor pipelining shifted my interests to Single Instruction Multiple Data. SIMD has been added to the .Net ecosystem with the RyuJIT, as the following blog post details it.

There are a couple of mathematical and graphical use-cases described where this technology is applicable, but I still see a great opportunity ahead, where mass amount of real-time numerical data needs to be processed (and we are not willing to use GPU for instance). By curiosity I have created a couple of examples and compared the performance of these use-cases to regular code.

Note that the following examples use Vector<T> which is available through a NuGet package at the time. Only regular Vector2/3/4f are available in the full framework.

Let's have arrays of double-s:

double[] items = new double[64 * 1024 * 1024];
Vector<double>[] vectors = new Vector<double>[16 * 1024 * 1024];

In my example I will be using 4 doubles in the vectors as the machine used for the tests has a 256 bit SIMD register. The following code returns the number of items I can store in a vector of a given type:

Vector<double>.Count

With this on mind, have I chosen vectors array to be 4 times 'smaller', but eventually it contains the same amount of doubles.

In my initial approach of comparison I tried to compare operations only by creating Vectors from the double array. This comparison ended with false results, as we if we need to copy back the doubles from the Vector to the array, we will have huge amount of extra memory copy, which takes significant amount of time.

With following loop I iterate over the array and increment each double. I repeated the test and the average took 218ms.

for(int i = 0; i < vectors.Length; i += 1)
{
  vectors[i] += v1;
}

While incrementing each double one-by-one takes an average of 220 ms:

for(int i = 0; i < items.Length; i += 1)
{
  items[i]++;
}

It seems the performance improvement is less then a couple of millisecond. The significant part of the above operations seems to be moving data from memory to CPU cache and registers.

To prove this point, let's repeat the test with a much smaller array, that can fit say in L1 cache. Using doubles (64 bit) and 32KB of L1 cache, it takes 52ms to execute the loops below. Note I am doing the same number of operations as above, but this time I iterate over a chunk of memory that possibly fits in L1 16384 times.

for(int j = 0; j < 16384; j += 1)
{
  for(int i = 0; i < 4096; i += 1)
  {
    items[i]++;
  }
}

Repeating the same with SIMD and vectors takes 20ms in average. That is already a much better performance.

for(int j = 0; j < 16384; j += 1)
{
  for(int i = 0; i < 1024; i += 1)
  {
    vectors[i] += v1;
  }
}

Let's take this to next step and repeat test with using one-one vector or one-one double only. At this point we are using still the arrays defined at the beginning. The following non-SIMD code take 238 ms to run.

for(int i = 0; i < items.Length; i += 1)
{
  items[0]++;
  items[1]++;
}

With SIMD the below code takes 80 ms, which is about 3 times faster. Note, that in case of SIMD, I only iterate to vectors.Length, but the number of doubles I add together would be the same amount as in the non-SIMD case. (non-SIMD: 64*1024*1024*2 = 134217728, SIMD: 16*1024*1024*2*4 = 134217728)

for(int i = 0; i < vectors.Length; i += 1)
{
  vectors[0] += v1;
  vectors[1] += v1;
}

Let's reduce further the effect of memory access, and replace the indexed arrays with local variables. The non-SIMD solution takes 76 ms, while the SIMD solution with the same amount of addition takes 28 ms, while the same number of operations are executed on doubles.

var v = new Vector<double>(new double[] { 2, 3, 4, 5 });
var z = new Vector<double>(new double[] { 2, 3, 4, 5 });
var v1 = new Vector<double>(new double[] { 1, 1, 1, 1 });
...
for(int i = 0; i < vectors.Length; i += 1)
{
  v = v + v1;
  z = z + v1;
}

Note that the above measurements are not compensated with the integer addition for the loop.

The key take-away is that there are some use-cases, where SIMD is extremely fast. Although when data is not in a proper format, conversion needed or memory needs to be moved to processor cache, the performance gain is easily lost. My suggestion is to always carefully choose SIMD for a given algorithm, and the prove decisions by measuring the performance gains/losses.