 # Hello, I am Laszlo

## Introduction

In this post I will look into how someone can implement Sin/Cosin functions with SIMD in NET 5. NET 5 does not have a wrapper on SIMD trigonometric functions, so it seems a good exercise to implement it.

Using SIMD, one may gain performance advantage on large data sets. Later in this post I will run a performance benchmark to compare the SIMD implementation with `Math.Cos()` and `MathF.Cos()` methods.

## Considerations

For the implementation I have made these following considerations. There are multiple methods to estimate a good enough value for cos/sin functions. Because cosine to sine is just a 90 degree (PI/2) offset on the input, it is enough to have an algorithm that calculates either sin or cos, while translating the input for the counterpart. Another observation to make is that the function is periodic, with a period of 2*PI.

There are multiple algorithms to be look at, each has different characteristics considering performance (CPU and memory), complexity and accuracy.

• One algorithm is CORDIC which is used by most calculators. The algorithm is efficient on constrained memory and CPU environments.

• Another approach is to pre-calculate a lookup table of input-output values using a more precise calculation, and use linear interpolation to estimate the sin/cos value of a given input that falls in-between of two known values in the table.

• Use Taylor series to approximate sine/cosine.

• Using Chebyshev polynomials.

For this exercise, I have chosen to use Taylor series, as it is one of the mathematical tools I learned previously that provides a good approximation for small input values.

## Implementation

At this stage I have been through a couple of iterations of the implementations, and here I present the most recent one. See the detailed description below the current implementation:

``````public class MathSimd
{
private readonly Vector256<double> _pi = new Vector<double>(Math.PI).AsVector256();
private readonly Vector256<double> _one = new Vector<double>(1).AsVector256();
private readonly Vector256<double> _negPi = new Vector<double>(-Math.PI).AsVector256();
private readonly Vector256<double> _cosCosLimit = new Vector<double>(Math.PI / 4).AsVector256();
private readonly Vector256<double> _cosSinLimit = new Vector<double>(-Math.PI / 4).AsVector256();
private readonly Vector256<double> _cosOffset = new Vector<double>(Math.PI / 2).AsVector256();
private readonly Vector256<double> _negOne = new Vector<double>(-1).AsVector256();

private readonly Vector256<double> _cos0 = new Vector<double>(BitConverter.ToDouble(new byte[] { 0, 0, 0, 0, 0, 0, 224, 191, })).AsVector256();
private readonly Vector256<double> _sin0 = new Vector<double>(BitConverter.ToDouble(new byte[] { 85, 85, 85, 85, 85, 85, 197, 191, })).AsVector256();

private readonly Vector256<double> _cos1 = new Vector<double>(BitConverter.ToDouble(new byte[] { 85, 85, 85, 85, 85, 85, 165, 63, })).AsVector256();
private readonly Vector256<double> _sin1 = new Vector<double>(BitConverter.ToDouble(new byte[] { 17, 17, 17, 17, 17, 17, 129, 63, })).AsVector256();

private readonly Vector256<double> _cos2 = new Vector<double>(BitConverter.ToDouble(new byte[] { 23, 108, 193, 22, 108, 193, 86, 191, })).AsVector256();
private readonly Vector256<double> _sin2 = new Vector<double>(BitConverter.ToDouble(new byte[] { 26, 160, 1, 26, 160, 1, 42, 191, })).AsVector256();

private readonly Vector256<double> _cos3 = new Vector<double>(BitConverter.ToDouble(new byte[] { 26, 160, 1, 26, 160, 1, 250, 62, })).AsVector256();
private readonly Vector256<double> _sin3 = new Vector<double>(BitConverter.ToDouble(new byte[] { 52, 199, 86, 165, 227, 29, 199, 62, })).AsVector256();

private readonly Vector256<double> _cos4 = new Vector<double>(BitConverter.ToDouble(new byte[] { 92, 159, 120, 183, 79, 126, 146, 190, })).AsVector256();
private readonly Vector256<double> _sin4 = new Vector<double>(BitConverter.ToDouble(new byte[] { 228, 68, 245, 103, 69, 230, 90, 190, })).AsVector256();

private readonly Vector256<double> _cos5 = new Vector<double>(BitConverter.ToDouble(new byte[] { 152, 216, 248, 239, 216, 238, 33, 62, })).AsVector256();
private readonly Vector256<double> _sin5 = new Vector<double>(BitConverter.ToDouble(new byte[] { 9, 109, 168, 19, 70, 18, 230, 61, })).AsVector256();

private readonly Vector256<double> _cos6 = new Vector<double>(BitConverter.ToDouble(new byte[] { 157, 124, 192, 168, 116, 57, 169, 189, })).AsVector256();
private readonly Vector256<double> _sin6 = new Vector<double>(BitConverter.ToDouble(new byte[] { 31, 184, 51, 231, 243, 231, 106, 189, })).AsVector256();

public void Cos(ReadOnlySpan<double> input, Span<double> output)
{
Vector256<double> vInput = new Vector<double>(input).AsVector256();

var mutlipleOf = Avx2.RoundToZero(Avx2.Divide(vInput, _pi));
vInput = Avx2.Add(vInput, Avx2.Multiply(mutlipleOf, _negPi));

var negativeSinMask = Avx2.CompareEqual(Avx2.And(mutlipleOf, _one), _one);

var byteMask = Avx2.CompareGreaterThan(vInput, _cosCosLimit);
var taylorX = Avx2.BlendVariable(vInput, Avx2.Add(vInput, _negPi), byteMask);
var neg = Avx2.BlendVariable(_one, _negOne, negSign);

byteMask = Avx2.CompareLessThan(taylorX, _cosSinLimit);

var taylorBegin = Avx2.BlendVariable(_one, taylorX, byteMask);
var taylorAccumulator = Vector256<double>.Zero;
var taylorX2 = Avx2.Multiply(taylorX, taylorX);

taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorBegin, Avx2.BlendVariable(_cos0, _sin0, byteMask)));
var taylorExp = Avx2.Multiply(taylorBegin, taylorX2);
taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorExp, Avx2.BlendVariable(_cos1, _sin1, byteMask)));
taylorExp = Avx2.Multiply(taylorExp, taylorX2);
taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorExp, Avx2.BlendVariable(_cos2, _sin2, byteMask)));
taylorExp = Avx2.Multiply(taylorExp, taylorX2);
taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorExp, Avx2.BlendVariable(_cos3, _sin3, byteMask)));
taylorExp = Avx2.Multiply(taylorExp, taylorX2);
taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorExp, Avx2.BlendVariable(_cos4, _sin4, byteMask)));
taylorExp = Avx2.Multiply(taylorExp, taylorX2);
taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorExp, Avx2.BlendVariable(_cos5, _sin5, byteMask)));
taylorExp = Avx2.Multiply(taylorExp, taylorX2);
taylorAccumulator = Avx2.Add(taylorAccumulator, Avx2.Multiply(taylorExp, Avx2.BlendVariable(_cos6, _sin6, byteMask)));

taylorAccumulator = Avx2.Add(Avx2.Multiply(taylorAccumulator, taylorX2), taylorBegin);
taylorAccumulator = Avx2.Multiply(taylorAccumulator, neg);

taylorAccumulator.AsVector().TryCopyTo(output);
}
}``````

There is a single public `Cos()` method implemented, which is calculating the cosine. For brevity of this post, I will focus on cosine in the rest of this post.

The Taylor series of sine and cosine functions follows as:

cos x = 1 - x4 / 4! - x^6 / 6! + ...

sin x = x - x5 / 5! - x^7 / 7! + ...

For a good enough accuracy we may calculate the first n terms.

However the period of cosine is 2PI, the first and second half values are equal in their absolute value, and just negated by -1. The goal is to the calculation for values in `[0, PI[` range, and then correct the sign of the final value based on the input value.

Taylor series (sin/cos) is very accurate in the `[-PI/4, PI/4]` range. The `[0, PI[` range may be divided into 3 segments:

• `[0, PI/4[` may remain still cosine values may be calculated directly on this path

• `[PI/4, 3*PI/4[` range represents the negated values of sine function on range `[-PI/4, PI/4[`. Hence this input range can be shifted and sine calculation may be applied.

• `[3*PI/4, PI/2[` range equals to the negated values of cosine function on `[-PI/4, 0[` With this, all values may be calculated with the most accurate input range of sine/cosine Taylor series. The two variables to store for an input value is:

• if the final value should be negated by -1 and

• if sine or cosine Taylor series should be used.

The next consideration is how to use SIMD operations for the calculation. The two approaches:

1. The input of the function is a single `double` number; utilize SIMD to calculate a single value faster.

2. The input of the function is an array of `double`s; SIMD is applied to execute the same operations on all input values at the same time.

Let's examines the first approach. At first sight there might be a lot of independent computations that could be done parallel. For example each term of the series could be a calculated parallel in a single SIMD register. However, each term also builds on top of the previous term, for example x2 * x * x or 4! = 2! * 3 * 4. This seems to produce a lot of duplicate calculations, as x^2 would be calculated for each term. Once a SIMD register reaches its target value, it would sit idle until registers with higher terms still process further calculations.

This makes the second approach as a natural fit: calculate the sin/cos for multiple floating point numbers at the same time. In this case each SIMD register is assigned to input number. The algorithm can re-use already calculated term's values. For each input number first calculate x4, reuse the already calculated result of x^2 and multiply it by x * x.

For a good enough accuracy I found that calculating 7 terms of the series is good enough for `double` input values. By determining what is 'good enough' I compared the results with the built in `Math.Cos` method. For each term I have also pre-calculated a part of the term: the sign and the denominator values are constant, so pre-calculating them is safe for later re-use. After doing the pre-calculation, I have serialized the `double`s as `byte`s values, so I can maintain a high precision while hardcoding them in the source. These values are hardcoded as the `_cos0`, `_cos1`, `_sin0`, `_sin1`, etc. readonly fields.

With this all background, ready to look at the `void Cos(ReadOnlySpan<double> input, Span<double> output)` method. In the first two lines, the input values are normalized for the `[0, PI[` segment. Next an initial negative mask is calculated based on the factor of normalization. If the value is odd or even by using the `Add` and `Compare` methods.

The next 4+2 lines shift the input values to the `[-PI/4, PI/4]` range while also maintaining the negative mask. First every value above PI/4 is shifted left by PI, so the third segment get shifted into the `[-PI/4, 0[` range, and then the values for the second segment are shifted to right by PI. The mask for this last shift can be directly re-used for the rest of series calculation for masking where sin/cos Taylor series shall be used.

The first term of the series is calculated, which is either 1 or the input value based on the sin/cos mask. Then the rest of the terms are aggregated and calculated without a loop. This yields better performance by eliminating houskeeping of the for loop. Finally, the aggregate value is multiplied by x2, we shall yield a higher accuracy, and values will be "more continuous" as pointed out by some of the articles on the web.

The last step is to fix the sign of the final result using the negative mask.

## Performance Benchmarks

I used BenchmarkDotNet to compare the SIMD algorithm with an equal length of `Math.Cos` invocations in a loop. For simplicity the size of the input array is a multiple of the size of the SIMD register on my machine.

``````// * Summary *

Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.5.21302.13
[Host]     : .NET Core 5.0.7 (CoreCLR 5.0.721.25508, CoreFX 5.0.721.25508), X64 RyuJIT
DefaultJob : .NET Core 5.0.7 (CoreCLR 5.0.721.25508, CoreFX 5.0.721.25508), X64 RyuJIT

|    Method |     Mean |    Error |   StdDev |
|---------- |---------:|---------:|---------:|
|   MathCLR | 17.02 ns | 0.361 ns | 0.370 ns |
| MathSIMD3 | 16.56 ns | 0.354 ns | 0.508 ns |``````

With increased number of input values:

``````// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19043
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.5.21302.13
[Host]     : .NET Core 5.0.7 (CoreCLR 5.0.721.25508, CoreFX 5.0.721.25508), X64 RyuJIT
DefaultJob : .NET Core 5.0.7 (CoreCLR 5.0.721.25508, CoreFX 5.0.721.25508), X64 RyuJIT

|    Method |      Mean |    Error |   StdDev |
|---------- |----------:|---------:|---------:|
| MathSIMD3 |  92.04 ns | 0.982 ns | 0.767 ns |
|   MathCLR | 113.90 ns | 0.458 ns | 0.428 ns |
``````

Next I adopted the above algorithm to handle `float` types instead of `double`. This may be compared to `MathF.Cos` respectively. Here are the benchmark results:

``````// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19043
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.5.21302.13
[Host]     : .NET Core 5.0.7 (CoreCLR 5.0.721.25508, CoreFX 5.0.721.25508), X64 RyuJIT
DefaultJob : .NET Core 5.0.7 (CoreCLR 5.0.721.25508, CoreFX 5.0.721.25508), X64 RyuJIT

|     Method |     Mean |    Error |   StdDev |
|----------- |---------:|---------:|---------:|
|   MathFCLR | 53.15 ns | 0.288 ns | 0.256 ns |
| MathSIMD3F | 11.48 ns | 0.214 ns | 0.178 ns |``````

## Conclusion

There are use-cases when using a custom SIMD sine and cosine calculation has performance advantage. However, I rather find the content of this post as an interesting coding exercise rather than a production ready solution. Spending more time understanding the inner workings of CPU, using further benchmarking tools, such as Intel VTune could further optimize CPU and memory utilization. Interesting enough, a previous, seemingly less optimal SIMD implementation yields slightly better results in some cases, as the CPU can better align instructions.