Missing Number Performance Tests 2
02/01/2020
4 minutes
A couple of weeks ago, I had a post about a tipical interview question: finding a missing number from an unordered array.
The task goes as follows:
Given an array with integers from 1 to N. The order of the integers is unknown. One of the randomly chosen number is replaced with 0. Create a method that receives the array as an input and returns the replaced number.
I am not going into the details of edge cases, input validation, error handling, etc. This post purely checks the performance aspects of some of the possible solutions. I also assume that N=x*8
.
I have previously presented three solutions:
A naive solution
One usually accepted solution
A fast solution
In this post, I propose a 4th solution to find the replaced number using .net core 3 hardware intrinsics. This solution is not cross-platform nor general purpose and does not provide the same fallbacks as existing .net core SIMD. Methods using System.Runtime.Intrinsics
will end up being unsafe too, as loading data into Avx registers requires the use of a pointer type. As long as we are sure that the target users of the application are equipped with the right processor, we can leverage the performance boost provided by the hardware instrinsics.
Vectorization with Intrinsics
The method is similar to the SIMD solution in the previous post. It first calculates the expected sum of the numbers in the array. It initializes a sum vector of integeres with the value of 0.
Next the items span is pinned so we can load the values into the Avx registers. Pinning will avoid the Garbage Collector to compact empty memory before this object. Pined objects could cause demotion and extra load on the garbage collection. On the other hand, in such a code that is optimized for high performance, we shall avoid allocations that could trigger garbage collection.
Next in a loop we load 8 integers (Int32
* 8 fits into 256 bit register) into the Avx registers and add it to the sum vector.
Once the loop completes, we need sum up the 8 integers of the vector. HorizontalAdd
adds two integeres horizontaly.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
x0+x1 | x2+x3 | x0+x1 | x2+x3 | x4+x5 | x6+x7 | x4+x5 | x6+x7 |
x0+x1+x2+x3 | x0+x1+x2+x3 | x0+x1+x2+x3 | x0+x1+x2+x3 | x4+x5+x6+x7 | x4+x5+x6+x7 | x4+x5+x6+x7 | x4+x5+x6+x7 |
The header row indicates the index of the integer number in the register.
1st row indicates the initial state of the vector to sum up horizontaly.
2nd row indicates the result after the first
HorizontalAdd
3rd row indicates the result after the second
HorizontalAdd
Finally, ToScalar()
is used to get the first value of the lower (128 bit) and upper (128 bit) part of the register. Both numbers extracted from the expected will result the missing number, which value is being returned.
public static unsafe int VectorizeAvx(ReadOnlySpan<int> items) { int expected = (items.Length + 1) * items.Length >> 1; Vector256<int> sum = Vector256<int>.Zero; fixed (int* pItem = items) { int i = 0; while (i < items.Length) { sum = Avx2.Add(sum, Avx2.LoadVector256(pItem + i)); i += 8; } } sum = Avx2.HorizontalAdd(sum, sum); sum = Avx2.HorizontalAdd(sum, sum); return expected - sum.ToScalar() - sum.GetUpper().ToScalar(); }
Note, the current solution is unsafe, do not use this method in production code, as it lacks handling edge cases, exceptions and checks if Avx2 is supported on the target machine.
Conclusion
For benchmarking, I used BenchmarkDotNet library, with one randomized integer array of length 8192 items. The table below shows that the previous results (Contains, Sum, Vectorize) and using the Avx instrinsics as well. Compared to Contains using the formula in Sum increases the preformance, several magnitudes, and vectorization further boosts execution.
Finally, using the Avx registers we can reach a more than 6 times improvement to the Sum solution, and nearly 40% improvement on the generic SIMD (Vectorize) solution.
Method | Mean | Error | StdDev |
---|---|---|---|
Contains | 1,685,797.5 ns | 21,441.62 ns | 19,007.43 ns |
Sum | 4,857.7 ns | 96.79 ns | 115.22 ns |
Vectorize | 1,229.8 ns | 13.98 ns | 13.07 ns |
VectorizeAvx | 737.4 ns | 6.63 ns | 5.53 ns |
Which one do I prefer? Unless knowing the exact platform where the application runs and the need to squeeze every nanosecond, I would use the more general System.Numerics
which provides a safer, cross-platform solution.