An approach to 95th Percentile with SIMD
11/05/2023
6 minutes
Introduction
In a previous post I looked into using Single instruction, multiple data, SIMD to Sum
an array of integer elements and to find the Median
value of sample.
When working with statistical data, we often need to find the 95th percentile element of a sample. Calculating the Sum
is useful for finding the mean value but finding the 95th percentile element requires a different approach. However, the P95 algorithm is similar to the way the Median
element is found.
The 95th percentile is a statistical term that means the value below which 95% of the data in a given set is found. For example, if you have a set of test scores, the 95th percentile score is the one that is higher than 95% of the other scores. You can calculate the 95th percentile using this formula: n = (P/100) x N, where P = percentile, N = number of values in the data set, and n = ordinal rank of the given set. The 95th percentile is often used to measure service response times, as it allows for burstable usage patterns.
There are multiple algorithms to calculate find a P percentile value of a data, in this post I will focus approaches that sort the input data.
Let me first set assumptions to the original problem:
the input sample are positive integers (
uint[]
).
With this assumption we can use the same approaches to find P percentile as described in the previous post. In the previous post, I searched for function f(sample, n)
so that it returns the nth element of the ordered items.
In this post, I adapt the previous implementations of a method f
that is given inputs of an int[] source
and an int n
number to the percentile problem. From data types point of view an IReadOnlyCollection<int>
would be more accurate semantically however, for the brevity in this post I will use int[]
. Code samples can be adapted to use the above interface type if required.
Heaps
Heaps provide a unique data structure that fits well for the solution of the above problem. A max-heap is a tree structure that keeps the maximum element on the top. We can use the heapsort algorithm solve the problem of percentile. Heapsort orders items of a collection. For a collection of n
items, it builds a heap in O(n)
then applies the siftDown()
operation n
times. Each invocation of siftDown()
returns the next item in the ordered sequence. The complexity of siftDown()
operation is O(log n)
. Altogether this runs the algorithm with O(n * log n)
performance to completely order a collection. However, we do not need to run the siftDown()
n
times to solve the problem of percentile. When searching for P percentile, we only need n - (P/100) x n
times to invoke siftDown()
where n
is the size of the input data.
Heapify
operation is used typically for building a heap. Heapify internally uses the siftDown()
operation. Another favorable advantage of using heaps is that it can easily represented by an array of items. In this case an uint[]
can represent the heap structure. In case of a binary heap, a parent node has 2 children. When stored by an array a parent node with index i
would have children at 2i+1
and 2i+2
indexes. A child at index i
has its parent at floor((i-1)/2)
index.
SIMD
The advantage of SIMD is to process multiple data with the same instruction. To apply SIMD on heap operation, we need to identify the places where multiple data could be processed by the same operation. In case of heap, a closer look on siftDown()
suggests parallelizing the comparison and swap operations. siftDown()
compares all children with the parent, and in case of max-heap swaps the maximum child with the parent, so that the parent becomes the new maximum.
d-ary heap is a heap, where nodes have d children instead of 2. With this in mind, we could build 16-heap, while keeping the semantics of the heap described above. For further details, please refer to the previous post.
16-heap with Positive Numbers
The code example for finding the median element can be easily adapted for the problem. Instead of building and iterating the items of a min-heap, we can build and iterate items of a max-heap. This way, we don't need to iterate 95% of the items but only 5% percent. The adapt the code in the previous post the following changes are applied:
Min method invocations are replaced with Max method invocations.
Maximum values applied to the unset elements of the array are replaced with 0.
Performance Benchmarks
This section compares the SIMD approach with the non-SIMD solutions described in the previous post. I am focusing on inputs with elements 100, 1000, 10000, and respectively fetching the respective value for the 95th percentile (P=95
).
Note, that in the previous post, finding the median element meant to iteration / order half of the elements from the input. However, to find P95 only the top 5% of the elements need to be sorted. This brings a nice performance boost for the SIMD approach.
In all test cases the heap SIMD solution bring a performance gain compared to the OrderBy() and Linq approaches:
Source Length 100
| Method | Mean | Error | StdDev | |---------------------------- |---------:|--------:|--------:| | GetByLinq | 403.7 ns | 4.51 ns | 4.22 ns | | GetByOrder | 397.9 ns | 4.25 ns | 3.77 ns | | GetNth16HeapPositiveNumbers | 142.0 ns | 2.11 ns | 1.98 ns |
Source Length 1000
| Method | Mean | Error | StdDev | |---------------------------- |---------:|----------:|----------:| | GetByLinq | 4.565 us | 0.0899 us | 0.0924 us | | GetByOrder | 6.068 us | 0.0716 us | 0.0598 us | | GetNth16HeapPositiveNumbers | 1.417 us | 0.0149 us | 0.0116 us |
Source Length 10000
| Method | Mean | Error | StdDev | |---------------------------- |----------:|---------:|---------:| | GetByLinq | 75.08 us | 1.446 us | 1.665 us | | GetByOrder | 377.75 us | 2.281 us | 2.134 us | | GetNth16HeapPositiveNumbers | 17.08 us | 0.183 us | 0.171 us |
While in the previous post it was a reasonable suggestion to use the built in methods to find the median value, for P95, P90 percentiles, it is less clear. In this case, for very tight loops, a developer might consider the optimization of using SIMD, given that the application is running on a known hardware.