An approach to pseudo-Median with SIMD

In a previous posts I looked into using Single instruction, multiple data, SIMD to find the Median value in a sample of integer values as well as I looked into generating pairwise averages.

The definition of median in case of

  • an odd sample size, find 'middle' element of the ordered samples.

  • an even sample size, find the 2 elements in the middle of an ordered sample size, and calculate the average of those.

I have recently came across a blog post from Andrey Akinshin proposing to benefits of using Hodges-Lehmann location estimator to calculate a pseudo-median value.

The Hodges-Lehmann estimator uses Walsh (pairwise) averages as the base of the median.

In this post I will combine the previously shown SIMD solutions for pairwise averages and median, so that it calculates pseudo-median.

Hodges-Lehmann location estimator

To better understand the benefits of using Hodges-Lehmann location estimator compared to median is detailed in Andrey's post. In short it provides better statistical efficiency for smaller sample sizes. When compared to mean, it is less sensitive to outliers.

While this post does not introduce a fundamentally new idea, it combines two previous findings to create a SIMD based Hodges-Lehmann location estimator.

[Benchmark]
public uint SimdMedian() => GetNth16HeapPositiveNumbers(SimdWalsh());

[Benchmark]
public uint RegularMedian() => GetByOrder(RegularWalsh());

For detailed explanation and implementation of the methods above, please review my previous posts in the topic: An approach to Walsh pais with SIMD and An approach to median with SIMD.

The SimdMedian uses a SIMD approach to generate pairwise averages, then uses the 16-heap positive number implementation of heap sort to iterate to the median number.

The RegularMedian method uses the regular non-SIMD solution to generate the pairwise averages and then Array.Sort and indexing to read out the median value.

Benchmarks

The previous posts discussed the individual performance characteristics of the pairwise average generation as well as finding the median number. For the pairwise averages the SIMD solution provides a strong performance gain. In the case of fetching the median value, the results are more nuanced. It only provides a performance gain for large inputs, and even in those cases the gain does not justify the extra complexity. This made GetNth16HeapPositiveNumbers method less attractive for the general usage.

Based on Andrey Akinshin's post, Hodges-Lehmann location estimator excels on small samples. Although the input sample might be as small, the pairwise averages result quadratic number of pairs. This means that even for small input of 100 samples, the median has to order and search ~5000 elements. This is exactly the range where the GetNth16HeapPositiveNumbers starts yielding performance gains.

Calculating pseudo-median of 100 random integers:

|        Method |     Mean |    Error |   StdDev |
|-------------- |---------:|---------:|---------:|
|    SimdMedian | 61.39 us | 0.566 us | 0.501 us |
| RegularMedian | 92.58 us | 0.856 us | 0.801 us |

Note that due to the nature of these algorithms certain inputs work better, and others worse. For example, if the input is a fully ordered array, the results will be within <5% range for the SIMD and non-SIMD solution. The SIMD based pseudo-median a more competitive solution compared to the SIMD based media. However, the suggestion is still to do careful measurements and evaluation to decide if the performance gain worth the additional complexity of this algorithm.