Laszlo - Finding Duplicate Integers

Finding Duplicate Integers

11/19/2019

6 minutes

This post is a continuation of using Vectorization to achieve better performance. In this post I am going to look at the performance aspect of another tipical programming exercises on interviews.

Task usually goes by this:

Given an array with integers from 1 to N. The order of the integers is unknown. One of the randomly chosen number is duplicated by overwriting another number. Create a method that receives the array as an input and returns the replaced number.For example having an input of [2,5,4,3,1,7,3] should return 3, as being the duplicate, as number 6 is being overwritten by another 3.

I am not going into the details of edge cases, input validation, error handling, etc. This post purely focuses on the performance aspects of some of the possible solutions. I also assume that N = x * 8.

Next, I will present three solutions:

A naive solution
A naive solution vectorized
One usually accepted solution
A fast solution

In the last section I compare the performance of each solution head-to-head from memory usage and execution time point of views.

Naive solution

The following naive solution iterates over each item of the input array, and tests equality with items in the remaining of the array. It has an operation number of O(N*N), which does not make it fast, but it also does not need any memory allocated to perform the calculation.

internal static int DuplicateNaive(int[] items)
{
    for (int i = 0; i < items.Length; i++)
        for (int j = i + 1; j < items.Length; j++)
            if (items[i] == items[j])
                return items[i];
    return -1;
}

Naive solution vectorized

The next solution uses SIMD (Single Instruction Multiple Data), to improve the performance of the previous. It is based on the similar algorithm to the previous solution as well. It selects the next number to test, then iteratures over the rest of the array to check for equality. It has the same O(N*N) performance, but because of the vectorization is executes faster.The idea is to partition the array into fix slices, that are being vectorized [0..7], [8..15], .... In each iteration it takes next Vector<int> sized number of items. For each partition to tests, it has an inner for loop and an inner while loop. First, for each number in this partition, it creates a vector, whoes items are all the chosen number. For example for a partition: [2,3,9,5,8,7,1,6] it will create 8 vectors:

{2,2,2,2,2,2,2,2}
{3,3,3,3,3,3,3,3}
{9,9,9,9,9,9,9,9}
etc.

These vectors are stored on a stack allocated Vector<int>[]. We can do this, as the size is limited of this error, and the method is not recursive.It also tests each number in the partition one-by-one to the other items of the partition.

Secondly, the inner while loop takes each of the above create vectors and tests equality with the next Vector created on the remaining items. If any of the two elements of the Vectors are equal, a non-zero vector is returned, meaning, we found the duplicate item.

internal static int DuplicateVectorize(ReadOnlySpan<int> items)
{
    var vlength = Vector<int>.Count;
    Span<Vector<int>> currentSliceVs = stackalloc Vector<int>[vlength];
    while (items.Length > 0)
    {
        var currentSlice = items.Slice(0, vlength);
        for (int i = 0; i < vlength; i++)
        {
            for (int j = i + 1; j < vlength; j++)
                if (currentSlice[i] == currentSlice[j])
                    return currentSlice[j];
            currentSliceVs[i] = new Vector<int>(currentSlice[i]);
        }
        var remaining = items.Slice(vlength);
        while (remaining.Length > 0)
        {
            var remainingV = new Vector<int>(remaining);
            for (int i = 0; i < vlength; i++)
            {
                var mask = Vector.Equals(remainingV, currentSliceVs[i]);
                if (mask != Vector<int>.Zero)
                    return currentSliceVs[i][0];
            }
            remaining = remaining.Slice(vlength);
        }
        items = items.Slice(vlength);
    }
    return -1;
}

Usually accepted solution

One of the more usually accepted solution uses an additional set data structure. It iterates over each item of the input and adds the item to the set. A set may not store duplicates, so once we try to add an item already added, HashSet<int> returns false. The operation count can be estimated by O(N), as we iterate over the input once, and hash set's Add operation can be estimated as O(1).By Big-O notation's definition this solution is faster to the previous ones, on big enough arrays. On the other hand, performance gain is traded by using more memory. While previous solutions allocate a negligible at most on the stack, this one needs to allocate a set on the heap, which will result in a GC penalty at some point later.

internal static int DuplicateHashSet(ReadOnlySpan<int> items)
{
    HashSet<int> set = new HashSet<int>(items.Length);
    foreach (var item in items)
    {
        if (!set.Add(item))
            return item;
    }
    return -1;
}

Note, that I provide a capacity for the HashSet during construction. Without that, HashSet would need to grow its internal data structure, which is a CPU and memory hit. The following table shows the characteristics of the DuplicateHashSet method on a sample array, with and without the capacity defined (the one suffixed with C uses capacity):

|              Method |         Mean |      Error |     StdDev |   Gen 0 |   Gen 1 |   Gen 2 | Allocated |
|-------------------- |-------------:|-----------:|-----------:|--------:|--------:|--------:|----------:|
|    DuplicateHashSet |    604.82 us |  12.027 us |  13.851 us | 94.7266 | 94.7266 | 94.7266 |  538624 B |
|   DuplicateHashSetC |    367.63 us |   7.345 us |  13.431 us | 66.4063 | 66.4063 | 66.4063 |  280424 B |

Fast solution with Arrays

The last solution uses the similar approach as the previous one, with sets. Instead of using a HashSet<int>, it uses a bool array. The lenght of the array matches to the largest number of the input. The algorithm iterates over the input items and sets a flag at the corresponding item's value in the array. It can do this, as input values are between [1..N]. This approach can be faster to the HashSet one, because it avoids the hashing of items. It can also use less memory, as it only stores a bool flag for each number, it does not need to store the value itself.

internal static int DuplicateArray(ReadOnlySpan<int> items)
{
    var numbers = new bool[items.Length+1];
    foreach (var item in items)
    {
        if (numbers[item] == false)
            numbers[item] = true;
        else
            return item;
    }
    return -1;
}

Benchmarks

For benchmarking, I used BenchmarkDotNet library, with one randomized integer array of length 16384 items. The table below shows that the naive solution is the slowest. Having the vectorization increases the preformance. When we trade memory, we can get several magnitudes better execution time.

|             Method |         Mean |      Error |     StdDev |   Gen 0 |   Gen 1 |   Gen 2 | Allocated |
|------------------- |-------------:|-----------:|-----------:|--------:|--------:|--------:|----------:|
|     DuplicateNaive | 41,235.67 us | 780.593 us | 730.167 us |       - |       - |       - |         - |
| DuplicateVectorize | 10,882.24 us | 213.860 us | 285.497 us |       - |       - |       - |         - |
|   DuplicateHashSet |    367.63 us |   7.345 us |  13.431 us | 66.4063 | 66.4063 | 66.4063 |  280424 B |
|     DuplicateArray |     17.90 us |   0.357 us |   0.697 us |  1.0376 |       - |       - |   16416 B |

Why does the DuplicateArray allocate 16416 bytes for 16384 items? We allocate one item larger array to avoid shifting values to be 0 based, and an array on the heap also requires a sync block, a MT pointer and some internal values on its size. On a 64-bit machine, each of these are 8 bytes, resulting 16384+1+3*8=16409. The last 7 bytes are padding, so the size is a multiple of 8 bytes.‬

vector simd performance dotnet

Hello, I am Laszlo

.NET developer