Vectorized Longest Increasing Subsequence

In the problem of Longest Increasing Subsequence, the aim is to find an increasing sequence of elements in a given input. This new subsequence does not need to be continuous. The longest such sequence is the solution to the Longest Increasing Subsequence (LIS) problem.

For example, for input sequence of 0,8,4,5,2, the LIS will return 3 as the longest subsequence is 0,4,5.

In this post I will not focus on the most efficient implementation of LIS. Instead, I will present the most common approach to solve the problem using Dynamic Programming and then present a vectorized version of the algorithm.

The Base Solution

The idea behind the Dynamic Programming approach is to reduce the problem to a sub-problem. Solve the sub-problem recursively and store the intermediate results for the future calculations.

In case of the LIS problem, lets denote xi as the ith element of a sequence. For all z ∈ Z sequences of length 0..j where j < i we already know the longest subsequence. So, in this sub-problem xi needs to be compared with the last element of each z sequence. When xi's value is larger than the last element, a longer subsequence is found. The longest subsequence for xi will be the subsequence with the maximum length.

In practice, a typical implementation may look like the one below. Note, that it is not the most efficient algorithm O(n^2) nor implementation. Please note, this implementation also omits handling edge-cases (e.g., null inputs or inputs with length 0).

int Lis(int[] input)
{
    var dp = new int[input.Length];
    dp[0] = 1;
    for (int i = 1; i < input.Length; i++)
    {
        var longest = 1;
        for (int j = 0; j < i; j++)
        {
            if (input[i] > input[j])
            {
                longest = Math.Max(longest, dp[j] + 1);
            }
        }
        dp[i] = longest;
    }
    return dp.Max();
}

First, a temporary array, dp, is allocated that is the same size in length as the input. Each i index of this array will store the LIS value that corresponds to the range of [0..i] in the input. For the first element (at the index of zero), the LIS is 1.Next, all items of the input sequence are iterated. Each value at index i is compared with values preceding it (indexes j, where j < i). When the value in the input array at index i is greater than at index j, it takes j's corresponding LIS value from the dp array and adds one to it: (dp[j] + 1). The largest of such values for the current index is persisted in dp[i]. Finally, the result is the maximum value in the LIS array.

Vectorization

Let me present a vectorized implementation. While the algorithm and the main idea are the same, the implementation differs in a few places. Vectorization usually means using the SIMD. Vectors are stored in special SIMD registers, different CPUs/architectures are equipped with different register sizes (e.g. 64, 128, 256 and 512 bits). The machine used for this blog post has 256-bit long SIMD registers. That means it can process 8 Int32 values with a single instruction. That means an algorithm may be sped up by a factor of 8 on this machine. In practice it is rare to see such improvements, as loading data into the SIMD registers and saving the results also takes a few operations. Also note, that the vectorized algorithm is still O(n^2), because constant factors are eliminated by the definition of Big O notation.

In .NET Vector<T> represents a single vector for parallel algorithms. The JIT uses intrinsics to decide at runtime which concrete type and instructions can be used on the given architecture it runs on. A developer could also choose to use those lower-level types directly, such as Vector256<T>. However, an algorithm using Vector256<T> directly might not leverage all the capabilities of CPU supporting larger registers or fail to on CPU not having 256-bit long registers. To use lower-level types a developer would need to check the supported capabilities of the machine and branch to a corresponding implementation. While one can branch the code based on the supported instruction set, it is easier to use the higher-level types when possible.

In the following algorithm a temporary dp array is rented from an array pool and returned to the pool before the method returns. Inputs with length smaller to the number of elements that can fit into a vector needs special handling. A helper method LisRosGeneric calculates the results in a similar way as shown for the non-vectorized version previously. As discussed earlier, the number of elements that can fit in the vector depends on the generic type T and the underlying architecture. Using shorts as the generic type my CPU with 256-bit long registers can fit 16 values, while using longs it can only fit 4 values.

When input size is larger to the vector size, each index of the input is iterated and a vectorized inner loop is exercised:

T LisSimdGeneric<T>(T[] input) where T : IBinaryInteger<T>
{
    var dp = ArrayPool<T>.Shared.Rent(input.Length);
    try
    {
        LisRosGeneric<T>(input.AsSpan(0, input.Length > Vector<T>.Count ? Vector<T>.Count : input.Length), dp);
        if (input.Length < Vector<T>.Count)
            return MaxGeneric<T>(dp[0..input.Length]);

        for (int i = Vector<T>.Count; i < input.Length; i++)
        {
            var vMax = Vector<T>.One;
            var vI = new Vector<T>(input[i]);
            Vector<T> vCurrent;
            Vector<T> mask;
            Vector<T> vIncrementedDp;
            for (int j = 0; j < i - Vector<T>.Count; j += Vector<T>.Count)
            {
                vCurrent = Vector.LoadUnsafe(ref input[j]);
                mask = Vector.GreaterThan(vI, vCurrent);
                vIncrementedDp = Vector.LoadUnsafe(ref dp[j]) + Vector<T>.One;
                vMax = Vector.Max(vMax, Vector.ConditionalSelect(mask, vIncrementedDp, Vector<T>.One));
            }

            // Last round
            vCurrent = Vector.LoadUnsafe(ref input[i - Vector<int>.Count]);
            mask = Vector.GreaterThan(vI, vCurrent);
            vIncrementedDp = Vector.LoadUnsafe(ref dp[i - Vector<T>.Count]) + Vector<T>.One;
            vMax = Vector.Max(vMax, Vector.ConditionalSelect(mask, vIncrementedDp, Vector<T>.One));
            dp[i] = MaxVector(vMax);
        }
        return MaxVectorized<T>(dp.AsSpan(0, input.Length));
    }
    finally
    {
        ArrayPool<T>.Shared.Return(dp);
    }
}

The inner loop creates a vector that has the value of input[i] in each element of the vector. Next, the input elements of [0..i-Vector<T>.Count)are iterated in the batches matching the element size of the vector. This inner loop must not iterate up til i, as in that case the end of the array could read out of range memory resulting a memory access violation. So, for that the very last iteration will rather take a vector slice from index i towards the front beginning of the array. This results that a few indexes of the input may be processed twice. This isn't a problem for this algorithm, but it may cost a performance overhead for smaller inputs.

Let's discuss the body of the inner loop:

vCurrent = Vector.LoadUnsafe(ref input[j]);
mask = Vector.GreaterThan(vI, vCurrent);
vIncrementedDp = Vector.LoadUnsafe(ref dp[j]) + Vector<T>.One;
vMax = Vector.Max(vMax, Vector.ConditionalSelect(mask, vIncrementedDp, Vector<T>.One));

We take a slice of vectors, compare to the vector created of the item at index i. This results in a mask containing 0 and non-zero elements indicating which items are greater. Next, we load the corresponding slice of dp array into a new vector. Optimistically, all elements of this vector are incremented with 1, indicating a longer subsequence is found. In reality this should only apply to elements where the mask indicates that the elements of vI was larger Therefore, the new max vector is calculated by taking the existing maximum vector and the incremented vectored filtered by mask.

Performance

Comparing the performance of these two solutions reveals a few very interesting facts. I used .NET 8 with BenchmarkDotNet on a machine supporting AVX2 the execute a performance comparison. The input is an array of 10000 int numbers shuffled randomly. The results indicate that SIMD solution being close to 20 times faster than the original solution:

| Method         | Mean       | Error     | StdDev    |
|--------------- |-----------:|----------:|----------:|
| Lis            | 116.377 ms | 2.2566 ms | 2.1108 ms |
| LisSimdGeneric |   6.081 ms | 0.0816 ms | 0.0723 ms |

This performance gain cannot be justified by SIMD instructions only, given I proposed a maximum 8 fold speed up.

If we re-run the performance test on a fully ordered input, the results reveal yet another an interesting fact. Now the performance improvements in only 4 fold:

| Method         | Mean      | Error     | StdDev    |
|--------------- |----------:|----------:|----------:|
| Lis            | 19.551 ms | 0.1689 ms | 0.1497 ms |
| LisSimdGeneric |  5.937 ms | 0.0903 ms | 0.0845 ms |

What may cause such a huge deviation? To find the answer, we need to better understand how CPUs work. Let me applying the [HardwareCounters(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions)] to gain some further understanding.

| Method         | BranchMispredictions/Op | BranchInstructions/Op |
|--------------- |------------------------:|----------------------:|
| Lis            |              14,332,068 |           128,980,091 |
| LisSimdGeneric |                  28,687 |            19,056,026 |

CPUs use branch prediction to speed up code execution. In this case the CPU tries to predict which branch of the input[i] > input[j] if condition will execute. While the CPUs are really good at figuring out patterns, a random input array won't necessarily contain such patterns. When branch prediction fails the CPU needs to "return" to the condition to execute the other branch while wasting precious cycles. This is also shown on the above results: CPU's Branch Mispredictions metric is a multiple magnitude higher for the non-SIMD implementation.

This is partially to the fact that Lis method has a branching instruction in its inner loop , while the SIMD instructions are branch free.

Conclusion

In this post I introduced a typical interview question exercise, the Longest Increasing Subsequence problem. I have outlined the most common implementation in C#. Then shown how one could consider vectorizing this problem. The examples are not production ready code, please don't use them as-is in prod applications. Finally, I compared the performance of the two solutions, and revealed an interesting finding: the inner loop of the SIMD implementation is branching free achieving considerable gains in this case.