Laszlo - Vectorized Subset Sum

Vectorized Subset Sum

08/25/2024

8 minutes

In this series I am implementing Dynamic Programming Katas with vectorized solutions.

The first post explored the Vectorized Longest Increasing Subsequence problem, and in the second post I dived into the Checkerboard problem.

In this post I will explore a version of the Subset sum problem. I will not focus on the most efficient algorithm or implementation for this problem. Instead, I will present the most common approach to solve the problem using Dynamic Programming and then present a vectorized version of the algorithm.

Introduction

In the subset sum problem a set of integers and a target number is given. The task is to decide whether any subset of the integers adds up precisely to the target number. The subset sum problem has multiple variants. In this post I focus on the simplest variant that allows only positive integers in the input.

For example, given the following input would result these outputs:

([3, 4, 2, 6], 5) => true, because 3+2=5
([3, 4, 2, 6], 11) => true, because 3+2+6=11
([3, 4, 2, 6], 14) => false, because no subset of integers sums 14

The Naive Solution

The idea behind the Dynamic Programming approach is to reduce the problem to a sub-problem. Solve the sub-problem recursively and store the intermediate results for the future calculations.

In this case the sub-problem is the following: have a set of numbers and all the possible sum values that any subset of this set can produce. Take one (yet unvisited) number from the input, and add it to this set; then create all the new possible sums with this new number. When the target sum is produced as a possible sum, the algorithm may terminate and return true. Otherwise, if all inputs are visited and the algorithm has not terminated yet, return false.

The simplest implementation fills a table (dp) row-by-row. The columns of this table are the numbers from [0..n], where n is the target sum. The rows of the table are the input integers. For each cells determine the value:

if (input[i] > j) then dp[i][j] = dp[i-1][j], where j is the column index.
else dp[i][j] = dp[i-1][j] || dp[i-1][j-input[i-1]]

For example, for the input of [3, 4, 2, 6] and a target number of 5, the following table would be created (where the first row stands for an empty set):

	0	1	2	3	4	5
	T	F	F	F	F	F
3	T	F	F	T	F	F
4	T	F	F	T	T	F
2	T	F	T	T	T	T
6	T	F	T	T	T	T

The corresponding C# implementation could look like the following: allocate table with dimensions of (input count + 1) * (target sum + 1). Fill the first row (corresponding to the empty set) with false for all cells except for column 0, which in this case would be true. Then iterate all rows and columns and apply the rules described above. The code snippet is commented for further understanding.

Note that the code snippets are not production ready. They demonstrate the overall idea of the algorithms but lack handing edge cases.

internal bool SubsetSum(int[] input, int sum)
{
    // Allocate a table
    var dp = new bool[input.Length + 1][];

    // Fill in the first row
    for (int i = 0; i < dp.Length; i++)
        dp[i] = new bool[sum + 1];
    dp[0][0] = true;

    // Fill in ther rest of the rows
    for (int i = 1; i < dp.Length; i++)
    {
        for (int j = 0; j < dp[i].Length; j++)
        {
            if (j < input[i - 1])
                dp[i][j] = dp[i - 1][j];
            else
                dp[i][j] = dp[i - 1][j] || dp[i - 1][j - input[i - 1]];
        }

        // Return early, when the targget sum is set to true.
        if (dp[i][sum])
            return true;
    }

    return dp[input.Length][sum];
}

Vectorization

Let me present a vectorized implementation. While the algorithm and the main idea are the same, the implementation differs in a few places. Vectorization usually means using the SIMD. Vectors are stored in special SIMD registers, different CPUs/architectures are equipped with different register sizes (e.g. 64, 128, 256 and 512 bits). The machine used for this blog post has 256-bit long SIMD registers. That means it can process 32 byte values with a single instruction. That means an algorithm using bytes may be sped up by a factor of 32 on this machine. In practice it is rare to see such improvements, as loading data into the SIMD registers and saving the results also takes a few extra operations. Also note, that the vectorized algorithm is still O(n*m), because constant factors are eliminated by the definition of Big O notation.

In .NET Vector<T> represents a single vector for parallel algorithms. The JIT uses intrinsics to decide at runtime which concrete type and instructions can be used on the given architecture it runs on. A developer could also choose to use those lower-level types directly, such as Vector256<T>. However, an algorithm using Vector256<T> directly might not leverage all the capabilities of CPU supporting larger registers or fail to on CPU not having 256-bit long registers. To use lower-level types a developer would need to check the supported capabilities of the machine and branch to a corresponding implementation. While one can branch the code based on the supported instruction set, it is easier to use the higher-level types when possible.

For the vectorized implementation I commented on the code snippet, and that is followed by a more detailed description beneath.

internal bool SubsetSumSimd(int[] input, int sum)
{
    // Rent two rows instead of a complete table
    var dp0 = ArrayPool<byte>.Shared.Rent(sum + 1 + Vector<byte>.Count);
    var dp1 = ArrayPool<byte>.Shared.Rent(sum + 1 + Vector<byte>.Count);
    ReadOnlySpan<byte[]> dp = [dp0, dp1];
    try
    {
        var currentRow = 0;
        var prevRow = 1;

        // Fill in the first row
        for (int i = 0; i < sum + 1; i += Vector<byte>.Count)
            Vector.StoreUnsafe(Vector<byte>.Zero, ref dp[currentRow][i]);
        dp[0][0] = 1;

        for (int i = 0; i < input.Length; i++)
        {
            // Swap row indexes
            var tempRowIndex = prevRow;
            prevRow = currentRow;
            currentRow = tempRowIndex;

            // Copy previous row until j<input[i]
            var copyToLength = Math.Min(input[i], sum + 1);
            dp[prevRow].AsSpan(0, copyToLength).CopyTo(dp[currentRow]);

            // Fill the rest of the row
            for (int j = input[i]; j < sum + 1; j += Vector<byte>.Count)
            {
                var vPrevRow = Vector.LoadUnsafe(ref dp[prevRow][j]);
                var vPrevRowOffset = Vector.LoadUnsafe(ref dp[prevRow][j - input[i]]);
                Vector.StoreUnsafe(Vector.BitwiseOr(vPrevRow, vPrevRowOffset), ref dp[currentRow][j]);
            }

            // Return early, when the targget sum is set to true.
            if (dp[currentRow][sum] == 1)
                return true;
        }

        return dp[currentRow][sum] == 1;
    }
    finally
    {
        ArrayPool<byte>.Shared.Return(dp[0]);
        ArrayPool<byte>.Shared.Return(dp[1]);
    }
}

The key differences in the vectorized implementation:

It uses bytes instead of bools to store the state. This is due to the lack of SIMD instructions supporting bools in the .NET 8. However, a bool can be stored as 0 and 1 in a byte.
Instead of allocating a table, it rents only 2 arrays because only the previous row is required when filling in a current row. This implementation swaps the two arrays (previous and current) before each iteration.
Renting the arrays helps to save heap allocations. The variable dp is a ReadOnlySpan<byte[]>, so that it is allocated on the stack. Note, that this is only the size of 2 pointers to the underlying arrays, hence it safe from stackoverflow point of view.
The previous and current arrays are a Vector<T>.Count longer to the target sum value. This helps with the last vector calculated for a row. Otherwise a vector might read values beyond the end of the array, that would result in a memory access violation. This way the algorithm can still read safely a complete vector at the last relevant index of the array. A few 'uninitialized' entries might also be processed, but we can avoid falling back to a non-vectorized solution.
Uses CopyTo method to copy values input[i] < j, as it is already vectorized.

Performance

I have compared the performance of the two implementations using BenchmarkDotNet and .NET 8.0.2.

The benchmarking machine supports AVX2 instruction set. The input is an array of integers from the [1..512] range in a random permutation. The target sum is 100000. The vectorized algorithm is not only 33 times faster but is close to allocation free (besides renting the arrays):

Method	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
SubsetSum	46.849 ms	0.9171 ms	1.2553 ms	900.0000	900.0000	900.0000	51320466 B
SubsetSumSimd	1.451 ms	0.0287 ms	0.0488 ms	-	-	-	1 B

Conclusion

In this post I introduced the Subset Sum problem. I have outlined the most common implementation in C#. Then shown how one could consider vectorizing this problem. The examples are not production ready code, please don't use them as-is in prod apps. Finally, I compared the performance of the two solutions.

performance C# vector SIMD

Hello, I am Laszlo

.NET developer