Laszlo - The Knapsack Problem with Hardware Intrinsics

The Knapsack Problem with Hardware Intrinsics

03/07/2020

9 minutes

Acknowledgements

I would like to thank Kristof for reviewing this post. He gave me great ideas to further discover the problem and to have a better understanding on AVX. Thank you, Kristof.

Introduction to the Problem

The Knapsack problem is an optimization task: given a set of items, each with a weight and a value, create a subset of these items so that their total weight is less than or equal to a given limit, and their total value is maximized.

An example can be when flying an airplane: each person may carry a luggage with a maximum allowed weight defined by the airline. While you pack your bag, select the carry-on items that give you the maximum comfort, but still altogether fit the weight limit.

Read more on the Knapsack problem's history and the solving method's complexity.

This post is focusing on the 0/1 knapsack problem, which is a special case of the more generic Knapsack problem family. In this case all items can be picked at most once. Let's also assume that each item's weight is a positive integer, and we are given a maximum total weight: max.

Solution

The goal of this post is to provide two implementations to the 0/1 knapsack problem and compare them from performance point of view. For the whole picture, a third solution is also added for the comparison, but its implementation is not detailed for brevity.

I will use a Dynamic Programming optimization method to solve the problem. From dynamic programming most people associate on computing a sort of a table. Instead, one characteristic of dynamic programming (dp) is to optimize the solution of a sub-problem, and combine these sub-solutions to an optimized solution. The other characteristic of dp is having overlapping sub-problems, meaning that for solving a problem F(n), we can leverage the result of F(n-1), as F(n) only adds small additional sub-problems on top of F(n-1). For example, if F(n) = F(n-1) * F(n-2) we can combine the results of F(n-1) and F(n-2) instead of recalculating every intermediate result from F(n-1) and F(n-2) to F(0).

Because of these characteristics, the creation of a table fits very well with solving dp problems, as the table holds a memory for the previously calculated results. Without going further into the details of theory, let's see what is the solution of the 0/1 knapsack problem.

Let's define a 2-dimensional table, where m[i,w] is the maximum value that can be achieved within the weight less than or equal to w, using the first i items. m[i,w] defined recursively:

m[0,w] = 0
m[i, w] = m[i-1, w] if the weight of item i is more than w
m[i, w] = max(m[i-1, w], m[i-1, w-wi] + vi) otherwise. wi: the weight of item i, vi: the value of item i; In other words m[i,w] is the maximum of the value within w weight (without item i) AND the value within w-wi (so we can include item i within the weight limit of w) plus the value of item i.

Great, where is my table?

Assume we have the following items and a weight limit max of 4.

item	weight	value
1	3	7
2	1	9
3	4	5

We can easily see that taking items with weight 3 and 1 is giving us the maximum value of 16. Here is the table showing the result:

item	1	2	3	4
	0	0	0	0
1	0	0	7	7
2	9	9	9	16
3	9	9	9	16

Each row can be calculated from left to right, and then rows from the top to the bottom. In the next section I will put this algorithm into C#.

Code

The Regular Solution

The regular C# solution goes as follows:

First create an empty table. For performance reasons I use a single dimensional, zero indexed array.
Set 0 for the first row.
Iterate over the items (rows).
For each item (row) fill the row from left to right by calculating m[i,w] according to the rules detailed above.

Note, that this code is not production ready. It does not validate inputs, it does not use DDD to avoid invalid usage. It also assumes that size of values and weights arrays are the same, and valid inputs for the 0/1 knapsack problem.

public static int Regular(int[] values, int[] weights, int max)
{
    max++;
    int[] table = new int[(values.Length + 1) * max];
    for (int i = 0; i < max; i++)
        table[i] = 0;

    for (int item = 1; item <= values.Length; item++)
    {
        int itemWeight = weights[item - 1];
        int itemValue = values[item - 1];
        for (int w = 0; w < max; w++)
        {
            if (itemWeight > w)
            {
                table[item * max + w] = table[(item - 1) * max + w];
            }
            else
            {
                table[item * max + w] = Math.Max(table[(item - 1) * max + w], table[(item - 1) * max + w - itemWeight] + itemValue);
            }
        }

    }
    return table[^1];
}

Using Hardware Intrinsics

The second solution uses the very same logic but has some optimizations applied:

Using the fact that the values of an integer array are set to 0 by definition.
Using array copy when w <= wi.
Using AVX2 vector operations otherwise.

public unsafe static int Avx(int[] values, int[] weights, int max)
{
    max++;
    int[] table = new int[(values.Length + 1) * max];

    for (int item = 1; item <= values.Length; item++)
    {
        int itemWeight = weights[item - 1];
        var itemValueSlice = Vector256.Create(values[item - 1]);
        Array.Copy(table, (item - 1) * max, table, item * max, Math.Min(itemWeight, max));

        int w = itemWeight;
        fixed (int* t = table)
        {
            for (w = itemWeight; w < max - 8; w += 8)
            {
                var excludeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w);
                var includeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w - itemWeight);
                includeItemSlice = Avx2.Add(includeItemSlice, itemValueSlice);
                var maxSlice = Avx2.Max(excludeItemSlice, includeItemSlice);
                Avx2.Store(t + item * max + w, maxSlice);
            }
        }
        var itemValue = values[item - 1];
        for (; w < max; w++)
        {
            table[item * max + w] = Math.Max(table[(item - 1) * max + w], table[(item - 1) * max + w - itemWeight] + itemValue);
        }
    }
    return table[^1];
}

Let me detail the inner loop using the AVX2 hardware intrinsics:

var excludeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w);
var includeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w - itemWeight);
includeItemSlice = Avx2.Add(includeItemSlice, itemValueSlice);
var maxSlice = Avx2.Max(excludeItemSlice, includeItemSlice);
Avx2.Store(t + item * max + w, maxSlice);

The first line loads 8 integers from the array (8 * 32 bit) into a 256 bit long register. These values hold m[i-1, w], m[i-1, w+1], m[i-1, w+2], etc.
The second line loads 8 integers from the array into a 256 bit long register, these values hold m[i-1, w-wi], m[i-1, w-wi+1], m[i-1, w-wi+2], etc.
The third line calculates m[i-1, w-wi]+vi, m[i-1, w-wi+1]+vi, m[i-1, w-wi+2]+vi, etc.
The fourth line calculates the pairwise maximum of the above vectors.
The last line writes the result into the array.

At the end of the inner loop, the last couple of items of the row must be processed by an item-by-item calculation as we cannot fit them safely into a 256 bit long register anymore.

Recursion Solution

The last solution named "Recursive" using recursion to calculate the maximum value. For brevity this implementation is not detailed here, but it finds the result using less calculations compared to the previous implementations.

Benchmarks

To compare the performance of these implementations I used BenchmarkDotNet to compare the mean execution time and memory allocation.

I have executed 3 batches of tests with different number of items:

100 items
1000 items
10000 items

Items have random order, with random weights and values generated at the beginning of each batch of tests. Within each batch I executed the tests three times, parameterizing the total weight limit to: 100, 1000 and 2000.

100 items results

Method	MaxSize	Mean	Error	StdDev	Allocated
KnapsackRegular	100	16.567 us	0.3240 us	0.4097 us	39.88 KB
KnapsackAvx	100	4.688 us	0.0898 us	0.0840 us	39.88 KB
KnapsackRecursive	100	83.431 us	1.6492 us	2.0857 us	39.88 KB
KnapsackRegular	1000	261.778 us	4.9368 us	4.6178 us	394.95 KB
KnapsackAvx	1000	119.208 us	2.2837 us	2.4435 us	394.95 KB
KnapsackRecursive	1000	1,227.043 us	19.7313 us	15.4049 us	394.96 KB
KnapsackRegular	2000	525.410 us	7.7401 us	7.2401 us	789.49 KB
KnapsackAvx	2000	233.600 us	4.4236 us	4.7332 us	789.48 KB
KnapsackRecursive	2000	2,061.847 us	39.1731 us	36.6425 us	789.49 KB

1000 items results

Method	MaxSize	Mean	Error	StdDev	Allocated
KnapsackRegular	100	212.7 us	4.09 us	3.83 us	394.96 KB
KnapsackAvx	100	140.6 us	1.85 us	1.54 us	394.95 KB
KnapsackRecursive	100	854.2 us	16.57 us	25.80 us	394.95 KB
KnapsackRegular	1000	2,166.9 us	36.18 us	33.84 us	3914.1 KB
KnapsackAvx	1000	1,261.1 us	24.17 us	22.61 us	3914.1 KB
KnapsackRecursive	1000	12,801.6 us	249.91 us	245.44 us	3914.17 KB
KnapsackRegular	2000	3,710.4 us	46.44 us	41.17 us	7824.25 KB
KnapsackAvx	2000	1,292.6 us	7.59 us	6.73 us	7824.26 KB
KnapsackRecursive	2000	28,933.0 us	190.60 us	168.96 us	7824.24 KB

10000 items results

Method	MaxSize	Mean	Error	StdDev	Allocated
KnapsackRegular	100	1.988 ms	0.0256 ms	0.0200 ms	3.85 MB
KnapsackAvx	100	1.345 ms	0.0294 ms	0.0440 ms	3.85 MB
KnapsackRegular	1000	20.095 ms	0.3934 ms	0.3680 ms	38.19 MB
KnapsackAvx	1000	14.590 ms	0.2231 ms	0.2087 ms	38.19 MB
KnapsackRegular	2000	40.392 ms	0.3806 ms	0.3374 ms	76.34 MB
KnapsackAvx	2000	27.966 ms	0.2346 ms	0.2194 ms	76.34 MB

Memory

We may conclude that the allocated memory for each implementation is the same. It can be easily proven that most of this is allocated by the array at the beginning. For the simplest case, new int[(values.Length + 1) * max] where max == 100 and values.Length == 100, the allocated array is 101 * 100 * 4 / 1024 = 39 KB. There is some additional object headers and array headers I have not added to this result, but as we see most allocated memory is due to the array being used.

Mean Execution Time

The main interesting question how vectorization performs compared to the original implementation. As seen in most test-cases it gains a 2-4 times improvement compared to the non-vectorized implementation. One might also ask, why the results for the recursive implementation is missing from the 10000 items batch: with this many items, it throws Stackoverflow exception. Every recursive implementation can be modified to a non-recursive one, which could be done in this case for further comparisons, but for brevity of this post it is omitted.

Another question to consider is the CPU cache sizes (L1, L2, etc): the longer lines we have and the bigger weights we have the more often we will need to hit the memory to fetch data instead of the having them in the CPU cache. Accessing RAM is significantly slower compared to using data from the CPU cache, and this can undermine the benefits of vectorization.

Conclusion

Using AVX registers can speed up execution time. Even though the recursive algorithm is expected to be the fastest as it requires less computation on the given inputs, in practice it was the slowest. Not every problem can benefit using AVX, and some problems may gain significant performance improvement, while others pay an additional penalty, or their gain is undermined by further components. The suggestion is to measure the code, to make sure that a given use-case can benefit from vectorization or not.

As a last note, the current implementation can be improved to allocate less memory by only keeping the current row and the last row in memory.

vector simd performance dotnet

Hello, I am Laszlo

.NET developer