The Knapsack Problem with Hardware Intrinsics
03/07/2020
9 minutes
Acknowledgements
I would like to thank Kristof for reviewing this post. He gave me great ideas to further discover the problem and to have a better understanding on AVX. Thank you, Kristof.
Introduction to the Problem
The Knapsack problem is an optimization task: given a set of items, each with a weight and a value, create a subset of these items so that their total weight is less than or equal to a given limit, and their total value is maximized.
An example can be when flying an airplane: each person may carry a luggage with a maximum allowed weight defined by the airline. While you pack your bag, select the carry-on items that give you the maximum comfort, but still altogether fit the weight limit.
Read more on the Knapsack problem's history and the solving method's complexity.
This post is focusing on the 0/1 knapsack problem, which is a special case of the more generic Knapsack problem family. In this case all items can be picked at most once. Let's also assume that each item's weight is a positive integer, and we are given a maximum total weight: max
.
Solution
The goal of this post is to provide two implementations to the 0/1 knapsack problem and compare them from performance point of view. For the whole picture, a third solution is also added for the comparison, but its implementation is not detailed for brevity.
I will use a Dynamic Programming optimization method to solve the problem. From dynamic programming most people associate on computing a sort of a table. Instead, one characteristic of dynamic programming (dp) is to optimize the solution of a sub-problem, and combine these sub-solutions to an optimized solution. The other characteristic of dp is having overlapping sub-problems, meaning that for solving a problem F(n)
, we can leverage the result of F(n-1)
, as F(n)
only adds small additional sub-problems on top of F(n-1)
. For example, if F(n) = F(n-1) * F(n-2)
we can combine the results of F(n-1)
and F(n-2)
instead of recalculating every intermediate result from F(n-1)
and F(n-2)
to F(0)
.
Because of these characteristics, the creation of a table fits very well with solving dp problems, as the table holds a memory for the previously calculated results. Without going further into the details of theory, let's see what is the solution of the 0/1 knapsack problem.
Let's define a 2-dimensional table, where m[i,w]
is the maximum value that can be achieved within the weight less than or equal to w, using the first i items. m[i,w]
defined recursively:
m[0,w] = 0
m[i, w] = m[i-1, w] if the weight of item i is more than w
m[i, w] = max(m[i-1, w], m[i-1, w-wi] + vi) otherwise. wi: the weight of item i, vi: the value of item i; In other words
m[i,w]
is the maximum of the value within w weight (without item i) AND the value within w-wi (so we can include item i within the weight limit of w) plus the value of item i.
Great, where is my table?
Assume we have the following items and a weight limit max
of 4.
item | weight | value |
---|---|---|
1 | 3 | 7 |
2 | 1 | 9 |
3 | 4 | 5 |
We can easily see that taking items with weight 3 and 1 is giving us the maximum value of 16. Here is the table showing the result:
item | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 7 | 7 |
2 | 0 | 9 | 9 | 9 | 16 |
3 | 0 | 9 | 9 | 9 | 16 |
Each row can be calculated from left to right, and then rows from the top to the bottom. In the next section I will put this algorithm into C#.
Code
The Regular Solution
The regular C# solution goes as follows:
First create an empty table. For performance reasons I use a single dimensional, zero indexed array.
Set 0 for the first row.
Iterate over the items (rows).
For each item (row) fill the row from left to right by calculating
m[i,w]
according to the rules detailed above.
Note, that this code is not production ready. It does not validate inputs, it does not use DDD to avoid invalid usage. It also assumes that size of
values
andweights
arrays are the same, and valid inputs for the 0/1 knapsack problem.
public static int Regular(int[] values, int[] weights, int max) { max++; int[] table = new int[(values.Length + 1) * max]; for (int i = 0; i < max; i++) table[i] = 0; for (int item = 1; item <= values.Length; item++) { int itemWeight = weights[item - 1]; int itemValue = values[item - 1]; for (int w = 0; w < max; w++) { if (itemWeight > w) { table[item * max + w] = table[(item - 1) * max + w]; } else { table[item * max + w] = Math.Max(table[(item - 1) * max + w], table[(item - 1) * max + w - itemWeight] + itemValue); } } } return table[^1]; }
Using Hardware Intrinsics
The second solution uses the very same logic but has some optimizations applied:
Using the fact that the values of an integer array are set to 0 by definition.
Using array copy when w <= wi.
Using AVX2 vector operations otherwise.
public unsafe static int Avx(int[] values, int[] weights, int max) { max++; int[] table = new int[(values.Length + 1) * max]; for (int item = 1; item <= values.Length; item++) { int itemWeight = weights[item - 1]; var itemValueSlice = Vector256.Create(values[item - 1]); Array.Copy(table, (item - 1) * max, table, item * max, Math.Min(itemWeight, max)); int w = itemWeight; fixed (int* t = table) { for (w = itemWeight; w < max - 8; w += 8) { var excludeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w); var includeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w - itemWeight); includeItemSlice = Avx2.Add(includeItemSlice, itemValueSlice); var maxSlice = Avx2.Max(excludeItemSlice, includeItemSlice); Avx2.Store(t + item * max + w, maxSlice); } } var itemValue = values[item - 1]; for (; w < max; w++) { table[item * max + w] = Math.Max(table[(item - 1) * max + w], table[(item - 1) * max + w - itemWeight] + itemValue); } } return table[^1]; }
Let me detail the inner loop using the AVX2 hardware intrinsics:
var excludeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w); var includeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w - itemWeight); includeItemSlice = Avx2.Add(includeItemSlice, itemValueSlice); var maxSlice = Avx2.Max(excludeItemSlice, includeItemSlice); Avx2.Store(t + item * max + w, maxSlice);
The first line loads 8 integers from the array (8 * 32 bit) into a 256 bit long register. These values hold m[i-1, w], m[i-1, w+1], m[i-1, w+2], etc.
The second line loads 8 integers from the array into a 256 bit long register, these values hold m[i-1, w-wi], m[i-1, w-wi+1], m[i-1, w-wi+2], etc.
The third line calculates m[i-1, w-wi]+vi, m[i-1, w-wi+1]+vi, m[i-1, w-wi+2]+vi, etc.
The fourth line calculates the pairwise maximum of the above vectors.
The last line writes the result into the array.
At the end of the inner loop, the last couple of items of the row must be processed by an item-by-item calculation as we cannot fit them safely into a 256 bit long register anymore.
Recursion Solution
The last solution named "Recursive" using recursion to calculate the maximum value. For brevity this implementation is not detailed here, but it finds the result using less calculations compared to the previous implementations.
Benchmarks
To compare the performance of these implementations I used BenchmarkDotNet to compare the mean execution time and memory allocation.
I have executed 3 batches of tests with different number of items:
100 items
1000 items
10000 items
Items have random order, with random weights and values generated at the beginning of each batch of tests. Within each batch I executed the tests three times, parameterizing the total weight limit to: 100, 1000 and 2000.
100 items results
Method | MaxSize | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|
KnapsackRegular | 100 | 16.567 us | 0.3240 us | 0.4097 us | 39.88 KB |
KnapsackAvx | 100 | 4.688 us | 0.0898 us | 0.0840 us | 39.88 KB |
KnapsackRecursive | 100 | 83.431 us | 1.6492 us | 2.0857 us | 39.88 KB |
KnapsackRegular | 1000 | 261.778 us | 4.9368 us | 4.6178 us | 394.95 KB |
KnapsackAvx | 1000 | 119.208 us | 2.2837 us | 2.4435 us | 394.95 KB |
KnapsackRecursive | 1000 | 1,227.043 us | 19.7313 us | 15.4049 us | 394.96 KB |
KnapsackRegular | 2000 | 525.410 us | 7.7401 us | 7.2401 us | 789.49 KB |
KnapsackAvx | 2000 | 233.600 us | 4.4236 us | 4.7332 us | 789.48 KB |
KnapsackRecursive | 2000 | 2,061.847 us | 39.1731 us | 36.6425 us | 789.49 KB |
1000 items results
Method | MaxSize | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|
KnapsackRegular | 100 | 212.7 us | 4.09 us | 3.83 us | 394.96 KB |
KnapsackAvx | 100 | 140.6 us | 1.85 us | 1.54 us | 394.95 KB |
KnapsackRecursive | 100 | 854.2 us | 16.57 us | 25.80 us | 394.95 KB |
KnapsackRegular | 1000 | 2,166.9 us | 36.18 us | 33.84 us | 3914.1 KB |
KnapsackAvx | 1000 | 1,261.1 us | 24.17 us | 22.61 us | 3914.1 KB |
KnapsackRecursive | 1000 | 12,801.6 us | 249.91 us | 245.44 us | 3914.17 KB |
KnapsackRegular | 2000 | 3,710.4 us | 46.44 us | 41.17 us | 7824.25 KB |
KnapsackAvx | 2000 | 1,292.6 us | 7.59 us | 6.73 us | 7824.26 KB |
KnapsackRecursive | 2000 | 28,933.0 us | 190.60 us | 168.96 us | 7824.24 KB |
10000 items results
Method | MaxSize | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|
KnapsackRegular | 100 | 1.988 ms | 0.0256 ms | 0.0200 ms | 3.85 MB |
KnapsackAvx | 100 | 1.345 ms | 0.0294 ms | 0.0440 ms | 3.85 MB |
KnapsackRegular | 1000 | 20.095 ms | 0.3934 ms | 0.3680 ms | 38.19 MB |
KnapsackAvx | 1000 | 14.590 ms | 0.2231 ms | 0.2087 ms | 38.19 MB |
KnapsackRegular | 2000 | 40.392 ms | 0.3806 ms | 0.3374 ms | 76.34 MB |
KnapsackAvx | 2000 | 27.966 ms | 0.2346 ms | 0.2194 ms | 76.34 MB |
Memory
We may conclude that the allocated memory for each implementation is the same. It can be easily proven that most of this is allocated by the array at the beginning. For the simplest case, new int[(values.Length + 1) * max]
where max == 100 and values.Length == 100, the allocated array is 101 * 100 * 4 / 1024 = 39 KB. There is some additional object headers and array headers I have not added to this result, but as we see most allocated memory is due to the array being used.
Mean Execution Time
The main interesting question how vectorization performs compared to the original implementation. As seen in most test-cases it gains a 2-4 times improvement compared to the non-vectorized implementation. One might also ask, why the results for the recursive implementation is missing from the 10000 items batch: with this many items, it throws Stackoverflow exception. Every recursive implementation can be modified to a non-recursive one, which could be done in this case for further comparisons, but for brevity of this post it is omitted.
Another question to consider is the CPU cache sizes (L1, L2, etc): the longer lines we have and the bigger weights we have the more often we will need to hit the memory to fetch data instead of the having them in the CPU cache. Accessing RAM is significantly slower compared to using data from the CPU cache, and this can undermine the benefits of vectorization.
Conclusion
Using AVX registers can speed up execution time. Even though the recursive algorithm is expected to be the fastest as it requires less computation on the given inputs, in practice it was the slowest. Not every problem can benefit using AVX, and some problems may gain significant performance improvement, while others pay an additional penalty, or their gain is undermined by further components. The suggestion is to measure the code, to make sure that a given use-case can benefit from vectorization or not.
As a last note, the current implementation can be improved to allocate less memory by only keeping the current row and the last row in memory.