 # Hello, I am Laszlo

## Acknowledgements

I would like to thank Kristof for reviewing this post. He gave me great ideas to further discover the problem and to have a better understanding on AVX. Thank you, Kristof.

## Introduction to the Problem

The Knapsack problem is an optimization task: given a set of items, each with a weight and a value, create a subset of these items so that their total weight is less than or equal to a given limit, and their total value is maximized.

An example can be when flying an airplane: each person may carry a luggage with a maximum allowed weight defined by the airline. While you pack your bag, select the carry-on items that give you the maximum comfort, but still altogether fit the weight limit.

Read more on the Knapsack problem's history and the solving method's complexity.

This post is focusing on the 0/1 knapsack problem, which is a special case of the more generic Knapsack problem family. In this case all items can be picked at most once. Let's also assume that each item's weight is a positive integer, and we are given a maximum total weight: `max`.

## Solution

The goal of this post is to provide two implementations to the 0/1 knapsack problem and compare them from performance point of view. For the whole picture, a third solution is also added for the comparison, but its implementation is not detailed for brevity.

I will use a Dynamic Programming optimization method to solve the problem. From dynamic programming most people associate on computing a sort of a table. Instead, one characteristic of dynamic programming (dp) is to optimize the solution of a sub-problem, and combine these sub-solutions to an optimized solution. The other characteristic of dp is having overlapping sub-problems, meaning that for solving a problem `F(n)`, we can leverage the result of `F(n-1)`, as `F(n)` only adds small additional sub-problems on top of `F(n-1)`. For example, if `F(n) = F(n-1) * F(n-2)` we can combine the results of `F(n-1)` and `F(n-2)` instead of recalculating every intermediate result from `F(n-1)` and `F(n-2)` to `F(0)`.

Because of these characteristics, the creation of a table fits very well with solving dp problems, as the table holds a memory for the previously calculated results. Without going further into the details of theory, let's see what is the solution of the 0/1 knapsack problem.

Let's define a 2-dimensional table, where `m[i,w]` is the maximum value that can be achieved within the weight less than or equal to w, using the first i items. `m[i,w]` defined recursively:

• m[0,w] = 0

• m[i, w] = m[i-1, w] if the weight of item i is more than w

• m[i, w] = max(m[i-1, w], m[i-1, w-wi] + vi) otherwise. wi: the weight of item i, vi: the value of item i; In other words `m[i,w]` is the maximum of the value within w weight (without item i) AND the value within w-wi (so we can include item i within the weight limit of w) plus the value of item i.

Great, where is my table?

Assume we have the following items and a weight limit `max` of 4.

item

weight

value

1

3

7

2

1

9

3

4

5

We can easily see that taking items with weight 3 and 1 is giving us the maximum value of 16. Here is the table showing the result:

item

0

1

2

3

4

0

0

0

0

0

1

0

0

0

7

7

2

0

9

9

9

16

3

0

9

9

9

16

Each row can be calculated from left to right, and then rows from the top to the bottom. In the next section I will put this algorithm into C#.

## Code

### The Regular Solution

The regular C# solution goes as follows:

1. First create an empty table. For performance reasons I use a single dimensional, zero indexed array.

2. Set 0 for the first row.

3. Iterate over the items (rows).

4. For each item (row) fill the row from left to right by calculating `m[i,w]` according to the rules detailed above.

Note, that this code is not production ready. It does not validate inputs, it does not use DDD to avoid invalid usage. It also assumes that size of `values` and `weights` arrays are the same, and valid inputs for the 0/1 knapsack problem.

``````public static int Regular(int[] values, int[] weights, int max)
{
max++;
int[] table = new int[(values.Length + 1) * max];
for (int i = 0; i < max; i++)
table[i] = 0;

for (int item = 1; item <= values.Length; item++)
{
int itemWeight = weights[item - 1];
int itemValue = values[item - 1];
for (int w = 0; w < max; w++)
{
if (itemWeight > w)
{
table[item * max + w] = table[(item - 1) * max + w];
}
else
{
table[item * max + w] = Math.Max(table[(item - 1) * max + w], table[(item - 1) * max + w - itemWeight] + itemValue);
}
}

}
return table[^1];
}``````

### Using Hardware Intrinsics

The second solution uses the very same logic but has some optimizations applied:

• Using the fact that the values of an integer array are set to 0 by definition.

• Using array copy when w <= wi.

• Using AVX2 vector operations otherwise.

``````public unsafe static int Avx(int[] values, int[] weights, int max)
{
max++;
int[] table = new int[(values.Length + 1) * max];

for (int item = 1; item <= values.Length; item++)
{
int itemWeight = weights[item - 1];
var itemValueSlice = Vector256.Create(values[item - 1]);
Array.Copy(table, (item - 1) * max, table, item * max, Math.Min(itemWeight, max));

int w = itemWeight;
fixed (int* t = table)
{
for (w = itemWeight; w < max - 8; w += 8)
{
var excludeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w);
var includeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w - itemWeight);
var maxSlice = Avx2.Max(excludeItemSlice, includeItemSlice);
Avx2.Store(t + item * max + w, maxSlice);
}
}
var itemValue = values[item - 1];
for (; w < max; w++)
{
table[item * max + w] = Math.Max(table[(item - 1) * max + w], table[(item - 1) * max + w - itemWeight] + itemValue);
}
}
return table[^1];
}``````

Let me detail the inner loop using the AVX2 hardware intrinsics:

``````var excludeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w);
var includeItemSlice = Avx2.LoadVector256(t + (item - 1) * max + w - itemWeight);
var maxSlice = Avx2.Max(excludeItemSlice, includeItemSlice);
Avx2.Store(t + item * max + w, maxSlice);``````
1. The first line loads 8 integers from the array (8 * 32 bit) into a 256 bit long register. These values hold m[i-1, w], m[i-1, w+1], m[i-1, w+2], etc.

2. The second line loads 8 integers from the array into a 256 bit long register, these values hold m[i-1, w-wi], m[i-1, w-wi+1], m[i-1, w-wi+2], etc.

3. The third line calculates m[i-1, w-wi]+vi, m[i-1, w-wi+1]+vi, m[i-1, w-wi+2]+vi, etc.

4. The fourth line calculates the pairwise maximum of the above vectors.

5. The last line writes the result into the array.

At the end of the inner loop, the last couple of items of the row must be processed by an item-by-item calculation as we cannot fit them safely into a 256 bit long register anymore.

### Recursion Solution

The last solution named "Recursive" using recursion to calculate the maximum value. For brevity this implementation is not detailed here, but it finds the result using less calculations compared to the previous implementations.

## Benchmarks

To compare the performance of these implementations I used BenchmarkDotNet to compare the mean execution time and memory allocation.

I have executed 3 batches of tests with different number of items:

1. 100 items

2. 1000 items

3. 10000 items

Items have random order, with random weights and values generated at the beginning of each batch of tests. Within each batch I executed the tests three times, parameterizing the total weight limit to: 100, 1000 and 2000.

### 100 items results

Method

MaxSize

Mean

Error

StdDev

Allocated

KnapsackRegular

100

16.567 us

0.3240 us

0.4097 us

39.88 KB

KnapsackAvx

100

4.688 us

0.0898 us

0.0840 us

39.88 KB

KnapsackRecursive

100

83.431 us

1.6492 us

2.0857 us

39.88 KB

KnapsackRegular

1000

261.778 us

4.9368 us

4.6178 us

394.95 KB

KnapsackAvx

1000

119.208 us

2.2837 us

2.4435 us

394.95 KB

KnapsackRecursive

1000

1,227.043 us

19.7313 us

15.4049 us

394.96 KB

KnapsackRegular

2000

525.410 us

7.7401 us

7.2401 us

789.49 KB

KnapsackAvx

2000

233.600 us

4.4236 us

4.7332 us

789.48 KB

KnapsackRecursive

2000

2,061.847 us

39.1731 us

36.6425 us

789.49 KB

### 1000 items results

Method

MaxSize

Mean

Error

StdDev

Allocated

KnapsackRegular

100

212.7 us

4.09 us

3.83 us

394.96 KB

KnapsackAvx

100

140.6 us

1.85 us

1.54 us

394.95 KB

KnapsackRecursive

100

854.2 us

16.57 us

25.80 us

394.95 KB

KnapsackRegular

1000

2,166.9 us

36.18 us

33.84 us

3914.1 KB

KnapsackAvx

1000

1,261.1 us

24.17 us

22.61 us

3914.1 KB

KnapsackRecursive

1000

12,801.6 us

249.91 us

245.44 us

3914.17 KB

KnapsackRegular

2000

3,710.4 us

46.44 us

41.17 us

7824.25 KB

KnapsackAvx

2000

1,292.6 us

7.59 us

6.73 us

7824.26 KB

KnapsackRecursive

2000

28,933.0 us

190.60 us

168.96 us

7824.24 KB

Method

MaxSize

Mean

Error

StdDev

Allocated

KnapsackRegular

100

1.988 ms

0.0256 ms

0.0200 ms

3.85 MB

KnapsackAvx

100

1.345 ms

0.0294 ms

0.0440 ms

3.85 MB

KnapsackRegular

1000

20.095 ms

0.3934 ms

0.3680 ms

38.19 MB

KnapsackAvx

1000

14.590 ms

0.2231 ms

0.2087 ms

38.19 MB

KnapsackRegular

2000

40.392 ms

0.3806 ms

0.3374 ms

76.34 MB

KnapsackAvx

2000

27.966 ms

0.2346 ms

0.2194 ms

76.34 MB

## Memory

We may conclude that the allocated memory for each implementation is the same. It can be easily proven that most of this is allocated by the array at the beginning. For the simplest case, `new int[(values.Length + 1) * max]` where max == 100 and values.Length == 100, the allocated array is 101 * 100 * 4 / 1024 = 39 KB. There is some additional object headers and array headers I have not added to this result, but as we see most allocated memory is due to the array being used.

## Mean Execution Time

The main interesting question how vectorization performs compared to the original implementation. As seen in most test-cases it gains a 2-4 times improvement compared to the non-vectorized implementation. One might also ask, why the results for the recursive implementation is missing from the 10000 items batch: with this many items, it throws Stackoverflow exception. Every recursive implementation can be modified to a non-recursive one, which could be done in this case for further comparisons, but for brevity of this post it is omitted.

Another question to consider is the CPU cache sizes (L1, L2, etc): the longer lines we have and the bigger weights we have the more often we will need to hit the memory to fetch data instead of the having them in the CPU cache. Accessing RAM is significantly slower compared to using data from the CPU cache, and this can undermine the benefits of vectorization.

## Conclusion

Using AVX registers can speed up execution time. Even though the recursive algorithm is expected to be the fastest as it requires less computation on the given inputs, in practice it was the slowest. Not every problem can benefit using AVX, and some problems may gain significant performance improvement, while others pay an additional penalty, or their gain is undermined by further components. The suggestion is to measure the code, to make sure that a given use-case can benefit from vectorization or not.

As a last note, the current implementation can be improved to allocate less memory by only keeping the current row and the last row in memory.