Laszlo - Vectorized Checkerboard

Vectorized Checkerboard

08/04/2024

8 minutes

In this series I am implementing Dynamic Programming Katas with vectorized implementations.

The first post explored the Vectorized Longest Increasing Subsequence problem, and in this post I will dive into the Checkerboard problem.

Introduction

Consider a checkerboard with n*n squares and a cost function c(i, j) which returns a cost associated with the indexes of (i,j). The checker starts in the first row. We would like to know the shortest path (the sum of the minimum costs at each visited row) to get to the last row. The checker could move only diagonally left forward, diagonally right forward, or straight forward. That is, a checker on (1,3) can move to (2,2), (2,3) or (2,4).

For example, a cost square:

| 5 | 6 | 1 | 4 | 3 |
| 3 | 1 | 1 | 2 | 1 |
| 8 | 3 | 9 | 2 | 3 |
| 2 | 4 | 7 | 3 | 5 |
| 1 | 5 | 4 | 6 | 6 |
| 3 | 8 | 3 | 6 | 7 |

When standing at o, the checker may move to positions marked with x:

|   |   |   |   |   |
|   |   |   |   |   |
|   |   |   |   |   |
|   |   |   |   |   |
|   | x | x | x |   |
|   |   | o |   |   |

The solution is detailed in the Checkerboard, but the idea is the following: to each (i,j) position the checker may move from either of (i-1, j-1), (i-1, j), (i-1, j+1) positions, where i-1 indicates the previous row.

To calculate the cost of the shortest path we can create table q with n rows and n+2 columns. Write 'infinite' to the first and last columns of the table. The cells of the first row [1..n+1] should be filled with the corresponding cells of the costs table.

The remainder of the cells to be filled by the following formula: min(q(i-1, j-1), q(i-1, j), q(i-1, j+1)) + cost(i,j)

Finally, the cost of the shortest path is the min(q(n,[1..n+1])), the minimum value of the last row.

The Base Solution

The base solution implements exactly the above description. The Checkerboard method receives the cost table as a jagged array. At this point a production ready code should validate the input. A temp table is allocated with n rows and n+2 columns. The first for loop allocates the inner arrays of the jagged array. For the rest of the steps the code snippet is commented for better understanding:

private int Checkerboard(int[][] input)
{
    int rows = N;
    int columns = N + 2;
    var temp = new int[rows][];
    
    // Allocate the jagged array
    for (int i = 0; i < rows; i++)
        temp[i] = new int[columns];

    // Set infinite to the first and last coloumns
    for (int i = 0; i < rows; i++)
    {
        temp[i][0] = int.MaxValue;
        temp[i][^1] = int.MaxValue;
    }

    // Copy the first row of the cost table to the first row of 'q'
    for (int i = 1; i < columns - 1; i++)
        temp[0][i] = input[0][i - 1];

    // Fill the rest of 'q'
    for (int i = 1; i < rows; i++)
    {
        for (int j = 1; j < columns - 1; j++)
        {
            var left = temp[i - 1][j - 1];
            var below = temp[i - 1][j];
            var right = temp[i - 1][j + 1];
            temp[i][j] = Math.Min(Math.Min(left, below), right) + inpu[i][j - 1];
        }
    }

    // Return the min value of the last row of 'q'
    return temp[^1].Min();
}

Note that the code snippets are not production ready. They demonstrate the overall idea of the algorithms but lack handing edge cases.

Vectorization

Let me present a vectorized implementation. While the algorithm and the main idea are the same, the implementation differs in a few places. Vectorization usually means using the SIMD. Vectors are stored in special SIMD registers, different CPUs/architectures are equipped with different register sizes (e.g. 64, 128, 256 and 512 bits). The machine used for this blog post has 256-bit long SIMD registers. That means it can process 8 Int32 values with a single instruction. That means an algorithm may be sped up by a factor of 8 on this machine. In practice it is rare to see such improvements, as loading data into the SIMD registers and saving the results also takes a few extra operations. Also note, that the vectorized algorithm is still O(n^2), because constant factors are eliminated by the definition of Big O notation.

In .NET Vector<T> represents a single vector for parallel algorithms. The JIT uses intrinsics to decide at runtime which concrete type and instructions can be used on the given architecture it runs on. A developer could also choose to use those lower-level types directly, such as Vector256<T>. However, an algorithm using Vector256<T> directly might not leverage all the capabilities of CPU supporting larger registers or fail to on CPU not having 256-bit long registers. To use lower-level types a developer would need to check the supported capabilities of the machine and branch to a corresponding implementation. While one can branch the code based on the supported instruction set, it is easier to use the higher-level types when possible.

I commented the vectorized implementation in the code snippet, and that is followed by a more detailed description beneath.

private T CheckerboardSimd<T>(T[][] input) where T : INumber<T>, IMinMaxValue<T>
{
    int rows = N;
    int columns = N + 2;
    var dp = ArrayPool<T[]>.Shared.Rent(rows);
    try
    {
        // Rent arrays from an ArrayPool to create the jagged array
        // Sets infinite for the the first and last column
        for (int i = 0; i < rows; i++)
        {
            dp[i] = ArrayPool<T>.Shared.Rent(columns);
            dp[i][0] = T.MaxValue;
            dp[i][N + 1] = T.MaxValue;
        }

        // Copy the first row of the cost table to the first row of 'q'
        input[0].CopyTo(dp[0].AsSpan(1));

        // Fill the rest of 'q'
        for (int i = 1; i < rows; i++)
        {
            Vector<T> vCurrent;
            int j;
            int rowBelow = i - 1;
            for (j = 1; j < columns - Vector<T>.Count - 1; j += Vector<T>.Count)
            {
                vCurrent = Vector.LoadUnsafe(in dp[rowBelow][j - 1]);
                vCurrent = Vector.Min(vCurrent, Vector.LoadUnsafe(in dp[rowBelow][j]));
                vCurrent = Vector.Min(vCurrent, Vector.LoadUnsafe(in dp[rowBelow][j + 1]));
                Vector.StoreUnsafe(vCurrent + Vector.LoadUnsafe(in input[i][j - 1]), ref dp[i][j]);
            }
            j = columns - Vector<T>.Count - 1;
            vCurrent = Vector.LoadUnsafe(in dp[rowBelow][j - 1]);
            vCurrent = Vector.Min(vCurrent, Vector.LoadUnsafe(in dp[rowBelow][j]));
            vCurrent = Vector.Min(vCurrent, Vector.LoadUnsafe(in dp[rowBelow][j + 1]));
            Vector.StoreUnsafe(vCurrent + Vector.LoadUnsafe(in input[i][j - 1]), ref dp[i][j]);
        }
        
        // Return the min value of the last row of 'q'. The implementation of MinVectorized is omitted for brevity.
        return MinVectorized<T>(dp[^1].AsSpan(1, N));
    }
    finally
    {
        // Return the rented arrays.
        foreach (var row in dp)
            ArrayPool<T>.Shared.Return(row);
        ArrayPool<T[]>.Shared.Return(dp);
    }
}

Note, that this code is generic, and it constrains T to INumber<T>, IMinMaxValue<T>, so that it is a type that has a max value. This allows us to write T.MaxValue using the static abstract methods of C#.

This implementation does not handle inputs smaller to the vector count. In a production code, one may want to fall back to the base solution in such cases, as vectorization might only bring negligible performance benefits.

The rest of the code is similar to the base solution, but instead of handling a single item at a time, it handles vCurrent vector in a single instruction. The vectorized implementation has an additional iteration after the inner for loop that handles the remaining items of a single row, by fetching items for a vector from the end of the current row. This way some items may be calculated two times, but the algorithm does not need to fallback to a non-vectorized implementation.

Performance

I have compared the performance of the two implementations using BenchmarkDotNet and .NET 8.0.2.

The benchmarking machine supports AVX2 instruction set. The input is an array of 512*512 table of int numbers shuffled randomly. The results indicate that SIMD solution being close to 7 times faster than the original solution:

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
Checkerboard	1,436.8 us	16.52 us	15.45 us	169.9219	125.0000	1044.02 KB
CheckerboardSimd	193.9 us	1.95 us	1.82 us	83.0078	82.7637	510.98 KB

I also notice that the performance gain varies greatly by the input array. For really large arrays the performance advantage shrinks, because a lot more memory needs to be allocated (ArrayPool allocates new arrays above a certain size). The CPU will also need to bring in new memory segments to the CPU cache, so the performance will be bottlenecked by the speed of updating the cache lines.

Conclusion

In this post I introduced the checkerboard problem. I have outlined the most common implementation in C#. Then shown how one could consider vectorizing this problem. The examples are not production ready code, please don't use them as-is in prod apps. Finally, I compared the performance of the two solutions.

performance C# vector SIMD

Hello, I am Laszlo

.NET developer