Laszlo - Performance Improving a Sudoku Resolver .net application

Performance Improving a Sudoku Resolver .net application

09/20/2020

19 minutes

In this post, I will investigate how to improve the performance of a sample Sudoku resolver application. The base Sudoku resolver is a .net core console application written in C#. The original source has been written only in a couple of hours, while performance was not the main priority. In this post, I will gradually improve the performance of the application and document the process step-by-step. These articles will also touch on how some of the tools such as PerfView, dotnet-counters and BenchmarkDotNet can be used to measure performance. The goal of the optimization will be to reach a mean execution time below 500 us with a unimodal distribution and minimal allocation per run. By run I mean once the application is started, a run is parsing an input of a Sudoku and providing a solution for it. The application may execute multiple runs during its lifetime as a process. For repeatability of the performance tests, I have chosen a sample Sudoku input that is used throughout all the performance tests.

The goal of this post is to represent the ideas that may be used to optimize applications. I will try to avoid showing huge classes or large code, but I will show slices of the code that is modified in an iterative fashion. At this point, I would like to highlight that I am not optimizing the algorithm to resolve a Sudoku, I am optimizing the code that executes the algorithm to resolve the Sudoku. The algorithm itself will not change throughout the optimization process.

Why does performance matter in these days? In a cloud first, serverless world, where computation is bought on memory and CPU used by our application, we can only serve more request for the save money if our application is efficient - also we end up with a more "environment friendly application" by using less power with potentially less CO2 emissions.

Base Application

Let me first take a look at the application I use as the base for the performance optimizations. To have an overall understanding of the application structure this section details the key classes and methods. The reader does not need to understand every bit of detail of the algorithm at this point. Note, that I am not changing the key ideas on how the algorithm works, I only improve the code executing the same algorithm.

The resolver consists of 3 classes.

TableArray operates on top of an array as the name suggests. It represents a state of the Sudoku game. It has a reference to an array, and it has the main responsibility to translate row and column values into a single dimensional array index. It only considers operations on individual cell level. It also provides methods such as

Set: sets a given value to a cell
Unset: removes a possible value from a cell
Valid: validates that a cell has at least one possible value
GetAvailable: returns a list of possible values for a given cell
Clone: returns a deep clone of the state represented by the TableArray

The table array operates on top of a bool array. Each cell of the Sudoku table is represented by 9 bools, each bool indicating if a given number is still a possible value for the given cell.

Table is a static class; it has the main responsibility to maintain a valid table according to the rules of Sudoku. It has methods to

Initialize: sets all possible values for all cells, then iterates the input and sets the input values in the given cells
Set: sets a number for a given TableArray's cell. It makes sure that all neighboring cells are updated too: removing the same value from the possible values in each neighboring cell (row, column, and the 3x3 area), and validating the table after the update.
Resolve: as a recursive method, it sets the next possible number to a given cell, then invokes itself with a clone of the TableArray and the next cell to set. If there is no possible number to set or the TableArray became invalid, it returns to the caller (previous cell), which may try to set another possible number.

The initial Table uses an ArrayPool<bool> type to rent a bool array for the next state for the Resolve method. This new array is then passed as the destination for TableArray's Clone method.

Program initializes the application by creating the necessary types / asking the user for input and printing result. During the performance measurements, I typically replaced the user interacting Console Read/Write methods with the predefined input set programmatically and wrapped the app logic with a while(true) loop or used BenchmarkDotNet's bootstrap code.

Initial Run

For the initial Run I used dotnet counters ps to find the PID of the application then dotnet counters monitor -p 11136 to visualize built in performance counters.

[System.Runtime]
    # of Assemblies Loaded                             7
    % Time in GC (since last GC)                       0
    Allocation Rate (Bytes / sec)              8 151 328
    CPU Usage (%)                                     12
    Exceptions / sec                                   0
    GC Heap Size (MB)                                  2
    Gen 0 GC / sec                                   120
    Gen 0 Size (B)                               119 456
    Gen 1 GC / sec                                     0
    Gen 1 Size (B)                               208 376
    Gen 2 GC / sec                                     0
    Gen 2 Size (B)                                63 928
    LOH Size (B)                                  19 640
    Monitor Lock Contention Count / sec                0
    Number of Active Timers                            0
    ThreadPool Completed Work Items / sec              1
    ThreadPool Queue Length                            0
    ThreadPool Threads Count                           2
    Working Set (MB)                                  22

It soon turns out that the application allocates quite a lot of memory, by looking at the Allocation Rate column. Let's dive into what is being allocated with PerfView. I start a new PerfView session and collect data for a couple of seconds with .Net Sample Alloc checked under Advanced Options.The collected trace data on allocations show the following top allocations:

Name                                                               	Exc %	    Exc	Exc Ct	Inc %	      Inc
Type SudokuResolver.TableArray                                     	 76,6	 57 400	     0	 76,6	   57 400
Type System.Boolean[]                                              	 12,3	  9 200	     0	 12,3	    9 200
Type System.Int32[]                                                	 10,1	  7 600	     0	 10,1	    7 600
Type System.ValueTuple`3[System.Int32,System.Int32,System.Int32][] 	  0,9	    700	     0	  0,9	      700

The above table is cut for brevity. It is easy to point out that the app allocates huge number of TableArray objects and bool array objects. Bool arrays are understandable as I use that for representing state throughout the recursive algorithm. I also use integer arrays that are used for returning the possible numbers for a given cell. Both arrays are pooled with ArrayPool, so one might wonder how it could be further optimized. More on that later. For now, let me focus on the TableArray type.

Lastly, let's benchmark the Run method with BenchmarkDotNet:

Runtime = .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT; GC = Concurrent Workstation
Mean = 1.455 ms, StdErr = 0.007 ms (0.49%), N = 16, StdDev = 0.029 ms
Min = 1.401 ms, Q1 = 1.440 ms, Median = 1.451 ms, Q3 = 1.475 ms, Max = 1.509 ms
IQR = 0.035 ms, LowerFence = 1.387 ms, UpperFence = 1.527 ms
ConfidenceInterval = [1.426 ms; 1.484 ms] (CI 99.9%), Margin = 0.029 ms (2.00% of Mean)
Skewness = 0.14, Kurtosis = 2.25, MValue = 2
-------------------- Histogram --------------------
[1.395 ms ; 1.424 ms) | @@
[1.424 ms ; 1.461 ms) | @@@@@@@@@
[1.461 ms ; 1.524 ms) | @@@@@
---------------------------------------------------


// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.302
  [Host]     : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT
  DefaultJob : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT


| Method |     Mean |     Error |    StdDev |   Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|----------:|----------:|--------:|------:|------:|----------:|
|    Run | 1.455 ms | 0.0291 ms | 0.0286 ms | 23.4375 |     - |     - |  69.59 KB |

The table shows that the application does not accomplish the performance goals set earlier. The mean operation runs slower as expected and the allocated memory per run is larger than desired.

Before optimization may begin, make sure to have good unit and integration tests that verify the application behavior. Optimizations may easily break the app; hence it is key to get a timely feedback proving that changes do not break the behavior. Run tests often, in each iteration of the optimization process.

Optimization 1: struct

Thinking about TableArray type, one could realize that it could be implemented as a struct. It has only one field, a reference to the underlying array, so it has a small memory footprint. As well as Table is a static class, TableArray-s are always passed as input/output arguments, there are no references hold on to them. Technically I only needed to switch from class to struct, no further changes required. For changes like this it is always important to confirm that the rest of the application does not end up boxing this type. Let us review the effect of the change:

dotnet counters show reduced Allocation Rate, seems like a good start:

[System.Runtime]
    # of Assemblies Loaded                             7
    % Time in GC (since last GC)                       0
    Allocation Rate (Bytes / sec)              2 908 440

Reviewing the allocations with PerfView shows TableArray gone from the blotter. If there was boxing, TableArray would still show up in the table.

Name                                                               	Exc %	    Exc	Exc Ct	Inc %	      Inc
Type System.Int32[]                                      '/          	 83,3	 11 500	     0	 83,3	   11 500
Type System.Boolean[]                                              	 12,3	  1 700	     0	 12,3	    1 700
Type System.ValueTuple`3[System.Int32,System.Int32,System.Int32][] 	  4,3	    600	     0	  4,3	      600

Finally, BenchmarkDotNet we can confirm that nearly 2/3 of the allocated memory per run is saved.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.302
  [Host]     : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT
  DefaultJob : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT


| Method |     Mean |     Error |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|----------:|----------:|-------:|------:|------:|----------:|
|    Run | 1.430 ms | 0.0277 ms | 0.0330 ms | 7.8125 |     - |     - |  25.93 KB |

Optimization 2: Select cell Remove

The second optimization still investigates TableArray type. It has a private method Select, which selects a cell of the Sudoku table by row and column values and fills an input argument of Span<Cell>, where each Cell struct represents an integer number and bool. Naming of this type is inaccurate, do not mix Cell type with a cell of the Sudoku table. Here a Cell type represents is a possible number and if that number is still available or ruled out. So, a given cell is represented by 9 Cell structs. Methods Valid and GetAvailable use this private helper to gather underlying cell data. This is not too efficient as validation only need to find if any numbers are set (does not need all other values provided by Select), while GetAvailable method still must filter out the unset values. In this next step, I removed the Select method with Cell struct, and implement the necessary logic in the aforementioned methods. This saved unnecessary stack allocations and iterations to populate the Span<Cell>. The effect of this is clearly visible on the mean execution time measured by BenchmarkDotNet.

| Method |     Mean |    Error |   StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|---------:|---------:|-------:|------:|------:|----------:|
|    Run | 838.6 us | 13.14 us | 12.29 us | 7.8125 |     - |     - |  25.93 KB |

I categorize this change as removing over-abstractions. It is hard to spot these types of changes because abstractions are generally good. Developers practice abstraction day-to-day, so we tend to abstract things away. Sometimes we push abstractions too far, which may result a larger performance hit. Removing such abstractions may be only feasible if the code remains readable.

Optimization 3: uint

The next optimization is still focusing on TableArray type. It operates on top of a bool array. Each cell is represented by 9 bools which are allocated on 9 bytes (one byte for each bool). These bytes are not aligned one after the other, but it will skew 8-byte alignment typically preferred on x64 machines. How else could I represent possible cell values?

One idea is to use an unsigned integer type to represent a cell. An unsigned integer is 32 bit long, where each of the last 9 bits can represent bit-by-bit one of the possible values. Now, note this is not going to be the final solution, but it is our next step for this optimization process. Representing a cell with bools required 9 bytes, this solution requires 4 bytes. In case the app uses less memory, Clone needs to copy less amount of data. Another big advantage of this is how CPU works with memory in modern machines. Cache lines are used to read memroy into the CPU cache, where a cache line is typically 64 bytes long. So, in case of 9x9 Sudoku table, using 9 bool per cell results 9x9x9/64 ~ 12 cache lines, while using uint it is ~ 6 cache lines to operate with. By reducing the number of cache lines, I can save significant amount of CPU cycles, that otherwise would just wait on the memory completing the read operation.

With this change though, I need to re-implement some of the methods of TableArray, so that they operate on bits of uint instead of booleans. Let us take a look at two examples:

public void Unset(int value, int row, int column)
{
    _table[row * 9 + column] &= (uint)~(1 << (value - 1));
}

public int Get(int row, int column)
{
    uint i = _table.Span[row * 9 + column];
    if (Popcnt.PopCount(i) != 1)
        return -1;
    return (int)(32 - Lzcnt.LeadingZeroCount(i));
}

Unset: de-selects the given value from the possible values of a cell. Shifts the single bit of 1 to the left by the value given, so we select the bit we want to unset. Next, it calculates the complement of the number. Finally, the result is applied on the cell's number with bitwise AND operations.
Get: uses PopCount to count the number of 1 bit in the number. If there more or zero 1 bits, -1 is return indicating that we do not have a definite value set for the cell. Otherwise it is using the LeadingZeroCount intrinsic to count the leading zeros and return the number set for the cell.

These are 2 of the most complex methods of TableArray.

Re-running BenchmarkDotNet tests shown significant improvement on the mean execution time and some improvement on the allocations too:

| Method |     Mean |   Error |  StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|--------:|--------:|-------:|------:|------:|----------:|
|    Run | 421.2 us | 5.58 us | 4.95 us | 5.3711 |     - |     - |  17.13 KB |

Optimization 4: stackalloc

One other remarkably high allocation is made for type int array. Integer arrays are allocated from an ArrayPool<int> by the Table class. When Resolve method requests for the possible numbers of a cell, it pre-allocates an integer array. This array is filled by TableArray. Resolve method uses this array to iterates over the possible numbers of a cell during the Sudoku resolution. Reviewing this code pointed out 3 important observations:

as a bug, not all rented arrays have been returned to ArrayPool<int>
the size of the array to be rented is finite
returned integers values are in the [1..9] range

Using the second observation, we know that worst case 9 numbers may be returned, and a maximum of depth of recursion is 81 (as we iterate over the 81 cells).The third observation means that we can use the byte type instead of integers, using only 1/4th of memory compared to Int32.

Combining point 2 and 3 we can tell that the memory required for the possible numbers is 81x9 = 729 bytes, hence I was more confident on replacing the ArrayPool implementation with stackalloc. This seems a very risky change for a recursive method, but the above calculation should provide the confidence that the application will not stack overflow. A third optimization is also added so that the Resolve method requests the number of possible values of a cell from TableArray, and only then allocates an array that fits these numbers.

Applying the changes improves both the mean execution time, and the allocated memory too:

| Method |     Mean |   Error |  StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|--------:|--------:|-------:|------:|------:|----------:|
|    Run | 321.8 us | 6.17 us | 5.77 us | 3.4180 |     - |     - |  10.17 KB |

PerfView also show that the biggest allocation left is the unsigned integers allocated for the representing the state for Resolve:

Name                    	 Exc %
Type System.UInt32[]    	 100,0

Optimization 5: ushort

Next, I would like to further optimize the allocated memory. Previously, I have switched from representing a cell with 9 bools to uint. It does not take much to take this a step further. As the 9 possible numbers are represented by one-one bit, we only need 9 bits to represent all possible values for a cell. An unsigned integer though has a 32-bit memory footprint. More than 2 bytes are unused. Instead of using uint, I switch to use ushort. This will further reduce the memory footprint and cache line faults. Re-running the benchmarks shows, that nearly half the memory required per run.

| Method |     Mean |   Error |  StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|--------:|--------:|-------:|------:|------:|----------:|
|    Run | 318.9 us | 6.23 us | 6.12 us | 1.9531 |     - |     - |   5.86 KB |

This change only requires re-writing the type of data structure in use. It does not require changing bit manipulation or application to logic. In that sense it is a simple and easy change with good benefits.

Optimization 6: Rent Smarter

Looking at further the allocated memory for persisting the state in the recursive Resolve method, I realize some minor optimization. Resolve method iterates over the possible numbers for a given cell. At each iteration it creates a copy of the current state, that is used by the exploration of the next state. In each iteration an array is rented and returned for this copied state. Instead of renting an array for each iteration of the possible cell values, I can rent the table once, as the size is constant, and the values are overwritten by the copy inside the loop. This saves some operations with ArrayPool<ushort>, which results in a faster execution:

| Method |     Mean |   Error |  StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|--------:|--------:|-------:|------:|------:|----------:|
|    Run | 298.5 us | 5.17 us | 4.84 us | 1.9531 |     - |     - |   5.86 KB |

Optimization 7: ref struct

Before I get into this optimization, I ran a PerfView session and investigated the CPU stacks view for the application.

getspan before

It seems that the application spends quite a lot of time in a Span property getter highlighted. TableArray as a struct is used to run cell level operations. Each time it needs to modify a cell, a TableArray is created which operates on top of a Memory<ushort> type. To index into the memory the Span property is used to access the underlying array. To mitigate this accessTableArray would need to replace its Memory<T> field to a Span<T> field, which is only possible if it becomes a ref struct itself.

This change from TableArray's point of view is rather simple: add a ref keyword to the declaration, replace Memory<ushort> with Span<ushort> and remove the usages of accessing the previous Span property of Memory<T>.

Table class also need to go through minor refactoring: TableArray has been passed around by Resolve method as an element of a value tuple. It means TableArray would become a generic type argument, which is not allowed for ref structs. With simple refactoring I replaced the usage of TableArray as a parameter with Memory<ushort>. I have also added a simple helper method to get a TableArray created from a Memory<ushort>.

After the changes, there are some clear improvements visible with PerfView session and BenchmarkDotNet analysis as well:

getspan after

| Method |     Mean |   Error |  StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|--------:|--------:|-------:|------:|------:|----------:|
|    Run | 219.6 us | 4.10 us | 3.64 us | 1.4648 |     - |     - |   4.75 KB |

Optimization 7: ArrayPool Create

Taking a second look at the previous PerfView session, I noticed that execution spends a significant percentage in ArrayPool's Rent and Return methods.

arrayPool

By default, I have used a shared ArrayPool. A shared array pool means that some necessary thread synchronization must take place during Rent and Return. Because the Sudoku resolver is single threaded, we can avoid the synchronization by using a private instance of the array pool. Changing implementation is a simple refactoring:

private static readonly ArrayPool<ushort> _tablePool = ArrayPool<ushort>.Create(128, 100);

I create a private ArrayPool<ushort> with a max array length of 128 (to represent a state of the Sudoku table I need an array with length of 81), and a max arrays per bucket is set to 100 (the maximum 'depth' of states is 81 - one for each cell).

The non-shared instance of ArrayPool uses a different pool implementation, which is faster in this use-case:

arrayPool

BenchmarkDotNet measurements show a significant step forward: both Mean execution time as well as the allocated memory per run has decreased.

| Method |     Mean |   Error |  StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|--------:|--------:|------:|------:|------:|----------:|
|    Run | 154.5 us | 2.65 us | 2.47 us |     - |     - |     - |     384 B |

Optimization 8: Fixed Memory Allocation

Taking the previous thought a step forward, I realize that to resolve any Sudoku a fixed amount of memory is required. There are 81 states to store in memory, both with 81 cells (each cell represented as an ushort). The algorithm also created 2 additinal state copies:

from the initial input
for the final result

With this knowledge, I replaced the ArrayPool used with a fixed sized array allocation. Each Resolve method will work on a slice of this array which is mapped to the given state being worked on.

private readonly ushort[] _tableArrayPool = new ushort[81 * (81 + 2)];

This allocation happens once, at the startup of the application. Each subsequent run will not need to allocate, but clear the previous results stored in the array for a given state before using it.

Re-running benchmarks shows further improvement on execution time, as well as the per run allocated memory:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.302
  [Host]     : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT
  DefaultJob : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT


| Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |---------:|---------:|---------:|------:|------:|------:|----------:|
|    Run | 92.91 us | 1.838 us | 2.043 us |     - |     - |     - |         - |

Conclusion

At this point I feel happy with the current implementation and performance properties of the application.

With all these changes, the application performance has improved x15 times faster to initial execution time, and ~69KB less allocation per run. The initial performance goals are met and superseded. In each step, only a small iteration has been made to the application to reach the end goal. Also note, that adequate measurements are helping in each step to steer the optimization path to next step. In this simple example optimization has been easy, I had a full understanding of the code, allocations classes and their responsibilities, but the same approach can be easily adopted on larger projects too.

In this blog post I have not covered one crucial step in each optimization step: testing. Make sure to have good unit tests that covers the refactored application logic. Each optimization step requires validating that the previous behavior has not been broken. Good unit tests can give only the confidence fast enough that the application still works correctly.

C# performance dotnet benchmarking perfview dotnet-counters

Hello, I am Laszlo

.NET developer