SkipInitLocal or ArrayPool

Many data processing algorithms require a temporary data storage for certain operations. I have been recently working with one such algorithm in the sound processing domain. In my case a fixed sized temporary byte array has been required for a method execution. In a performance sensitive code, one would prefer to achieve a non-allocating solution for this problem.

In the particular case it was a tight loop for processing live sound data. Allocating a new array for every iteration of the loop is excessive. However, I could also not use a pre-allocated array for the purpose, due to the structure of the code. There are still a few options to acquire a temporary array:

ArrayPool type provides a shared pool of arrays, from which a developer might rent an array, and return it once it is no longer needed. This array exists on the heap as a regular array, and the ArrayPool makes sure to provide an array when renting which is at least the size asked for. Because the array is returned to the pool after usage, the next iteration of the algorithm may re-use the same array, avoiding any excess allocation.

Stackalloc on the other hand allocates a block of memory on the stack. Which means that once the method is unrolled this memory is also released. Because the allocation incurs on the stack instead of the heap, this will not increase the workload of the Garbage Collector. The stack is limited in size (remember StackOverflowException), which limits the size of array we may allocate.

Skip Locals Init

.NET 5 adds a new attribute [SkipLocalsInit] in the hands of the developers. [SkipLocalsInit] applied on a method instructs the runtime to leave the stack allocated memory uninitialized (values are not set to 0). This is an unsafe operation which needs to be explicitly enabled in csproj by the following property:

<AllowUnsafeBlocks>True</AllowUnsafeBlocks>

This attribute may give a performance advantage in very tight loops. In the above example, this attribute may be applied on the method using stackalloc safely, because the algorithm writes a new value into every element of the array.

With this background knowledge, I have designed a few performance tests to determine the characteristics of these solutions.

Benchmark

Each performance test acquires a temporary byte array, sets each element of the array with a const value and returns the length of the array while also releasing it. The reason for setting each element is because with skip locals init, the user would likely (except some edge cases) set values or zero out the unset elements. It also resembles more closely with the algorithm in question above. The subject of the code measured looks as the following code snippet in case of allocating on the stack:

Span<byte> buffer = stackalloc byte[Size];
for (int i = 0; i < buffer.Length; i++)
    buffer[i] = 1;

While the same code with ArrayPool:

var data = ArrayPool<byte>.Shared.Rent(Size);
var buffer = data.AsSpan(0, Size);
for (int i = 0; i < buffer.Length; i++)
    buffer[i] = 1;
ArrayPool<byte>.Shared.Return(data, false);
return buffer.Length;

The benchmarks are parameterized by array sizes, starting from 256 bytes, 2048 bytes and 8192 bytes.

No Concurrency

In the first test I compared the vanilla performance of the different approaches.

|             Method | Size |        Mean |     Error |      StdDev | Ratio | RatioSD | Code Size | Allocated |
|------------------- |----- |------------:|----------:|------------:|------:|--------:|----------:|----------:|
|      SkipLocalInit |  256 |  1,004.1 ns |  19.94 ns |    33.32 ns |  1.00 |    0.00 |     249 B |         - |
| StackAllocWithZero |  256 |    923.9 ns |  18.06 ns |    33.02 ns |  0.92 |    0.04 |     218 B |         - |
|          ArrayPool |  256 |  1,146.1 ns |  22.23 ns |    30.43 ns |  1.13 |    0.04 |     296 B |         - |
|                    |      |             |           |             |       |         |           |           |
|      SkipLocalInit | 2048 |  7,423.9 ns | 147.93 ns |   243.06 ns |  1.00 |    0.00 |     249 B |         - |
| StackAllocWithZero | 2048 |  7,420.1 ns | 145.43 ns |   269.56 ns |  1.00 |    0.05 |     218 B |         - |
|          ArrayPool | 2048 |  7,535.4 ns | 149.26 ns |   291.11 ns |  1.01 |    0.05 |     296 B |         - |
|                    |      |             |           |             |       |         |           |           |
|      SkipLocalInit | 8192 | 29,814.0 ns | 484.34 ns |   453.05 ns |  1.00 |    0.00 |     249 B |         - |
| StackAllocWithZero | 8192 | 30,575.1 ns | 607.28 ns |   722.92 ns |  1.02 |    0.03 |     218 B |         - |
|          ArrayPool | 8192 | 29,823.7 ns | 592.85 ns | 1,326.00 ns |  0.99 |    0.05 |     296 B |         - |

The key takeaways on my machine of this measurement are:

  • for small arrays there is no big performance advantage of [SkipLocalsInit] (for large arrays, there is an advantage)

  • for small arrays stack allocation is faster to renting from ArrayPool

  • for large arrays acquiring arrays from ArrayPool is comparable with stack allocation with [SkipLocalsInit].

Stressing ArrayPool

A careful reader may notice, that ArrayPool is actually using a shared instance of the pool. This might come at a tradeoff in some scenarios. I have outlined a few typical use-cases with multiple threads exercising the shared pool instance, but I found no degradation of performance compared to the counterpart test methods.

Reading the ArrayPool implementation, shows that it is very cleverly designed. It uses a thread local storage for each array size, and if one not available only tries to access a locked stack assigned to a processor core.

This meant two further tests to execute:

  • renting multiple arrays in benchmark

  • renting multiple arrays while also having other threads exercising the same pool instance

In the first case I observed an increased overhead for the ArrayPool test case, however my measurements still shown a very close performance to stack allocation with [SkipLocalsInit] for all tested array sizes.

The second case is more interesting, where I shall note that the background threads themself may alter the outcome of the benchmark. However I observed the same trends over multiple execution of the tests. In this case I observed a larger performance degradation, which also shown in the case of acquiring the 8192 bytes long array:

|             Method | Size |       Mean |     Error |    StdDev |     Median | Ratio | RatioSD | Code Size | Allocated |
|------------------- |----- |-----------:|----------:|----------:|-----------:|------:|--------:|----------:|----------:|
|      SkipLocalInit |  256 |   3.031 us | 0.0594 us | 0.1533 us |   3.019 us |  1.00 |    0.00 |     372 B |         - |
| StackAllocWithZero |  256 |   4.037 us | 0.1848 us | 0.5448 us |   4.155 us |  1.32 |    0.21 |     308 B |         - |
|          ArrayPool |  256 |   5.821 us | 0.1736 us | 0.5065 us |   5.911 us |  1.95 |    0.18 |     253 B |         - |
|                    |      |            |           |           |            |       |         |           |
|
|      SkipLocalInit | 2048 |  26.212 us | 0.5231 us | 1.0804 us |  26.354 us |  1.00 |    0.00 |     372 B |         - |
| StackAllocWithZero | 2048 |  37.317 us | 1.1431 us | 3.3706 us |  38.161 us |  1.37 |    0.14 |     308 B |         - |
|          ArrayPool | 2048 |  32.815 us | 0.6514 us | 1.4297 us |  32.972 us |  1.25 |    0.08 |     253 B |         - |
|                    |      |            |           |           |            |       |         |           |
|
|      SkipLocalInit | 8192 | 123.391 us | 2.4292 us | 5.2292 us | 123.504 us |  1.00 |    0.00 |     372 B |         - |
| StackAllocWithZero | 8192 | 142.248 us | 2.8234 us | 6.4302 us | 144.509 us |  1.15 |    0.07 |     308 B |         - |
|          ArrayPool | 8192 | 132.240 us | 2.6411 us | 6.6743 us | 132.640 us |  1.07 |    0.06 |     253 B |         - |

Conclusion

ArrayPool is implemented with very good performance characteristics, and it is generally a good choice for most use cases. If someone is going for the extra performance, in very tight loops stack allocation with [SkipLocalsInit] might have some advantage. However using [SkipLocalsInit] attribute requires to mark the project with the unsafe flag.