False Cache Sharing with Spans

Introduction

False cache sharing is a leaky abstraction of the CPU cache. CPUs cache memory in cache lines. Typically a cache line is 64 bytes long. That means when a single byte is required from the memory, the CPU rather brings in a full cache line, 64 bytes of data. This is beneficial from spatial/temporal locality principals. CPU cache is usually hierarchical, it has L1, L2 and L3 levels, each different size and speed. L1 and L2 is owned by the actual processor core, while L3 is a shared cache among all CPU cores.

You can find more details on CPU caching and false sharing in Pro .Net Memory Management by Konrak Kokosa.

Lower cache level being owned by the actual CPU cores means that when a data is modified in a cache line, it has to synchronize with other cores sharing the same cache line. This can have a negative performance effect: even in a synchronization free (at application code level) multithreaded application, the CPU cache might become a bottleneck by keeping their cache lines up-to-date.

This performance penalty can be demonstrated with the following code:

Implementation

public async Task Indexer()
{
    var tasks = new List<Task<int>>();
    for (int i = 0; i < 8; i++)
    {
        int index = Gap + i * Offset;
        tasks.Add(Task.Run(() => Calculate(index)));
    }
    await Task.WhenAll(tasks).ConfigureAwait(false);
}

public int Calculate(int index)
{
    for (int i = 0; i < Iterations; i++)
    {
        _data[index] = _data[index] + 1;
    }
    return _data[index];
}

The Indexer method starts 8 tasks, selects an index for each task, and invokes Caclulate method with that index. Then all task completions are awaited. The Caclulate updates the given index of the array, by incrementing its value.

The index is chosen carefully by setting Gap and Offset properties. Gap is an initial gap between the objects header of the array and the first selected index. Offset is an offset between selected indexes.I ran the above code with the following Gap and Offset configurations:

  • Gap: 0, Offset: 1

  • Gap: 0, Offset: 16

  • Gap: 16, Offset: 16

As an example when Gap is 0 and Offset is 16, the following indexes are chosen:[0,16,32,...]

I am using a similar architecture and gap/offset values as used in Pro .Net Memory Management book.

Benchmarks

Running benchmarks results the following perf data:

Method

Offset

Gap

Mean

Error

StdDev

Median

Indexer

1

0

907.2 ms

126.60 ms

367.30 ms

653.6 ms

Indexer

16

0

592.0 ms

115.38 ms

340.19 ms

301.6 ms

Indexer

16

16

283.7 ms

5.60 ms

5.23 ms

283.7 ms

Note, that we have a huge variance, but mean and median should give a good understanding on how Offset and Gap impacts the results. To clarify it further, in the first case all threads are sharing the same cache line.

In the second case, all indexes fall into separate cache lines, but the head of the array (containing the length used for boundary checks) is still shared, and the cache line is updated regularly by the first index (index 0).

In the third case the initial gap makes sure that the first index, does not sit on the same cache line as the array's header, so no cache lines collide. Also note that the Error rate and the StdDev are much lower as well.

Can we avoid using the gap between the array header and the first index?

Using Spans

This section investigates if we can avoid using the initial gap between the array header and the first element. For this we need to avoid boundary checks of the array, which happens when we index into it. One plausible solution is to use Span which is a type- and memory-safe representation of a contiguous region of memory.

In this followings it will represent the indexes to be updated. It is pointing to a 1 length long slice of the array. Let's look at the updated code:

public async Task Span()
{
    var tasks = new List<Task<int>>();
    for (int i = 0; i < 8; i++)
    {
        int index = Gap + i * Offset;
        tasks.Add(Task.Run(() => Calculate(_data.AsSpan().Slice(index, 1))));
    }
    await Task.WhenAll(tasks).ConfigureAwait(false);
}

public int Calculate(Span<int> postoupdate)
{
    for (int i = 0; i < Iterations; i++)
    {
        postoupdate[0] = postoupdate[0] + 1;
    }
    return postoupdate[0];
}

Span method - in the above example - is very similar to the Indexer method. The only difference is it creating a Span and a Slice before it invokes the Caclulate method. Calculate method is also very similar to the previous implementation, though it is using a 0th item of the Span argument to update the same index of the array as earlier. Note, that Span points into the array at the calculated index, hence the 0th element of the Span is the item as pointed by indexer in the previous example.

Spans are ref structs, that means they can only live on the stack, which far away of the array allocated on the heap. These spans might be located very close to each other, but we are only reading the values of them: a reference to the actual location of the array slice.

On the other hand we do not need to add additional boundary checks on the array during updating the given index, as those are done once upfront, during slicing.Which means false sharing should only be the problem if the offset is small enough between the indexed elements, but independent from the gaps.

Running the benchmarks again:

I used BenchmarkDotNet to measure the performance of the above examples.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363 Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores .NET Core SDK=3.1.101DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method

Offset

Gap

Mean

Error

StdDev

Median

Indexer

1

0

907.2 ms

126.60 ms

367.30 ms

653.6 ms

Indexer

16

0

592.0 ms

115.38 ms

340.19 ms

301.6 ms

Indexer

16

16

283.7 ms

5.60 ms

5.23 ms

283.7 ms

Span

1

0

701.3 ms

4.51 ms

4.22 ms

701.7 ms

Span

16

0

304.9 ms

3.24 ms

3.03 ms

303.8 ms

Span

16

16

304.4 ms

2.09 ms

1.86 ms

304.3 ms

Shows that the Span solution is independent from the Gap property's value from performance points of view. Also note, that using spans gives a much smaller error rate and StdDev.