Testing Locks is Difficult

I have recently worked with multithreaded code using shared state, that required mutual exclusion. In .NET there are numerous different types to accomplish mutual exclusion, the C# language even provides a lock keyword. Depending on the use-case one can use Monitor, lock on reference types, SemaphoneSlim, ReaderWriterLockSlim, Interlocked, SpinLock concurrent collections, or the .NET 9 introduced Lock type. This latter one also works with the lock language keyword, and it brings much of the underlying native implementation to the managed space. This results in some minor performance improvements compared to locking on any other reference type.

Achieving the above improvement seems straightforward: change the object's type that is used within a lock keyword. However, lock might be still slower to other types of locks as the 'best' locking solution depends on many factors:

  • number of threads accessing the lock

  • the frequence of the threads accessing the lock

  • duration inside the lock

  • contention

  • CPU, architecture, OS, etc.

  • memory model

  • etc.

In general, certain locks might fit better for certain uses cases (for example, when using collection, async code, granular locks etc.), which drives the decisions on the lock that suits the best for an application. To prove that one locking solution is more performant than another is only possible by performance measuring an actual application on a production-like load. This also means that the same code might perform differently from application to application.

Unit Testing is Difficult

Given a lock that a developer would like to refactor. Existing unit tests might already validate the corresponding application logic. In practice it is rather rare to see any unit tests for testing mutual exclusion.

The most common ways I have seen mutual exclusions being tested: start multiple threads performing some operations and once they are done, validate the result state. If mutual exclusion was broken the state might be corrupted. Nothing guarantees that the test threads were actually running parallel. They might be concurrent but not parallel given CPUs, OS thread scheduling or .NET thread pooling. Hence, repeating the unit test a 100, 1000, etc. times can help to eliminate some of the factors statistically, but not all of them. Repeating 1000 times the same test also makes the overall test execution longer. Depending on the type of lock and the memory model, executing these locks on different architectures (arm, x64, etc.) might be also necessary.

CPUs might reorder instructions within a memory model which can end up reading-writing fields in unexpected orders. That makes unit-tests validating mutual exclusion even more difficult, as now the tests themselves must take care of using the right exclusions too. One needs to use memory barriers with Interlocked or Volatile in certain cases. Testing correctness under race conditions would require a step-by-step execution of multiple threads, however instrumenting the code for such step-by-step execution will alter the behavior of the lock as the duration/contention for the lock changes due to the instrumentation.

Another aspect which is difficult to unit test: how can one test that all access to the shared state is covered by mutual exclusion? Or is thread exhaustion avoided? Answering these questions is hard, and unit testing is equally difficult.

Often tests are incorrectly marked as flaky (or are inherently flaky), due to a rare bug in the application logic.

Benchmarking is Difficult

Benchmarking is similarly difficult to unit testing. While correctness is already asserted by the unit tests, benchmarks typically concern the performance of the application or a given code segment (microbenchmarks). Performance benchmarks face the same challenges regarding the environment, scheduling etc. as the unit tests.

There are many aspects of locking to be measured:

  • throughput

  • time required for entering / exiting the lock

  • latency, lock acquiring duration

  • etc.

However, these aspects depend heavily on the nature of the code requiring mutual exclusion. If the threads require rare exclusion for short period of time, locking will be less dominant. If many threads access the lock at the same time, locking might become a dominant factor for the performance of the application.

Here is a quick benchmark using mutual exclusion for incrementing an integer number. Each execution describes the number of threads incrementing the shared integer, the locking mechanism used, and if a thread incrementing the number 1 or 100 times within a single time the lock is acquired. (Note, that Interlocked always incrementing by 1):

1 thread, lock per increment
     Monitor:     18,028 ns/change
        Lock:     16,192 ns/change
 Interlocked:      6,509 ns/change

4 threads, lock per increment
     Monitor:    110,754 ns/change
        Lock:    104,302 ns/change
 Interlocked:     67,172 ns/change

1 thread, lock per 100 increments
     Monitor:      4,471 ns/change
        Lock:      4,494 ns/change
 Interlocked:      6,325 ns/change

4 threads, lock per 100 increments
     Monitor:      5,749 ns/change
        Lock:      5,443 ns/change
 Interlocked:     63,877 ns/change

On my machine, the same locking solutions perform vastly differently with different number of threads or slightly different granularity of code. These variables affect the overall results; hence a production-like load is always necessary. Based on the load, the same application might require different locking solutions in highly optimized scenarios.

Interlocked is not as fast

Generally Interlocked based solutions are considered 'fast'. Many times, for granular state changes, the Interlocked.CompareExchange or Interlocked.Increment pattern is the suggested approach. However, it is still several times slower compared to the non-locked equivalents of these operations. Based on uops.info for Intel Alder Lake processor lock inc is 36 times slower to a regular inc. Similarly lock cmpxchg is several times slower to cmpxchg:

lock inc  dword ptr [rdx] [17;≤42]	18.00	8 / 6 (Latency, Throughput, Uops)
lock cmpxchg [r10],edx  [14;≤43]	19.04	10 / 8  (Latency, Throughput, Uops)

Conclusion

Mutual exclusion and locking stand many challenges. Many times, the simplest solutions are the safest, however testing and proving that the mutual exclusion is correct and complete is not straightforward. Applications that are performance critical shall avoid locking, and only enter mutual exclusion when that is unavoidable to synchronize shard state.