Faster Bound Checks

In a recent pull request I received some comments on how to do bound checks better when indexing into arrays. The original code received a span and an index (scan). If the index was larger to the span's length, it returned minus one. Otherwise it indexes into the buffer and does a further lookup.

private static int ReadHex(int scan, ReadOnlySpan<char> buffer)
{
    if (scan >= buffer.Length)
        return -1;
    int c = buffer[scan];
    return c >= CharToHexLookup.Length ? -1 : CharToHexLookup[c];
}

After implementing the suggestion, the code looks nearly identical, but it is using uint types for the comparing the index variable with the length of the bugger:

private static int ReadHex(int scan, ReadOnlySpan<char> buffer)
{
    if ((uint)scan >= (uint)buffer.Length)
        return -1;
    int c = buffer[scan];
    return c >= CharToHexLookup.Length ? -1 : CharToHexLookup[c];
}

A careful reader may ask why this would be better. It is clearly less readable, and also a reader might even worry on the extra casting.

From the context of the code it was clear that index can never be a negative value, and a length of a span should neither be negative, hence casting is a safe operation in this case.

The above change is suggested for performance reasons in very tight loops and performance sensitive code paths. The implementation with the unit comparison executes faster, as the JIT compiler may omit the extra bound check when indexing into the span: buffer[scan]. In this post I decided to look into the details of this behavior.

Performance Measurement

First of all, let's confirm the performance benefits. I used BenchmarkDotNet for the comparison. I filled an array with random chars, and iterated all the 4096 elements through the ReadHex method.

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1237 (21H1/May2021Update)
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21463.6
  [Host]     : .NET 6.0.0 (6.0.21.45113), X64 RyuJIT
  DefaultJob : .NET 6.0.0 (6.0.21.45113), X64 RyuJIT


|        Method |     Mean |     Error |    StdDev |
|-------------- |---------:|----------:|----------:|
| NoBoundChecks | 6.188 us | 0.1215 us | 0.1703 us |
|   BoundChecks | 8.701 us | 0.1717 us | 0.2916 us |

The results show that the one without the bound checks (the one with the uint comparison) executes faster. This validates the original suggestion on the pull request. The next step is to look into the JIT compiled code. I used WinDBG tool for the purpose. Because of recent changes in compilation, only quick JITTED assembly codes were resolved on my first attempt. After careful comparison of those results, I found no difference in the compiled code.

In a second attempt I added the following properties to the csproj to disable quick JIT and tiered compilation features.

<TieredCompilation>false</TieredCompilation>
<TieredCompilationQuickJit>false</TieredCompilationQuickJit>
<TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>

Assembly Code

Compiling the application with these properties, WinDBG shows clearly the difference between the two implementations. Here comes the JITTED code of the two methods, and some further explanation below.

Without Bound Checks

0:012> !name2ee ForLoopBoundChecking.dll!ForLoopBoundChecking.BoundchecksBenchmarks.ReadHex
Module:      00007ff7afd95df0
Assembly:    ForLoopBoundChecking.dll
Token:       0000000006000007
MethodDesc:  00007ff7afe1b908
Name:        ForLoopBoundChecking.BoundchecksBenchmarks.ReadHex(Int32, System.ReadOnlySpan`1<Char>)
JITTED Code Address: 00007ff7afcea580
0:012> !U /d 00007ff7afcea580
Normal JIT generated code
ForLoopBoundChecking.BoundchecksBenchmarks.ReadHex(Int32, System.ReadOnlySpan`1<Char>)
ilAddr is 00000230ACFF215C pImport is 000001EDA08FF300
Begin 00007FF7AFCEA580, size 38

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 61:
>>> 00007ff7`afcea580 4c8b02          mov     r8,qword ptr [rdx]
00007ff7`afcea583 8b4208          mov     eax,dword ptr [rdx+8]
00007ff7`afcea586 3bc8            cmp     ecx,eax
00007ff7`afcea588 7206            jb      00007ff7`afcea590

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 62:
00007ff7`afcea58a b8ffffffff      mov     eax,0FFFFFFFFh
00007ff7`afcea58f c3              ret

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 63:
00007ff7`afcea590 4863c1          movsxd  rax,ecx
00007ff7`afcea593 410fb70440      movzx   eax,word ptr [r8+rax*2]

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 64:
00007ff7`afcea598 3d00010000      cmp     eax,100h
00007ff7`afcea59d 7d13            jge     00007ff7`afcea5b2
00007ff7`afcea59f 4863c0          movsxd  rax,eax
00007ff7`afcea5a2 48ba5831ffac30020000 mov rdx,offset ForLoopBoundChecking!COM+_Entry_Point <PERF> (ForLoopBoundChecking+0x3158) (00000230`acff3158)
00007ff7`afcea5ac 480fbe0410      movsx   rax,byte ptr [rax+rdx]
00007ff7`afcea5b1 c3              ret
00007ff7`afcea5b2 b8ffffffff      mov     eax,0FFFFFFFFh
00007ff7`afcea5b7 c3              ret

With Bound Checks

0:012> !name2ee ForLoopBoundChecking.dll!ForLoopBoundChecking.BoundchecksBenchmarks.ReadHexBoundChecks
Module:      00007ff7afd95df0
Assembly:    ForLoopBoundChecking.dll
Token:       0000000006000008
MethodDesc:  00007ff7afe1b920
Name:        ForLoopBoundChecking.BoundchecksBenchmarks.ReadHexBoundChecks(Int32, System.ReadOnlySpan`1<Char>)
JITTED Code Address: 00007ff7afcea920
0:012> !U /d 00007ff7afcea920
Normal JIT generated code
ForLoopBoundChecking.BoundchecksBenchmarks.ReadHexBoundChecks(Int32, System.ReadOnlySpan`1<Char>)
ilAddr is 00000230ACFF21A0 pImport is 000001EDA08FF1A0
Begin 00007FF7AFCEA920, size 52

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 69:
>>> 00007ff7`afcea920 4883ec28        sub     rsp,28h
00007ff7`afcea924 4c8b02          mov     r8,qword ptr [rdx]
00007ff7`afcea927 8b5208          mov     edx,dword ptr [rdx+8]
00007ff7`afcea92a 3bca            cmp     ecx,edx
00007ff7`afcea92c 7c0a            jl      00007ff7`afcea938

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 70:
00007ff7`afcea92e b8ffffffff      mov     eax,0FFFFFFFFh
00007ff7`afcea933 4883c428        add     rsp,28h
00007ff7`afcea937 c3              ret

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 71:
00007ff7`afcea938 3bca            cmp     ecx,edx
00007ff7`afcea93a 7330            jae     00007ff7`afcea96c
00007ff7`afcea93c 4863c1          movsxd  rax,ecx
00007ff7`afcea93f 410fb70440      movzx   eax,word ptr [r8+rax*2]

C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 72:
00007ff7`afcea944 3d00010000      cmp     eax,100h
00007ff7`afcea949 7d17            jge     00007ff7`afcea962
00007ff7`afcea94b 4863c0          movsxd  rax,eax
*** WARNING: Unable to verify checksum for C:\Users\user\source\repos\Playground\ForLoopBoundChecking\bin\Release\net6.0\ForLoopBoundChecking.dll
00007ff7`afcea94e 48ba5831ffac30020000 mov rdx,offset ForLoopBoundChecking!COM+_Entry_Point <PERF> (ForLoopBoundChecking+0x3158) (00000230`acff3158)
00007ff7`afcea958 480fbe0410      movsx   rax,byte ptr [rax+rdx]
00007ff7`afcea95d 4883c428        add     rsp,28h
00007ff7`afcea961 c3              ret
00007ff7`afcea962 b8ffffffff      mov     eax,0FFFFFFFFh
00007ff7`afcea967 4883c428        add     rsp,28h
00007ff7`afcea96b c3              ret
00007ff7`afcea96c e89f9fc95f      call    coreclr!JIT_RngChkFail (00007ff8`0f984910)
00007ff7`afcea971 cc              int     3

The first thing to notice is the code size is much smaller in case of the no bound checks. The generated code size with bound checks is 52, while without them is 38. The key difference is that in line Program.cs @ 71, there are compare (cmp) and jump (jae) instructions emitted before the movsxd in the code sample of With Bound Checks, while these instructions are missing in the optimized case. If this check fails it jumps to coreclr!JIT_RngChkFail to execute the code of throwing an IndexOutOfRangeException. However, a few lines above in Program.cs @ 69 the same check is already done, and we return -1 if the check fails. So the second check in Program.cs @ 71 may be omitted, as well as the instructions to throw the exception.

Conclusion

With these optimizations the code is smaller and faster to execute, resulting a net positive performance improvement. I am glad this suggestion has been added to the pull request as it is a good example for educating developers on best practices. However, we shall keep in mind to apply such optimizations only along with careful performance measurements, and in very tight loops. Otherwise, the performance gains might not compensate for the readability concerns.