Faster Bound Checks
02/06/2022
6 minutes
In a recent pull request I received some comments on how to do bound checks better when indexing into arrays. The original code received a span and an index (scan). If the index was larger to the span's length, it returned minus one. Otherwise it indexes into the buffer and does a further lookup.
private static int ReadHex(int scan, ReadOnlySpan<char> buffer) { if (scan >= buffer.Length) return -1; int c = buffer[scan]; return c >= CharToHexLookup.Length ? -1 : CharToHexLookup[c]; }
After implementing the suggestion, the code looks nearly identical, but it is using uint
types for the comparing the index variable with the length of the bugger:
private static int ReadHex(int scan, ReadOnlySpan<char> buffer) { if ((uint)scan >= (uint)buffer.Length) return -1; int c = buffer[scan]; return c >= CharToHexLookup.Length ? -1 : CharToHexLookup[c]; }
A careful reader may ask why this would be better. It is clearly less readable, and also a reader might even worry on the extra casting.
From the context of the code it was clear that index can never be a negative value, and a length of a span should neither be negative, hence casting is a safe operation in this case.
The above change is suggested for performance reasons in very tight loops and performance sensitive code paths. The implementation with the unit
comparison executes faster, as the JIT compiler may omit the extra bound check when indexing into the span: buffer[scan]
. In this post I decided to look into the details of this behavior.
Performance Measurement
First of all, let's confirm the performance benefits. I used BenchmarkDotNet for the comparison. I filled an array with random chars, and iterated all the 4096 elements through the ReadHex
method.
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1237 (21H1/May2021Update)
Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-rc.1.21463.6
[Host] : .NET 6.0.0 (6.0.21.45113), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.45113), X64 RyuJIT
| Method | Mean | Error | StdDev |
|-------------- |---------:|----------:|----------:|
| NoBoundChecks | 6.188 us | 0.1215 us | 0.1703 us |
| BoundChecks | 8.701 us | 0.1717 us | 0.2916 us |
The results show that the one without the bound checks (the one with the uint
comparison) executes faster. This validates the original suggestion on the pull request. The next step is to look into the JIT compiled code. I used WinDBG tool for the purpose. Because of recent changes in compilation, only quick JITTED assembly codes were resolved on my first attempt. After careful comparison of those results, I found no difference in the compiled code.
In a second attempt I added the following properties to the csproj
to disable quick JIT and tiered compilation features.
<TieredCompilation>false</TieredCompilation> <TieredCompilationQuickJit>false</TieredCompilationQuickJit> <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
Assembly Code
Compiling the application with these properties, WinDBG shows clearly the difference between the two implementations. Here comes the JITTED code of the two methods, and some further explanation below.
Without Bound Checks
0:012> !name2ee ForLoopBoundChecking.dll!ForLoopBoundChecking.BoundchecksBenchmarks.ReadHex
Module: 00007ff7afd95df0
Assembly: ForLoopBoundChecking.dll
Token: 0000000006000007
MethodDesc: 00007ff7afe1b908
Name: ForLoopBoundChecking.BoundchecksBenchmarks.ReadHex(Int32, System.ReadOnlySpan`1<Char>)
JITTED Code Address: 00007ff7afcea580
0:012> !U /d 00007ff7afcea580
Normal JIT generated code
ForLoopBoundChecking.BoundchecksBenchmarks.ReadHex(Int32, System.ReadOnlySpan`1<Char>)
ilAddr is 00000230ACFF215C pImport is 000001EDA08FF300
Begin 00007FF7AFCEA580, size 38
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 61:
>>> 00007ff7`afcea580 4c8b02 mov r8,qword ptr [rdx]
00007ff7`afcea583 8b4208 mov eax,dword ptr [rdx+8]
00007ff7`afcea586 3bc8 cmp ecx,eax
00007ff7`afcea588 7206 jb 00007ff7`afcea590
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 62:
00007ff7`afcea58a b8ffffffff mov eax,0FFFFFFFFh
00007ff7`afcea58f c3 ret
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 63:
00007ff7`afcea590 4863c1 movsxd rax,ecx
00007ff7`afcea593 410fb70440 movzx eax,word ptr [r8+rax*2]
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 64:
00007ff7`afcea598 3d00010000 cmp eax,100h
00007ff7`afcea59d 7d13 jge 00007ff7`afcea5b2
00007ff7`afcea59f 4863c0 movsxd rax,eax
00007ff7`afcea5a2 48ba5831ffac30020000 mov rdx,offset ForLoopBoundChecking!COM+_Entry_Point <PERF> (ForLoopBoundChecking+0x3158) (00000230`acff3158)
00007ff7`afcea5ac 480fbe0410 movsx rax,byte ptr [rax+rdx]
00007ff7`afcea5b1 c3 ret
00007ff7`afcea5b2 b8ffffffff mov eax,0FFFFFFFFh
00007ff7`afcea5b7 c3 ret
With Bound Checks
0:012> !name2ee ForLoopBoundChecking.dll!ForLoopBoundChecking.BoundchecksBenchmarks.ReadHexBoundChecks
Module: 00007ff7afd95df0
Assembly: ForLoopBoundChecking.dll
Token: 0000000006000008
MethodDesc: 00007ff7afe1b920
Name: ForLoopBoundChecking.BoundchecksBenchmarks.ReadHexBoundChecks(Int32, System.ReadOnlySpan`1<Char>)
JITTED Code Address: 00007ff7afcea920
0:012> !U /d 00007ff7afcea920
Normal JIT generated code
ForLoopBoundChecking.BoundchecksBenchmarks.ReadHexBoundChecks(Int32, System.ReadOnlySpan`1<Char>)
ilAddr is 00000230ACFF21A0 pImport is 000001EDA08FF1A0
Begin 00007FF7AFCEA920, size 52
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 69:
>>> 00007ff7`afcea920 4883ec28 sub rsp,28h
00007ff7`afcea924 4c8b02 mov r8,qword ptr [rdx]
00007ff7`afcea927 8b5208 mov edx,dword ptr [rdx+8]
00007ff7`afcea92a 3bca cmp ecx,edx
00007ff7`afcea92c 7c0a jl 00007ff7`afcea938
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 70:
00007ff7`afcea92e b8ffffffff mov eax,0FFFFFFFFh
00007ff7`afcea933 4883c428 add rsp,28h
00007ff7`afcea937 c3 ret
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 71:
00007ff7`afcea938 3bca cmp ecx,edx
00007ff7`afcea93a 7330 jae 00007ff7`afcea96c
00007ff7`afcea93c 4863c1 movsxd rax,ecx
00007ff7`afcea93f 410fb70440 movzx eax,word ptr [r8+rax*2]
C:\Users\user\source\repos\Playground\ForLoopBoundChecking\Program.cs @ 72:
00007ff7`afcea944 3d00010000 cmp eax,100h
00007ff7`afcea949 7d17 jge 00007ff7`afcea962
00007ff7`afcea94b 4863c0 movsxd rax,eax
*** WARNING: Unable to verify checksum for C:\Users\user\source\repos\Playground\ForLoopBoundChecking\bin\Release\net6.0\ForLoopBoundChecking.dll
00007ff7`afcea94e 48ba5831ffac30020000 mov rdx,offset ForLoopBoundChecking!COM+_Entry_Point <PERF> (ForLoopBoundChecking+0x3158) (00000230`acff3158)
00007ff7`afcea958 480fbe0410 movsx rax,byte ptr [rax+rdx]
00007ff7`afcea95d 4883c428 add rsp,28h
00007ff7`afcea961 c3 ret
00007ff7`afcea962 b8ffffffff mov eax,0FFFFFFFFh
00007ff7`afcea967 4883c428 add rsp,28h
00007ff7`afcea96b c3 ret
00007ff7`afcea96c e89f9fc95f call coreclr!JIT_RngChkFail (00007ff8`0f984910)
00007ff7`afcea971 cc int 3
The first thing to notice is the code size is much smaller in case of the no bound checks. The generated code size with bound checks is 52, while without them is 38. The key difference is that in line Program.cs @ 71, there are compare (cmp) and jump (jae) instructions emitted before the movsxd in the code sample of With Bound Checks, while these instructions are missing in the optimized case. If this check fails it jumps to coreclr!JIT_RngChkFail to execute the code of throwing an IndexOutOfRangeException
. However, a few lines above in Program.cs @ 69 the same check is already done, and we return -1 if the check fails. So the second check in Program.cs @ 71 may be omitted, as well as the instructions to throw the exception.
Conclusion
With these optimizations the code is smaller and faster to execute, resulting a net positive performance improvement. I am glad this suggestion has been added to the pull request as it is a good example for educating developers on best practices. However, we shall keep in mind to apply such optimizations only along with careful performance measurements, and in very tight loops. Otherwise, the performance gains might not compensate for the readability concerns.