Inlining Code 2
08/23/2020
6 minutes
This post is the second part of a 2 posts long series, discussing inlining in relation to DCE and exception handling.
Introduction
This post is about code inlinig by the JIT compiler and Dead Code Elimination (DCE).
Code Inline: call of the method at the callsite is replaced with the invoked method body.
DCE: Modern compilers are smart enough to detect code segments, that has no effects, hence they can be removed.
I will walk to through two examples showing these compiler features in action in regards to exception handling and recursive methods. At this point, I expect the reader to have a good understanding on both of the above compiler features.
Code Inlining is a really powerful, but it has a couple of restrictions. The compiler may not be able inline code in the following cases:
After the inlining, the result method would be too 'large'.
The method to be inlined has a try-catch block.
The method to be inlined is virtual.
The method to be inlined is recursive.
The method to be inlined has
NoInlining
attribute.
In this I will look at try-catch blocks in relation to DCE and Inlining.
Methods with Try-Catch blocks
I have the following to classes implementated:
public class ExInlineAndDceHandle { private const int _param = 0; public int InnerLoop() { int j = 0; for (int i = 0; i < Program.FooCouter; i++) { if (Foo(_param)) j++; } return j; } public bool Foo(int limit) { if (limit > 500) { HandleEx(); } return false; } public bool HandleEx() { try { return false; } catch { return true; } } } public class ExInlineAndDce { private const int _param = 0; public int InnerLoop() { int j = 0; for (int i = 0; i < Program.FooCouter; i++) { if (Foo(_param)) j++; } return j; } public bool Foo(int limit) { if (limit > 500) { try { return false; } catch { return true; } } return false; } }
ExInlineAndDceHandle
and ExInlineAndDce
both implement the same logic. Neither of them do anything useful. In both cases we end up invoking Foo
, which returns false. The only difference between the two classes is that ExInlineAndDceHandle
has the exception handling in a separate method: HandleEx
. Note, that neither of the Foo
methods execute the exception handling code, as in both cases the limit
is below 500.
Let's benchmark both InnerLoop
methods, then look at the JITTED code for further explanation of the results. For Benchmarking I use BenchmarkDotNet, FooCouter
static field is set to 10000000.
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363, Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores, .NET Core SDK=3.1.200[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Method | Mean | Error | StdDev |
---|---|---|---|
ExInlineAndDceHandle | 3.372 ms | 0.0652 ms | 0.0640 ms |
ExInlineAndDce | 24.038 ms | 0.3469 ms | 0.3075 ms |
Results show the ExInlineAndDceHandle
solution is several times faster. To understand the reason, let's look at the JITTED code. I use WinDBG, to look into the generated assembly code.
ExInlineAndDceHandle
InlineCode.ExInlineAndDceHandle.InnerLoop()
Begin 00007FFA41E4B200, size e
00007ffa`41e4b200 33c0 xor eax,eax
00007ffa`41e4b202 ffc0 inc eax
00007ffa`41e4b204 3d80969800 cmp eax,989680h
00007ffa`41e4b209 7cf7 jl 00007ffa`41e4b202
00007ffa`41e4b20b 33c0 xor eax,eax
00007ffa`41e4b20d c3 ret
!name2ee InlineCode ExInlineAndDceHandle.Foo Not JITTED yet. Use !bpmd -md 00007FFA4175FBF8 to break on run.
The generated InnerLoop
method is super compact. Foo
method is completly inlined and removed, as it only returns false. Incrementing variable j
is also removed, as the if
condition around the Foo statement always returns false. Basically an empty loop is left. 989680h is the FooCouter
static field's value.
ExInlineAndDce
The InnerLoop
method:
00007ffa`41e4b260 57 push rdi 00007ffa`41e4b261 56 push rsi 00007ffa`41e4b262 53 push rbx 00007ffa`41e4b263 4883ec20 sub rsp,20h 00007ffa`41e4b267 488bf1 mov rsi,rcx 00007ffa`41e4b26a 33ff xor edi,edi 00007ffa`41e4b26c 33db xor ebx,ebx 00007ffa`41e4b26e 488bce mov rcx,rsi 00007ffa`41e4b271 33d2 xor edx,edx 00007ffa`41e4b273 e8c86e84ff call 00007ffa`41692140 (InlineCode.ExInlineAndDce.Foo(Int32), mdToken: 0000000006000012) 00007ffa`41e4b278 85c0 test eax,eax 00007ffa`41e4b27a 7402 je 00007ffa`41e4b27e 00007ffa`41e4b27c ffc7 inc edi 00007ffa`41e4b27e ffc3 inc ebx 00007ffa`41e4b280 81fb80969800 cmp ebx,989680h 00007ffa`41e4b286 7ce6 jl 00007ffa`41e4b26e 00007ffa`41e4b288 8bc7 mov eax,edi 00007ffa`41e4b28a 4883c420 add rsp,20h 00007ffa`41e4b28e 5b pop rbx 00007ffa`41e4b28f 5e pop rsi 00007ffa`41e4b290 5f pop rdi 00007ffa`41e4b291 c3 ret
This method does not inline the Foo
, it makes a call instruction within the loop. The generated code for the Foo
method is also more complex. Please look at the size of this code, without a deeper explanation.
00007ffa`41e4b2b0 55 push rbp
00007ffa`41e4b2b1 4883ec20 sub rsp,20h
00007ffa`41e4b2b5 488d6c2420 lea rbp,[rsp+20h]
00007ffa`41e4b2ba 33c0 xor eax,eax
00007ffa`41e4b2bc 8945fc mov dword ptr [rbp-4],eax
00007ffa`41e4b2bf 488945f0 mov qword ptr [rbp-10h],rax
00007ffa`41e4b2c3 488965e0 mov qword ptr [rbp-20h],rsp
00007ffa`41e4b2c7 48894d10 mov qword ptr [rbp+10h],rcx
00007ffa`41e4b2cb 895518 mov dword ptr [rbp+18h],edx
00007ffa`41e4b2ce 817d18f4010000 cmp dword ptr [rbp+18h],1F4h
00007ffa`41e4b2d5 7e07 jle 00007ffa`41e4b2de
00007ffa`41e4b2d7 33c0 xor eax,eax
00007ffa`41e4b2d9 8945fc mov dword ptr [rbp-4],eax
00007ffa`41e4b2dc eb08 jmp 00007ffa`41e4b2e6
00007ffa`41e4b2de 33c0 xor eax,eax
00007ffa`41e4b2e0 488d6500 lea rsp,[rbp]
00007ffa`41e4b2e4 5d pop rbp
00007ffa`41e4b2e5 c3 ret
00007ffa`41e4b2e6 8b45fc mov eax,dword ptr [rbp-4]
00007ffa`41e4b2e9 488d6500 lea rsp,[rbp]
00007ffa`41e4b2ed 5d pop rbp
00007ffa`41e4b2ee c3 ret
00007ffa`41e4b2ef 55 push rbp
00007ffa`41e4b2f0 4883ec10 sub rsp,10h
00007ffa`41e4b2f4 488b29 mov rbp,qword ptr [rcx]
00007ffa`41e4b2f7 48892c24 mov qword ptr [rsp],rbp
00007ffa`41e4b2fb 488d6d20 lea rbp,[rbp+20h]
00007ffa`41e4b2ff 488955f0 mov qword ptr [rbp-10h],rdx
00007ffa`41e4b303 c745fc01000000 mov dword ptr [rbp-4],1
00007ffa`41e4b30a 488d05d5ffffff lea rax,[00007ffa`41e4b2e6]
00007ffa`41e4b311 4883c410 add rsp,10h
00007ffa`41e4b315 5d pop rbp
00007ffa`41e4b316 c3 ret
The whole Foo
method is JITTED including the try-catch part, even though that branch is not executed. DCE is not the removing this branch, the generated code is larger and has the try-catch block included, hence it cannot be inlined.
Although at C# level, we expect both methods to have a similar performance: in both cases InnerLoop
method is indentical and Foo
is only different in a code path not executed, but still they have a completely different performance characteristic.
In the first case DCE was able to remove the HandleEx
invocation, hence enabling inlining of Foo
method and speeding up overall execution.