Inlining Code
08/01/2020
6 minutes
This post is the first part of a 2 posts long series, discussing code inlining in relation to DCE and recursive methods.
Introduction
Code Inline: call of the method at the callsite is replaced with the invoked method body.
DCE: Modern compilers are smart enough to detect code segments, that has no effects, hence they can be removed.
I will walk to through two examples showing these compiler features in action in regards to exception handling and recursive methods. At this point, I expect the reader to have a good understanding on both of the above compiler features.
Code Inlining is a really powerful, but it has a couple of restrictions. The compiler may not be able inline code in the following cases:
After the inlining, the result method would be too 'large'.
The method to be inlined has a try-catch block.
The method to be inlined is virtual.
The method to be inlined is recursive.
The method to be inlined has
NoInlining
attribute.
In this post I will look at recursive methods in relation to DCE and Inlining.
Recursive methods
I have the following to classes implementated:
public class InlineAndDce { private const int _param = 0; public int InnerLoop() { int j = 0; for (int i = 0; i < Program.FooCouter; i++) { if (Foo(_param) || Foo(_param)) j++; } return j; } public bool Foo(int limit) { if (limit > 500) { Foo(--limit); } return false; } } public class InlineAndDce500 { private const int _param = 501; public int InnerLoop() { int j = 0; for (int i = 0; i < Program.FooCouter; i++) { if (Foo(_param)) j++; } return j; } public bool Foo(int limit) { if (limit > 500) { return Foo(--limit); } return false; } }
InlineAndDce
and InlineAndDce500
both implement the same logic. Neither of them do anything useful. In both cases we end up invoking Foo
twice.
InlineAndDce
in the InnerLoop
method invokes Foo
twice. Inside Foo
method, it always returns false.
InlineAndDce500
in its InnerLoop
method invokes Foo
once. But because InlineAndDce500
has _param
set to 501, the Foo
method recursively invokes itself once.
Let's benchmark both InnerLoop
methods, then look at the JITTED code for further explanation of the results. For Benchmarking I use BenchmarkDotNet. Note, that FooCouter
static field is set to 10000000.
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363, Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores, .NET Core SDK=3.1.200[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Method | Mean | Error | StdDev |
---|---|---|---|
InnerLoop | 3.373 ms | 0.0674 ms | 0.0631 ms |
InnerLoop500 | 24.457 ms | 0.4890 ms | 0.5232 ms |
Results show the InnerLoop
solution is several times faster. To understand the reason, let's look at the JITTED code. I use WinDBG, to look into the generated assembly code.
InlineAndDce
00007ffa`41e79880 33c0 xor eax,eax
00007ffa`41e79882 ffc0 inc eax
00007ffa`41e79884 3d80969800 cmp eax,989680h
00007ffa`41e79889 7cf7 jl 00007ffa`41e79882
00007ffa`41e7988b 33c0 xor eax,eax
00007ffa`41e7988d c3 ret
!name2ee InlineCode InlineAndDce.Foo Not JITTED yet. Use !bpmd -md 00007FFA4178FBF8 to break on run.
The generated InnerLoop
method is super compact. Foo
method is completly inlined and removed, as it only returns false. Incrementing variable j
is also removed, as the if
condition around the Foo statement always returns false. Basically an empty loop is left. 989680h is the FooCouter
static field's value.
InlineAndDce500
The InnerLoop
method:
00007ffa`41e798e0 57 push rdi 00007ffa`41e798e1 56 push rsi 00007ffa`41e798e2 53 push rbx 00007ffa`41e798e3 4883ec20 sub rsp,20h 00007ffa`41e798e7 488bf1 mov rsi,rcx 00007ffa`41e798ea 33ff xor edi,edi 00007ffa`41e798ec 33db xor ebx,ebx 00007ffa`41e798ee 488bce mov rcx,rsi 00007ffa`41e798f1 baf4010000 mov edx,1F4h 00007ffa`41e798f6 e83d8884ff call 00007ffa`416c2138 (InlineCode.InlineAndDce500.Foo(Int32), mdToken: 000000000600000B) 00007ffa`41e798fb 0fb6c0 movzx eax,al 00007ffa`41e798fe 85c0 test eax,eax 00007ffa`41e79900 7402 je 00007ffa`41e79904 00007ffa`41e79902 ffc7 inc edi 00007ffa`41e79904 ffc3 inc ebx 00007ffa`41e79906 81fb80969800 cmp ebx,989680h 00007ffa`41e7990c 7ce0 jl 00007ffa`41e798ee 00007ffa`41e7990e 8bc7 mov eax,edi 00007ffa`41e79910 4883c420 add rsp,20h 00007ffa`41e79914 5b pop rbx 00007ffa`41e79915 5e pop rsi 00007ffa`41e79916 5f pop rdi 00007ffa`41e79917 c3 ret
This method does not inline the Foo
, it makes a call instruction within the loop. The generated code for the Foo
method is also more complex:
00007ffa`41e79930 55 push rbp
00007ffa`41e79931 4883ec30 sub rsp,30h
00007ffa`41e79935 488d6c2430 lea rbp,[rsp+30h]
00007ffa`41e7993a 48894d10 mov qword ptr [rbp+10h],rcx
00007ffa`41e7993e 895518 mov dword ptr [rbp+18h],edx
00007ffa`41e79941 817d18f4010000 cmp dword ptr [rbp+18h],1F4h
00007ffa`41e79948 7e21 jle 00007ffa`41e7996b
00007ffa`41e7994a 8b4d18 mov ecx,dword ptr [rbp+18h]
00007ffa`41e7994d ffc9 dec ecx
00007ffa`41e7994f 894dfc mov dword ptr [rbp-4],ecx
00007ffa`41e79952 8b4dfc mov ecx,dword ptr [rbp-4]
00007ffa`41e79955 894d18 mov dword ptr [rbp+18h],ecx
00007ffa`41e79958 488b4d10 mov rcx,qword ptr [rbp+10h]
00007ffa`41e7995c 8b55fc mov edx,dword ptr [rbp-4]
00007ffa`41e7995f e8d48784ff call 00007ffa`416c2138 (InlineCode.InlineAndDce500.Foo(Int32), mdToken: 000000000600000B)
00007ffa`41e79964 90 nop
00007ffa`41e79965 488d6500 lea rsp,[rbp]
00007ffa`41e79969 5d pop rbp
00007ffa`41e7996a c3 ret
00007ffa`41e7996b 33c0 xor eax,eax
00007ffa`41e7996d 488d6500 lea rsp,[rbp]
00007ffa`41e79971 5d pop rbp
00007ffa`41e79972 c3 ret
The whole method is JITTED as a recursive method, as the recursive branch is executed in this case. DCE is not able the remove this branch, the generated code is larger and recursive, hence it cannot be inlined.
Although at C# level, we expect both methods to have a similar performance: in both cases Foo
method is indentical and invoked twice, the overall result is the same, but with a completely different performance characteristic.
In the first case DCE was able to remove the recurision, hence enabling inlining and speeding up overall execution. In the next post I will look at a similar example, but in that case with exceptions instead of recursive methods.