Inlining Code 2

This post is the second part of a 2 posts long series, discussing inlining in relation to DCE and exception handling.

Introduction

This post is about code inlinig by the JIT compiler and Dead Code Elimination (DCE).

  • Code Inline: call of the method at the callsite is replaced with the invoked method body.

  • DCE: Modern compilers are smart enough to detect code segments, that has no effects, hence they can be removed.

I will walk to through two examples showing these compiler features in action in regards to exception handling and recursive methods. At this point, I expect the reader to have a good understanding on both of the above compiler features.

Code Inlining is a really powerful, but it has a couple of restrictions. The compiler may not be able inline code in the following cases:

  • After the inlining, the result method would be too 'large'.

  • The method to be inlined has a try-catch block.

  • The method to be inlined is virtual.

  • The method to be inlined is recursive.

  • The method to be inlined has NoInlining attribute.

In this I will look at try-catch blocks in relation to DCE and Inlining.

Methods with Try-Catch blocks

I have the following to classes implementated:

public class ExInlineAndDceHandle
{
    private const int _param = 0;
    public int InnerLoop()
    {
        int j = 0;
        for (int i = 0; i < Program.FooCouter; i++)
        {
            if (Foo(_param))
                j++;
        }
        return j;
    }
    public bool Foo(int limit)
    {
        if (limit > 500)
        {
            HandleEx();
        }
        return false;
    }
    public bool HandleEx()
    {
        try
        {
            return false;
        }
        catch
        {
            return true;
        }
    }
}

public class ExInlineAndDce
{
    private const int _param = 0;
    public int InnerLoop()
    {
        int j = 0;
        for (int i = 0; i < Program.FooCouter; i++)
        {
            if (Foo(_param))
                j++;
        }
        return j;
    }
    public bool Foo(int limit)
    {
        if (limit > 500)
        {
            try
            {
                return false;
            }
            catch
            {
                return true;
            }
        }
        return false;
    }
}

ExInlineAndDceHandle and ExInlineAndDce both implement the same logic. Neither of them do anything useful. In both cases we end up invoking Foo, which returns false. The only difference between the two classes is that ExInlineAndDceHandle has the exception handling in a separate method: HandleEx. Note, that neither of the Foo methods execute the exception handling code, as in both cases the limit is below 500.

Let's benchmark both InnerLoop methods, then look at the JITTED code for further explanation of the results. For Benchmarking I use BenchmarkDotNet, FooCouter static field is set to 10000000.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363, Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores, .NET Core SDK=3.1.200[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

Method

Mean

Error

StdDev

ExInlineAndDceHandle

3.372 ms

0.0652 ms

0.0640 ms

ExInlineAndDce

24.038 ms

0.3469 ms

0.3075 ms

Results show the ExInlineAndDceHandle solution is several times faster. To understand the reason, let's look at the JITTED code. I use WinDBG, to look into the generated assembly code.

ExInlineAndDceHandle

InlineCode.ExInlineAndDceHandle.InnerLoop()
Begin 00007FFA41E4B200, size e
00007ffa`41e4b200 33c0            xor     eax,eax
00007ffa`41e4b202 ffc0            inc     eax
00007ffa`41e4b204 3d80969800      cmp     eax,989680h
00007ffa`41e4b209 7cf7            jl      00007ffa`41e4b202
00007ffa`41e4b20b 33c0            xor     eax,eax
00007ffa`41e4b20d c3              ret
!name2ee InlineCode ExInlineAndDceHandle.Foo
Not JITTED yet. Use !bpmd -md 00007FFA4175FBF8 to break on run.

The generated InnerLoop method is super compact. Foo method is completly inlined and removed, as it only returns false. Incrementing variable j is also removed, as the if condition around the Foo statement always returns false. Basically an empty loop is left. 989680h is the FooCouter static field's value.

ExInlineAndDce

The InnerLoop method:

00007ffa`41e4b260 57              push    rdi
00007ffa`41e4b261 56              push    rsi
00007ffa`41e4b262 53              push    rbx
00007ffa`41e4b263 4883ec20        sub     rsp,20h
00007ffa`41e4b267 488bf1          mov     rsi,rcx
00007ffa`41e4b26a 33ff            xor     edi,edi
00007ffa`41e4b26c 33db            xor     ebx,ebx
00007ffa`41e4b26e 488bce          mov     rcx,rsi
00007ffa`41e4b271 33d2            xor     edx,edx
00007ffa`41e4b273 e8c86e84ff      call    00007ffa`41692140 (InlineCode.ExInlineAndDce.Foo(Int32), mdToken: 0000000006000012)
00007ffa`41e4b278 85c0            test    eax,eax
00007ffa`41e4b27a 7402            je      00007ffa`41e4b27e
00007ffa`41e4b27c ffc7            inc     edi
00007ffa`41e4b27e ffc3            inc     ebx
00007ffa`41e4b280 81fb80969800    cmp     ebx,989680h
00007ffa`41e4b286 7ce6            jl      00007ffa`41e4b26e
00007ffa`41e4b288 8bc7            mov     eax,edi
00007ffa`41e4b28a 4883c420        add     rsp,20h
00007ffa`41e4b28e 5b              pop     rbx
00007ffa`41e4b28f 5e              pop     rsi
00007ffa`41e4b290 5f              pop     rdi
00007ffa`41e4b291 c3              ret

This method does not inline the Foo, it makes a call instruction within the loop. The generated code for the Foo method is also more complex. Please look at the size of this code, without a deeper explanation.

00007ffa`41e4b2b0 55              push    rbp
00007ffa`41e4b2b1 4883ec20        sub     rsp,20h
00007ffa`41e4b2b5 488d6c2420      lea     rbp,[rsp+20h]
00007ffa`41e4b2ba 33c0            xor     eax,eax
00007ffa`41e4b2bc 8945fc          mov     dword ptr [rbp-4],eax
00007ffa`41e4b2bf 488945f0        mov     qword ptr [rbp-10h],rax
00007ffa`41e4b2c3 488965e0        mov     qword ptr [rbp-20h],rsp
00007ffa`41e4b2c7 48894d10        mov     qword ptr [rbp+10h],rcx
00007ffa`41e4b2cb 895518          mov     dword ptr [rbp+18h],edx
00007ffa`41e4b2ce 817d18f4010000  cmp     dword ptr [rbp+18h],1F4h
00007ffa`41e4b2d5 7e07            jle     00007ffa`41e4b2de
00007ffa`41e4b2d7 33c0            xor     eax,eax
00007ffa`41e4b2d9 8945fc          mov     dword ptr [rbp-4],eax
00007ffa`41e4b2dc eb08            jmp     00007ffa`41e4b2e6
00007ffa`41e4b2de 33c0            xor     eax,eax
00007ffa`41e4b2e0 488d6500        lea     rsp,[rbp]
00007ffa`41e4b2e4 5d              pop     rbp
00007ffa`41e4b2e5 c3              ret
00007ffa`41e4b2e6 8b45fc          mov     eax,dword ptr [rbp-4]
00007ffa`41e4b2e9 488d6500        lea     rsp,[rbp]
00007ffa`41e4b2ed 5d              pop     rbp
00007ffa`41e4b2ee c3              ret
00007ffa`41e4b2ef 55              push    rbp
00007ffa`41e4b2f0 4883ec10        sub     rsp,10h
00007ffa`41e4b2f4 488b29          mov     rbp,qword ptr [rcx]
00007ffa`41e4b2f7 48892c24        mov     qword ptr [rsp],rbp
00007ffa`41e4b2fb 488d6d20        lea     rbp,[rbp+20h]
00007ffa`41e4b2ff 488955f0        mov     qword ptr [rbp-10h],rdx
00007ffa`41e4b303 c745fc01000000  mov     dword ptr [rbp-4],1
00007ffa`41e4b30a 488d05d5ffffff  lea     rax,[00007ffa`41e4b2e6]
00007ffa`41e4b311 4883c410        add     rsp,10h
00007ffa`41e4b315 5d              pop     rbp
00007ffa`41e4b316 c3              ret

The whole Foo method is JITTED including the try-catch part, even though that branch is not executed. DCE is not the removing this branch, the generated code is larger and has the try-catch block included, hence it cannot be inlined.

Although at C# level, we expect both methods to have a similar performance: in both cases InnerLoop method is indentical and Foo is only different in a code path not executed, but still they have a completely different performance characteristic.

In the first case DCE was able to remove the HandleEx invocation, hence enabling inlining of Foo method and speeding up overall execution.