Inlining Code

This post is the first part of a 2 posts long series, discussing code inlining in relation to DCE and recursive methods.

Introduction

  • Code Inline: call of the method at the callsite is replaced with the invoked method body.

  • DCE: Modern compilers are smart enough to detect code segments, that has no effects, hence they can be removed.

I will walk to through two examples showing these compiler features in action in regards to exception handling and recursive methods. At this point, I expect the reader to have a good understanding on both of the above compiler features.

Code Inlining is a really powerful, but it has a couple of restrictions. The compiler may not be able inline code in the following cases:

  • After the inlining, the result method would be too 'large'.

  • The method to be inlined has a try-catch block.

  • The method to be inlined is virtual.

  • The method to be inlined is recursive.

  • The method to be inlined has NoInlining attribute.

In this post I will look at recursive methods in relation to DCE and Inlining.

Recursive methods

I have the following to classes implementated:

public class InlineAndDce
{
    private const int _param = 0;
    public int InnerLoop()
    {
        int j = 0;
        for (int i = 0; i < Program.FooCouter; i++)
        {
            if (Foo(_param) || Foo(_param))
                j++;
        }
        return j;
    }

    public bool Foo(int limit)
    {
        if (limit > 500)
        {
            Foo(--limit);
        }
        return false;
    }
}

public class InlineAndDce500
{
    private const int _param = 501;
    public int InnerLoop()
    {
        int j = 0;
        for (int i = 0; i < Program.FooCouter; i++)
        {
            if (Foo(_param))
                j++;
        }
        return j;
    }

    public bool Foo(int limit)
    {
        if (limit > 500)
        {
            return Foo(--limit);
        }
        return false;
    }
}

InlineAndDce and InlineAndDce500 both implement the same logic. Neither of them do anything useful. In both cases we end up invoking Foo twice.

InlineAndDce in the InnerLoop method invokes Foo twice. Inside Foo method, it always returns false.

InlineAndDce500 in its InnerLoop method invokes Foo once. But because InlineAndDce500 has _param set to 501, the Foo method recursively invokes itself once.

Let's benchmark both InnerLoop methods, then look at the JITTED code for further explanation of the results. For Benchmarking I use BenchmarkDotNet. Note, that FooCouter static field is set to 10000000.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363, Intel Core i5-1035G4 CPU 1.10GHz, 1 CPU, 8 logical and 4 physical cores, .NET Core SDK=3.1.200[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

Method

Mean

Error

StdDev

InnerLoop

3.373 ms

0.0674 ms

0.0631 ms

InnerLoop500

24.457 ms

0.4890 ms

0.5232 ms

Results show the InnerLoop solution is several times faster. To understand the reason, let's look at the JITTED code. I use WinDBG, to look into the generated assembly code.

InlineAndDce

00007ffa`41e79880 33c0            xor     eax,eax
00007ffa`41e79882 ffc0            inc     eax
00007ffa`41e79884 3d80969800      cmp     eax,989680h
00007ffa`41e79889 7cf7            jl      00007ffa`41e79882
00007ffa`41e7988b 33c0            xor     eax,eax
00007ffa`41e7988d c3              ret
!name2ee InlineCode InlineAndDce.Foo
Not JITTED yet. Use !bpmd -md 00007FFA4178FBF8 to break on run.

The generated InnerLoop method is super compact. Foo method is completly inlined and removed, as it only returns false. Incrementing variable j is also removed, as the if condition around the Foo statement always returns false. Basically an empty loop is left. 989680h is the FooCouter static field's value.

InlineAndDce500

The InnerLoop method:

00007ffa`41e798e0 57              push    rdi
00007ffa`41e798e1 56              push    rsi
00007ffa`41e798e2 53              push    rbx
00007ffa`41e798e3 4883ec20        sub     rsp,20h
00007ffa`41e798e7 488bf1          mov     rsi,rcx
00007ffa`41e798ea 33ff            xor     edi,edi

00007ffa`41e798ec 33db            xor     ebx,ebx
00007ffa`41e798ee 488bce          mov     rcx,rsi
00007ffa`41e798f1 baf4010000      mov     edx,1F4h
00007ffa`41e798f6 e83d8884ff      call    00007ffa`416c2138 (InlineCode.InlineAndDce500.Foo(Int32), mdToken: 000000000600000B)
00007ffa`41e798fb 0fb6c0          movzx   eax,al
00007ffa`41e798fe 85c0            test    eax,eax
00007ffa`41e79900 7402            je      00007ffa`41e79904
00007ffa`41e79902 ffc7            inc     edi
00007ffa`41e79904 ffc3            inc     ebx
00007ffa`41e79906 81fb80969800    cmp     ebx,989680h
00007ffa`41e7990c 7ce0            jl      00007ffa`41e798ee
00007ffa`41e7990e 8bc7            mov     eax,edi
00007ffa`41e79910 4883c420        add     rsp,20h
00007ffa`41e79914 5b              pop     rbx
00007ffa`41e79915 5e              pop     rsi
00007ffa`41e79916 5f              pop     rdi
00007ffa`41e79917 c3              ret

This method does not inline the Foo, it makes a call instruction within the loop. The generated code for the Foo method is also more complex:

00007ffa`41e79930 55              push    rbp
00007ffa`41e79931 4883ec30        sub     rsp,30h
00007ffa`41e79935 488d6c2430      lea     rbp,[rsp+30h]
00007ffa`41e7993a 48894d10        mov     qword ptr [rbp+10h],rcx
00007ffa`41e7993e 895518          mov     dword ptr [rbp+18h],edx
00007ffa`41e79941 817d18f4010000  cmp     dword ptr [rbp+18h],1F4h
00007ffa`41e79948 7e21            jle     00007ffa`41e7996b
00007ffa`41e7994a 8b4d18          mov     ecx,dword ptr [rbp+18h]
00007ffa`41e7994d ffc9            dec     ecx
00007ffa`41e7994f 894dfc          mov     dword ptr [rbp-4],ecx
00007ffa`41e79952 8b4dfc          mov     ecx,dword ptr [rbp-4]
00007ffa`41e79955 894d18          mov     dword ptr [rbp+18h],ecx
00007ffa`41e79958 488b4d10        mov     rcx,qword ptr [rbp+10h]
00007ffa`41e7995c 8b55fc          mov     edx,dword ptr [rbp-4]
00007ffa`41e7995f e8d48784ff      call    00007ffa`416c2138 (InlineCode.InlineAndDce500.Foo(Int32), mdToken: 000000000600000B)
00007ffa`41e79964 90              nop
00007ffa`41e79965 488d6500        lea     rsp,[rbp]
00007ffa`41e79969 5d              pop     rbp
00007ffa`41e7996a c3              ret
00007ffa`41e7996b 33c0            xor     eax,eax
00007ffa`41e7996d 488d6500        lea     rsp,[rbp]
00007ffa`41e79971 5d              pop     rbp
00007ffa`41e79972 c3              ret

The whole method is JITTED as a recursive method, as the recursive branch is executed in this case. DCE is not able the remove this branch, the generated code is larger and recursive, hence it cannot be inlined.

Although at C# level, we expect both methods to have a similar performance: in both cases Foo method is indentical and invoked twice, the overall result is the same, but with a completely different performance characteristic.

In the first case DCE was able to remove the recurision, hence enabling inlining and speeding up overall execution. In the next post I will look at a similar example, but in that case with exceptions instead of recursive methods.