Method Size and CPU

Usually, assembly code size is good measure to have for performance optimizations. In this post I plan to gain an understanding why smaller assembly code size might be beneficial and how this affects the performance.

Problem

To answer this question, I prepare two methods to compare. How can I create two methods that do the same operations in essence but differ in code size that is executed?

Instead of writing 2 methods to compare, I decided to write ~2000 methods:

public int Test0000(int remaining) => remaining == 0 ? 1 : Test0001(remaining - 1) + 1;
public int Test0001(int remaining) => remaining == 0 ? 1 : Test0002(remaining - 1) + 1;
//...
public int Test1998(int remaining) => remaining == 0 ? 1 : Test1999(remaining - 1) + 1;
public int Test1999(int remaining) => remaining == 0 ? 1 : Test2000(remaining - 1) + 1;
public int Test2000(int remaining) => remaining == 0 ? 1 : 1;

Every method has the same body, each calls a subsequently numbered method and returns its result incremented by 1. The last method in the call chain Test2000 returns the value of 1 as a stopping condition. The rest of the methods may also apply a stopping condition, by branching on the value of remaining parameter which is decremented at each method call.

This way if I invoke Test0000(1999) it will return the value of 2000. However, if invoke Test0000(0) it returns the value of 1. However, the goal would be to achieve that the code does the about same amount of work:

  • we make roughly the same number of method calls

  • eventually we return the value of 2000 regardless of the input parameter

To achieve this, a helper method has been setup:

public static int Range(CacheLineMethodsBenchmark b, int methodRange)
{
    var r = 0;
    var loopRepeat = 2000 / methodRange;
    for (int i = 0; i < loopRepeat; i++)
        r += b.Test0000(methodRange - 1);
    return r;
}

CacheLineMethodsBenchmark is a type that contains the 2000 methods, while methodRange is an input argument that picks up the following values later in the benchmarks: 1, 10, 20, 40, 50, 100, 200, 400, 500, 1000, 2000. When 2000 is divided by methodRange, we get an integer number. When methodRange takes the value of 1, the for loop in the Range method runs from 0 to 1999. In each iteration it calls Test0000(0), so we end up with 2000 method calls resulting the final value of 2000. When methodRange takes the value of 2000, the for loop has a single iteration. However, this single iteration invokes Test0000(1999) which also returns the value of 2000. Notice, that we end up doing the same number of method calls in both of these examples. However, the former reaches this value with a small amount of JITted code (only the Range method and Test0000 method are executed), while the latter executes 2000 chained methods.

Not to be cheated by JIT's tiered compilation, all methods are attributed with no inlining and no optimization. I also pre-JIT all 2000 methods during startup of the benchmarks, so all methods are JITted at the same time in order - and based on my observation, their assembly code is laid out contiguously in memory.

Executing this benchmark might already highlight some of the answers to my question, but I'd like to add a few more tests to the benchmark.

Extending The Benchmark

I prepared one additional method, Test, which does the same work as Test0000 to Test2000 but in a recursive manner.

public int Test(int remaining) => remaining == 0 ? 1 : Test(remaining - 1) + 1;

It also has a wrapper method similar to Range. This method calls the above recursive Test instead of Test0000. For the benchmark, this method is also named Test, which I also consider as a baseline or the comparison.

public static int Test(CacheLineMethodsBenchmark b, int methodRange)
{
    var r = 0;
    var loopRepeat = 2000 / methodRange;
    for (int i = 0; i < loopRepeat; i++)
        r += b.Test(methodRange - 1);
    return r;
}

Exception Handling

Finally, to significantly increase to the size of the methods I surrounded each with a try catch block. In one case the exception is handled by printing it to the console and rethrowing it:

public int Test0000(int remaining) { try { return remaining == 0 ? 1 : Test0001(remaining - 1) + 1; } catch(Exception e) { Console.WriteLine(e); throw; } }

and another case the exception handler is more complex by writing an interpolated string to the console.

public int Test0000(int remaining) { try { return remaining == 0 ? 1 : Test0001(remaining - 1) + 1; } catch(Exception e) { Console.WriteLine($"hello {e.Message.Substring(0,10)} error."); throw; } }

Note, that none of these exception handlers are executed in the benchmarks, but they artificially inflate the code size to 118 and 252 bytes respectively.

Performance Management

In the below table:

  • Range denotes the call chain for Test0000 with simple exception handler (method size 118 bytes)

  • RangeExtended denotes the call chain for Test0000 with complex exception handler (method size 252 bytes)

  • Test denotes the recursive Test method without an exception handler

  • MethodRange variable drives the depth of the call chain

|        Method | MethodRange |      Mean |     Error |    StdDev |    Median | InstructionRetired/Op | BranchMispredictions/Op | CacheMisses/Op | Allocated |
|-------------- |------------ |----------:|----------:|----------:|----------:|----------------------:|------------------------:|---------------:|----------:|
|         Range |           1 |  3.855 us | 0.0297 us | 0.0263 us |  3.854 us |                38,195 |                       2 |              0 |         - |
| RangeExtended |           1 |  4.139 us | 0.0133 us | 0.0117 us |  4.141 us |                46,209 |                       2 |              0 |         - |
|          Test |           1 |  3.545 us | 0.0131 us | 0.0102 us |  3.545 us |                40,184 |                       2 |              1 |         - |
|         Range |          10 |  2.676 us | 0.0142 us | 0.0133 us |  2.681 us |                27,347 |                       3 |              0 |         - |
| RangeExtended |          10 |  2.818 us | 0.0157 us | 0.0131 us |  2.816 us |                35,356 |                       3 |              0 |         - |
|          Test |          10 |  3.350 us | 0.0669 us | 0.1866 us |  3.425 us |                31,181 |                       2 |              0 |         - |
|         Range |          20 |  6.633 us | 0.0746 us | 0.0623 us |  6.619 us |                27,101 |                     391 |              1 |         - |
| RangeExtended |          20 |  7.899 us | 0.1574 us | 0.2715 us |  7.836 us |                35,191 |                     393 |              1 |         - |
|          Test |          20 |  4.058 us | 0.0459 us | 0.0407 us |  4.054 us |                30,725 |                       4 |              0 |         - |
|         Range |          40 | 15.893 us | 0.3179 us | 0.6706 us | 16.128 us |                27,675 |                   1,155 |              3 |         - |
| RangeExtended |          40 | 17.262 us | 0.2451 us | 0.2293 us | 17.286 us |                35,708 |                   1,106 |              2 |         - |
|          Test |          40 |  8.273 us | 0.1245 us | 0.1103 us |  8.287 us |                30,864 |                     355 |              1 |         - |
|         Range |         100 | 23.881 us | 0.4604 us | 0.5654 us | 23.691 us |                28,647 |                   1,653 |              5 |         - |
| RangeExtended |         100 | 24.835 us | 0.4126 us | 0.7950 us | 24.579 us |                36,929 |                   1,650 |              7 |         - |
|          Test |         100 |  8.661 us | 0.1667 us | 0.1784 us |  8.621 us |                30,981 |                     388 |              2 |         - |
|         Range |         200 | 29.165 us | 0.4010 us | 0.3348 us | 29.003 us |                29,842 |                   1,834 |              6 |         - |
| RangeExtended |         200 | 30.533 us | 0.4496 us | 0.3755 us | 30.415 us |                38,245 |                   1,831 |              6 |         - |
|          Test |         200 |  8.866 us | 0.1748 us | 0.1717 us |  8.793 us |                31,164 |                     393 |              2 |         - |
|         Range |         400 | 33.325 us | 0.1992 us | 0.1663 us | 33.310 us |                31,432 |                   1,922 |              9 |         - |
| RangeExtended |         400 | 33.168 us | 0.4016 us | 0.3756 us | 33.093 us |                39,923 |                   1,917 |              8 |         - |
|          Test |         400 |  8.751 us | 0.0836 us | 0.0782 us |  8.739 us |                31,389 |                     392 |              1 |         - |
|         Range |         500 | 35.903 us | 0.7029 us | 0.8095 us | 35.670 us |                32,485 |                   1,947 |             14 |         - |
| RangeExtended |         500 | 36.021 us | 0.6336 us | 0.7042 us | 35.860 us |                41,175 |                   1,937 |             16 |         - |
|          Test |         500 |  8.968 us | 0.1652 us | 0.1379 us |  8.971 us |                31,504 |                     405 |              2 |         - |
|         Range |        1000 | 41.753 us | 0.1436 us | 0.1343 us | 41.760 us |                33,563 |                   2,000 |             10 |         - |
| RangeExtended |        1000 | 46.011 us | 0.7531 us | 0.7044 us | 46.138 us |                43,286 |                   2,137 |             13 |         - |
|          Test |        1000 |  9.533 us | 0.1263 us | 0.2926 us |  9.472 us |                31,885 |                     428 |              3 |         - |
|         Range |        2000 | 78.023 us | 1.4033 us | 1.1718 us | 77.724 us |                40,788 |                   3,270 |             23 |         - |
| RangeExtended |        2000 | 82.178 us | 1.1754 us | 1.5283 us | 81.659 us |                52,923 |                   3,312 |             40 |         - |
|          Test |        2000 | 12.105 us | 0.1840 us | 0.1721 us | 12.146 us |                32,180 |                     478 |              2 |         - |

The following observations can be made:

  • None of the solutions allocate on GC.

  • The more times Test recursively calls itself the more branch mispredictions happens, that slows the execution. Cache misses do not increase significantly, which is likely because the same code is executed over-and-over.

  • RangeExtended performs slightly worse to Range. Comparing them, the larger TestXXXX method results more cache misses, but slightly the same amount of branch mispredictions at the same MethodRange values.

  • The deeper the call chain is in case of Range and RangeExtended methods (the more code to be executed), the more branch mispredictions and cache misses may be observed.

  • Both cache misses and branch mispredictions seem to decrease the performance.

  • Performance decrease between the best- and worst-case scenario is more than 30 folds.

Intel VTune Measurements

Making some of these observations is interesting, but not reassuring. To better understand these performance differences, I used Intel VTune to profile the CPU performance. The benchmark mode has been set for Microarchitecture Exploration. Comparing a test run of RangeExtended method with methodRange parameter set to 2000 and 10 respectively, the following observations can be made:

  • Execution Time difference is significant, (72.687s vs. 11.294s) but the difference is not as large as seen on the above microbenchmark

  • Measurement reliability is above 0.970 (1 is the theoretical maximum)

  • CPI rate is 5.148 vs 0.285 (where 0.25 is the best on current CPUs)

  • Instruction Retired: difference between the runs is less than 0.08%

  • Front-End bound latency has increased by 70%

  • Retiring decreased by 82.1%

What do these measures mean?

According to the user guide of Intel VTune:

Front-End Bound metric represents a slots fraction where the processor's Front-End undersupplies its Back-End. Front-End denotes the first part of the processor core responsible for fetching operations that are executed later on by the Back-End part. Within the Front-End, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uOps). Front-End Bound metric denotes unutilized issue-slots when there is no Back-End stall (bubbles where Front-End delivered no uOps while Back-End could have accepted them). For example, stalls due to instruction-cache misses would be categorized as Front-End Bound.

Under the Front-End bound operations the key differences are:

ICache Misses (29.3% vs. 0.2%): o introduce new uOps into the pipeline, the core must either fetch them from a decoded instruction cache, or fetch the instructions themselves from memory and then decode them. In the latter path, the requests to memory first go through the L1I (level 1 instruction) cache that caches the recent code working set. Front-end stalls can accrue when fetched instructions are not present in the L1I. Possible reasons are a large code working set or fragmentation between hot and cold code. In the latter case, when a hot instruction is fetched into the L1I, any cold code on its cache line is brought along with it. This may result in the eviction of other, hotter code.

ITLB Overhead (35.5% vs. 0.6%): In x86 architectures, mappings between virtual and physical memory are facilitated by a page table, which is kept in memory. To minimize references to this table, recently-used portions of the page table are cached in a hierarchy of 'translation look-aside buffers', or TLBs, which are consulted on every virtual address translation. As with data caches, the farther a request has to go to be satisfied, the worse the performance impact. This metric estimates the performance penalty of page walks induced on ITLB (instruction TLB) misses.

Branch Resteers (58.2% vs. 0.4%), due to mispredicted resteers: This metric measures the fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch Misprediction at execution stage.

FrontEnd Bandwidth / DSB Coverage (5.2% vs. 99.1%): Fraction of uOps delivered by the DSB (known as Decoded ICache or uOp Cache).

Retiring metric represents a Pipeline Slots fraction utilized by useful work, meaning the issued uOps that eventually get retired. Ideally, all Pipeline Slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum possible number of uOps retired per cycle has been achieved.

Retiring for Light operations (4.7% vs. 86.9%): instructions that require no more than one uop.

By stalling the Front-End operations, the Back-End operations could not utilize the CPU resources as efficiently as they could have.

Conclusion

Witghthe above sample code benchmarked, we can see a huge performance difference depending on the depth of the call chain that was executed. The deeper the call chain is, the more code is to be executed. The more code to be executed, the CPU has become Front-End bound:

  • branch mispredictions increased

  • the less effective the CPU instruction cache become (L1 instructions, decoded instructions, larger ITLB overhead)

Optimizing for smaller assembly code in general could be a good target, however good measure must be taken to really understand the way the CPU processes assembly code. In this case the one implementation has performed significantly worse due to stalling certain steps of the CPU instruction pipeline.