Laszlo

Hello, I am Laszlo

Software-Engineer, .NET developer

Contact Me

Object Stack Allocation in .NET 10

Object Stack Allocation is a performance optimization feature in .NET 10 that allows certain objects to be allocated on the stack instead of the heap. The JIT compiler identifies objects and object allocations that don't escape from a method and may decide to allocate these non-escaping objects on the stack. When these objects are stack-allocated, they get freed automatically when the stack frame is freed as the method returns, reducing the load on the Garbage Collector (GC).

Recent improvements in this area have been detailed in the Performance Improvements in .NET 10 blog post.

In this post, I will investigate the current state and limitations of this feature. Please note that the findings reveal some of the current internal limitations of the runtime, which may change or be adjusted in future releases.

Using Stopwatch

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public long DoWork()
{
    var sw = Stopwatch.StartNew();
    int i = 65;
    Nop();
    sw.Stop();
    return i + sw.ElapsedTicks;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Nop() { }

To observe the generated code, run the application with the $env:DOTNET_JitDisasm="Test:DoWork()" environment variable set. It prints the following assembly code on the console on a Windows 11 x64 platform:

G_M000_IG01:                ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 56
       vzeroupper
       xor      eax, eax
       mov      qword ptr [rsp+0x30], rax
       mov      qword ptr [rsp+0x28], rax

G_M000_IG02:                ;; offset=0x0015
       lea      rcx, [rsp+0x30]
       mov      rax, 0x7FF8F15A3FB0

G_M000_IG03:                ;; offset=0x0024
       call     rax ; Interop+Kernel32:QueryPerformanceCounter(ptr):int
       mov      rbx, qword ptr [rsp+0x30]
       call     [Test:Nop()]
       lea      rcx, [rsp+0x28]
       mov      rax, 0x7FF8F15A3FB0

G_M000_IG04:                ;; offset=0x0040
       call     rax ; Interop+Kernel32:QueryPerformanceCounter(ptr):int
       mov      rsi, qword ptr [rsp+0x28]
       sub      rsi, rbx
       cmp      dword ptr [(reloc 0x7ff85a53e808)], 0
       jne      SHORT G_M000_IG07

G_M000_IG05:                ;; offset=0x0053
       lea      rax, [rsi+0x41]

G_M000_IG06:                ;; offset=0x0057
       add      rsp, 56
       pop      rbx
       pop      rsi
       ret

G_M000_IG07:                ;; offset=0x005E
       call     CORINFO_HELP_POLL_GC
       jmp      SHORT G_M000_IG05

; Total bytes of code 101

This optimization does not kick in with Tier 0 compilation of the code; hence I added the [MethodImpl(MethodImplOptions.AggressiveOptimization)] attribute to the DoWork() method.

Code with labels G_M000_IG03 and G_M000_IG04 contains the inlined code for the Start and Stop method calls on the Stopwatch object. Label G_M000_IG05 add 65 to the result value.

An important observation: the more code inlined, the better escape analysis can perform.

When I run the above code and measure the allocated memory in bytes, it prints 0.

public void Measurement()
{
    long beginning = GC.GetTotalAllocatedBytes(true);
    for (int i = 0; i < 100; i++)
        sum += DoWork();

    var allocated = GC.GetTotalAllocatedBytes(true) - beginning;
    Console.WriteLine($"Allocated: {allocated}, Sum: {sum}");
}

Console output: Allocated: 0, Sum: 6889

Custom Objects

Let's replace the Stopwatch type with a custom type (similarly returning a timestamp).

public long DoWork()
{
        var obj = new MyClass();
        Nop();
        return 65 + obj.Get();
}
//...
public class MyClass
{
    public long Get() => Stopwatch.GetTimestamp();
}

Without AggressiveOptimization attribute it prints Allocated: 2400, Sum: 249270223149128 on the console. With AggressiveOptimization attribute it prints Allocated: 0, Sum: 7393 on the console.

Size of the Custom Objects

Does the size of the allocated object limit stack object allocation? To test this, I added an InlineArray to MyClass type:

public class MyClass
{
    private MyInlineType<byte> _arr;
    public long Get() => Stopwatch.GetTimestamp() + _arr[0];
}

[InlineArray(520)]
public struct MyInlineType<T>
{
    public T _value;
}

The maximum size of the inline array that still got stack-allocated on my test machine is 520 bytes. However, using up to 8 such objects still got stack-allocated, and the 9th resumed to heap allocation.

While this seems to be pointing to a limit, it cannot be generalized. Classes sometimes already with 18 long fields will not be stack allocated. For example, a class with 18 similar fields:

private long _a01 = (Stopwatch.GetTimestamp() * 1); // does get stack-allocated

However, changing the multiplication to division and the type gets stack-allocated:

private long _a01 = (Stopwatch.GetTimestamp() / 1); // does get stack-allocated

This above case is discussed in detail here.

For arrays, I observed stack allocation with the same limit. The maximum byte[] size observed is 512 elements, and the maximum int[] size observed is 128 elements.

Escaping Into Substack

Is a stack-allocated object allowed to 'escape' to sub frames?

public long DoWork()
{
    var obj = new MyClass();
    var value = Nop(obj);
    return 65 + value + obj.Get();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static long Nop(MyClass obj) => 0;

Without NoInlining (when the Nop method is likely inlined), objects can be stack-allocated. This even works if Nop invokes a Nop2 method passing the same object. With NoInlining, objects are heap allocated - even though obj parameter is not used.

Try-Finally Blocks

Try-Finally blocks allow for stack allocation optimization.

Try-Catch Blocks

In case of a simple try-catch block as below, the MyClass instance is stack-allocated.

public long DoWork()
{
    try
    {
        var obj = new MyClass();
        Nop();
        return 65 + obj.Get();
    }
    catch (Exception)
    {
        throw;
    }
}

However, when obj is used in the catch block, it is heap allocated:

public long DoWork()
{
    var obj = new MyClass();
    try
    {
        Nop();
        return 65 + obj.Get();
    }
    catch (Exception)
    {
        // 👇 Heap allocated in catch block.
        return obj.Get();
    }
}

Lock Statement

Locking an object prevents it from being stack-allocated.

public long DoWork()
{
    var obj = new MyClass();
    
    // 👇 Heap allocated when locked.
    lock (obj)
    {
        Nop();
        return 65 + obj.Get();
    }
}

However, when locking other objects (e.g. a Lock object) had no impact on the instance of MyClass to be stack-allocated.

Constructor

The constructor of stack-allocated objects is still executed. The following type with a sleep in the constructor gets stack-allocated, but even in the stack-allocated case it still executes and waits for 15 ms. This behavior is expected, but an important note, that object stack allocation does not cut the execution of the construtor.

public class MyClass
{
    private long _a00;
    public MyClass()
    {
        Thread.Sleep(15);
        _a00 = Stopwatch.GetTimestamp();
    }
    public long Get() => Stopwatch.GetTimestamp() + _a00;
}

Finalizer

A finalizer will prevent the object from being stack-allocated as it makes the object escape onto the finalizer queue.

public class MyClass
{
    private long _a00;
    public long Get() => Stopwatch.GetTimestamp() + _a00;
    // 👇 Heap allocated with finalizer.
    ~MyClass() => _a00 = 0;
}

Disposable

Disposable objects can be stack-allocated, regardless of whether they are disposed of or not, when used with a using statement.

public class MyClass : IDisposable
{
    private long _a00;
    public void Dispose() =>_a00 = 0;
    public long Get() => Stopwatch.GetTimestamp() + _a00;
}

With Measurement Optimized

One interesting behavior change happens when increasing the loop count in the Measurement() method from 100 to 100_000 or higher. This will trigger Tier1 optimization for the Measurement() method. During this optimization, the DoWork() method may get inlined, while the stack allocation optimization will no longer be applied. This results in an interesting situation where a 'non-allocating' method (due to optimizations) becomes allocating again.

Conclusion

Object Stack Allocation in .NET 10 is a powerful optimization feature that can reduce GC pressure by allocating certain objects on the stack rather than the heap. Through the experiments, I've identified several key characteristics and limitations of this feature.

While this feature shows great promise for performance optimization, it's important to note that these findings represent the current state in .NET 10 and may evolve in future releases. Developers should avoid making strong assumptions about exactly what will be stack-allocated, as the JIT's decisions may vary based on multiple factors including the runtime version, platform, and broader optimization context.