Assembly Assembly

The Machine Under the Code: Stack Frames and Function Calls in x86-64

Dima May 28, 2026

Introduction

Every call stack trace you have ever read — in a debugger, a crash report, or a profiler — is a human-readable rendering of a data structure that the CPU maintains in memory. That structure is the call stack, and it exists because the CPU has no built-in concept of functions. The call instruction is just a jump that saves its return address. The ret instruction is just a jump to the address that call saved. Everything else — argument passing, local variables, register preservation — is a convention that compilers follow consistently so that functions written by different people can call each other without corrupting each other's state.

Understanding the x86-64 System V calling convention is not academic. When you read a segfault at address 0x00007ffd..., you are looking at a corrupted stack pointer. When GDB shows <unknown> frames in a backtrace, the frame pointer chain is broken. When a function returning a struct silently produces garbage, a register clobber has overwritten the return value. These failures are only diagnosable by someone who knows what the stack is supposed to look like at each instruction.

This tutorial builds a working multi-function assembly program step by step: the call and ret mechanism, the function prologue and epilogue, argument passing in registers, local variable allocation on the stack, and a complete example of two functions calling each other with correct stack discipline.


Background

The stack is a region of memory that grows downward toward lower addresses on x86-64. The stack pointer rsp always points to the last item pushed. push rax decrements rsp by 8 and writes rax to the new rsp. pop rax reads from [rsp] and increments rsp by 8.

The call label instruction does two things: pushes the address of the instruction after the call (the return address) onto the stack, then jumps to label. The ret instruction pops the top of the stack into rip, returning execution to the caller.

The System V AMD64 ABI defines the calling convention for Linux. The rules:

  • Integer and pointer arguments go in rdi, rsi, rdx, rcx, r8, r9 (in that order).
  • The return value goes in rax.
  • Caller-saved registers (rax, rcx, rdx, rsi, rdi, r8, r9, r10, r11) may be destroyed by the callee; the caller must save them before the call if it needs them afterward.
  • Callee-saved registers (rbx, rbp, r12r15) must be preserved by the callee; if a function uses them, it must save and restore them.

The function prologue establishes the stack frame:

push rbp        ; save caller's base pointer
mov  rbp, rsp   ; set up new frame pointer
sub  rsp, N     ; allocate N bytes for local variables

The function epilogue tears it down:

mov rsp, rbp    ; restore stack pointer
pop rbp         ; restore caller's base pointer
ret             ; return to caller

rbp (the base pointer) holds a fixed reference point within the current frame. Local variables live at negative offsets from rbp (e.g., [rbp - 8]); saved registers and arguments passed on the stack live at positive offsets.


Practical Scenario

A game engine's scripting subsystem needs a math utility library written in assembly for performance on its inner loop calculations. The library must expose functions that can be called from C code. Two functions are needed: one that computes the maximum of two integers, and one that computes the absolute difference between them. The difference function calls the maximum function internally.

Both functions must follow the System V AMD64 ABI exactly — otherwise the C compiler's generated code will corrupt the stack when it calls into the assembly library. The team needs to see the complete call frame layout to verify that registers are saved and restored correctly and that rsp is 16-byte aligned before every call (a requirement of the ABI for functions that use SSE instructions).


The Problem

A first attempt at writing multiple assembly functions ignores the calling convention and manages no state at all.

Create a new file:

touch program.asm

Assemble and run using:

nasm -f elf64 program.asm -o program.o && ld -o program program.o && ./program
section .data
    result_msg db "Result: ", 0
    newline    db 10

section .bss
    digit_buf resb 20

section .text
    global _start

; print integer in rax to stdout
print_int:
    lea rdi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rdi
    mov [rdi], dl
    inc rcx
    test rax, rax
    jnz .loop
    mov rax, 1
    mov rsi, rdi
    mov rdx, rcx
    mov rdi, 1
    syscall
    ret

_start:
    ; compute max(7, 12) inline — no function
    mov rax, 7
    mov rbx, 12
    cmp rax, rbx
    jge already_max
    mov rax, rbx
already_max:
    ; print "Result: "
    push rax
    mov rax, 1
    mov rdi, 1
    mov rsi, result_msg
    mov rdx, 8
    syscall
    pop rax
    call print_int
    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    mov rax, 60
    mov rdi, 0
    syscall


Result: 12


The logic is inlined in _start because there is no function to call. Every place in the program that needs a maximum must duplicate the comparison code. There is no reuse, no clear separation of concerns, and print_int uses rbx as a scratch register without saving it — a correct caller that stores something important in rbx before calling print_int would find that value corrupted on return. The convention is the contract; violating it is silent corruption.


The call Instruction and Return Address

call label is syntactic sugar for two operations: push rip_after_call followed by jmp label. The return address sits at [rsp] immediately after the call. ret pops it back into rip. If anything modifies rsp between the call and the ret without a matching correction, the ret jumps to garbage.

Replace the entire program.asm with the following, which adds a proper max_int function:

section .data
    result_msg db "Result: ", 0
    newline    db 10

section .bss
    digit_buf resb 20

section .text
    global _start

; max_int(rdi, rsi) -> rax
; Returns the larger of two integers.
; Preserves: rbx, rbp, r12-r15 (callee-saved — none used here)
; Clobbers:  rax (return value)
max_int:
    cmp rdi, rsi
    jge .rdi_wins
    mov rax, rsi
    ret
.rdi_wins:
    mov rax, rdi
    ret

; print_int: prints integer in rax to stdout
print_int:
    push rbx               ; rbx is callee-saved — must preserve it
    lea rdi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rdi
    mov [rdi], dl
    inc rcx
    test rax, rax
    jnz .loop
    push rcx
    push rdi
    mov rax, 1
    mov rsi, rdi
    mov rdx, rcx
    mov rdi, 1
    syscall
    pop rdi
    pop rcx
    pop rbx                ; restore rbx before returning
    ret

_start:
    ; call max_int(7, 12)
    mov rdi, 7
    mov rsi, 12
    call max_int           ; return value in rax

    push rax               ; save result before write syscall clobbers rax
    mov rax, 1
    mov rdi, 1
    mov rsi, result_msg
    mov rdx, 8
    syscall
    pop rax                ; restore result

    call print_int

    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    mov rax, 60
    mov rdi, 0
    syscall


Result: 12


max_int receives its arguments in rdi and rsi, the first two integer registers in the System V AMD64 ABI. It returns its result in rax. It uses no callee-saved registers, so no prologue or epilogue is needed — a function that does not touch rbx, rbp, or r12r15 needs nothing more than its logic and a ret. print_int uses rbx as the divisor, so it saves and restores rbx around its body.

Why this is better

The maximum logic is now in one place. Any code in this binary can call max_int(rdi, rsi) and receive the result in rax. Callee-saved registers are preserved correctly: a caller that holds a value in rbx can call print_int and trust that rbx still holds that value afterward.


The Function Prologue and Stack Frame

When a function needs local variables, it allocates space on the stack by decrementing rsp in the prologue. The base pointer rbp provides a stable frame reference — local variables are at [rbp - 8], [rbp - 16], etc., regardless of what happens to rsp later in the function body.

Add a new function abs_diff that demonstrates a full prologue and epilogue. Replace program.asm with:

section .data
    result_msg db "Result: ", 0
    newline    db 10

section .bss
    digit_buf resb 20

section .text
    global _start

; max_int(rdi, rsi) -> rax
max_int:
    cmp rdi, rsi
    jge .rdi_wins
    mov rax, rsi
    ret
.rdi_wins:
    mov rax, rdi
    ret

; abs_diff(rdi, rsi) -> rax
; Computes |rdi - rsi| by calling max_int internally.
; Uses a full prologue/epilogue because it calls another function
; and needs to preserve the original arguments across that call.
abs_diff:
    push rbp
    mov  rbp, rsp
    sub  rsp, 16           ; allocate space for two local qwords

    mov [rbp - 8],  rdi   ; save first argument: a
    mov [rbp - 16], rsi   ; save second argument: b

    call max_int           ; max_int(rdi=a, rsi=b) -> rax = max(a, b)

    mov rcx, [rbp - 8]    ; reload a
    mov rdx, [rbp - 16]   ; reload b
    sub rcx, rdx           ; a - b (may be negative)
    sub rdx, rcx           ; b - a ... wait, we want |a - b|
    ; simpler: max - min = max(a,b) - min(a,b)
    ; rax = max(a, b); min(a,b) = a + b - max(a,b)
    mov rcx, [rbp - 8]
    mov rdx, [rbp - 16]
    add rcx, rdx           ; a + b
    sub rcx, rax           ; a + b - max = min(a, b)
    sub rax, rcx           ; max - min = |a - b|

    mov rsp, rbp
    pop rbp
    ret

; print_int: prints integer in rax to stdout
print_int:
    push rbx
    lea rdi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rdi
    mov [rdi], dl
    inc rcx
    test rax, rax
    jnz .loop
    push rcx
    push rdi
    mov rax, 1
    mov rsi, rdi
    mov rdx, rcx
    mov rdi, 1
    syscall
    pop rdi
    pop rcx
    pop rbx
    ret

_start:
    ; compute abs_diff(20, 35) -> should print 15
    mov rdi, 20
    mov rsi, 35
    call abs_diff

    push rax
    mov rax, 1
    mov rdi, 1
    mov rsi, result_msg
    mov rdx, 8
    syscall
    pop rax

    call print_int

    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    mov rax, 60
    mov rdi, 0
    syscall


Result: 15


abs_diff pushes rbp, sets rbp = rsp, then subtracts 16 from rsp to carve out two local variable slots. The original arguments rdi and rsi are saved to [rbp - 8] and [rbp - 16] before the call to max_int, which will clobber rdi and rsi with its own argument setup. After max_int returns, abs_diff reloads the originals from the stack, computes the absolute difference, and restores rsp and rbp before returning.

Why this is better

Without the prologue, abs_diff has no safe place to store its arguments across the call to max_int. A function that calls another function must assume that rdi, rsi, and all other caller-saved registers are destroyed by the callee. The stack frame is the standard, ABI-compatible way to preserve values across calls — debuggers, profilers, and unwinders all know how to walk it.

Note: The System V AMD64 ABI requires rsp to be 16-byte aligned before a call instruction. _start begins with rsp aligned (the kernel sets it up that way). Each call pushes 8 bytes (the return address), misaligning rsp by 8. The push rbp in the prologue pushes another 8 bytes, restoring 16-byte alignment inside the function body. When allocating local space with sub rsp, N, N must be a multiple of 16 to maintain alignment for any nested call.


Callee-Saved Registers in Practice

When a function needs to use rbx, r12, r13, r14, or r15 as long-lived scratch registers across multiple operations, it must save them at the top of the function and restore them before returning. The convention guarantees callers can rely on these values being unchanged.

Replace the abs_diff function with a version that uses r12 and r13 as register-resident locals instead of memory:

; abs_diff(rdi, rsi) -> rax
; Uses r12 and r13 as callee-saved locals to avoid memory loads.
abs_diff:
    push rbp
    mov  rbp, rsp
    push r12               ; r12 is callee-saved
    push r13               ; r13 is callee-saved

    mov r12, rdi           ; r12 = a (preserved across call)
    mov r13, rsi           ; r13 = b (preserved across call)

    ; rdi and rsi already set — call max_int(a, b)
    call max_int           ; rax = max(a, b)

    ; compute min(a, b) = a + b - max(a, b)
    mov rcx, r12
    add rcx, r13
    sub rcx, rax           ; rcx = min(a, b)
    sub rax, rcx           ; rax = max - min = |a - b|

    pop r13
    pop r12
    mov rsp, rbp
    pop rbp
    ret


Result: 15


r12 and r13 are callee-saved, so abs_diff pushes them at the top and pops them at the bottom. Inside the function, they hold a and b across the call to max_int without needing memory loads afterward — faster and cleaner than spilling to the stack frame. The calling convention guarantees that max_int will leave r12 and r13 untouched.

Why this is better

Callee-saved registers are the assembly-level equivalent of local variables that survive function calls. When a function's inner loop calls a helper repeatedly, using r12r15 to hold loop state avoids memory accesses that a stack-based approach would require. The compiler does this automatically when optimizing; writing it by hand teaches you exactly what the optimizer is doing.


Summary

This tutorial built a multi-function x86-64 assembly program following the System V AMD64 ABI, starting from inlined duplicated logic and arriving at correct, composable functions with proper stack discipline. Every piece of the calling convention was demonstrated in running code:

  • call label pushes the return address onto the stack and jumps; ret pops that address back into rip — any modification to rsp between the two that is not matched by a correction will cause ret to jump to an arbitrary address.
  • The function prologue (push rbp / mov rbp, rsp / sub rsp, N) establishes a stable frame reference and allocates local variable space; the epilogue (mov rsp, rbp / pop rbp / ret) tears it down in reverse.
  • The System V AMD64 ABI passes the first six integer arguments in rdi, rsi, rdx, rcx, r8, r9 and returns the result in rax; functions that only use these registers need no prologue or epilogue.
  • Caller-saved registers (rax, rcx, rdx, rsi, rdi, r8r11) are assumed destroyed by any call; a function must save them to its own stack frame before calling another function if it needs the values afterward.
  • Callee-saved registers (rbx, rbp, r12r15) must be pushed at the start of any function that uses them and popped before ret; callers can rely on these registers being unchanged across any well-behaved function call.
  • rsp must be 16-byte aligned before a call; the push rbp in the prologue compensates for the 8-byte return address pushed by call, and sub rsp, N must use a multiple of 16 to maintain that alignment through nested calls.
  • Local variables live at negative offsets from rbp; this convention is what debuggers and stack unwinders use to reconstruct a call trace — breaking it produces the <unknown> frames and corrupted backtraces that are otherwise inexplicable without this knowledge.

You need to be logged in to access the cloud lab and experiment with the code presented in this tutorial.

Log in