Assembly
The Machine Under the Code: Stack Frames and Function Calls in x86-64
Introduction
Every call stack trace you have ever read — in a debugger, a crash report, or a profiler — is a human-readable rendering of a data structure that the CPU maintains in memory. That structure is the call stack, and it exists because the CPU has no built-in concept of functions. The call instruction is just a jump that saves its return address. The ret instruction is just a jump to the address that call saved. Everything else — argument passing, local variables, register preservation — is a convention that compilers follow consistently so that functions written by different people can call each other without corrupting each other's state.
Understanding the x86-64 System V calling convention is not academic. When you read a segfault at address 0x00007ffd..., you are looking at a corrupted stack pointer. When GDB shows <unknown> frames in a backtrace, the frame pointer chain is broken. When a function returning a struct silently produces garbage, a register clobber has overwritten the return value. These failures are only diagnosable by someone who knows what the stack is supposed to look like at each instruction.
This tutorial builds a working multi-function assembly program step by step: the call and ret mechanism, the function prologue and epilogue, argument passing in registers, local variable allocation on the stack, and a complete example of two functions calling each other with correct stack discipline.
Background
The stack is a region of memory that grows downward toward lower addresses on x86-64. The stack pointer rsp always points to the last item pushed. push rax decrements rsp by 8 and writes rax to the new rsp. pop rax reads from [rsp] and increments rsp by 8.
The call label instruction does two things: pushes the address of the instruction after the call (the return address) onto the stack, then jumps to label. The ret instruction pops the top of the stack into rip, returning execution to the caller.
The System V AMD64 ABI defines the calling convention for Linux. The rules:
- Integer and pointer arguments go in
rdi,rsi,rdx,rcx,r8,r9(in that order). - The return value goes in
rax. - Caller-saved registers (
rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11) may be destroyed by the callee; the caller must save them before the call if it needs them afterward. - Callee-saved registers (
rbx,rbp,r12–r15) must be preserved by the callee; if a function uses them, it must save and restore them.
The function prologue establishes the stack frame:
push rbp ; save caller's base pointer
mov rbp, rsp ; set up new frame pointer
sub rsp, N ; allocate N bytes for local variables
The function epilogue tears it down:
mov rsp, rbp ; restore stack pointer
pop rbp ; restore caller's base pointer
ret ; return to caller
rbp (the base pointer) holds a fixed reference point within the current frame. Local variables live at negative offsets from rbp (e.g., [rbp - 8]); saved registers and arguments passed on the stack live at positive offsets.
Practical Scenario
A game engine's scripting subsystem needs a math utility library written in assembly for performance on its inner loop calculations. The library must expose functions that can be called from C code. Two functions are needed: one that computes the maximum of two integers, and one that computes the absolute difference between them. The difference function calls the maximum function internally.
Both functions must follow the System V AMD64 ABI exactly — otherwise the C compiler's generated code will corrupt the stack when it calls into the assembly library. The team needs to see the complete call frame layout to verify that registers are saved and restored correctly and that rsp is 16-byte aligned before every call (a requirement of the ABI for functions that use SSE instructions).
The Problem
A first attempt at writing multiple assembly functions ignores the calling convention and manages no state at all.
Create a new file:
touch program.asm
Assemble and run using:
nasm -f elf64 program.asm -o program.o && ld -o program program.o && ./program
section .data
result_msg db "Result: ", 0
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
; print integer in rax to stdout
print_int:
lea rdi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rdi
mov [rdi], dl
inc rcx
test rax, rax
jnz .loop
mov rax, 1
mov rsi, rdi
mov rdx, rcx
mov rdi, 1
syscall
ret
_start:
; compute max(7, 12) inline — no function
mov rax, 7
mov rbx, 12
cmp rax, rbx
jge already_max
mov rax, rbx
already_max:
; print "Result: "
push rax
mov rax, 1
mov rdi, 1
mov rsi, result_msg
mov rdx, 8
syscall
pop rax
call print_int
mov rax, 1
mov rdi, 1
mov rsi, newline
mov rdx, 1
syscall
mov rax, 60
mov rdi, 0
syscall
Result: 12
The logic is inlined in _start because there is no function to call. Every place in the program that needs a maximum must duplicate the comparison code. There is no reuse, no clear separation of concerns, and print_int uses rbx as a scratch register without saving it — a correct caller that stores something important in rbx before calling print_int would find that value corrupted on return. The convention is the contract; violating it is silent corruption.
The call Instruction and Return Address
call label is syntactic sugar for two operations: push rip_after_call followed by jmp label. The return address sits at [rsp] immediately after the call. ret pops it back into rip. If anything modifies rsp between the call and the ret without a matching correction, the ret jumps to garbage.
Replace the entire program.asm with the following, which adds a proper max_int function:
section .data
result_msg db "Result: ", 0
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
; max_int(rdi, rsi) -> rax
; Returns the larger of two integers.
; Preserves: rbx, rbp, r12-r15 (callee-saved — none used here)
; Clobbers: rax (return value)
max_int:
cmp rdi, rsi
jge .rdi_wins
mov rax, rsi
ret
.rdi_wins:
mov rax, rdi
ret
; print_int: prints integer in rax to stdout
print_int:
push rbx ; rbx is callee-saved — must preserve it
lea rdi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rdi
mov [rdi], dl
inc rcx
test rax, rax
jnz .loop
push rcx
push rdi
mov rax, 1
mov rsi, rdi
mov rdx, rcx
mov rdi, 1
syscall
pop rdi
pop rcx
pop rbx ; restore rbx before returning
ret
_start:
; call max_int(7, 12)
mov rdi, 7
mov rsi, 12
call max_int ; return value in rax
push rax ; save result before write syscall clobbers rax
mov rax, 1
mov rdi, 1
mov rsi, result_msg
mov rdx, 8
syscall
pop rax ; restore result
call print_int
mov rax, 1
mov rdi, 1
mov rsi, newline
mov rdx, 1
syscall
mov rax, 60
mov rdi, 0
syscall
Result: 12
max_int receives its arguments in rdi and rsi, the first two integer registers in the System V AMD64 ABI. It returns its result in rax. It uses no callee-saved registers, so no prologue or epilogue is needed — a function that does not touch rbx, rbp, or r12–r15 needs nothing more than its logic and a ret. print_int uses rbx as the divisor, so it saves and restores rbx around its body.
Why this is better
The maximum logic is now in one place. Any code in this binary can call max_int(rdi, rsi) and receive the result in rax. Callee-saved registers are preserved correctly: a caller that holds a value in rbx can call print_int and trust that rbx still holds that value afterward.
The Function Prologue and Stack Frame
When a function needs local variables, it allocates space on the stack by decrementing rsp in the prologue. The base pointer rbp provides a stable frame reference — local variables are at [rbp - 8], [rbp - 16], etc., regardless of what happens to rsp later in the function body.
Add a new function abs_diff that demonstrates a full prologue and epilogue. Replace program.asm with:
section .data
result_msg db "Result: ", 0
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
; max_int(rdi, rsi) -> rax
max_int:
cmp rdi, rsi
jge .rdi_wins
mov rax, rsi
ret
.rdi_wins:
mov rax, rdi
ret
; abs_diff(rdi, rsi) -> rax
; Computes |rdi - rsi| by calling max_int internally.
; Uses a full prologue/epilogue because it calls another function
; and needs to preserve the original arguments across that call.
abs_diff:
push rbp
mov rbp, rsp
sub rsp, 16 ; allocate space for two local qwords
mov [rbp - 8], rdi ; save first argument: a
mov [rbp - 16], rsi ; save second argument: b
call max_int ; max_int(rdi=a, rsi=b) -> rax = max(a, b)
mov rcx, [rbp - 8] ; reload a
mov rdx, [rbp - 16] ; reload b
sub rcx, rdx ; a - b (may be negative)
sub rdx, rcx ; b - a ... wait, we want |a - b|
; simpler: max - min = max(a,b) - min(a,b)
; rax = max(a, b); min(a,b) = a + b - max(a,b)
mov rcx, [rbp - 8]
mov rdx, [rbp - 16]
add rcx, rdx ; a + b
sub rcx, rax ; a + b - max = min(a, b)
sub rax, rcx ; max - min = |a - b|
mov rsp, rbp
pop rbp
ret
; print_int: prints integer in rax to stdout
print_int:
push rbx
lea rdi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rdi
mov [rdi], dl
inc rcx
test rax, rax
jnz .loop
push rcx
push rdi
mov rax, 1
mov rsi, rdi
mov rdx, rcx
mov rdi, 1
syscall
pop rdi
pop rcx
pop rbx
ret
_start:
; compute abs_diff(20, 35) -> should print 15
mov rdi, 20
mov rsi, 35
call abs_diff
push rax
mov rax, 1
mov rdi, 1
mov rsi, result_msg
mov rdx, 8
syscall
pop rax
call print_int
mov rax, 1
mov rdi, 1
mov rsi, newline
mov rdx, 1
syscall
mov rax, 60
mov rdi, 0
syscall
Result: 15
abs_diff pushes rbp, sets rbp = rsp, then subtracts 16 from rsp to carve out two local variable slots. The original arguments rdi and rsi are saved to [rbp - 8] and [rbp - 16] before the call to max_int, which will clobber rdi and rsi with its own argument setup. After max_int returns, abs_diff reloads the originals from the stack, computes the absolute difference, and restores rsp and rbp before returning.
Why this is better
Without the prologue, abs_diff has no safe place to store its arguments across the call to max_int. A function that calls another function must assume that rdi, rsi, and all other caller-saved registers are destroyed by the callee. The stack frame is the standard, ABI-compatible way to preserve values across calls — debuggers, profilers, and unwinders all know how to walk it.
Note: The System V AMD64 ABI requires rsp to be 16-byte aligned before a call instruction. _start begins with rsp aligned (the kernel sets it up that way). Each call pushes 8 bytes (the return address), misaligning rsp by 8. The push rbp in the prologue pushes another 8 bytes, restoring 16-byte alignment inside the function body. When allocating local space with sub rsp, N, N must be a multiple of 16 to maintain alignment for any nested call.
Callee-Saved Registers in Practice
When a function needs to use rbx, r12, r13, r14, or r15 as long-lived scratch registers across multiple operations, it must save them at the top of the function and restore them before returning. The convention guarantees callers can rely on these values being unchanged.
Replace the abs_diff function with a version that uses r12 and r13 as register-resident locals instead of memory:
; abs_diff(rdi, rsi) -> rax
; Uses r12 and r13 as callee-saved locals to avoid memory loads.
abs_diff:
push rbp
mov rbp, rsp
push r12 ; r12 is callee-saved
push r13 ; r13 is callee-saved
mov r12, rdi ; r12 = a (preserved across call)
mov r13, rsi ; r13 = b (preserved across call)
; rdi and rsi already set — call max_int(a, b)
call max_int ; rax = max(a, b)
; compute min(a, b) = a + b - max(a, b)
mov rcx, r12
add rcx, r13
sub rcx, rax ; rcx = min(a, b)
sub rax, rcx ; rax = max - min = |a - b|
pop r13
pop r12
mov rsp, rbp
pop rbp
ret
Result: 15
r12 and r13 are callee-saved, so abs_diff pushes them at the top and pops them at the bottom. Inside the function, they hold a and b across the call to max_int without needing memory loads afterward — faster and cleaner than spilling to the stack frame. The calling convention guarantees that max_int will leave r12 and r13 untouched.
Why this is better
Callee-saved registers are the assembly-level equivalent of local variables that survive function calls. When a function's inner loop calls a helper repeatedly, using r12–r15 to hold loop state avoids memory accesses that a stack-based approach would require. The compiler does this automatically when optimizing; writing it by hand teaches you exactly what the optimizer is doing.
Summary
This tutorial built a multi-function x86-64 assembly program following the System V AMD64 ABI, starting from inlined duplicated logic and arriving at correct, composable functions with proper stack discipline. Every piece of the calling convention was demonstrated in running code:
call labelpushes the return address onto the stack and jumps;retpops that address back intorip— any modification torspbetween the two that is not matched by a correction will causeretto jump to an arbitrary address.- The function prologue (
push rbp/mov rbp, rsp/sub rsp, N) establishes a stable frame reference and allocates local variable space; the epilogue (mov rsp, rbp/pop rbp/ret) tears it down in reverse. - The System V AMD64 ABI passes the first six integer arguments in
rdi,rsi,rdx,rcx,r8,r9and returns the result inrax; functions that only use these registers need no prologue or epilogue. - Caller-saved registers (
rax,rcx,rdx,rsi,rdi,r8–r11) are assumed destroyed by anycall; a function must save them to its own stack frame before calling another function if it needs the values afterward. - Callee-saved registers (
rbx,rbp,r12–r15) must be pushed at the start of any function that uses them and popped beforeret; callers can rely on these registers being unchanged across any well-behaved function call. rspmust be 16-byte aligned before acall; thepush rbpin the prologue compensates for the 8-byte return address pushed bycall, andsub rsp, Nmust use a multiple of 16 to maintain that alignment through nested calls.- Local variables live at negative offsets from
rbp; this convention is what debuggers and stack unwinders use to reconstruct a call trace — breaking it produces the<unknown>frames and corrupted backtraces that are otherwise inexplicable without this knowledge.