Assembly
Registers, Arithmetic, and Control Flow in x86-64 Assembly
Introduction
Every high-level language arithmetic operation — an addition, a comparison, a loop — compiles down to a handful of x86-64 instructions operating directly on CPU registers. Understanding those registers and the instruction set that manipulates them is what separates a developer who can read disassembly from one who cannot. When a C compiler emits a tight inner loop, when a profiler reports that an arithmetic operation is unexpectedly slow, when a debugger shows register values at a crash site — the developer who knows what rax, rdx, and the flags register contain at each moment can diagnose immediately. Everyone else has to experiment blindly.
The x86-64 register file is not a flat array of identically sized slots. Every general-purpose register has four overlapping views — 64-bit, 32-bit, 16-bit, and 8-bit — that refer to the same underlying storage. Code that uses the wrong view introduces silent truncation bugs that are notoriously hard to find, because the value looks correct until it overflows a boundary the programmer did not know existed. The division instruction requires an explicit sign-extension step that beginners consistently forget, producing a divide exception that the program has no way to catch. Conditional branches depend on flags that the CPU sets automatically, and knowing which instruction sets which flag — and which flag a given jump instruction reads — is the prerequisite for writing any loop or comparison correctly.
This tutorial builds an integer statistics calculator in x86-64 assembly that computes the sum, average, minimum, and maximum of a fixed array of values. Every step introduces one cluster of concepts: the register hierarchy, the mov variants, signed arithmetic with add/sub/imul/idiv, the flags register, and conditional plus unconditional branching with cmp, jz/jne/jl/jg, and jmp.
Background
The x86-64 CPU has sixteen general-purpose registers. Each one has a name at four different widths:
rax(64-bit) /eax(32-bit) /ax(16-bit) /al(8-bit low,ah8-bit high)rbx/ebx/bx/blrcx/ecx/cx/clrdx/edx/dx/dlrsi,rdi,rsp,rbp,r8–r15follow the same pattern
Writing to eax (32-bit) zero-extends into rax — the upper 32 bits are cleared automatically. Writing to ax (16-bit) or al (8-bit) does not clear the upper bits; the rest of rax is preserved. This asymmetry is intentional and matters when mixing sizes.
The flags register (rflags) holds single-bit results set automatically after arithmetic and comparison instructions:
- ZF (Zero Flag): set when the result is zero
- SF (Sign Flag): set when the result is negative (the sign bit is 1)
- CF (Carry Flag): set on unsigned overflow or borrow
- OF (Overflow Flag): set on signed overflow
cmp a, b computes a - b and sets flags without storing the result. Conditional jump instructions read specific flags: jz jumps if ZF=1, jne jumps if ZF=0, jl jumps if SF≠OF (signed less than), jg jumps if ZF=0 and SF=OF (signed greater than).
The idiv instruction divides the 128-bit value formed by rdx:rax by its operand. Before dividing, rdx must be set to the sign extension of rax — the cdq instruction does this for 32-bit operands, and cqo does it for 64-bit operands. Forgetting this step leaves garbage in rdx and produces a divide exception.
Practical Scenario
A systems team is building a performance monitoring tool that samples CPU utilisation readings from hardware counters at fixed intervals and stores them as integer percentages in a shared memory segment. At the end of each measurement window, the tool must compute the sum, average, minimum, and maximum of the collected readings entirely in assembly — no C runtime, no library calls — because the tool runs in a context where the C runtime is not initialised and calling it would corrupt the thread state the monitoring system is trying to observe.
The team needs a reference implementation that demonstrates how to traverse an array declared in .data, perform signed integer arithmetic using all four arithmetic instructions, and select minimum and maximum values using conditional branches. The same register patterns and branching idioms appear verbatim in hot loops throughout the codebase, so getting them right in this canonical form matters for everything built on top.
The Problem
A first attempt iterates the array with hardcoded iteration logic that does not generalise to different array sizes and omits sign-extension before division.
Create a new file:
touch program.asm
Assemble and run using:
nasm -f elf64 program.asm -o program.o && ld -o program program.o && ./program
section .data
values dq 42, 17, 89, 55, 23, 76, 8, 61
count equ 8
newline db 10
section .bss
result_buf resb 20
section .text
global _start
_start:
; sum all values manually (hardcoded unroll — wrong approach)
mov rax, [values + 0]
add rax, [values + 8]
add rax, [values + 16]
add rax, [values + 24]
add rax, [values + 32]
add rax, [values + 40]
add rax, [values + 48]
add rax, [values + 56]
; divide by count — missing cdq/cqo, rdx may contain garbage
mov rbx, count
div rbx ; WRONG: udiv on potentially signed dividend, rdx undefined
; print rax as decimal
lea rdi, [result_buf + 19]
mov rcx, 0
.conv:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rdi
mov [rdi], dl
inc rcx
test rax, rax
jnz .conv
mov rax, 1
mov rsi, rdi
mov rdx, rcx
mov rdi, 1
syscall
mov rax, 1
mov rdi, 1
mov rsi, newline
mov rdx, 1
syscall
mov rax, 60
xor rdi, rdi
syscall
46
Three problems are already present. First, the unrolled addition is hardcoded to exactly eight elements — changing the array requires rewriting the entire addition block. Second, the first div rbx uses div (unsigned division) and never clears rdx before executing; if any prior code leaves a non-zero value there, the division operates on a 128-bit dividend and produces a wildly incorrect quotient or a divide exception. Third, minimum and maximum are never computed at all. Each of these flaws needs its own fix.
Register Aliases: Choosing the Right Width
Each general-purpose register exposes its storage at four widths. Using the wrong width silently truncates values. The value 89 stored in rax and then read through al gives 89 — but the value 300 stored in rax and read through al gives 44 (300 mod 256) with no warning, no exception, and no flag set.
Replace the entire content of program.asm to add a dedicated print routine that uses al correctly as a single-byte ASCII holder while keeping the full 64-bit value in rax:
section .data
values dq 42, 17, 89, 55, 23, 76, 8, 61
count equ 8
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
; print_int: prints the integer in rax as decimal followed by newline
; clobbers: rax, rbx, rcx, rdx, rsi, rdi
print_int:
lea rsi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx ; rax = quotient, rdx = remainder
add dl, '0' ; dl is the low 8 bits of rdx — one ASCII digit
dec rsi
mov [rsi], dl ; store the ASCII byte
inc rcx
test rax, rax
jnz .loop
mov rax, 1
mov rdi, 1
; rsi already points to first digit
mov rdx, rcx
syscall
mov rax, 1
mov rdi, 1
lea rsi, [newline]
mov rdx, 1
syscall
ret
_start:
mov rax, 42 ; 42 fits in al without truncation
call print_int
mov rax, 371 ; 371 does NOT fit in a byte — must use rax, not al
call print_int
mov rax, 60
xor rdi, rdi
syscall
42
371
dl (the low byte of rdx) holds one ASCII digit at a time — the remainder of dividing by 10 is always 0–9, so one byte suffices. The full quotient stays in rax (64-bit) throughout. Using al to hold the full accumulated sum would have silently wrapped at 256.
Why this is better
Mixing register widths without intent is one of the most common sources of silent data corruption in assembly. Making the width selection explicit — dl for a single digit, rax for the full running total — means the code documents its invariants in the instructions themselves. The print_int subroutine also eliminates the repeated inline conversion block, making the remaining sections shorter and easier to reason about.
mov with Register, Immediate, and Memory Operands
mov has three operand forms that work differently. mov rax, 42 loads an immediate constant. mov rax, rbx copies a register. mov rax, [label] loads eight bytes from a memory address (for rax, a qword). The square brackets are the dereference operator — omitting them loads the address itself, not the value at that address.
Replace the entire content of program.asm to demonstrate all three forms while loading the first value from the array:
section .data
values dq 42, 17, 89, 55, 23, 76, 8, 61
count equ 8
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
print_int:
lea rsi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rsi
mov [rsi], dl
inc rcx
test rax, rax
jnz .loop
mov rax, 1
mov rdi, 1
mov rdx, rcx
syscall
mov rax, 1
mov rdi, 1
lea rsi, [newline]
mov rdx, 1
syscall
ret
_start:
; Form 1: immediate — load a constant directly into a register
mov rax, 100
; Form 2: register-to-register copy
mov rbx, rax ; rbx = 100
; Form 3: memory load — load the first element of values (42)
mov rcx, [values] ; rcx = 42 (8 bytes, matching rcx's qword size)
; lea loads an address, not the value at that address
lea rdx, [values] ; rdx = address of values array, NOT 42
; print all four to confirm
mov rax, rax
call print_int ; prints 100
mov rax, rbx
call print_int ; prints 100
mov rax, rcx
call print_int ; prints 42
mov rax, rdx
call print_int ; prints the address — a large number
mov rax, 60
xor rdi, rdi
syscall
100
100
42
4194368
The fourth line prints the address of the values array — a raw memory address, not the 42 stored there. lea rdx, [values] loads the pointer; mov rdx, [values] would load the value. The address will differ across runs if the binary is loaded at a different base, but on a system without address randomisation for static binaries it is deterministic.
Why this is better
The distinction between mov and lea, and between [label] and label, is where most assembly beginners make their first memory access error. Printing all four forms and seeing their outputs side by side makes the difference concrete. The address printed for lea confirms that pointer arithmetic operates on 64-bit values — the same register rdx that holds a single-digit number also holds a 7-digit address without truncation.
Note: NASM requires that the memory operand width match the register width unless an explicit size override (qword, dword, word, byte) is used. mov rax, [values] loads a qword because rax is 64-bit. To load only the first byte, write movzx rax, byte [values] (zero-extend) or movsx rax, byte [values] (sign-extend).
add and sub: Summing an Array with a Loop
add dest, src adds src to dest and stores the result in dest. sub dest, src subtracts. Both update the flags register. A loop over an array uses a base register holding the start address, an offset computed from a counter multiplied by the element size, and a conditional jump that exits when the counter reaches the element count.
Replace the entire content of program.asm with a version that sums the array using a proper loop:
section .data
values dq 42, 17, 89, 55, 23, 76, 8, 61
count equ 8
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
print_int:
lea rsi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rsi
mov [rsi], dl
inc rcx
test rax, rax
jnz .loop
mov rax, 1
mov rdi, 1
mov rdx, rcx
syscall
mov rax, 1
mov rdi, 1
lea rsi, [newline]
mov rdx, 1
syscall
ret
_start:
xor rax, rax ; sum accumulator = 0
xor rcx, rcx ; loop counter = 0
sum_loop:
cmp rcx, count
jge sum_done
mov rbx, [values + rcx * 8] ; load element at index rcx
add rax, rbx ; accumulate into rax
inc rcx
jmp sum_loop
sum_done:
call print_int ; prints the sum
mov rax, 60
xor rdi, rdi
syscall
371
values + rcx * 8 is NASM's scaled index addressing: rcx holds the element index, multiplying by 8 converts it to a byte offset because each dq element occupies 8 bytes. add rax, rbx accumulates — rax grows by the loaded value on each iteration. cmp rcx, count followed by jge sum_done exits the loop when the counter equals the array length.
Why this is better
The unrolled version required one add instruction per element and silently broke whenever the array size changed. The loop version handles any count stored in the count constant — change the array declaration and the count constant, and the loop adapts automatically. This is the pattern that compilers emit for array traversal, and recognising it in disassembly is a prerequisite for performance analysis.
imul and idiv with cqo: Signed Multiply and Divide
imul reg, reg multiplies two signed integers and stores the result in the destination register. idiv reg divides the signed 128-bit value rdx:rax by the operand, storing the quotient in rax and the remainder in rdx. The critical prerequisite: before idiv, call cqo to sign-extend rax into rdx:rax. Without it, rdx contains whatever the previous instruction left there, and the division operates on a garbage 128-bit dividend.
Replace the entire content of program.asm to compute the average using cqo + idiv:
section .data
values dq 42, 17, 89, 55, 23, 76, 8, 61
count equ 8
newline db 10
section .bss
digit_buf resb 20
section .text
global _start
print_int:
lea rsi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rsi
mov [rsi], dl
inc rcx
test rax, rax
jnz .loop
mov rax, 1
mov rdi, 1
mov rdx, rcx
syscall
mov rax, 1
mov rdi, 1
lea rsi, [newline]
mov rdx, 1
syscall
ret
_start:
; --- compute sum ---
xor rax, rax
xor rcx, rcx
sum_loop:
cmp rcx, count
jge sum_done
mov rbx, [values + rcx * 8]
add rax, rbx
inc rcx
jmp sum_loop
sum_done:
mov r12, rax ; save sum in r12 (callee-saved, not clobbered by print_int)
; --- compute average: sum / count ---
mov rax, r12
cqo ; sign-extend rax into rdx:rax (sets rdx = 0 for positive rax)
mov rbx, count
idiv rbx ; rax = quotient (average), rdx = remainder
call print_int ; prints 46
; --- demonstrate imul: multiply average by 2 ---
mov rax, 46
imul rax, rax, 2 ; three-operand form: rax = rax * 2
call print_int ; prints 92
mov rax, 60
xor rdi, rdi
syscall
46
92
cqo copies the sign bit of rax into every bit of rdx, making rdx:rax a properly sign-extended 128-bit representation of the signed integer in rax. For a positive sum like 371, cqo sets rdx to zero — but for a negative sum it would set rdx to 0xFFFFFFFFFFFFFFFF, which is the correct two's-complement sign extension. imul rax, rax, 2 uses the three-operand form where the third argument is an immediate multiplier stored directly in the instruction encoding.
Why this is better
Using div (unsigned) on a signed value and skipping cqo before idiv are the two most common arithmetic bugs in x86-64 assembly. Both are silent in the majority of cases — the values are small enough that rdx happens to be zero — and catastrophic in the minority. Making cqo a mandatory step before every idiv is the correct discipline, and using imul for signed multiplication instead of mul prevents sign interpretation errors in the product.
Note: imul has three forms: one-operand (imul rbx — multiplies rdx:rax by rbx, result in rdx:rax), two-operand (imul rax, rbx — rax *= rbx, result truncated to 64 bits), and three-operand (imul rax, rbx, 5 — rax = rbx * 5). The two- and three-operand forms are the ones used in practice because the one-operand form clobbers rdx.
The Flags Register and Conditional Jumps: Finding Min and Max
cmp a, b subtracts b from a and discards the result, but sets ZF, SF, OF, and CF exactly as if the subtraction had been stored. Conditional jump instructions read specific flag combinations: jl (jump if less, signed) jumps when SF≠OF, jg (jump if greater, signed) jumps when ZF=0 and SF=OF, jz jumps when ZF=1, jne jumps when ZF=0. Chaining cmp with the right conditional jump implements every comparison-based selection.
Replace the entire content of program.asm with the final version that computes sum, average, minimum, and maximum:
section .data
values dq 42, 17, 89, 55, 23, 76, 8, 61
count equ 8
newline db 10
msg_sum db "sum=", 0
msg_avg db "avg=", 0
msg_min db "min=", 0
msg_max db "max=", 0
section .bss
digit_buf resb 20
section .text
global _start
print_label:
; rsi = pointer to null-terminated label string
; find length
xor rcx, rcx
.scan:
cmp byte [rsi + rcx], 0
jz .done_scan
inc rcx
jmp .scan
.done_scan:
mov rax, 1
mov rdi, 1
mov rdx, rcx
syscall
ret
print_int:
lea rsi, [digit_buf + 19]
mov rcx, 0
.loop:
xor rdx, rdx
mov rbx, 10
div rbx
add dl, '0'
dec rsi
mov [rsi], dl
inc rcx
test rax, rax
jnz .loop
mov rax, 1
mov rdi, 1
mov rdx, rcx
syscall
mov rax, 1
mov rdi, 1
lea rsi, [newline]
mov rdx, 1
syscall
ret
_start:
; initialise: load first element as starting min and max
mov r13, [values] ; r13 = min
mov r14, [values] ; r14 = max
xor r12, r12 ; r12 = sum
xor rcx, rcx ; loop counter
stats_loop:
cmp rcx, count
jge stats_done
mov rax, [values + rcx * 8]
; accumulate sum
add r12, rax
; update min: if rax < r13, then r13 = rax
cmp rax, r13
jge not_min
mov r13, rax
not_min:
; update max: if rax > r14, then r14 = rax
cmp rax, r14
jle not_max
mov r14, rax
not_max:
inc rcx
jmp stats_loop
stats_done:
; compute average
mov rax, r12
cqo
mov rbx, count
idiv rbx
mov r15, rax ; r15 = average
; print sum
lea rsi, [msg_sum]
call print_label
mov rax, r12
call print_int
; print average
lea rsi, [msg_avg]
call print_label
mov rax, r15
call print_int
; print min
lea rsi, [msg_min]
call print_label
mov rax, r13
call print_int
; print max
lea rsi, [msg_max]
call print_label
mov rax, r14
call print_int
mov rax, 60
xor rdi, rdi
syscall
sum=371
avg=46
min=8
max=89
The min/max selection uses jge and jle — signed comparison jumps. The CPU sets SF and OF when executing cmp rax, r13; jge reads SF and OF together (jumps when SF=OF) rather than reading CF the way unsigned jae would. Initialising both r13 and r14 to the first element rather than to 0 or a sentinel value means the algorithm is correct even if all elements are negative.
Why this is better
Replacing separate scan passes for min and max with a single loop that updates all four statistics simultaneously halves the number of memory loads. More importantly, the jge/jle signed comparisons would silently give wrong results if replaced with their unsigned equivalents jae/jbe when the array contains negative values. Choosing the right conditional jump for the value's signedness is not a style preference — it is a correctness requirement.
Note: The registers r12–r15 are callee-saved in the System V AMD64 ABI — routines called from C code must preserve them across calls. In a standalone assembly program with no C runtime, this restriction does not apply, but using them as scratch registers here builds the habit of keeping long-lived values in callee-saved registers and short-lived temporaries in caller-saved ones (rax, rcx, rdx, rsi, rdi, r8–r11).
Summary
This tutorial built an integer statistics calculator in x86-64 assembly that computes sum, average, minimum, and maximum over a .data array, covering the full instruction set for arithmetic and conditional control flow.
- Each general-purpose register (
rax,rbx,rcx,rdx,rsi,rdi,r8–r15) has four overlapping width aliases: writing to the 32-bit alias (eax) zero-extends into the 64-bit register, but writing to the 16-bit or 8-bit alias leaves the upper bits unchanged. mov rax, immloads a constant;mov rax, rbxcopies a register;mov rax, [label]loads from memory;lea rax, [label]loads the address itself — the square brackets are the dereference operator.addandsubupdate the flags register after every operation; use scaled-index addressing ([base + index * scale]) to walk arrays without computing byte offsets manually.cqosign-extendsraxintordx:raxand must precede everyidiv; omitting it leavesrdxwith stale data and causes silent incorrect quotients or a divide exception on typical input values.imulhas three forms: the three-operand formimul dst, src, immis the safest for most uses because it does not implicitly clobberrdx.cmp a, bsubtractsbfromaand sets ZF, SF, CF, and OF without storing the result; the subsequent conditional jump reads the appropriate flag combination:jl/jgfor signed comparisons,jb/jafor unsigned.- Initialise min and max to the first array element, not to zero or a large sentinel, so the algorithm is correct for arrays that contain only negative values.
r12–r15are callee-saved in the System V AMD64 ABI and are the right choice for values that must survive across subroutine calls.