Assembly

Registers, Arithmetic, and Control Flow in x86-64 Assembly

Dima Iun 24, 2026 10,0

Introduction

Every high-level language arithmetic operation — an addition, a comparison, a loop — compiles down to a handful of x86-64 instructions operating directly on CPU registers. Understanding those registers and the instruction set that manipulates them is what separates a developer who can read disassembly from one who cannot. When a C compiler emits a tight inner loop, when a profiler reports that an arithmetic operation is unexpectedly slow, when a debugger shows register values at a crash site — the developer who knows what rax, rdx, and the flags register contain at each moment can diagnose immediately. Everyone else has to experiment blindly.

The x86-64 register file is not a flat array of identically sized slots. Every general-purpose register has four overlapping views — 64-bit, 32-bit, 16-bit, and 8-bit — that refer to the same underlying storage. Code that uses the wrong view introduces silent truncation bugs that are notoriously hard to find, because the value looks correct until it overflows a boundary the programmer did not know existed. The division instruction requires an explicit sign-extension step that beginners consistently forget, producing a divide exception that the program has no way to catch. Conditional branches depend on flags that the CPU sets automatically, and knowing which instruction sets which flag — and which flag a given jump instruction reads — is the prerequisite for writing any loop or comparison correctly.

This tutorial builds an integer statistics calculator in x86-64 assembly that computes the sum, average, minimum, and maximum of a fixed array of values. Every step introduces one cluster of concepts: the register hierarchy, the mov variants, signed arithmetic with add/sub/imul/idiv, the flags register, and conditional plus unconditional branching with cmp, jz/jne/jl/jg, and jmp.

Background

The x86-64 CPU has sixteen general-purpose registers. Each one has a name at four different widths:

rax (64-bit) / eax (32-bit) / ax (16-bit) / al (8-bit low, ah 8-bit high)
rbx / ebx / bx / bl
rcx / ecx / cx / cl
rdx / edx / dx / dl
rsi, rdi, rsp, rbp, r8–r15 follow the same pattern

Writing to eax (32-bit) zero-extends into rax — the upper 32 bits are cleared automatically. Writing to ax (16-bit) or al (8-bit) does not clear the upper bits; the rest of rax is preserved. This asymmetry is intentional and matters when mixing sizes.

The flags register (rflags) holds single-bit results set automatically after arithmetic and comparison instructions:

ZF (Zero Flag): set when the result is zero
SF (Sign Flag): set when the result is negative (the sign bit is 1)
CF (Carry Flag): set on unsigned overflow or borrow
OF (Overflow Flag): set on signed overflow

cmp a, b computes a - b and sets flags without storing the result. Conditional jump instructions read specific flags: jz jumps if ZF=1, jne jumps if ZF=0, jl jumps if SF≠OF (signed less than), jg jumps if ZF=0 and SF=OF (signed greater than).

The idiv instruction divides the 128-bit value formed by rdx:rax by its operand. Before dividing, rdx must be set to the sign extension of rax — the cdq instruction does this for 32-bit operands, and cqo does it for 64-bit operands. Forgetting this step leaves garbage in rdx and produces a divide exception.

Practical Scenario

A systems team is building a performance monitoring tool that samples CPU utilisation readings from hardware counters at fixed intervals and stores them as integer percentages in a shared memory segment. At the end of each measurement window, the tool must compute the sum, average, minimum, and maximum of the collected readings entirely in assembly — no C runtime, no library calls — because the tool runs in a context where the C runtime is not initialised and calling it would corrupt the thread state the monitoring system is trying to observe.

The team needs a reference implementation that demonstrates how to traverse an array declared in .data, perform signed integer arithmetic using all four arithmetic instructions, and select minimum and maximum values using conditional branches. The same register patterns and branching idioms appear verbatim in hot loops throughout the codebase, so getting them right in this canonical form matters for everything built on top.

The Problem

A first attempt iterates the array with hardcoded iteration logic that does not generalise to different array sizes and omits sign-extension before division.

Create a new file:

touch program.asm

Assemble and run using:

nasm -f elf64 program.asm -o program.o && ld -o program program.o && ./program

section .data
    values  dq 42, 17, 89, 55, 23, 76, 8, 61
    count   equ 8
    newline db 10

section .bss
    result_buf resb 20

section .text
    global _start

_start:
    ; sum all values manually (hardcoded unroll — wrong approach)
    mov rax, [values + 0]
    add rax, [values + 8]
    add rax, [values + 16]
    add rax, [values + 24]
    add rax, [values + 32]
    add rax, [values + 40]
    add rax, [values + 48]
    add rax, [values + 56]

    ; divide by count — missing cdq/cqo, rdx may contain garbage
    mov rbx, count
    div rbx        ; WRONG: udiv on potentially signed dividend, rdx undefined

    ; print rax as decimal
    lea rdi, [result_buf + 19]
    mov rcx, 0
.conv:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rdi
    mov [rdi], dl
    inc rcx
    test rax, rax
    jnz .conv

    mov rax, 1
    mov rsi, rdi
    mov rdx, rcx
    mov rdi, 1
    syscall

    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    mov rax, 60
    xor rdi, rdi
    syscall

Three problems are already present. First, the unrolled addition is hardcoded to exactly eight elements — changing the array requires rewriting the entire addition block. Second, the first div rbx uses div (unsigned division) and never clears rdx before executing; if any prior code leaves a non-zero value there, the division operates on a 128-bit dividend and produces a wildly incorrect quotient or a divide exception. Third, minimum and maximum are never computed at all. Each of these flaws needs its own fix.

Register Aliases: Choosing the Right Width

Each general-purpose register exposes its storage at four widths. Using the wrong width silently truncates values. The value 89 stored in rax and then read through al gives 89 — but the value 300 stored in rax and read through al gives 44 (300 mod 256) with no warning, no exception, and no flag set.

Replace the entire content of program.asm to add a dedicated print routine that uses al correctly as a single-byte ASCII holder while keeping the full 64-bit value in rax:

section .data
    values  dq 42, 17, 89, 55, 23, 76, 8, 61
    count   equ 8
    newline db 10

section .bss
    digit_buf resb 20

section .text
    global _start

; print_int: prints the integer in rax as decimal followed by newline
; clobbers: rax, rbx, rcx, rdx, rsi, rdi
print_int:
    lea rsi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx           ; rax = quotient, rdx = remainder
    add dl, '0'       ; dl is the low 8 bits of rdx — one ASCII digit
    dec rsi
    mov [rsi], dl     ; store the ASCII byte
    inc rcx
    test rax, rax
    jnz .loop

    mov rax, 1
    mov rdi, 1
    ; rsi already points to first digit
    mov rdx, rcx
    syscall

    mov rax, 1
    mov rdi, 1
    lea rsi, [newline]
    mov rdx, 1
    syscall
    ret

_start:
    mov rax, 42       ; 42 fits in al without truncation
    call print_int

    mov rax, 371      ; 371 does NOT fit in a byte — must use rax, not al
    call print_int

    mov rax, 60
    xor rdi, rdi
    syscall

42
371

dl (the low byte of rdx) holds one ASCII digit at a time — the remainder of dividing by 10 is always 0–9, so one byte suffices. The full quotient stays in rax (64-bit) throughout. Using al to hold the full accumulated sum would have silently wrapped at 256.

Mixing register widths without intent is one of the most common sources of silent data corruption in assembly. Making the width selection explicit — dl for a single digit, rax for the full running total — means the code documents its invariants in the instructions themselves. The print_int subroutine also eliminates the repeated inline conversion block, making the remaining sections shorter and easier to reason about.

`mov` with Register, Immediate, and Memory Operands

mov has three operand forms that work differently. mov rax, 42 loads an immediate constant. mov rax, rbx copies a register. mov rax, [label] loads eight bytes from a memory address (for rax, a qword). The square brackets are the dereference operator — omitting them loads the address itself, not the value at that address.

Replace the entire content of program.asm to demonstrate all three forms while loading the first value from the array:

section .data
    values  dq 42, 17, 89, 55, 23, 76, 8, 61
    count   equ 8
    newline db 10

section .bss
    digit_buf resb 20

section .text
    global _start

print_int:
    lea rsi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rsi
    mov [rsi], dl
    inc rcx
    test rax, rax
    jnz .loop

    mov rax, 1
    mov rdi, 1
    mov rdx, rcx
    syscall

    mov rax, 1
    mov rdi, 1
    lea rsi, [newline]
    mov rdx, 1
    syscall
    ret

_start:
    ; Form 1: immediate — load a constant directly into a register
    mov rax, 100

    ; Form 2: register-to-register copy
    mov rbx, rax       ; rbx = 100

    ; Form 3: memory load — load the first element of values (42)
    mov rcx, [values]  ; rcx = 42  (8 bytes, matching rcx's qword size)

    ; lea loads an address, not the value at that address
    lea rdx, [values]  ; rdx = address of values array, NOT 42

    ; print all four to confirm
    mov rax, rax
    call print_int     ; prints 100

    mov rax, rbx
    call print_int     ; prints 100

    mov rax, rcx
    call print_int     ; prints 42

    mov rax, rdx
    call print_int     ; prints the address — a large number

    mov rax, 60
    xor rdi, rdi
    syscall

The fourth line prints the address of the values array — a raw memory address, not the 42 stored there. lea rdx, [values] loads the pointer; mov rdx, [values] would load the value. The address will differ across runs if the binary is loaded at a different base, but on a system without address randomisation for static binaries it is deterministic.

The distinction between mov and lea, and between [label] and label, is where most assembly beginners make their first memory access error. Printing all four forms and seeing their outputs side by side makes the difference concrete. The address printed for lea confirms that pointer arithmetic operates on 64-bit values — the same register rdx that holds a single-digit number also holds a 7-digit address without truncation.

Note: NASM requires that the memory operand width match the register width unless an explicit size override (qword, dword, word, byte) is used. mov rax, [values] loads a qword because rax is 64-bit. To load only the first byte, write movzx rax, byte [values] (zero-extend) or movsx rax, byte [values] (sign-extend).

`add` and `sub`: Summing an Array with a Loop

add dest, src adds src to dest and stores the result in dest. sub dest, src subtracts. Both update the flags register. A loop over an array uses a base register holding the start address, an offset computed from a counter multiplied by the element size, and a conditional jump that exits when the counter reaches the element count.

Replace the entire content of program.asm with a version that sums the array using a proper loop:

section .data
    values  dq 42, 17, 89, 55, 23, 76, 8, 61
    count   equ 8
    newline db 10

section .bss
    digit_buf resb 20

section .text
    global _start

print_int:
    lea rsi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rsi
    mov [rsi], dl
    inc rcx
    test rax, rax
    jnz .loop

    mov rax, 1
    mov rdi, 1
    mov rdx, rcx
    syscall

    mov rax, 1
    mov rdi, 1
    lea rsi, [newline]
    mov rdx, 1
    syscall
    ret

_start:
    xor rax, rax       ; sum accumulator = 0
    xor rcx, rcx       ; loop counter = 0

sum_loop:
    cmp rcx, count
    jge sum_done

    mov rbx, [values + rcx * 8]   ; load element at index rcx
    add rax, rbx                   ; accumulate into rax
    inc rcx
    jmp sum_loop

sum_done:
    call print_int     ; prints the sum

    mov rax, 60
    xor rdi, rdi
    syscall

values + rcx * 8 is NASM's scaled index addressing: rcx holds the element index, multiplying by 8 converts it to a byte offset because each dq element occupies 8 bytes. add rax, rbx accumulates — rax grows by the loaded value on each iteration. cmp rcx, count followed by jge sum_done exits the loop when the counter equals the array length.

The unrolled version required one add instruction per element and silently broke whenever the array size changed. The loop version handles any count stored in the count constant — change the array declaration and the count constant, and the loop adapts automatically. This is the pattern that compilers emit for array traversal, and recognising it in disassembly is a prerequisite for performance analysis.

`imul` and `idiv` with `cqo`: Signed Multiply and Divide

imul reg, reg multiplies two signed integers and stores the result in the destination register. idiv reg divides the signed 128-bit value rdx:rax by the operand, storing the quotient in rax and the remainder in rdx. The critical prerequisite: before idiv, call cqo to sign-extend rax into rdx:rax. Without it, rdx contains whatever the previous instruction left there, and the division operates on a garbage 128-bit dividend.

Replace the entire content of program.asm to compute the average using cqo + idiv:

section .data
    values  dq 42, 17, 89, 55, 23, 76, 8, 61
    count   equ 8
    newline db 10

section .bss
    digit_buf resb 20

section .text
    global _start

print_int:
    lea rsi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rsi
    mov [rsi], dl
    inc rcx
    test rax, rax
    jnz .loop

    mov rax, 1
    mov rdi, 1
    mov rdx, rcx
    syscall

    mov rax, 1
    mov rdi, 1
    lea rsi, [newline]
    mov rdx, 1
    syscall
    ret

_start:
    ; --- compute sum ---
    xor rax, rax
    xor rcx, rcx
sum_loop:
    cmp rcx, count
    jge sum_done
    mov rbx, [values + rcx * 8]
    add rax, rbx
    inc rcx
    jmp sum_loop
sum_done:
    mov r12, rax       ; save sum in r12 (callee-saved, not clobbered by print_int)

    ; --- compute average: sum / count ---
    mov rax, r12
    cqo                ; sign-extend rax into rdx:rax (sets rdx = 0 for positive rax)
    mov rbx, count
    idiv rbx           ; rax = quotient (average), rdx = remainder

    call print_int     ; prints 46

    ; --- demonstrate imul: multiply average by 2 ---
    mov rax, 46
    imul rax, rax, 2   ; three-operand form: rax = rax * 2
    call print_int     ; prints 92

    mov rax, 60
    xor rdi, rdi
    syscall

46
92

cqo copies the sign bit of rax into every bit of rdx, making rdx:rax a properly sign-extended 128-bit representation of the signed integer in rax. For a positive sum like 371, cqo sets rdx to zero — but for a negative sum it would set rdx to 0xFFFFFFFFFFFFFFFF, which is the correct two's-complement sign extension. imul rax, rax, 2 uses the three-operand form where the third argument is an immediate multiplier stored directly in the instruction encoding.

Using div (unsigned) on a signed value and skipping cqo before idiv are the two most common arithmetic bugs in x86-64 assembly. Both are silent in the majority of cases — the values are small enough that rdx happens to be zero — and catastrophic in the minority. Making cqo a mandatory step before every idiv is the correct discipline, and using imul for signed multiplication instead of mul prevents sign interpretation errors in the product.

Note: imul has three forms: one-operand (imul rbx — multiplies rdx:rax by rbx, result in rdx:rax), two-operand (imul rax, rbx — rax *= rbx, result truncated to 64 bits), and three-operand (imul rax, rbx, 5 — rax = rbx * 5). The two- and three-operand forms are the ones used in practice because the one-operand form clobbers rdx.

The Flags Register and Conditional Jumps: Finding Min and Max

cmp a, b subtracts b from a and discards the result, but sets ZF, SF, OF, and CF exactly as if the subtraction had been stored. Conditional jump instructions read specific flag combinations: jl (jump if less, signed) jumps when SF≠OF, jg (jump if greater, signed) jumps when ZF=0 and SF=OF, jz jumps when ZF=1, jne jumps when ZF=0. Chaining cmp with the right conditional jump implements every comparison-based selection.

Replace the entire content of program.asm with the final version that computes sum, average, minimum, and maximum:

section .data
    values  dq 42, 17, 89, 55, 23, 76, 8, 61
    count   equ 8
    newline db 10
    msg_sum db "sum=", 0
    msg_avg db "avg=", 0
    msg_min db "min=", 0
    msg_max db "max=", 0

section .bss
    digit_buf resb 20

section .text
    global _start

print_label:
    ; rsi = pointer to null-terminated label string
    ; find length
    xor rcx, rcx
.scan:
    cmp byte [rsi + rcx], 0
    jz .done_scan
    inc rcx
    jmp .scan
.done_scan:
    mov rax, 1
    mov rdi, 1
    mov rdx, rcx
    syscall
    ret

print_int:
    lea rsi, [digit_buf + 19]
    mov rcx, 0
.loop:
    xor rdx, rdx
    mov rbx, 10
    div rbx
    add dl, '0'
    dec rsi
    mov [rsi], dl
    inc rcx
    test rax, rax
    jnz .loop

    mov rax, 1
    mov rdi, 1
    mov rdx, rcx
    syscall

    mov rax, 1
    mov rdi, 1
    lea rsi, [newline]
    mov rdx, 1
    syscall
    ret

_start:
    ; initialise: load first element as starting min and max
    mov r13, [values]  ; r13 = min
    mov r14, [values]  ; r14 = max
    xor r12, r12       ; r12 = sum
    xor rcx, rcx       ; loop counter

stats_loop:
    cmp rcx, count
    jge stats_done

    mov rax, [values + rcx * 8]

    ; accumulate sum
    add r12, rax

    ; update min: if rax < r13, then r13 = rax
    cmp rax, r13
    jge not_min
    mov r13, rax
not_min:

    ; update max: if rax > r14, then r14 = rax
    cmp rax, r14
    jle not_max
    mov r14, rax
not_max:

    inc rcx
    jmp stats_loop

stats_done:
    ; compute average
    mov rax, r12
    cqo
    mov rbx, count
    idiv rbx
    mov r15, rax       ; r15 = average

    ; print sum
    lea rsi, [msg_sum]
    call print_label
    mov rax, r12
    call print_int

    ; print average
    lea rsi, [msg_avg]
    call print_label
    mov rax, r15
    call print_int

    ; print min
    lea rsi, [msg_min]
    call print_label
    mov rax, r13
    call print_int

    ; print max
    lea rsi, [msg_max]
    call print_label
    mov rax, r14
    call print_int

    mov rax, 60
    xor rdi, rdi
    syscall

sum=371
avg=46
min=8
max=89

The min/max selection uses jge and jle — signed comparison jumps. The CPU sets SF and OF when executing cmp rax, r13; jge reads SF and OF together (jumps when SF=OF) rather than reading CF the way unsigned jae would. Initialising both r13 and r14 to the first element rather than to 0 or a sentinel value means the algorithm is correct even if all elements are negative.

Replacing separate scan passes for min and max with a single loop that updates all four statistics simultaneously halves the number of memory loads. More importantly, the jge/jle signed comparisons would silently give wrong results if replaced with their unsigned equivalents jae/jbe when the array contains negative values. Choosing the right conditional jump for the value's signedness is not a style preference — it is a correctness requirement.

Note: The registers r12–r15 are callee-saved in the System V AMD64 ABI — routines called from C code must preserve them across calls. In a standalone assembly program with no C runtime, this restriction does not apply, but using them as scratch registers here builds the habit of keeping long-lived values in callee-saved registers and short-lived temporaries in caller-saved ones (rax, rcx, rdx, rsi, rdi, r8–r11).

Summary

This tutorial built an integer statistics calculator in x86-64 assembly that computes sum, average, minimum, and maximum over a .data array, covering the full instruction set for arithmetic and conditional control flow.

Each general-purpose register (rax, rbx, rcx, rdx, rsi, rdi, r8–r15) has four overlapping width aliases: writing to the 32-bit alias (eax) zero-extends into the 64-bit register, but writing to the 16-bit or 8-bit alias leaves the upper bits unchanged.
mov rax, imm loads a constant; mov rax, rbx copies a register; mov rax, [label] loads from memory; lea rax, [label] loads the address itself — the square brackets are the dereference operator.
add and sub update the flags register after every operation; use scaled-index addressing ([base + index * scale]) to walk arrays without computing byte offsets manually.
cqo sign-extends rax into rdx:rax and must precede every idiv; omitting it leaves rdx with stale data and causes silent incorrect quotients or a divide exception on typical input values.
imul has three forms: the three-operand form imul dst, src, imm is the safest for most uses because it does not implicitly clobber rdx.
cmp a, b subtracts b from a and sets ZF, SF, CF, and OF without storing the result; the subsequent conditional jump reads the appropriate flag combination: jl/jg for signed comparisons, jb/ja for unsigned.
Initialise min and max to the first array element, not to zero or a large sentinel, so the algorithm is correct for arrays that contain only negative values.
r12–r15 are callee-saved in the System V AMD64 ABI and are the right choice for values that must survive across subroutine calls.