Interested in Implementation of Big Integers (u256, u1024, u3141, etc.)

I wanted to write a regex crossword solver for fun in Zig, and so I’ve been doing all sorts of wacky stuff with like using a u256 to represent the set of candidates for a particular character.

So, I was curious about how this is actually implemented. For example, if I wrote:

const std = @import("std");

pub noinline fn foobarbaz() void {
    const x: u1024 = 1 << 1021;
    const y: u1024 = 1 << 1022;
    const z = x + y;
    std.debug.print("{}\n", .{z});
}

pub fn main() void {
    foobarbaz();
}

What would the assembly look like? I am not an assembly pro, but I would guess that we would “somehow allocate 1024 bits on the stack for x,y,z” and “somehow add them” but the “somehow add them” is the part I want to actually try and understand!

So with some help from the Zig discord I built my file like this:

zig build-obj main.zig -fstrip -O ReleaseSmall -femit-asm=main.a

and when I look at the assembly I see

main.foobarbaz:
	.cfi_startproc
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset rbp, -16
	mov	rbp, rsp
	.cfi_def_cfa_register rbp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	sub	rsp, 2024
	.cfi_offset rbx, -56
	.cfi_offset r12, -48
	.cfi_offset r13, -40
	.cfi_offset r14, -32
	.cfi_offset r15, -24
	lea	rbx, [rbp - 960]
	lea	rsi, [rbp - 1024]
	push	64
	pop	rdx
	mov	rdi, rbx
	call	debug.lockStderr
	mov	r15, qword ptr [rbx]
	movabs	rax, 6917529027641081856
	mov	r10d, 1025
	xor	edi, edi
	mov	qword ptr [rbp - 144], 0
	mov	qword ptr [rbp - 200], 0
	mov	qword ptr [rbp - 176], 0
    ...omitting the rest

now, I look at this and the only part I recognize is push rbp and move rbp, rsp :stuck_out_tongue:
I don’t know if staring at this assembly and picking it apart is going to be useful, so I wanted to ask:

1 Like

Actually, I had an idea about the 6917529027641081856.

Python 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print(f"{6917529027641081856:b}")
110000000000000000000000000000000000000000000000000000000000000
>>> 

So perhaps it’s somehow taking that huge number and then trying to repeatedly multiply it by 2 until it gets $2^1021 + 2^1022$?

Because x y and z are all statically known constants I would expect that the addition happens as a part of the compilation process and that the resulting assembly just encodes the result in some way that llvm finds convenient.

For the addition to happen at runtime I think you need to have runtime inputs/variables.

4 Likes

You’re absolutely right! (for all intents and purposes this is a joke please don’t ban me)

Anyways, changing the program to this:

const std = @import("std");

pub noinline fn foobarbaz(x: u1024, y: u1024) u1024 {
    return x + y;
}

pub fn main(process: std.process.Init) !void {
    var stdin_buffer: [1024]u8 = undefined;
    var stdin_reader = std.Io.File.stdin().reader(process.io, &stdin_buffer);

    var in = try stdin_reader.interface.takeDelimiterExclusive('\n');
    stdin_reader.interface.toss(1);
    const x = try std.fmt.parseInt(u1024, in, 10);
    in = try stdin_reader.interface.takeDelimiterExclusive('\n');
    stdin_reader.interface.toss(1);
    const y = try std.fmt.parseInt(u1024, in, 10);

    std.debug.print("{}\n", .{foobarbaz(x, y)});
}

(reading from stdin ripped straight from Right way to read user input from stdin in zig 0.15.2 )
and looking at the assembly I see:

main.foobarbaz:
	.cfi_startproc
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset rbp, -16
	mov	rbp, rsp
	.cfi_def_cfa_register rbp
	add	rsi, qword ptr [rbp + 104]
	mov	rax, rdi
	adc	rdx, qword ptr [rbp + 112]
	adc	rcx, qword ptr [rbp + 120]
	adc	r8, qword ptr [rbp + 128]
	mov	qword ptr [rdi + 16], rcx
	mov	rcx, qword ptr [rbp + 144]
	mov	qword ptr [rdi + 8], rdx
	mov	qword ptr [rdi], rsi
	mov	rsi, qword ptr [rbp + 152]
	mov	rdx, qword ptr [rbp + 160]
	adc	r9, qword ptr [rbp + 136]
	mov	qword ptr [rdi + 24], r8
	adc	rcx, qword ptr [rbp + 16]
	mov	qword ptr [rdi + 32], r9
	adc	rsi, qword ptr [rbp + 24]
	mov	qword ptr [rdi + 40], rcx
	adc	rdx, qword ptr [rbp + 32]
	mov	qword ptr [rdi + 48], rsi
	mov	rsi, qword ptr [rbp + 168]
	adc	rsi, qword ptr [rbp + 40]
	mov	qword ptr [rdi + 56], rdx
	mov	rdx, qword ptr [rbp + 176]
	adc	rdx, qword ptr [rbp + 48]
	mov	qword ptr [rdi + 64], rsi
	mov	rsi, qword ptr [rbp + 184]
	adc	rsi, qword ptr [rbp + 56]
	mov	qword ptr [rdi + 72], rdx
	mov	rdx, qword ptr [rbp + 192]
	adc	rdx, qword ptr [rbp + 64]
	mov	qword ptr [rdi + 80], rsi
	mov	rsi, qword ptr [rbp + 200]
	adc	rsi, qword ptr [rbp + 72]
	mov	qword ptr [rdi + 88], rdx
	mov	rdx, qword ptr [rbp + 208]
	adc	rdx, qword ptr [rbp + 80]
	mov	qword ptr [rdi + 96], rsi
	mov	rsi, qword ptr [rbp + 216]
	adc	rsi, qword ptr [rbp + 88]
	mov	qword ptr [rdi + 104], rdx
	mov	rdx, qword ptr [rbp + 224]
	adc	rdx, qword ptr [rbp + 96]
	mov	qword ptr [rdi + 112], rsi
	mov	qword ptr [rdi + 120], rdx
	pop	rbp
	.cfi_def_cfa rsp, 8
	ret

Which looks more approachable. I think I should go sleep now but I’ll try to understand what this does tomorrow. Thanks!

1 Like

It’s long addition, the same way we did it in grade school. Only instead of adding a digit to a digit, the CPU can add, say, a u64 to a u64. Other than that detail, “carry the one” works the same. The two-u64-add can consume a carry, and produce a carry. Hence the op is called “add with carry”.

The carry is stored as a temporary semi-hidden state, in the CPU flags. You clear it before you do your chained adds, so you don’t accidentally add 1 from whatever the last math happened to be. In this particular case the code gen accomplishes that by using add instead of adc on the first pair to be added.

The exact bit width it ends up chunking it out to is CPU architecture and build flag dependent. The carry is always just 1 bit tho

6 Likes

The plan is for 512 bit integers to use Kogge-Stone Vector Addition on AVX-512 targets. The concept easily extends to 1024 bit integers with a little overhead. See llvm-project#132601

6 Likes