Use zig build run and run the binary directly

When I used zig build run the output is:
cost: 16269 ms
But I used the binary directly the output is:
fish: Job 1, './zig-out/bin/test' terminated by signal SIGSEGV (Address boundary error)

const std = @import("std");

pub fn main() !void {
    const n: usize = 1024;
    var a: [n][n]f64 = undefined;
    var b: [n][n]f64 = undefined;
    var c: [n][n]f64 = undefined;

    var rng = std.rand.DefaultPrng.init(0);
    const random = rng.random();
    for (0..n) |i| {
        for (0..n) |j| {
            a[i][j] = random.float(f64);
            b[i][j] = random.float(f64);
            c[i][j] = 0;
        }
    }

    var now = try std.time.Timer.start();

    for (0..n) |i| {
        for (0..n) |j| {
            for (0..n) |k| {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    const duration = now.read();
    std.debug.print("cost: {d} ms\n", .{duration / std.time.ns_per_ms});
}

1 Like

If you reduce the size of N, do you get the same behavior?
3 * (1024 * 1024) * @sizeOf(f64) is a lot of stack.

To be clear, that won’t say why zig build run works and the other doesn’t - I’m just trying to eliminate some possibilities here.

1 Like

Yes, after I changed n to 500, there is no error, it should be a stack overflow

Just saw a similar issue here: How can you allocate big chunks using FixedBufferAllocator? - #6 by jjmerelo

It looks like buffer overflows aren’t being reported as such and it’s either running garbage instructions or segfaulting. I can’t speak for what the core team is doing on this issue at the moment.

The real question is why zig build run doesn’t have an overflow, but you’d have to check the output of your matrix multiplication to see if it has bad behavior. One way would be to set each value to 1 and see if the output of the multiplication for each element is equal to 1024. That’s because sum: x_i * y_j for i,j < 1024 for x_i, y_j = 1 will produce 1024 as an output for each element (or N more generally in the case of square matrices).

I’m suggesting this because I’d like to see if the inner product behaves as expected because then we’d truly have a discrepency.

1 Like

I wonder whether an easy fix for this program could be to turn the stack variables into global mutable variables (put the data outside the function), I would imagine that the compiler reserves pre-allocated memory for those that can have arbitrary size. But I haven’t checked.

1 Like

Α different way, besides making them global variables, is to make them container variables.
Container variables have static lifetime.


pub fn main() !void {
    const n: usize = 1024;
    const S = struct {
        var a: [n][n]f64 = undefined;
        var b: [n][n]f64 = undefined;
        var c: [n][n]f64 = undefined;
    };

    var rng = std.rand.DefaultPrng.init(0);
    const random = rng.random();
    for (0..n) |i| {
        for (0..n) |j| {
            S.a[i][j] = random.float(f64);
            S.b[i][j] = random.float(f64);
            S.c[i][j] = 0;
        }
    }
    ...
2 Likes

I just learned (from watching Andrew’s last stream) that a, b and c are still global variables, only accessible from within main().

5 Likes

I always use a heap allocator for such sizes, as the compiler tries to place the variables on the stack. And depending on the hardware used, this can quickly lead to problems.
Here is an example of how I would do it:

const std = @import("std");

fn allocVector(allocator: std.mem.Allocator, n: usize) ![][]f64 {
    var vec: [][]f64 = undefined;
    vec = try allocator.alloc([]f64, n);
    for (vec) |*row| {
        row.* = try allocator.alloc(f64, n);
    }
    return vec;
}

pub fn main() !void {
    const n: usize = 1024;

    const allocator = std.heap.page_allocator;

    var a: [][]f64 = try allocVector(allocator, n);
    defer allocator.free(a);

    var b: [][]f64 = try allocVector(allocator, n);
    defer allocator.free(b);

    var c: [][]f64 = try allocVector(allocator, n);
    defer allocator.free(c);

    var rng = std.rand.DefaultPrng.init(0);
    const random = rng.random();
    for (0..n) |i| {
        for (0..n) |j| {
            a[i][j] = random.float(f64);
            b[i][j] = random.float(f64);
            c[i][j] = 0;
        }
    }

    var now = try std.time.Timer.start();

    for (0..n) |i| {
        for (0..n) |j| {
            for (0..n) |k| {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    const duration = now.read();
    std.debug.print("cost: {d} ms\n", .{duration / std.time.ns_per_ms});
}

If you use a buffer allocator instead of the heap allocator, you will also get the error message that the stack is too small:

...
    // const allocator = std.heap.page_allocator;
    var buffer: [n * n * n * 64]u8 = undefined;
    var fba = std.heap.FixedBufferAllocator.init(&buffer);
    var allocator = fba.allocator();
...
LLVM Emit Object... warning: bound_err.zig:12:0: stack frame size (68719477576) exceeds limit (4294967295) in function 'bound_err.main'
[1]    30990 segmentation fault  zig run bound_err.zig

1 Like

Global? I thought they are container-level?

https://ziglang.org/documentation/master/#Container-Level-Variables

Their access is local, but their lifetime is global.
If the function, that declares the container, is called multiple times the container data remains the same.

1 Like

I also follow this implementation in other files. When I free memory, I first traverse the large array, free the sub-array, and finally free the large array. Won’t there be any problems when you free the large array without first freeing the sub-array? I forgot to free the sub-array before, so I kept allocating memory and eventually ran out of memory.

We’re starting to get off topic here, but a better implementation is to allocate a flat array and then either index into it using math or divide it into slices for every row. You’re likely taking a cache miss every time you step across a row if you don’t use an allocator that pools memory correctly even for small row sizes (which page_allocator certainly won’t).

If you do the contiguous multi-slice option, you can free the entire thing in one call…

var entire = A[0]; // get your first slice which points to first element
entire.len = N * N; // set length to entire length of matrix
allocator.free(entire);

I don’t like this option either because you have to store your slices somewhere which is a waste of data and another way to miss cache. If you’re going to do matrix stuff on the cpu, your least overhead is A[i * N + j] for element access. Now you can free the whole thing in one go. You can make a wrapper struct for these if you want multi-index access…

const Matrix = struct {
    M: usize,
    N: usize,
    data: []f64,

    pub inline fn get(self: Matrix, i: usize, j: usize) f64 {
        return self.data[i * self.N + j];
    }
    pub inline fn set(self: Matrix, i: usize, j: usize, v: f64) void {
       self.data[i * self.N + j] = v;
    }
    pub fn init(allocator: std.mem.Allocator, M: usize, N: usize) !Matrix {
        return Matrix {
            .M = M,
            .N = N,
            .data = try allocator.alloc(f64, M * N),
        };
    }
    pub fn deinit(self: Matrix, allocator: std.mem.Allocator) void {
        allocator.free(self.data);
    }
};

I’m partial to just doing the index math directly though.

At this point, we should probably open a new topic.

3 Likes

This approach is great :+1:, efficient and safer,I will be using it to optimize my code

1 Like