When I used zig build run the output is: cost: 16269 ms
But I used the binary directly the output is: fish: Job 1, './zig-out/bin/test' terminated by signal SIGSEGV (Address boundary error)
const std = @import("std");
pub fn main() !void {
const n: usize = 1024;
var a: [n][n]f64 = undefined;
var b: [n][n]f64 = undefined;
var c: [n][n]f64 = undefined;
var rng = std.rand.DefaultPrng.init(0);
const random = rng.random();
for (0..n) |i| {
for (0..n) |j| {
a[i][j] = random.float(f64);
b[i][j] = random.float(f64);
c[i][j] = 0;
}
}
var now = try std.time.Timer.start();
for (0..n) |i| {
for (0..n) |j| {
for (0..n) |k| {
c[i][j] += a[i][k] * b[k][j];
}
}
}
const duration = now.read();
std.debug.print("cost: {d} ms\n", .{duration / std.time.ns_per_ms});
}
It looks like buffer overflows aren’t being reported as such and it’s either running garbage instructions or segfaulting. I can’t speak for what the core team is doing on this issue at the moment.
The real question is why zig build run doesn’t have an overflow, but you’d have to check the output of your matrix multiplication to see if it has bad behavior. One way would be to set each value to 1 and see if the output of the multiplication for each element is equal to 1024. That’s because sum: x_i * y_j for i,j < 1024 for x_i, y_j = 1 will produce 1024 as an output for each element (or N more generally in the case of square matrices).
I’m suggesting this because I’d like to see if the inner product behaves as expected because then we’d truly have a discrepency.
I wonder whether an easy fix for this program could be to turn the stack variables into global mutable variables (put the data outside the function), I would imagine that the compiler reserves pre-allocated memory for those that can have arbitrary size. But I haven’t checked.
I always use a heap allocator for such sizes, as the compiler tries to place the variables on the stack. And depending on the hardware used, this can quickly lead to problems.
Here is an example of how I would do it:
const std = @import("std");
fn allocVector(allocator: std.mem.Allocator, n: usize) ![][]f64 {
var vec: [][]f64 = undefined;
vec = try allocator.alloc([]f64, n);
for (vec) |*row| {
row.* = try allocator.alloc(f64, n);
}
return vec;
}
pub fn main() !void {
const n: usize = 1024;
const allocator = std.heap.page_allocator;
var a: [][]f64 = try allocVector(allocator, n);
defer allocator.free(a);
var b: [][]f64 = try allocVector(allocator, n);
defer allocator.free(b);
var c: [][]f64 = try allocVector(allocator, n);
defer allocator.free(c);
var rng = std.rand.DefaultPrng.init(0);
const random = rng.random();
for (0..n) |i| {
for (0..n) |j| {
a[i][j] = random.float(f64);
b[i][j] = random.float(f64);
c[i][j] = 0;
}
}
var now = try std.time.Timer.start();
for (0..n) |i| {
for (0..n) |j| {
for (0..n) |k| {
c[i][j] += a[i][k] * b[k][j];
}
}
}
const duration = now.read();
std.debug.print("cost: {d} ms\n", .{duration / std.time.ns_per_ms});
}
If you use a buffer allocator instead of the heap allocator, you will also get the error message that the stack is too small:
...
// const allocator = std.heap.page_allocator;
var buffer: [n * n * n * 64]u8 = undefined;
var fba = std.heap.FixedBufferAllocator.init(&buffer);
var allocator = fba.allocator();
...
LLVM Emit Object... warning: bound_err.zig:12:0: stack frame size (68719477576) exceeds limit (4294967295) in function 'bound_err.main'
[1] 30990 segmentation fault zig run bound_err.zig
Their access is local, but their lifetime is global.
If the function, that declares the container, is called multiple times the container data remains the same.
I also follow this implementation in other files. When I free memory, I first traverse the large array, free the sub-array, and finally free the large array. Won’t there be any problems when you free the large array without first freeing the sub-array? I forgot to free the sub-array before, so I kept allocating memory and eventually ran out of memory.
We’re starting to get off topic here, but a better implementation is to allocate a flat array and then either index into it using math or divide it into slices for every row. You’re likely taking a cache miss every time you step across a row if you don’t use an allocator that pools memory correctly even for small row sizes (which page_allocator certainly won’t).
If you do the contiguous multi-slice option, you can free the entire thing in one call…
var entire = A[0]; // get your first slice which points to first element
entire.len = N * N; // set length to entire length of matrix
allocator.free(entire);
I don’t like this option either because you have to store your slices somewhere which is a waste of data and another way to miss cache. If you’re going to do matrix stuff on the cpu, your least overhead is A[i * N + j] for element access. Now you can free the whole thing in one go. You can make a wrapper struct for these if you want multi-index access…