Pointer corrupted when passed as last argument to C function

AndrewCodeDev · June 30, 2024, 12:49pm

I’ve just tracked down a really strange strange bug in my code. Somehow, when passing a struct with void pointers, the position it’s passed in changes its values. I’m casting them to usize/size_t to print out the numeric value before it gets passed to the function and the value that gets received by the function:

// first argument
before function: 56125536
inside function: 56125536

// last argument
before function: 57333792
inside function: 140721618718672

Here’s the code that causes the issue…

This works:

extern "C" void launch_permutate_ij_ji_Scalar(
    StreamContext stream,    
    const void* src,
          void* dst,
    double alpha,
    len_t row,
    len_t col
) { ...

This fails:

extern "C" void launch_permutate_ij_ji_Scalar(
    const void* src,
          void* dst,
    double alpha,
    len_t row,
    len_t col,
    StreamContext stream
) { ...

Here’s the StreamContext as defined in C:

typedef struct {
  void* cuda_stream;
  void* blas_handle;
} StreamContext;

I’m at a loss here - it corrupts the cuda context and fails to launch the kernel because the pointer values are getting thrashed when I pass to this single C function and changing the argument position fixes it. Any ideas what would cause this?

chung-leong · June 30, 2024, 1:13pm

Try inserting the struct fields as function arguments:

extern "C" void launch_permutate_ij_ji_Scalar(
    const void* src,
          void* dst,
    double alpha,
    len_t row,
    len_t col,
    void* cuda_stream;
    void* blas_handle;
) { ...

I’ve recently encountered a miscompilation issue related to passing 128-bit values to a function on the stack. I think you’re encountering something similar. It’s works when stream is the first argument because it’d be stored in two registers. When it’s at the end then half of it ends up on the stack.

AndrewCodeDev · June 30, 2024, 1:57pm

Thanks for getting back to me quickly.

If I am understanding you correctly, this seems like a compilation bug then? This actually makes a great deal of sense but I wonder if there is an open issue to resolve this?

chung-leong · June 30, 2024, 3:22pm

I reported it: @cVaArg() segfaults when it needs to retrieve a u128 from the stack (target = x86_64-Linux) · Issue #20417 · ziglang/zig · GitHub. Yours is probably a separate issue. I actually miscounted the arguments. I didn’t see that alpha is a double. All arguments should be passed by register on x86_64:

src -> int reg 1
dst -> int reg 2
alpha -> float reg 1
row -> int reg 3
col -> int reg 4
stream -> int reg 5, 6

AndrewCodeDev · June 30, 2024, 7:17pm

Yes I realized that we’re actually having a different issue. The same thing happens in my case if I pass a StreamContext pointer instead of a direct struct and dereference it. For some reason, it just struggles on that one function in that one argument position.

chung-leong · June 30, 2024, 7:47pm

Try putting junk bytes before and after stream so you can see where the function is actually getting the value:

launch_permutate_ij_ji_Scalar(
   0xBBBB_BBBB_BBBB_BBBB,
   0xBBBB_BBBB_BBBB_BBBB,
   0.0,
   0xBBBB_BBBB_BBBB_BBBB,
   0xBBBB_BBBB_BBBB_BBBB,
   stream,
   0xAAAA_AAAA_AAAA_AAAA,
   0xAAAA_AAAA_AAAA_AAAA,
   0xAAAA_AAAA_AAAA_AAAA,
   0xAAAA_AAAA_AAAA_AAAA,
);

LucasSantos91 · June 30, 2024, 8:58pm

Can you show us the calling code? Maybe it’s got nothing to do with void pointers, it’s just that the last argument is being corrupted. Did you check the callconv in your zig declaration?

AndrewCodeDev · June 30, 2024, 9:40pm

@chung-leong I’ll try that if we can’t find something else.

@LucasSantos91 Good point. The backend is fairly extensive so I’ll do my best to show what’s happening here succinctly. Let me know if there’s something else I’m missing.

Here’s the scheme - essentially, I’m generating a whole library of C wrappers for a C++/Cuda backend that then make their way into the zig code. All of the functions are automatically tagged with extern "C", compiled to a static library, and then the function declarations that Zig sees are just extern. So far, this has been totally fine. Here’s an example of the generated file declarations that get brought in via @cimport:

// Top of generated declarations... Zig struggled with size_t
// so len_t is just unsigned long long:

#if defined(__cplusplus)
    #define EXTERN_C extern "C"
#else
    #define EXTERN_C extern
#endif

EXTERN_C void launch_reduce_ij_i_r16(
    const void* src,
          void* dst,
    double alpha,
    len_t m,
    len_t n,
    StreamContext stream
);
EXTERN_C void launch_reduce_ij_i_r32(
    const void* src,
          void* dst,
    double alpha,
    len_t m,
    len_t n,
    StreamContext stream
);
// ...

These get grouped into function arrays where each function array is generated as follows:

// Top of kernels file:
const decls = @import("cimport.zig").C;

fn dispatch_array(tuple: anytype) [tuple.len]*const @TypeOf(tuple[0]) { return tuple; }

pub const reduce_ij_i = dispatch_array(.{
	decls.launch_reduce_ij_i_r16,
	decls.launch_reduce_ij_i_r32,
	decls.launch_reduce_ij_i_r64,
});
// and so on...

These functions are then called by using runtime-type information at the call site. Here’s the offending example:

const key = core.dkey(y);

core.kernels.permutate_ij_ji[key](
    y.stream(),  <--- This works, but only in this position
    x.data_ptr(),
    y.data_ptr(),
    1.0, // alpha
    xs[0], xs[1],
   // y.stream() <--- The rest of the API is in this position
);

There’s a few things I’m going to try. Instead of using the GCC compiler (which is technically an external discrepancy as CUDA prefers Clang), I’m going to try using Clang instead.

~~This probably has nothing to do with it, but instead of using *const foo, I may just try doing foo to see if that has anything to do with it.~~ Edit: those are runtime values apparently, so they need to remain as pointers.

Also, these are compiled with the NVCC compiler atop of that which is NVIDIA’s cuda compiler. I know that Clang can compile Cuda as well, I may just try to switch over to that and see if it has better luck. NVCC dispatches to another compiler to handle the C++ code and then compiles the to a CUBIN file (their binary files) or PTX instructions for the devices from there.

That’s the setup.

AndrewCodeDev · June 30, 2024, 10:14pm

This is the calling convention type signature for one the kernels:

fn (?*anyopaque, f64, f64, c_ulong, cimport.StreamContext) callconv(.C)

AndrewCodeDev · June 30, 2024, 10:32pm

This is a work around that fixes the problem. I’m adding a single byte of padding and now the pointer value is no longer corrupted. @chung-leong, does this seem related to what you were saying? This might be influencing how the compiler is passing these values (register vs stack)

typedef struct {
  void* ptr;
  char pad;
} PaddedPtr;

typedef struct {
  PaddedPtr cuda_stream;
  PaddedPtr blas_handle;
} StreamContext;

chung-leong · July 1, 2024, 1:54am

I think it’s time for Godbolt. From this code:

typedef struct {
  void* cuda_stream;
  void* blas_handle;
} StreamContext;

typedef unsigned long long len_t;

extern void launch_permutate_ij_ji_Scalar(
  const void* src,
        void* dst,
  double alpha,
  len_t row,
  len_t col,
  StreamContext stream
) {
}

I get this from gcc:

launch_permutate_ij_ji_Scalar:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], rdi
        mov     QWORD PTR [rbp-16], rsi
        movsd   QWORD PTR [rbp-24], xmm0
        mov     QWORD PTR [rbp-32], rdx
        mov     QWORD PTR [rbp-40], rcx
        mov     QWORD PTR [rbp-64], r8
        mov     QWORD PTR [rbp-56], r9
        nop
        pop     rbp
        ret

Compiler is just copying the arguments onto the stack, in the exact order. Replacing adding the padding byte gives us the following:

launch_permutate_ij_ji_Scalar:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], rdi
        mov     QWORD PTR [rbp-16], rsi
        movsd   QWORD PTR [rbp-24], xmm0
        mov     QWORD PTR [rbp-32], rdx
        mov     QWORD PTR [rbp-40], rcx
        nop
        pop     rbp
        ret

r8 and r9 are no longer used as stream is now passed in the stack becase it’s larger than 2 * sizeof(size_t).

Exactly why in the first case is failing is a mystery. Maybe for some reason the callee thinks the values are in the stack? Maybe for some reason the caller thinks they should be passed in the stack? Maybe some prankster inserted #define double char into one of your include files?

AndrewCodeDev · July 1, 2024, 2:25am

That gives me some places to check at least - thanks for looking into that. Meanwhile, it looks like clang is doing this:

launch_permutate_ij_ji_Scalar(void const*, void*, double, unsigned long long, unsigned long long, StreamContext): # @launch_permutate_ij_ji_Scalar(void const*, void*, double, unsigned long long, unsigned long long, StreamContext)
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 16], r8
        mov     qword ptr [rbp - 8], r9
        mov     qword ptr [rbp - 24], rdi
        mov     qword ptr [rbp - 32], rsi
        movsd   qword ptr [rbp - 40], xmm0
        mov     qword ptr [rbp - 48], rdx
        mov     qword ptr [rbp - 56], rcx
        pop     rbp
        ret

Which is comedically quite opposed.

Total shot in the dark here, but what about this… I’m starting to wonder if there is some default setting in Zig that sets it to using Clang/LLVM instead of GCC in the build process and it’s setting up the arguments wrong. Is there a setting for making the Zig build system default to one convention/compiler/argument-placement or another? I haven’t spent enough time playing around with the build system to find out.

LucasSantos91 · July 1, 2024, 2:53am

This number in hex is 7FFC 4E18 8FD0, which looks like a return address.
It seems the caller in putting StreamContext in a register, but the callee is expecting to find it on the stack, and is accidentally picking up the return address.
Could this be a Windows vs System-V thing? Windows ABI only uses 4 registers for integers, while System-V uses 6. You have exactly 5 integers as parameters. If the caller is using System-V ABI, it would go like this:

src -> rdi
dst -> rsi
alpha -> xmm0
row -> rdx
col -> rcx
stream -> r8

And if the callee is using Windows ABI, it would pick up the arguments like this:

rcx -> src (should be col)
rdx -> dst (should be row)
r8 -> row (should be stream)
r9 -> col (garbage)
top of stack -> stream (actually the return address)
xmm0 -> alpha (correct)

AndrewCodeDev · July 1, 2024, 3:25am

The plot thickens, me thinks…

Maybe.

I could be wrong but at this point, but don’t you think I’d have run into bad behavior earlier if that were the case? Now I’m becoming paranoid and I probably should just print everything on a few samples to be totally sure - that’s something I can check because if I’m reading your post correctly then I should be seeing garbage values elsewhere, too.

I’ve got another computer here that I’m going to try to install this one and see if the same issue persists there too (just for my own sanity).

Meanwhile, @LucasSantos91, you probably know more about this than I do - does Zig default to using Clang/LLVM when setting up the call? Is there a way to control whether it uses GCC or Clang in the build.zig file? I’m not explicitly setting an option for that and I may be shooting myself in the foot here. The kernel library is compiled offsite in a child process as apart of the build.

chung-leong · July 1, 2024, 4:18am

Clang is just copying the registers to different temp locations. I used the gcc listing because looks a bit more obviously. There’s no real difference between the two.

The assembly of the actual code in question should reveal what the issue is. You can generate that by adding this to your build file:

    const waf = b.addWriteFiles();
    waf.addCopyFileToSource(exe.getEmittedAsm(), "main.asm");
    waf.step.dependOn(&exe.step);
    b.getInstallStep().dependOn(&waf.step);

Maybe stick a “@breakpoint()” into the function so you can find the code more easier. Look for “int3”.

AndrewCodeDev · July 1, 2024, 5:05am

Well, this is quite interesting an interesting turn of events. I installed it on my main machine and ran it there instead of my laptop and there’s no corruption.

Now I’m starting to suspect that this is actually an NVCC problem because my versions are different between the two.

Laptop (fails):
  NVCC: 12.5
  Clangd: 18.1.3
  GCC: 13.2.0
  Zig: 0.13.0

Desktop (works):
  NVCC: 12.3
  Clangd: 15.0.7
  GCC: 11.4.0
  Zig: 0.14.0-dev.130+cb308ba3a

I’m not sure which of these is the most relevant to my problem but I’ll probably start by rolling back NVCC to 12.3 to start. I’d like to get my install of Zig brought up to 0.14.0-dev... (I think I have to pass some sort of nightly flag for that - I forget what the setup there is for snap).

I don’t think Zig is to blame here. There’s something else going on. Well, lesson learned - run it on two machines before digging into the assembly. It always sucks when you try to keep your installs current and something else causes a failure.

AndrewCodeDev · July 1, 2024, 6:13am

@chung-leong, false alarm, I still had the padding applied turned on but I found out what’s causing this to happen.

This part right here:

const key = core.dkey(y);

core.kernels.permutate_ij_ji[key](

If I print the value of the key, it’s 1. I even threw in an assert and it passed just fine.

If I hardcode the call like so:

core.kernels.permutate_ij_ji[1](

It works just fine. For some reason, the runtime index is botching the call even when it’s the right value.

This is what creates the dispatch arrays for the array values:

fn dispatch_array(tuple: anytype) [tuple.len]*const @TypeOf(tuple[0]) { return tuple; }

And they get created like so:

pub const reduce_ij_i = dispatch_array(.{
	decls.launch_reduce_ij_i_r16,
	decls.launch_reduce_ij_i_r32,
	decls.launch_reduce_ij_i_r64,
});

This even works too:

const key = core.dkey(y);

switch (key) {
    0 => core.kernels.permutate_ij_ji[0](
        x.data_ptr(),
        y.data_ptr(),
        1.0, // alpha
        xs[0], xs[1],
        x.stream(),
    ),
    1 => core.kernels.permutate_ij_ji[1](
        x.data_ptr(),
        y.data_ptr(),
        1.0, // alpha
        xs[0], xs[1],
        x.stream(),
    ),
    2 => core.kernels.permutate_ij_ji[2](
        x.data_ptr(),
        y.data_ptr(),
        1.0, // alpha
        xs[0], xs[1],
        x.stream(),
    ),
    else => {
        @panic("invalid value");
    }
}

For some reason, when this is a runtime value used as an index, it corrupts it. When it’s a comptime int, it works totally fine.

chung-leong · July 1, 2024, 8:21am

One thing I’ve noticed is that if a function pointer is comptime known, then Zig would call it using the signature of the function and not the signature of the function pointer. The following is from my own project:

        var function_ptr: *const F = @ptrCast(&function);
        _ = &function_ptr;
        // ...
        arg_struct.retval = @call(.auto, function_ptr, args);

The cast to F would get ignored if the const is used for function_ptr.

The fact that your code works with a comptime-known index but not with a runtime one suggests there’s some sort of mismatch between the definitions of the function and the function pointer.

AndrewCodeDev · July 1, 2024, 8:32am

That’s possible, but I’m not sure how I could be more explicit about the type signature. I’m using @TypeOf and grabbing the argument index of the tuple and I’ve inspected those type signatures multiple times now. The same technique works without fail on all the other kernels. That said, here’s what I’ve done to fix this issue:

I moved all the function pointers away from runtime values to comptime by changing the dispatch arrays to fn types instead of *const fn types. Then, I added a handy little helper that translates our runtime index to a comptime index:

pub inline fn invoke(comptime kernels: anytype, key: usize, args: anytype) void {
    switch (key) {
        0 => @call(.auto, kernels[0], args),
        1 => @call(.auto, kernels[1], args),
        2 => @call(.auto, kernels[2], args),
        else => @panic("Invalid runtime key."),
    }
}

And yeah… that works. I’m too fatigued with this issue to keep looking into it further so I’ll just mark this as the solution but at some point in the future I may try to revisit this when I have more time. I’m only supporting floating point (and maybe I’ll reintroduce complex types again at some point) so enumerating the possibilities is quite trivial.

Anyhow, thanks again for all your help and your suggestions - you too, @LucasSantos91.

chung-leong · July 1, 2024, 9:25am

Perhaps callconv(.C) was missing from the pointer type? If we look at the Godbolt of the following:

const A = extern struct {
    ptr1: *anyopaque,
    ptr2: *anyopaque,
}; 

fn callee(a: A) void {
    _ = a;
}

export fn caller() void {
    const a: A = .{ .ptr1 = @ptrFromInt(1000), .ptr2 = @ptrFromInt(2000) };
    callee(a);
}

caller:
        push    rbp
        mov     rbp, rsp
        movabs  rdi, offset .L__unnamed_1
        call    example.callee
        pop     rbp
        ret

example.callee:
        push    rbp
        mov     rbp, rsp
        pop     rbp
        ret

.L__unnamed_1:
        .quad   1000
        .quad   2000

We can see that only one register (rdi) is used: Zig is passing the struct by pointer. Adding callconv(.C) or export to callee() makes it use rdi and rsi.