Test passing with -Doptimize=Debug but crashes with -Doptimize=ReleaseSafe

Hi! I’ve been working on an interpreter for the past while, but I’ve recently hit a snag when doing a release build.

My test suite has been passing just fine on debug builds, with both the native and LLVM backends. However, when I ran the tests on the release build, it crashed. The reported error is “switch on corrupt value”, but it highlights try as the error point, which doesn’t make sense. Here’s an excerpt (full log here):

       thread 635153 panic: switch on corrupt value
       /tmp/zicl/src/Heap.zig:2538:28: 0x1142863 in duplicateSingle (zicl-test)
                           .str = try dest_heap.duplicateObjString(handle),
                                  ^

The exact commands:
Working: zig build test -Doptimize=Debug -Duse-llvm=true -Dtrace-mem=false -Dtest-filter="fn command"
Failed: zig build test -Doptimize=ReleaseSafe -Duse-llvm=true -Dtrace-mem=false -Dtest-filter="fn command"

Version info:
uname -r: 7.0.7-arch1-1
zig version: 0.16.0

I’m fairly new to systems programming, so I’m not quite sure how to go forwards with this. I’ve looked at it in lldb, but I only know basic introspection and that didn’t get me anywhere.

What are productive avenues to explore/concepts to learn about that would help me with debugging this?

One tip: try running your debug build with Valgrind. It might catch the issue earlier.

Another tip is that the debug information might have gotten messed up by optimizations, but the panic message is much more likely to be correct, so look for a switch nearby.

3 Likes

Thanks for the tip on looking at the surrounding area! I’ve been slowly whittling down the problematic area in question thanks to that tip, but I’m still feeling utterly lost.

I’ve been able to get it to switch between crashing on run, and running just fine by toggling these two lines:

pub noinline fn duplicateSingle(dest_heap: *Heap, handle: Handle) error{ OutOfMemory, MultiItemObject }!Object {
    // Makes it work
    // ioutil.debug("Magic sparkles: {}\n", .{handle.tag()});
    // Doesn't make it work
    ioutil.debug("Magic sparkles: {}\n", .{@intFromEnum(handle.tag())});

(This is still with running zig build test -Doptimize=ReleaseSafe -Duse-llvm=true -Dtrace-mem=false -Dtest-filter="fn command".)

When I run the code with @intFromEnum (broken version), the logs are mostly in sync, but the returned value from duplicateSingle was almost always undefined, as evidenced by all the following src item entries (I printed src item whenever the returned value was undefined, this line).

Working version (tag ids for those following along):

Magic sparkles: .reference
Magic sparkles: .cached_local_var
Single dup, src tag 13 dup tag 0
Magic sparkles: .reference
Magic sparkles: .string
Single dup, src tag 13 dup tag 7
Magic sparkles: .reference
Magic sparkles: .cached_local_var
Single dup, src tag 13 dup tag 0
...

Broken version:

Magic sparkles: 13
Magic sparkles: 14
Single dup, src tag 13 dup tag 21
src item: .{ .index = 257, .heap = 0, ._padding = 0 }, tag: .reference, value: *
Magic sparkles: 13
Magic sparkles: 7
Single dup, src tag 13 dup tag 7
Magic sparkles: 13
Magic sparkles: 14
Single dup, src tag 13 dup tag 21
...

I know if I invoked illegal behavior in my code, technically anything can happen, but this feels like something else.

EDIT: I forgot to put the branch where all of this came from, it’s at GitHub - smj-edison/zicl at strange-happenings · GitHub

Note that asking us to debug LLM output is against the AI policy, part 3. Can you narrow it down to a repro made by you?

2 Likes

Ah sorry, that’s my bad! I’ll make sure to remember that going forward.

I’m working on creating a minimal reproduction right now by slowing whittling away at my code, until I can find something small enough to rewrite, so I’ll get back to you this discussion once I’ve prepared that.

1 Like

TL;DR as far as I can tell, duplicateObjString is returning an Object, but it should be returning a StrOrPtr. Essentially, duplicateObjString is returning a value that was created with a u5 bitmask, but I can’t understand why it would be using a u5 bitmask, because StrOrPtr doesn’t even have a u5. The only place where I use u5 is for an object Tag, or for the allocation order, but duplicateObjString shouldn’t return either of those.

I’ve been able to get the code down to 2200 lines, but every time I try narrowing it more the crash disappears. The reproduction is now certified LLM free, as I only ever used LLMs for the auxiliary parts. All code used in this post is here.

Object handles are deterministic between runs, since the repro is single threaded and uses a custom allocator. Because of that, it’s possible to use watchpoints for specific object indexes. I’ve added a gdb_helpers.py script, which adds $tag_ptr(index), so a watchpoint can be set for suspect object tags. All debugging is being done in rr (and by extension, gdb).

Quick reminder of the layout of Object:

pub const Object = packed struct(u128) {
    pub const null_string: StrOrPtr = .{ .str = .{ .index = 0, .len = 0 } };

    pub const StrOrPtr = packed struct(u59) {
        str: packed struct {
            index: u32,
            len: u26,
        },
        padding: u1 = 0,
    };

    pub const Head = packed struct(u64) {
        str: StrOrPtr = Object.null_string,
        tag: Tag,
    };

    head: Head,
    body: Body, // A 64-bit union, where the value depends on the tag type.
}

The crash in question happens at object 17:

$ zig build test -Doptimize=ReleaseSafe
test
└─ run test zicl-test 1 pass, 1 crash (2 total)
error: 'objutil.test.fn command' terminated with signal ABRT with stderr:
       thread 287119 panic: switch on corrupt value
       /home/mason/dev/zig/zicl/src/Heap.zig:972:42: 0x108b577 in duplicateSingle (zicl-test)
                       error.MultiItemObject => return src.body.reference.reference(),
                                                ^
       /home/mason/dev/zig/zicl/src/Heap.zig:1049:55: 0x1087266 in duplicate (zicl-test)
                       new_item.* = dest_heap.duplicateSingle(handle) catch |e| switch (e) {
                                                             ^
       /home/mason/dev/zig/zicl/src/objutil.zig:469:35: 0x1088bad in createVariable (zicl-test)
           const put_result = try dictPut(variables, name, value);
                                         ^
       /home/mason/dev/zig/zicl/src/objutil.zig:500:31: 0x1088f5f in setVariable (zicl-test)
                   try createVariable(name, value);
                                     ^
       /usr/lib/zig/compiler/test_runner.zig:140:68: 0x1078cbd in mainServer (zicl-test)
                       const status: TestResults.Status = if (test_fn.func()) |v| s: {
                                                                          ^
       /usr/lib/zig/compiler/test_runner.zig:71:26: 0x107b458 in main (zicl-test)
               return mainServer(init) catch |err| panic("internal test runner failure: {t}", .{err});
                                ^
       /usr/lib/zig/std/start.zig:699:88: 0x1027fc3 in callMain (zicl-test)
           if (fn_info.params[0].type.? == std.process.Init.Minimal) return wrapMain(root.main(.{
                                                                                              ^
       /usr/lib/zig/std/start.zig:190:5: 0x1027a9d in _start (zicl-test)
           asm volatile (switch (native_arch) {
           ^
failed command: ./.zig-cache/o/15ea9ec21f702cf6c6520c66e5d9be20/zicl-test --cache-dir=./.zig-cache --seed=0x23de3581 --listen=-
...
$ rr record ./.zig-cache/o/15ea9ec21f702cf6c6520c66e5d9be20/zicl-test
...
$ rr replay
(rr) source gdb_helpers.py
(rr) c
Continuing.
thread 287380 panic: switch on corrupt value
...
(rr) tag 17
object[17] tag = 21 (unknown(21))  [tag byte @ 0x7fda747f9117]

Note that object 17 has a tag type of 21, which is not a valid tag type. It is however the bits 0b10101, or undefined shifted by 1. There are no parts of my code that set an object’s head to undefined. This is the only object with an invalid tag type (I checked), so it must be what the “switch on corrupt value” was referring to. Let’s see where it got set to 21:

(rr) watch -l *$tag_ptr(17)
Hardware watchpoint 1: -location *$tag_ptr(17)
(rr) rc
Continuing.

Program received signal SIGABRT, Aborted.
...
(rr) rc
Continuing.

Hardware watchpoint 1: -location *$tag_ptr(17)

Old value = 168 '\250'
New value = 0 '\000'
0x0000000001087c88 in objutil.setCollectionLength (new_len=4) at objutil.zig:141
141                     new_item.* = old_item.peek().*;
(rr) p old_item
$1 = {bits = 13}

Looks like object 17 inherited the broken values from object 13. Is all of 13 undefined?

(rr) p/x ((unsigned char (*)[16])objects_ptr_for_gdb)[13]
$19 = {0xc, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0xa8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}

No, just the tag value (tag value is upper five bits of the eighth element, containing 0xA8. 0xA8 >> 3 equals 0x15, e.g. 21). So where did 13 get an invalid tag type?

(rr) watch -l *$tag_ptr(13)
Hardware watchpoint 2: -location *$tag_ptr(13)
(rr) rc
Continuing.

Hardware watchpoint 2: -location *$tag_ptr(13)

Old value = 168 '\250'
New value = 0 '\000'
0x000000000108727e in Heap.duplicate (dest_heap=0x10a1048 <Heap.heaps>, src_handle=...) at Heap.zig:1049
1049                    new_item.* = dest_heap.duplicateSingle(handle) catch |e| switch (e) {

I went into layout split mode at this point.

 0x1087276 <objutil.dictPut+1510>        vmovaps -0x1b0(%rbp),%xmm0
>0x108727e <objutil.dictPut+1518>        vmovaps %xmm0,(%r14)

Looks like the value is returned from the previous function call?

(rr) p $rbp
$22 = (void *) 0x7fffb714c160
(rr) x/16xb $rbp - 0x1b0
0x7fffb714bfb0: 0x0c    0x00    0x00    0x00    0x03    0x00    0x00    0xa8
0x7fffb714bfb8: 0x08    0x00    0x00    0x00    0x00    0x00    0x00    0x00

There’s that 0xA8 again. Let’s see where it came from:

(rr) watch -l *(unsigned char *)($rbp - 0x1b0 + 7)
Hardware watchpoint 3: -location *(unsigned char *)($rbp - 0x1b0 + 7)
(rr) rc
Continuing.

Hardware watchpoint 3: -location *(unsigned char *)($rbp - 0x1b0 + 7)

Old value = 168 '\250'
New value = 0 '\000'
0x000000000108b381 in Heap.duplicateSingle (dest_heap=<optimized out>, handle=...) at Heap.zig:971
 0x108b378 <Heap.duplicateSingle+392>    vmovaps 0x80(%rsp),%xmm0
>0x108b381 <Heap.duplicateSingle+401>    vmovaps %xmm0,(%rbx)
(rr) watch -l *(unsigned char *)($rsp + 0x80 + 7)
Hardware watchpoint 4: -location *(unsigned char *)($rsp + 0x80 + 7)
(rr) rc
Continuing.

Hardware watchpoint 4: -location *(unsigned char *)($rsp + 0x80 + 7)

Old value = 168 '\250'
New value = 0 '\000'
 0x108b348 <Heap.duplicateSingle+344>    mov    0x50(%rsp),%rax
>0x108b34d <Heap.duplicateSingle+349>    mov    %rax,(%rbx)
(rr) watch -l *(unsigned char *)($rsp + 0x50 + 7)
Hardware watchpoint 5: -location *(unsigned char *)($rsp + 0x50 + 7)
(rr) rc
Continuing.

Hardware watchpoint 5: -location *(unsigned char *)($rsp + 0x50 + 7)

Old value = 168 '\250'
New value = 170 '\252'
0x000000000108b1a0 in Heap.duplicateObjString (dest_heap=<optimized out>, handle=...) at Heap.zig:935
 0x108b183 <Heap.duplicateObjString+419> movw   $0x0,0x8(%rbx)
 0x108b189 <Heap.duplicateObjString+425> movabs $0xf800000000000000,%rax
 0x108b193 <Heap.duplicateObjString+435> and    (%rbx),%rax
 0x108b196 <Heap.duplicateObjString+438> shl    $0x20,%r14
 0x108b19a <Heap.duplicateObjString+442> or     %rax,%r14
 0x108b19d <Heap.duplicateObjString+445> or     %r15,%r14
>0x108b1a0 <Heap.duplicateObjString+448> mov    %r14,(%rbx)
 0x108b1a3 <Heap.duplicateObjString+451> add    $0x20,%rsp
 0x108b1a7 <Heap.duplicateObjString+455> pop    %rbx
 0x108b1a8 <Heap.duplicateObjString+456> pop    %r12

Okay, we’re getting to the end here, since this is where we combine everything into %r14. What are its constituents?

(rr) p/x (unsigned long long)$rax
$48 = 0xa800000000000000

Wait, why is this a bitmask for a u5? We’re in duplicateObjString, which has the signature pub fn duplicateObjString(dest_heap: *Heap, handle: Handle) !Object.StrOrPtr. StrOrPtr is a u59, so why is our function returning a u5? StrOrPtr has a u26, a u32, and a u1, but not a u5, so why does the 0xf800000000000000 mask appear? This function isn’t inlined either, since right after these commands it pops all the registers and returns. Also, shouldn’t (%rbx) be initialized to zero? What is the point of the movw?

This is as far as I’ve gotten, and I’m feeling rather flummoxed.

EDIT: here’s build artifacts if anyone wants to look through the assembly or is struggling to reproduce: zicl/repro at dd6006d96a1881f42639d03a642fe9d21740c56e · smj-edison/zicl · GitHub. .Ltmp48622 is the start of the problematic section.

just for my understanding: your ai wrote code that you don’t understand and now you’re asking the community for help, to understand the ai code?

No, the u5 bitmask I was talking about is with the movabs instruction, the 0xF800000000000000. And like I mentioned, the reproduction code has no AI generated content, it’s all my code.

The point I was confused on is the function that contains the movabs instruction, duplicateObjString, returns an StrOrPtr (backed by a u59), according to its signature. The assembly emitted is building a u5 bitmask though, which doesn’t make sense. And that u5 bitmask is and’ed with (%rbx), and (%rbx) was never fully initialized. So either I’ve invoked some illegal behavior exposed by the optimization build, or there’s a miscompilation with returning a packed struct.

More oddities I’ve found:

rm -r crash-cache; zig test -OReleaseSafe src/root.zig --cache-dir crash-cache crashes, as expected.
rm -r crash-cache; zig test -OReleaseSafe src/root.zig -fstrip --cache-dir crash-cache works, so stripping debug info causes the issue to go away?
rm -r crash-cache; zig test -OReleaseSafe -fopt-bisect-limit=0 src/root.zig --cache-dir crash-cache causes an assertion failure inside of LLVM:

/usr/include/c++/15.2.1/bits/stl_vector.h:1263: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = LiveDebugValues::LocIdx; _Alloc = std::allocator<LiveDebugValues::LocIdx>; reference = LiveDebugValues::LocIdx&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

I think I’ve narrowed down a reproduction, though it doesn’t crash:

pub const StrOrPtr = packed struct(u59) {
    index: u32 = 0,
    len: u26 = 0,
    _padding: u1 = 0,
};

pub noinline fn makeStrOrPtr(idx: u32, len: u26) !StrOrPtr {
    if (len > 0) {
        return .{ .index = idx, .len = len };
    }
    return .{};
}

export var global_result: u64 = 0;

test "repro" {
    const s = makeStrOrPtr(42, 5) catch unreachable;
    const widened: u64 = @as(u59, @bitCast(s));
    @atomicStore(u64, &global_result, widened, .seq_cst);
}

Run with zig test -OReleaseSafe -femit-llvm-ir=repro.ll -femit-asm=repro.s -fno-emit-bin repro.zig.

This generates the following IR:

; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind willreturn memory(argmem: readwrite) uwtable
define internal fastcc void @repro_crash.makeStrOrPtr(ptr noalias nonnull captures(none) initializes((8, 10)) %0) unnamed_addr #0 align 1 !dbg !6133 {
Entry:
    #dbg_value(i32 42, !6147, !DIExpression(), !6148)
    #dbg_value(i26 5, !6149, !DIExpression(), !6148)
  %1 = getelementptr inbounds nuw i8, ptr %0, i64 8, !dbg !6150
  store i16 0, ptr %1, align 2, !dbg !6150
  %2 = load i64, ptr %0, align 8, !dbg !6150
  %3 = and i64 %2, -576460752303423488, !dbg !6150
  %4 = or disjoint i64 %3, 21474836522, !dbg !6150
  store i64 %4, ptr %0, align 8, !dbg !6150
  ret void, !dbg !6153
}

Note that %2 is a read from sret, but sret hasn’t been initialized yet. The asm looked strikingly similar to the crash I found as well:

repro_crash.makeStrOrPtr:
.Lfunc_begin0:
	.loc	74 7 0
	.cfi_startproc
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset rbp, -16
	mov	rbp, rsp
	.cfi_def_cfa_register rbp
.Ltmp0:
	.loc	74 9 9 prologue_end
	mov	word ptr [rdi + 8], 0
	movabs	rax, -576460752303423488
	and	rax, qword ptr [rdi]
	movabs	rcx, 21474836522
	or	rcx, rax
	mov	qword ptr [rdi], rcx
.Ltmp1:
	.loc	74 0 0 epilogue_begin is_stmt 0
	pop	rbp
	.cfi_def_cfa rsp, 8
	ret
.Ltmp2:
.Lfunc_end0:
	.size	repro_crash.makeStrOrPtr, .Lfunc_end0-repro_crash.makeStrOrPtr
	.cfi_endproc

	.p2align	4
	.type	start._start,@function

There’s the same and rax, qword ptr [rdi], which is reading from [rdi] before it’s initialized. I think the reason the crash has been so finicky is because it depended on a previous function filling the stack with undefined.