TL;DR as far as I can tell, duplicateObjString is returning an Object, but it should be returning a StrOrPtr. Essentially, duplicateObjString is returning a value that was created with a u5 bitmask, but I can’t understand why it would be using a u5 bitmask, because StrOrPtr doesn’t even have a u5. The only place where I use u5 is for an object Tag, or for the allocation order, but duplicateObjString shouldn’t return either of those.
I’ve been able to get the code down to 2200 lines, but every time I try narrowing it more the crash disappears. The reproduction is now certified LLM free, as I only ever used LLMs for the auxiliary parts. All code used in this post is here.
Object handles are deterministic between runs, since the repro is single threaded and uses a custom allocator. Because of that, it’s possible to use watchpoints for specific object indexes. I’ve added a gdb_helpers.py script, which adds $tag_ptr(index), so a watchpoint can be set for suspect object tags. All debugging is being done in rr (and by extension, gdb).
Quick reminder of the layout of Object:
pub const Object = packed struct(u128) {
pub const null_string: StrOrPtr = .{ .str = .{ .index = 0, .len = 0 } };
pub const StrOrPtr = packed struct(u59) {
str: packed struct {
index: u32,
len: u26,
},
padding: u1 = 0,
};
pub const Head = packed struct(u64) {
str: StrOrPtr = Object.null_string,
tag: Tag,
};
head: Head,
body: Body, // A 64-bit union, where the value depends on the tag type.
}
The crash in question happens at object 17:
$ zig build test -Doptimize=ReleaseSafe
test
└─ run test zicl-test 1 pass, 1 crash (2 total)
error: 'objutil.test.fn command' terminated with signal ABRT with stderr:
thread 287119 panic: switch on corrupt value
/home/mason/dev/zig/zicl/src/Heap.zig:972:42: 0x108b577 in duplicateSingle (zicl-test)
error.MultiItemObject => return src.body.reference.reference(),
^
/home/mason/dev/zig/zicl/src/Heap.zig:1049:55: 0x1087266 in duplicate (zicl-test)
new_item.* = dest_heap.duplicateSingle(handle) catch |e| switch (e) {
^
/home/mason/dev/zig/zicl/src/objutil.zig:469:35: 0x1088bad in createVariable (zicl-test)
const put_result = try dictPut(variables, name, value);
^
/home/mason/dev/zig/zicl/src/objutil.zig:500:31: 0x1088f5f in setVariable (zicl-test)
try createVariable(name, value);
^
/usr/lib/zig/compiler/test_runner.zig:140:68: 0x1078cbd in mainServer (zicl-test)
const status: TestResults.Status = if (test_fn.func()) |v| s: {
^
/usr/lib/zig/compiler/test_runner.zig:71:26: 0x107b458 in main (zicl-test)
return mainServer(init) catch |err| panic("internal test runner failure: {t}", .{err});
^
/usr/lib/zig/std/start.zig:699:88: 0x1027fc3 in callMain (zicl-test)
if (fn_info.params[0].type.? == std.process.Init.Minimal) return wrapMain(root.main(.{
^
/usr/lib/zig/std/start.zig:190:5: 0x1027a9d in _start (zicl-test)
asm volatile (switch (native_arch) {
^
failed command: ./.zig-cache/o/15ea9ec21f702cf6c6520c66e5d9be20/zicl-test --cache-dir=./.zig-cache --seed=0x23de3581 --listen=-
...
$ rr record ./.zig-cache/o/15ea9ec21f702cf6c6520c66e5d9be20/zicl-test
...
$ rr replay
(rr) source gdb_helpers.py
(rr) c
Continuing.
thread 287380 panic: switch on corrupt value
...
(rr) tag 17
object[17] tag = 21 (unknown(21)) [tag byte @ 0x7fda747f9117]
Note that object 17 has a tag type of 21, which is not a valid tag type. It is however the bits 0b10101, or undefined shifted by 1. There are no parts of my code that set an object’s head to undefined. This is the only object with an invalid tag type (I checked), so it must be what the “switch on corrupt value” was referring to. Let’s see where it got set to 21:
(rr) watch -l *$tag_ptr(17)
Hardware watchpoint 1: -location *$tag_ptr(17)
(rr) rc
Continuing.
Program received signal SIGABRT, Aborted.
...
(rr) rc
Continuing.
Hardware watchpoint 1: -location *$tag_ptr(17)
Old value = 168 '\250'
New value = 0 '\000'
0x0000000001087c88 in objutil.setCollectionLength (new_len=4) at objutil.zig:141
141 new_item.* = old_item.peek().*;
(rr) p old_item
$1 = {bits = 13}
Looks like object 17 inherited the broken values from object 13. Is all of 13 undefined?
(rr) p/x ((unsigned char (*)[16])objects_ptr_for_gdb)[13]
$19 = {0xc, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0xa8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
No, just the tag value (tag value is upper five bits of the eighth element, containing 0xA8. 0xA8 >> 3 equals 0x15, e.g. 21). So where did 13 get an invalid tag type?
(rr) watch -l *$tag_ptr(13)
Hardware watchpoint 2: -location *$tag_ptr(13)
(rr) rc
Continuing.
Hardware watchpoint 2: -location *$tag_ptr(13)
Old value = 168 '\250'
New value = 0 '\000'
0x000000000108727e in Heap.duplicate (dest_heap=0x10a1048 <Heap.heaps>, src_handle=...) at Heap.zig:1049
1049 new_item.* = dest_heap.duplicateSingle(handle) catch |e| switch (e) {
I went into layout split mode at this point.
0x1087276 <objutil.dictPut+1510> vmovaps -0x1b0(%rbp),%xmm0
>0x108727e <objutil.dictPut+1518> vmovaps %xmm0,(%r14)
Looks like the value is returned from the previous function call?
(rr) p $rbp
$22 = (void *) 0x7fffb714c160
(rr) x/16xb $rbp - 0x1b0
0x7fffb714bfb0: 0x0c 0x00 0x00 0x00 0x03 0x00 0x00 0xa8
0x7fffb714bfb8: 0x08 0x00 0x00 0x00 0x00 0x00 0x00 0x00
There’s that 0xA8 again. Let’s see where it came from:
(rr) watch -l *(unsigned char *)($rbp - 0x1b0 + 7)
Hardware watchpoint 3: -location *(unsigned char *)($rbp - 0x1b0 + 7)
(rr) rc
Continuing.
Hardware watchpoint 3: -location *(unsigned char *)($rbp - 0x1b0 + 7)
Old value = 168 '\250'
New value = 0 '\000'
0x000000000108b381 in Heap.duplicateSingle (dest_heap=<optimized out>, handle=...) at Heap.zig:971
0x108b378 <Heap.duplicateSingle+392> vmovaps 0x80(%rsp),%xmm0
>0x108b381 <Heap.duplicateSingle+401> vmovaps %xmm0,(%rbx)
(rr) watch -l *(unsigned char *)($rsp + 0x80 + 7)
Hardware watchpoint 4: -location *(unsigned char *)($rsp + 0x80 + 7)
(rr) rc
Continuing.
Hardware watchpoint 4: -location *(unsigned char *)($rsp + 0x80 + 7)
Old value = 168 '\250'
New value = 0 '\000'
0x108b348 <Heap.duplicateSingle+344> mov 0x50(%rsp),%rax
>0x108b34d <Heap.duplicateSingle+349> mov %rax,(%rbx)
(rr) watch -l *(unsigned char *)($rsp + 0x50 + 7)
Hardware watchpoint 5: -location *(unsigned char *)($rsp + 0x50 + 7)
(rr) rc
Continuing.
Hardware watchpoint 5: -location *(unsigned char *)($rsp + 0x50 + 7)
Old value = 168 '\250'
New value = 170 '\252'
0x000000000108b1a0 in Heap.duplicateObjString (dest_heap=<optimized out>, handle=...) at Heap.zig:935
0x108b183 <Heap.duplicateObjString+419> movw $0x0,0x8(%rbx)
0x108b189 <Heap.duplicateObjString+425> movabs $0xf800000000000000,%rax
0x108b193 <Heap.duplicateObjString+435> and (%rbx),%rax
0x108b196 <Heap.duplicateObjString+438> shl $0x20,%r14
0x108b19a <Heap.duplicateObjString+442> or %rax,%r14
0x108b19d <Heap.duplicateObjString+445> or %r15,%r14
>0x108b1a0 <Heap.duplicateObjString+448> mov %r14,(%rbx)
0x108b1a3 <Heap.duplicateObjString+451> add $0x20,%rsp
0x108b1a7 <Heap.duplicateObjString+455> pop %rbx
0x108b1a8 <Heap.duplicateObjString+456> pop %r12
Okay, we’re getting to the end here, since this is where we combine everything into %r14. What are its constituents?
(rr) p/x (unsigned long long)$rax
$48 = 0xa800000000000000
Wait, why is this a bitmask for a u5? We’re in duplicateObjString, which has the signature pub fn duplicateObjString(dest_heap: *Heap, handle: Handle) !Object.StrOrPtr. StrOrPtr is a u59, so why is our function returning a u5? StrOrPtr has a u26, a u32, and a u1, but not a u5, so why does the 0xf800000000000000 mask appear? This function isn’t inlined either, since right after these commands it pops all the registers and returns. Also, shouldn’t (%rbx) be initialized to zero? What is the point of the movw?
This is as far as I’ve gotten, and I’m feeling rather flummoxed.
EDIT: here’s build artifacts if anyone wants to look through the assembly or is struggling to reproduce: zicl/repro at dd6006d96a1881f42639d03a642fe9d21740c56e · smj-edison/zicl · GitHub. .Ltmp48622 is the start of the problematic section.