DWARF parsing and stack traces

Hi, posting here because it feels too massive for a zulip chat.

While trying to get stack traces in a pretty specific use case I stumbled upon a dynamic library that has a separate debuginfo file that contains some DW_UT_partial units.

As expected after reading the std/debug/Dwarf.zig code, the parsing fails because the parser gives up when encountering a unit that is not a DW_UT_compile:

fn scanAllFunctions(di: *Dwarf, gpa: Allocator, endian: Endian) ScanError!void {
    var fr: Reader = .fixed(di.section(.debug_info).?);
    var this_unit_offset: u64 = 0;

    while (this_unit_offset < fr.buffer.len) {
        fr.seek = @intCast(this_unit_offset);

        const unit_header = try readUnitHeader(&fr, endian);
        if (unit_header.unit_length == 0) return;
        const next_offset = unit_header.header_length + unit_header.unit_length;

        const version = try fr.takeInt(u16, endian);
        if (version < 2 or version > 5) return bad();

        var address_size: u8 = undefined;
        var debug_abbrev_offset: u64 = undefined;
        if (version >= 5) {
            const unit_type = try fr.takeByte();
            if (unit_type != DW.UT.compile) return bad(); // <=== here
            address_size = try fr.takeByte();
            debug_abbrev_offset = try readFormatSizedInt(&fr, unit_header.format, endian);
        } else {
            debug_abbrev_offset = try readFormatSizedInt(&fr, unit_header.format, endian);
            address_size = try fr.takeByte();
        }

and so I get stack traces like this one:

info: loading system libc...
info: testing libc printf segfault...
Segmentation fault at address 0x8
???:?:?: 0x7f0f7fb5155d in ??? (/lib64/libc.so.6)
???:?:?: 0x7f0f7fa45697 in ??? (/lib64/libc.so.6)
???:?:?: 0x7f0f7fa46513 in ??? (/lib64/libc.so.6)
???:?:?: 0x7f0f7fa3a2c2 in ??? (/lib64/libc.so.6)
{project_path}/examples/segfault.zig:37:15: 0x12a64b6 in main (segfault.zig)
{zig_install_dir}/lib/std/start.zig:750:30: 0x12a7013 in callMain (std.zig)
{zig_install_dir}/lib/std/start.zig:203:5: 0x12663a1 in _start (std.zig)

It makes me a little sad because without the separate debuginfo file being accessible, it at least get function names form the .so file symbol table:

info: loading system libc...
info: testing libc printf segfault...
Segmentation fault at address 0x8
???:?:?: 0x7f862093155d in __strlen_avx2 (/lib64/libc.so.6)
???:?:?: 0x7f8620825697 in __printf_buffer (/lib64/libc.so.6)
???:?:?: 0x7f8620826513 in __vfprintf_internal (/lib64/libc.so.6)
???:?:?: 0x7f862081a2c2 in __printf (/lib64/libc.so.6)
{project_path}/examples/segfault.zig:37:15: 0x12a64b6 in main (segfault.zig)
{zig_install_dir}/lib/std/start.zig:750:30: 0x12a7013 in callMain (std.zig)
{zig_install_dir}/lib/std/start.zig:203:5: 0x12663a1 in _start (std.zig)

I made a small patch to the std lib to test this hypothesis, and it went well until encountering a DW_AT_abstract_origin referencing a DIE in another unit, which seems to be unsupported judging from this comment:

// Follow the DIE it points to and repeat
const ref_offset = try this_die_obj.getAttrRef(AT.abstract_origin, this_unit_offset, next_offset);
fr.seek = @intCast(ref_offset);
this_die_obj = (try parseDie(
    &fr,
    attrs_bufs[2],
    abbrev_table, // wrong abbrev table for different cu
    unit_header.format,
    endian,
    address_size,
)) orelse return bad();

So I guarded the parseDie call with a check that ref_offset was in the range of the current unit, breaking with null if it wasn’t the case, to not fail the DWARF file in this case. After that, stack traces were finally “more complete”:

info: loading system libc...
info: testing libc printf segfault...
Segmentation fault at address 0x8
../sysdeps/x86_64/multiarch/strlen-avx2.S:76: 0x7fbfd605055d in __strlen_avx2 (../sysdeps/x86_64/multiarch/strlen-avx2.S)
/usr/src/debug/glibc-2.43-4.fc44.x86_64/stdio-common/vfprintf-process-arg.c:443:17: 0x7fbfd5f44697 in __printf_buffer (vfprintf-internal.c)
/usr/src/debug/glibc-2.43-4.fc44.x86_64/stdio-common/vfprintf-internal.c:1548:7: 0x7fbfd5f45513 in __vfprintf_internal (vfprintf-internal.c)
/usr/src/debug/glibc-2.43-4.fc44.x86_64/stdio-common/printf.c:33:10: 0x7fbfd5f392c2 in __printf (printf.c)
{project_path}/examples/segfault.zig:37:15: 0x103762f in main (segfault)
{zig_install_dir}/lib/std/start.zig:203:5: 0x1021eed in _start (segfault)

Questions are:

  • is it worth an issue or a PR, regarding how niche the use case is?
  • if it is, how would one create a test for it? I can’t see a way besides using specific ELF and debuginfo files that reproduce the issue without the patch, but that would mean including a binary blob fixture somewhere…

And one opinion:

I think failing to parse DWARF from the separate debuginfo file should fall back to getting info from the original ELF file if possible instead of giving up completely. Note that this is doable in userland thanks to customizable SelfInfo, but this behavior by default makes more sense to me.

Also, vaguely related: https://codeberg.org/ziglang/zig/issues/31790.

I understand loading dynamic libraries without linking libc and still having perfect stack traces is not the most wanted feature for the zig std lib, but hey, it is my thing :slight_smile:

3 Likes

Following up here instead of on zulip - #std > std.debug.Dwarf refactor to reduce noise there.

like i said on zulip, PR #35610 should fix the second issue concerning references to DIEs in other units (not tested since I haven’t encountered a binary with debug info containing such attributes), but the first issue isn’t handled; compile units are still the only supported unit type.

seems like partial units are basically identical to full compile units (at least in terms of structure) so it probably shouldn’t be a problem to allow them.

I don’t have any binaries with debug info that contain partial units or DIE refs in other units so I can’t check, but if you can pull my PR, remove the if (unit_type != DW.UT.compile) return bad() line, check your stack traces after building with --zig-lib-dir <my-zig-branch>/lib and report back I’d appreciate it and could investigate if something fails. Otherwise if you can share the binaries that have that debug info I could look into it

I’ll do it as soon as I get time :slight_smile:

In the meantime, here is my patch to prevent the current parser from failing on DW_AT_abstract_origin and DW_UT_partial units:

--- a/lib/std/debug/Dwarf.zig
+++ b/custom_std_lib/lib/std/debug/Dwarf.zig
@@ -393,7 +393,11 @@ fn scanAllFunctions(di: *Dwarf, gpa: Allocator, endian: Endian) ScanError!void {
         var debug_abbrev_offset: u64 = undefined;
         if (version >= 5) {
             const unit_type = try fr.takeByte();
-            if (unit_type != DW.UT.compile) return bad();
+            if (unit_type != DW.UT.compile) {
+                // Only full compile units can contribute function ranges here.
+                this_unit_offset += next_offset;
+                continue;
+            }
             address_size = try fr.takeByte();
             debug_abbrev_offset = try readFormatSizedInt(&fr, unit_header.format, endian);
         } else {
@@ -477,6 +481,10 @@ fn scanAllFunctions(di: *Dwarf, gpa: Allocator, endian: Endian) ScanError!void {
 
                                 // Follow the DIE it points to and repeat
                                 const ref_offset = try this_die_obj.getAttrRef(AT.abstract_origin, this_unit_offset, next_offset);
+                                // Only same-unit references can be followed with the current unit's
+                                // abbrev table and base attributes. Cross-unit references are valid
+                                // DWARF, but unsupported here; keep the range and omit the optional name.
+                                if (ref_offset < this_unit_offset or ref_offset >= this_unit_offset + next_offset) break :x null;
                                 fr.seek = @intCast(ref_offset);
                                 this_die_obj = (try parseDie(
                                     &fr,
@@ -492,6 +500,10 @@ fn scanAllFunctions(di: *Dwarf, gpa: Allocator, endian: Endian) ScanError!void {
 
                                 // Follow the DIE it points to and repeat
                                 const ref_offset = try this_die_obj.getAttrRef(AT.specification, this_unit_offset, next_offset);
+                                // Only same-unit references can be followed with the current unit's
+                                // abbrev table and base attributes. Cross-unit references are valid
+                                // DWARF, but unsupported here; keep the range and omit the optional name.
+                                if (ref_offset < this_unit_offset or ref_offset >= this_unit_offset + next_offset) break :x null;
                                 fr.seek = @intCast(ref_offset);
                                 this_die_obj = (try parseDie(
                                     &fr,
@@ -588,7 +600,11 @@ fn scanAllCompileUnits(di: *Dwarf, gpa: Allocator, endian: Endian) ScanError!voi
         var debug_abbrev_offset: u64 = undefined;
         if (version >= 5) {
             const unit_type = try fr.takeByte();
-            if (unit_type != UT.compile) return bad();
+            if (unit_type != UT.compile) {
+                // Only full compile units can contribute address lookup entries here.
+                this_unit_offset += next_offset;
+                continue;
+            }
             address_size = try fr.takeByte();
             debug_abbrev_offset = try readFormatSizedInt(&fr, unit_header.format, endian);
         } else {

I can give you the debuginfo file with the DW_UT_partial unit if you want, but I don’t know how to do it conveniently (6.8M).

But let me emphasize that a problem I currently consider important is that, when parsing the external debuginfo file fails while unwinding, it does not fall back to using the ELF file itself to at least get function names.