Strange behavior with std.ArrayList (garbled output)

I’m at a bit of a loss. Trying to implement a function that takes a directory name and returns a list of all files in that directory as std.ArrayList.

So far I’ve come up with the following code:

const std = @import("std");
const DirectoryListing = @import("DirectoryListing");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();
    const dirname: []const u8 = "/etc";
    const filelist = try getDirListing(allocator, dirname);
    for (filelist.items) |element| {
        std.debug.print("  {s} \n", .{element});
    }
}

fn getDirListing(allocator: std.mem.Allocator, directory: []const u8) !std.ArrayList([]const u8) {
    var dir = try std.fs.cwd().openDir(directory, .{ .iterate = true });
    defer dir.close();
    var dirIterator = dir.iterate();
    var files: std.ArrayList([]const u8) = .empty;
    errdefer files.deinit(allocator);
    while (try dirIterator.next()) |dirContent| {
        if (dirContent.kind == .file) {
            std.debug.print("\n{s}\n", .{dirContent.name});
            files.append(allocator, dirContent.name) catch unreachable;
            for (files.items) |element| {
                std.debug.print("{s}, ", .{element});
            }
        }
    }
    return files;
}

Looping through the items of the returned ArrayList in the main function gets me basically just garbled output (mainly special characters).
Doing the same loop inside the function gives me the correct output but once it’s getting beyond 22 entries it all starts to fall apart and the output gets more and more garbled.

I’m pretty new to zig and would appreciate any suggestions how to fix this.

dirContent.name is a pointer provided by an iterator, it becomes invalid after the iterator performs a next(). You need to copy its content in order to access it long-term. In this case, using an arena is recommended.

Made some modifications based on your coding style. In practice, there may be better styles, such as using ArrayList(u8) as the underlying structure and using indices instead of pointers.

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();
    const dirname: []const u8 = "/etc";
    var filelist, var arena = try getDirListing(allocator, dirname);
    defer arena.deinit();
    defer filelist.deinit(allocator);
    std.debug.print("\n\nstart:\n", .{});
    for (filelist.items) |element| {
        std.debug.print("  {s} \n", .{element});
    }
}

fn getDirListing(allocator: std.mem.Allocator, directory: []const u8) !struct { std.ArrayList([]const u8), std.heap.ArenaAllocator } {
    var dir = try std.fs.cwd().openDir(directory, .{ .iterate = true });
    defer dir.close();
    var dirIterator = dir.iterate();
    var files: std.ArrayList([]const u8) = .empty;
    errdefer files.deinit(allocator);
    var arena: std.heap.ArenaAllocator = .init(allocator);
    errdefer arena.deinit();
    while (try dirIterator.next()) |dirContent| {
        if (dirContent.kind == .file) {
            const name = try arena.allocator().dupe(u8, dirContent.name);
            std.debug.print("\n{s}\n", .{name});
            files.append(allocator, name) catch unreachable;
            for (files.items) |element| {
                std.debug.print("{s}, ", .{element});
            }
        }
    }
    return .{ files, arena };
}

Since you (OP) mentioned you’re new to Zig, I’ll take the liberty of expanding on this valuable comment a bit more:

Your ArrayList is a list of slices, e.g. pointers to memory with a length. You have no control over what’s contained at the memory address until you take ownership of that block of memory. In your example, the block of memory will initially contain the file you expect in the loop. However, once the loop continues to the next iteration, you no longer have any guarantee what was there. That’s why you need to do something like allocator.dupe(…) or std.fmt.allocPrint(…) to copy the value to a new block of memory you will own.

As to why you should use an arena, it’s so you don’t have to track & free the individual allocations (assuming you don’t need to, of course, which I think is fair given the use-case). Let the arena do the allocations so that it can free everything in bulk when needed.

Don’t forget to deinit the ArrayList in main. Alternatively also init it using the arena.

1 Like

Welcome to Ziggit @mplindner!

What happens next?

I suspect a struct representing “a directory full of file names” would be more useful for whatever happens next, than an ArrayList full of strings. But the specifics depend on the purpose.

Thanks a bunch! That really helps.
Guess my biggest mistake was to assume that .append would do a copy as it wants an allocator and a value. Guess that’s actually just for allocating the memory for the pointer to the data.

Thanks!, that’s really good to know. I suspected it had something to do memory allocation. But, as I assumed that .append would basically copy the value into a new memory location as it’s asking for an allocator and the value it didn’t occur to me that this might be the culprit…

The ultimate goal is to build a configurable file poller that reads directory names, (RegEx) file masks, etc. from a config file, retrieves all file names for each directory, does a regex match and then triggers processing for matching files.
So, I’ll need to be able to deal with sets of 0 to thousands of filenames depending on what’s in each directory at the time of polling.

In that case, I suggest you have one std.ArrayList(u8) while reading the directory. Append every filename as a null-terminated string, then use array_list.toOwnedSliceSentinel(0) to end up with a field which is of type [:0]const u8. We’ll call this .file_text. It will make your life easier if the first item you put on the ArrayList is itself 0: prepend 0 for each name, rather than appending, because toOwnedSliceSentinel will also add one.

Conceptually, you’ll have a second slice of type [][:0]const u8, with each of the file names in it, call that field .file_names: these are just pointers to the regions of .file_text, with the length of the file included. If you make slice[0] == 0 (you’ll see why), the first of these points to [1].

It’s perfectly reasonable to also store .file_names with that exact type, you would use an ArrayList([:0]const u8) to build the slice, and then just .toOwnedSlice() to assign the field. However, it would be more space efficient (in 64 bit systems, at least) to store them as an index into .file_text, plus the length, using u32s. It’s simple to write a function for your struct which produces the slices on-demand from that. But you may not want to bother: computers without gigabytes of memory are pretty rare these days, the extra complexity may not be worthwhile.

This way, all searching within the directory is just searching .file_text. The names you find are bounded with 0, so: find a string or regex match, look backward and forward for 0, that’s your file name. Operating systems like filenames to be null-terminated, and also forbid null bytes in a filename.

Also, this way you just free the two slices, and if you allocated the struct on the heap, you destroy that last. An arena is not actually getting you much if you take this approach.

It’s actually not 100% clear to me that you need a separate indexed array of all files, either. Maybe just the concatenation of the names is sufficient to your purpose.

2 Likes

Good ideas. Two minor notes that are probably obvious:

  • If file_names (the slice of names) is needed, note that it can’t be constructed until after file_text is completed (not as you add names to it), since the pointers to names in the file_text ArrayList will change as the ArrayList grows.

  • If there is a requirement to rescan the file system each time processing is triggered, then storing the whole thing is a waste. The OP didn’t say anything about whether changes to the set of files are handled. If files will be added and deleted to a stored set by watching for directory changes, then a different data structure may be needed.

1 Like

Not a minor note! This is a very good argument for using indices the whole way through, those can’t be invalidated by a realloc.

I suppose the lengths could be used to do a fixup pass on the pointers, but at that point, indices are looking pretty nice.

2 Likes