Idiomatic Complex Formatting

peanut · March 28, 2024, 6:24pm

Hey y’all,

I have been messing around with some more complex formatting, in this case for matrices. After reading through some threads here and working with the std.testing.allocator, I have arrived at the following example for a memory-leak free formatting function that accepts runtime values. The issue is that what I came up with feels ugly.

What are some examples you’ve come across that showcase idiomatic complex formatting in Zig?

What I came up with and some notes:

clearAndFree() is necessary to prevent the output from persisting between formatting calls
defer alloc.free(s) appear to be necessary for freeing the allocated memory from the std.fmt.allocPrint function (and prevent a memory leak).

pub fn format(mat: *Matrix, alloc: std.mem.Allocator, out: *std.ArrayList(u8)) ![]const u8 {
            const label = @typeName(Matrix)[0..];
            try out.appendSlice(label);
            try out.appendSlice("[\n"[0..]);
            for (0..rows) |ridx| {
                try out.appendSlice("\t["[0..]);
                for (0..cols) |cidx| {
                    const s = try std.fmt.allocPrint(alloc, "{any}", .{mat.get(ridx, cidx)});
                    defer alloc.free(s);
                    try out.appendSlice(s);
                    if (cidx != cols - 1) {
                        try out.append(',');
                    }
                }
                try out.append(']');
                if (ridx != rows - 1) {
                    try out.appendSlice(",\n"[0..]);
                }
            }
            try out.appendSlice("]\n");

            return out.items;
        }  // ... rest of the struct

// invocation in a test block...

    var alloc = std.testing.allocator;
    const arrayListString = std.ArrayList(u8);
    var out = arrayListString.init(alloc);
    defer out.deinit();

    std.debug.print("\nlower: {s}\n", .{try lower.format(alloc, &out)});
    out.clearAndFree();
    std.debug.print("\nupper: {s}\n", .{try upper.format(alloc, &out)});
    out.clearAndFree();

AndrewCodeDev · March 28, 2024, 6:59pm

It’s worth looking into the the writer api for building/formatting strings. ArrayList has a writer member function that returns one such object.

Printing matrices (and tensors more generally) is ugly work. In this case, the line:

const s = try std.fmt.allocPrint(alloc, "{any}", .{mat.get(ridx, cidx)});

…will print some really nasty floating point values. If you want them to be readable, I recommend picking a decimal cutoff and using something like {d:.4} or some other decimal limit to help format them.

Honestly, you’ve got a decent idea going here… you’re trying to make something clean and tidy but I think you can cut down on some noise using writer instead of a direct ArrayList(u8) parameter and handle a few edge cases.

AndrewCodeDev · March 28, 2024, 7:12pm

One of these days, I’ll have to tackle printing tensors in my library… my plan of action would be something like…

Make a base case for Rank-1 tensors (vectors) that optionally prints as a row or column and gives the option to enclose in braces (like [...]).
Make a Rank-2 (matricial) version that parameterizes the vector version for each row.
Make a Rank-3 (vector-of-matrices) that dispatches to step 2 to print axis-wise matricies.
Repeat until I hit the max rank limit. Depending on memory layout, may have to interleave prints (probably not, just a consideration).

Basically, you can “clean” things up by working up the chain of outer-dimensions. That said, if you’re only going with matrices and vectors, then you can give each one a single function to format their output (same as what you’re doing here).

peanut · March 28, 2024, 7:53pm

Thanks, I’ll check out the writer. And yeah, this is a method for a Matrix (rank 2) generic. I think if I really wanted I could determine the buffer size at comptime as well since the matrix type includes the dimensions and the type for the values.

My biggest issue is actually with the calling function. Having to clearAndFree is… fragile? I guess I could wrap the print call and move the allocator or ArrayList instantiations there, but that generally goes against the advice I’ve read that functions shouldn’t encapsulate their own hidden allocators.

castholm · March 28, 2024, 8:28pm

The idiomatic way of implementing custom formatting would be to implement a format function. std.fmt.format, which std.debug.print and other formatting functions that take a format string with placeholders are derived from, has this to say in its doc comment:

If a formatted user type contains a function of the type
pub fn format(
   value: ?,
   comptime fmt: []const u8,
   options: std.fmt.FormatOptions,
   writer: anytype,
) !void
with ? being the type formatted, this function will be called instead of the default implementation. This allows user types to be formatted in a logical manner instead of dumping all fields of the type.

This means that if your struct implements a format function, you can print it simply by passing it to a print function, like std.debug.print("{}", .{mat}). See std.SemanticVersion.format for one example of such a function.

A port of your original function would look something like this (writer is assumed to be a std.io.Writer):

pub fn format(
    mat: Matrix,
    comptime fmt: []const u8,
    options: std.fmt.FormatOptions,
    writer: anytype,
) !void {
    _ = fmt;
    _ = options;
    try writer.writeAll(@typeName(Matrix));
    try writer.writeAll("[\n");
    for (0..rows) |ridx| {
        try writer.writeAll("\t[");
        for (0..cols) |cidx| {
            try writer.print("{any}", .{mat.get(ridx, cidx)});
            if (cidx != cols - 1) {
                try writer.writeByte(',');
            }
        }
        try writer.writeByte(']');
        if (ridx != rows - 1) {
            try writer.writeAll(",\n");
        }
    }
    try writer.writeAll("]\n");
}

The format function doesn’t allocate unless the writer implementation allocates.

Addendum: If you don’t control the struct you want to format, you can implement a function that returns a std.fmt.Formatter. See std.zig.fmtId for an example.

AndrewCodeDev · March 28, 2024, 9:07pm

In this case, @castholm’s answer will give you the tidiest looking code. If you want to expand it, I would add another step here.

Create helper-functions that are called by the format functions so you can build a cumulative case.

Again, in the case where you just want two functions to call (one for matrices, one for vectors) then you’re not gaining much ground by making things composable. If that’s what you want, I’d directly do what @castholm is recommending here.

castholm · March 28, 2024, 9:31pm

If your matrix type is implemented as something like struct { cols: [4]Vector4 } you can effectively compose them by first implementing a format function for Vector4 and then simply calling writer.print("{}", .{mat.cols[i]}) in Matrix’s format function.

AndrewCodeDev · March 28, 2024, 9:34pm

Yes, but that actually is quite restrictive - it depends on if you always want the same behaviour for every print and you may not.

For instance, in a matrix, you may decide that you don’t want to enclose rows with square brackets [...] but on vectors you do.

Likewise, it assumes a format. If your default thinking is row-major ordering and that’s all you support, sure. If you have column-major ordering and you default to printing vectors vertically (which is the typical assumption in most tensor calculus), then you’ll end up printing things incorrectly.

So for instance, I could have a single function with parameterized layout that says how to print each sub-rank (for column wise, it could be "\n", for row-wise it might be ", ". Mixing this format may lead to an illegible print if you try to row-wise print a bunch of column vectors.

Again, it depends on the circumstance and how flexible you want to be. It looks like OP is going for a single data format and thus can always safely assume that the data in the matrix is properly matched to a single print function.

Sze · March 28, 2024, 9:48pm

I think that std.fmt.Formatter would still allow you to explicitly define a different function that should do the formatting.

I think to compose column-major ordered formatters, you could theoretically write a utility that basically uses custom writers to buffer the output from the substeps and then acts like vim in visual block selection mode allowing you to paste one rectangle of text besides another.

Figuring out how to do that without adding unnecessary overhead might be a challenge.
Maybe it could be done with a bunch of cleverly implemented writers that cooperate with another (and are possibly parameterized with domain knowledge / custom settings).

AndrewCodeDev · March 28, 2024, 9:54pm

Sure, there’s probably standard ways to get around this. The important thing is that for printing complex data structures (which the topic is on idiomatic complex formatting), it’s probably best to realize that the situation is inherently complex and a single answer isn’t going to fit all needs.

What you’re suggesting actually makes that point quite elegantly. We now need extra utilities that we can dispatch to depending on the circumstance. If you choose to do that with free-functions, that’s one possible answer. If you go with a standard formatter, that’s another. It depends on how you want your composition to play out if you’re going in that direction.

For instance, in my tensor library, it assumes and only supports row-major because that’s the default thinking for most people. To be clear, row major vs col major is actually just a problem of transposition, but that’s an extra step and you have to reverse your indices and recalculate strides every time you decide to show vs calculate. It’s a non-trivial programming issue that has a clear mathematical answer.

If we’re talking about a single type of structure that can assume data formats, this is a really easy problem to solve and you can parameterize it however you choose, really.

AndrewCodeDev · March 28, 2024, 10:21pm

Let me give a simple example of how you can solve this formatting issue with the example that @castholm has provided (by the way, not disagreeing that this is a valid approach, just want to make the assumptions clear).

Let’s say I have a structure like the following:

3x3 matrix: vectors = [3]Vector3 // can be columns or rows

Let’s say we assume that it’s column major. So x.vectors[0] is actually a column. Now this sucks to print - if you print one at a time, they’ll stack up on each other. Let’s assume a matrix like so:

v0 v1 v2
0, 3, 6,
1, 4, 7,
2, 5, 8,

If we print each vector flat as is, we’ll get the following:

print v0: 0, 1, 2
print v1: 3, 4, 5
print v2: 6, 7, 8

So you can see we’ve printed the transpose. That means if we transpose our matrix first:

v0 v1 v2
0, 1, 2,
3, 4, 5,
6, 7, 8,

We now get:

print v0: 0, 3, 6
print v1: 1, 4, 7
print v2: 2, 5, 8

That’s the original one we started off with conceptually speaking.

If we assume that printing is really a debugging thing and this isn’t going to happen during heavy compute sessions, you can call transpose on your matrix and print its transpose depending on the ordering and use only one set of formatting calls on the entire thing. This generalizes to Rank-N tensors as transpose actually just becomes the reverse permutation of modes (for a matrix, the values just swap… so it’s a two element reversal for changing ordering).

So @Sze, I wouldn’t actually change anything. If we do the math right and can allow for printing to first do transposition, the problem is solved in all cases with the assumption of a single series of formatting concerns.

Sze · March 28, 2024, 10:39pm

I agree, what I wrote wasn’t in the context of data types that have operations like transpose, I was more thinking of arbitrary data types, where you maybe want some quick way to recombine the existing formatting of multiple elements of a compound data type, but maybe still have preferences about how it should be printed.

Basically I was thinking out loud about what could you do to rearrange outputs of existing formatters, only looking at what they output. I think writing utilities around that would be helpful for quick debugging output, but building up helper functions and using domain knowledge, probably is often a better way for formatters that aren’t just for debugging.

AndrewCodeDev · March 28, 2024, 10:46pm

I think what you said here is gist of my whole point (and the best advice on the topic):

I think for complex cases, there is no substitute for domain knowledge. The great thing about standard utilities is they give the user a channel to communicate to an existing backend. That channel can provide design choices that impose restrictions itself, however.

I’m very content to say that for genuinely involved cases, there isn’t an idiomatic way to do something at the level of the standard library - once you have your data setup correctly, you can then hand it to the formatter which can happily punch out bytes.