GrooveBasin Musicbrainz lookup in Zig 0.15.2

My cs background is somewhat limited (and about 4 years behind me now), and so I’ve been trying to get a better grasp on things with Zig and some hobby projects. One such project involves calls to MusicBrainz, and I had some ideas about what I might do for that I eventually implemented, and after some time I got around to looking at GrooveBasin to see how Andrew did it. Beyond much nicer looking code, he had a comment about using streaming JSON that I didn’t quite understand, but eventually decided he must have meant using Readers and Writers. My memory of such concepts were about as strong as a breeze against a brick wall, so I watched some of the recent talks about how the new Io interface works and I think I pieced it together, but I have some questions about what I eventually produced versus the original GrooveBasin code.

The original (note that according to git blame this code appears to have been written just prior to the release of 0.12):

pub fn lookup(
    arena: Allocator,
    http_client: *std.http.Client,
    recording_id: []const u8,
) !Response {
    var server_header_buffer: [16 * 1024]u8 = undefined;
    // musicbrainz can return a lot of data; this should switch to use the json
    // streaming API.
    const json_read_buffer = try arena.alloc(u8, 2 * 1024 * 1024);
    var req = try http_client.open(.GET, .{
        .scheme = "https",
        .host = .{ .percent_encoded = "musicbrainz.org" },
        .path = .{ .percent_encoded = try std.fmt.allocPrint(arena, "/ws/2/recording/{s}", .{recording_id}) },
        .query = .{ .percent_encoded = "inc=work-rels+artist-credits+releases+discids" },
    }, .{
        .server_header_buffer = &server_header_buffer,
        .headers = .{
            .user_agent = .{ .override = player.http_user_agent },
        },
        .extra_headers = &.{
            .{ .name = "accept", .value = "application/json" },
        },
    });
    defer req.deinit();

    try req.send();
    try req.wait();

    if (req.response.status != .ok)
        return error.HttpRequestFailed;

    const content_type = req.response.content_type orelse
        return error.HttpResponseMissingContentType;

    const mime_type_end = std.mem.indexOf(u8, content_type, ";") orelse content_type.len;
    const mime_type = content_type[0..mime_type_end];

    if (!std.ascii.eqlIgnoreCase(mime_type, "application/json"))
        return error.HttpResponseNotJson;

    const raw_json = json_read_buffer[0..(try req.readAll(json_read_buffer))];
    const response = try std.json.parseFromSliceLeaky(Response, arena, raw_json, .{
        .ignore_unknown_fields = true,
    });

    return response;
}

My own:

pub fn lookup(arena: std.mem.Allocator, http_client: *std.http.Client, recording_id: []const u8) !Response {
    // Size chosen based on GrooveBasin
    var json_read_buffer: [2 * 1044 * 1024]u8 = undefined;
    var writer = std.Io.Writer.fixed(&json_read_buffer);
    const res = try http_client.fetch(.{
        .method = .GET,
        .location = .{
            .uri = .{
                .scheme = "https",
                .host = .{ .percent_encoded = "musicbrainz.org" },
                .path = .{ .percent_encoded = try std.fmt.allocPrint(arena, "/ws/2/recording/{s}", .{recording_id}) },
                .query = .{ .percent_encoded = "inc=work-rels+artist-credits+releases+discids" },
            },
        },
        .headers = .{ .user_agent = .{ .override =  "my_super_awesome_uger_agent" } },
        .extra_headers = &.{.{ .name = "accept", .value = "application/json" }},
        .response_writer = &writer,
    });

    if (res.status != .ok) return error.HttpRequestFailed;
    if (!(try std.json.validate(arena, writer.buffered()))) return error.MalformedJson;

    var reader = std.Io.Reader.fixed(writer.buffered());
    var json_reader = std.json.Scanner.Reader.init(arena, &reader);

    const response = try std.json.parseFromTokenSourceLeaky(Response, arena, &json_reader, .{ .ignore_unknown_fields = true });
    //const response = try std.json.parseFromSliceLeaky(Response, arena, writer.buffered(), .{ .ignore_unknown_fields = true });
    try writer.flush();

    return response;
}
  1. Is setting up the json_reader necessary? Without a Reader, I would have read straight from the Writer’s buffer (shown in the last comment). Does having the Reader improve performance by moving only smaller pieces of data, rather than (potentially) much larger loads of memory, or does it maybe not matter?
  2. If I should have the Reader (or in future situations where it is warranted) does it make sense for me to make it’s buffer writer.buffered() or is there something dangerous there? Somehow “reaching into” the object makes me uncomfortable (I understand that probably sounds silly) but it’s what made my initial test pass, so I figured I must have been doing something right.
  3. Perhaps this is more specifically a question for Andrew (and my intuition is that this is sorta scrupulous anyways), but I’m wondering my error checking for JSON leaves something to be desired, whereas before one would check via Content-Type, and seeing as though you can’t check that when using std.http.Client.fetch(), I chose to use std.json.validate(). The goal of checking that it’s (valid) JSON is still achieved, but it seems that he had access to validate() even when he wrote the code, so it makes me question if it’s the right choice.

Those were my main points of curiosity, but if anything else stick out, I’d be happy to hear something. I’m having a blast learning more about Zig/programming in general!

1 Like

In this case, using parseFromTokenSourceLeaky instead of parseFromSliceLeaky is presumably roughly equivalent, since in both cases you have already loaded the entire raw json data into memory. What I believe the comment in the groovebasin code is suggesting is that you give parseFromTokenSourceLeaky a reader with the inner reader coming straight from the Response, similar to how the decompression works in http.Client.fetch:

(so in the streaming json case, there would be another step in the chain where it’d go response reader → maybe decompressing reader → json reader)

If set up that way, then the buffer you use would not need to be able to fit the full json response, and the program would not need to hold the entire json data in memory at once–instead, it’d parse it in chunks.

There’s a nice article on this general topic that is actually also very applicable to how Zig’s Reader/Writer works generally (“compaction” in the article is what Zig calls “rebase”):

(however, Zig’s streaming json parser isn’t designed to only use the buffer’s memory and nothing else; Zig will allocate extra memory as needed, and also allocate memory for the parsed values, etc)

3 Likes

The best real example of this in the Zig codebase might be the package fetching code, where it can set up a chain that goes something like: request reader → decompressing reader → gz decompressing reader → tar.pipeToFileSystem

Reading the head, setting up the response reader:

Here’s how it gets the reader to use in the next link the chain:

So for JSON you’d give that reader to std.json.Scanner.Reader.init and call parseFromTokenSourceLeaky with that.

1 Like

Wow, that was all fantastic information, thank you! fetch() doesn’t expose a reader, I guess the idea with it is specifically to load everything into memory at once, so I switched back to making a request() and followed Andrew’s/std.http.Client’s code from there. Definitely feel like I have a better understanding of things now, lots of fun.

Final result:

pub fn lookup(arena: std.mem.Allocator, http_client: *std.http.Client, recording_id: []const u8) !Response {
    var req = try http_client.request(.GET, .{
        .scheme = "https",
        .host = .{ .percent_encoded = "musicbrainz.org" },
        .path = .{ .percent_encoded = try std.fmt.allocPrint(arena, "/ws/2/recording/{s}", .{recording_id}) },
        .query = .{ .percent_encoded = "inc=work-rels+artist-credits+releases+discids" },
    }, .{
        .headers = .{ .user_agent = .{ .override = "my_super_awesome_user_agent" } },
        .extra_headers = &.{.{ .name = "accept", .value = "application/json" }},
    });

    defer req.deinit();

    try req.sendBodiless();
    // I feel confident in saying MB won't redirect
    var res = try req.receiveHead(&.{});

    if (res.head.status != .ok) return error.HttpRequestFailed;

    const content_type = res.head.content_type orelse return error.HttpResponseMissingContentType;
    const mime_type_end = std.mem.indexOf(u8, content_type, ";") orelse content_type.len;
    const mime_type = content_type[0..mime_type_end];

    if (!std.ascii.eqlIgnoreCase(mime_type, "application/json")) return error.HttpResponseNotJson;

    var transfer_buffer: [64]u8 = undefined;
    var decompress: std.http.Decompress = undefined;
    const decompress_buffer = try arena.alloc(u8, res.head.content_encoding.minBufferCapacity());
    const response_reader = res.readerDecompressing(&transfer_buffer, &decompress, decompress_buffer);

    var json_reader = std.json.Scanner.Reader.init(arena, response_reader);

    const response = try std.json.parseFromTokenSourceLeaky(Response, arena, &json_reader, .{ .ignore_unknown_fields = true });
    return response;
}

Choosing to define decompress as undefined on our own and then letting readerDecompressing() change it to the correct encoding feels weird, but I guess it’s nicer than receiving both a Reader and some random union. I’ll chalk that up to my lack of familiarity with the stuff/I trust the Zig team is making better design decision than I am :laughing:

Thank you again!

1 Like