Get HTML page content

Hi! I am looking for some references - obviously I did not found - to fetch the HTML content of a web page. Any guidance? (v 0.15.1)

Here’s just about the simplest way to do it:

test "get and print your thread" {
	var client: std.http.Client = .{
		.allocator = std.testing.allocator,
	};
	defer client.deinit();
	
	const stdout_writer_buf: []u8 = try std.testing.allocator.alloc(u8, 4096);
	defer std.testing.allocator.free(stdout_writer_buf);
	
	const stdout: std.fs.File = std.fs.File.stdout();
	// Release lock on stdout so we can print
	try stdout.lock(.none);

	var stdout_writer: std.Io.Writer = stdout.writer(stdout_writer_buf).interface;
	
	const fetch: std.http.Client.FetchResult = try client.fetch(.{
		// Print to stdout
		.response_writer = &stdout_writer,
		.location = .{
			.url = "https://ziggit.dev/t/get-html-page-content/11894",
		},
	});
	try std.testing.expect(fetch.status == .ok);
}

If you want to print the result body into memory or into a file that you then save, you just use a std.Io.Writer.Allocating or a std.fs.File.Writer with a non-stdout file.

Is this working with 0.15.1? Because it fails for me. I was more looking into std.Io.Writer as I thought it could the way-to-go.

I’m running 0.15.1. What error is the test returning for you?

lib/std/posix.zig:5483:27: 0x10431482b in flock (test)
            .OPNOTSUPP => return error.FileLocksNotSupported,
                          ^
/Users/nevendrean/.zvm/0.15.1/lib/std/fs/File.zig:2175:25: 0x10431355b in lock (test)
            else => |e| return e,

when it reaches try stdout.lock(.none);

and I tried to remove this lock, but I get another error:

.BADF => return error.NotOpenForWriting

The var stdout_writer: std.Io.Writer = stdout.writer(stdout_writer_buf).interface; line is incorrect - it copies the std.Io.Writer interface out of the parent struct, and since the interface uses @fieldParentPointer to find the parent struct, it’ll probably fail to flush in the best case. See Zig 0.15.1 reader/writer: Don't make copies of @fieldParentPtr()-based interfaces for reference.

4 Likes

I tried:

const stdout_writer_buf: []u8 = try std.testing.allocator.alloc(u8, 4096);
defer std.testing.allocator.free(stdout_writer_buf);
 
var file_writer: std.fs.File.Writer = std.fs.File.stdout().writer(stdout_writer_buf);
    
const writer_ptr = &file_writer.interface;

const fetch: std.http.Client.FetchResult = try client.fetch(.{
        // Print to stdout
        .method = .GET,
        .response_writer = writer_ptr,
        .location = .{
            .url = "https://example.com",
        },
});
try std.testing.expect(fetch.status == .ok);
try file_writer.interface.flush();

It compiles, but hangs.

This is all rather baffling because the test code that I posted originally not only compiles and runs for me, but even prints correctly despite me forgetting to flush.
I also edited it to include a flush() call, and it still works perfectly - no change.

I get the file lock thing - that’s an underlying syscall difference in behaviour between operating systems, sure.
But apparently I can copy the stdout writer interface, and it not only prints but also succeeds at flushing?

In any case, I found out that we can use stderr instead of stdout for the printing part, since std.debug has a nice API for that, and that API should hopefully be more platform-agnostic.
Here’s the adjusted test code:

test "get and print your thread" {
	var client: std.http.Client = .{
		.allocator = std.testing.allocator,
	};
	defer client.deinit();
	
	const stderr_writer_buf: []u8 = try std.testing.allocator.alloc(u8, 4096);
	defer std.testing.allocator.free(stderr_writer_buf);
	
	std.debug.lockStdErr();
	defer std.debug.unlockStdErr();

	var stderr_writer: *std.Io.Writer = std.debug.lockStderrWriter(stderr_writer_buf);
	defer std.debug.unlockStderrWriter();
	
	const fetch: std.http.Client.FetchResult = try client.fetch(.{
		// Print to stderr
		.response_writer = stderr_writer,
		.location = .{
			.url = "https://ziggit.dev/t/get-html-page-content/11894",
		},
	});
	try std.testing.expect(fetch.status == .ok);
	try stderr_writer.flush();
}

(P.S: Judging by std.debug’s documentation, lockStderrWriter() returns an unbuffered writer, so the flush() call isn’t actually necessary - if stdout behaves the same way, this could potentially explain why it was printing with no flush needed.
Also, the std.debug.lockStdErr(); defer std.debug.unlockStdErr(); statements seem to be unnecessary.)

Copying the interface out of the parent, then using it, is undefined behaviour.

It just happened to work by bad luck. It’s bad luck because you don’t know there is a problem until it bites you later.

1 Like

This compiles and I get the content indeed. But it seems a bit confusing to use stderr_writer isn’t it?

I also found this post:
https://ziggit.dev/t/im-too-dumb-for-zigs-new-io-interface/11645/4

Using the interface field const writer_ptr = &file_writer.interface; as the .response_writer is wrong. Which function coudl be used?

no it is correct to use a pointer to that field.

I am not sure how you came to that interpretation.

I am just say that setting .response_writer = &file_writer.interface compiles but is incorrect in the fetchOptions as this makes the call to hang. Then the codebase std/http/Client does not test fetch so I don’t know where

In case of any interest for a future reader, this post: https://ziggit.dev/t/unable-to-make-https-post-request/11955 presents an answer.

test "fetch" {
    var allocating = std.Io.Writer.Allocating.init(allocator);
    defer allocating.deinit();

    var client: std.http.Client = .{
        .allocator = allocator,
    };
    defer client.deinit();

    const response = try client.fetch(.{
        .method = .GET,
        .location = .{ .url = url },
        .response_writer = &allocating.writer,
    });

    std.debug.assert(response.status == .ok);
    std.debug.print("{s}\n", .{allocating.written()});

   // allocating.toOwnedSlice()     <-- to use the result
}

Then looking how to get streams.