Where to enforce minimum buffer len?

Hi all,

In the new I/O interface, how do I describe a Reader that always read with len 8KB?

My first impression is to require an user to pass in a buf with at least 8KB, that is nice. But how does that interract with a peek across the 8KB boundary? The buffer now has to be at least 8KB + some previous buffered data.

So do I return 0 if I need more unused capacity in stream? Or do I just assert that I need that much unused capacity in stream?

A typical peek implementation will rebase (@memmove buffered data to start of buffer and fill the remaining capacity) when the requested number of bytes to peek is greater than the number of bytes available.

This is also how std.Io.Reader.defaultPeek works. I haven’t chased the implementation details to confirm, but I assume that you will hit an unreachable somewhere if you attempt to peek more bytes than the buffer’s total capacity, but otherwise you can safely peek up to that amount (normal read/EOF errors aside).

you meant defaultRebase, which does in fact assert the buffer is large enough. The documentation for the vtable function states it should: Asserts 'capacity' is within buffer capacity, or that the stream ends within 'capacity' bytes.

I see, but the problem is while seek amount is definitely fit inside the buffer capacity, the “underlying read” amount might not fit inside the buffer capacity.

So one example: The buffer is 8192 bytes. peek(5) might need to return 3 buffered bytes and 2 freshly read bytes, which is done by reading from the underlying source, so I need 3 + 8192 bytes at least for the buffer. Is this enforced anywhere? Or rather, where and how should I enforce it?

I see the contract of rebase is just 5 <= 8192, which is satisfied here. But the “minimum buffer len” requirement of the reader is not: it needs an unused capacity of 8192 bytes. And it is not clear where to put it: do you just return 0 from stream and it will be asserted somewhere in the Reader functions instead of infinite looping, or do you need to also add 8192 to the amount requested in rebase your self so that it requires (requested_amount + 8192) bytes in the buffer capacity, or… what.

Hope this clarifies.

TL;DR: You can peek up to the total capacity of your buffer. The seek/end positions have no bearing on that. You don’t need to concern yourself with manually rebasing.

  1. Assume that you have a buffer with a total capacity of 8192 bytes.
  2. The current seek position is 8190, and the end position is 8192.
  3. You attempt to peek 4 bytes
  4. This triggers a rebase:
    • The 2 buffered bytes are moved to the beginning of the buffer
    • Seek is set to 0, end is set to 2.
    • The reader streams in more data, filling it to capacity
    • Seek is still 0, end is now 8192
  5. Control flow returns the peek call you made, which has still not returned.
  6. It can now return return the peek bytes which are now in the buffer

This all happens transparently, you can just call peek and cast your worries aside, it will either return the bytes or an error if a read failed, no different than any other read operation.

Conversely, if you try to peek 9000 bytes, you will trigger an assert, as the requested number of bytes cannot exceed the buffer capacity.

2 Likes

The reader streams in more data, filling it to the capacity.

This is exactly where it fails: I need a mandatory unused capacity of 8192 due to the requirement of the underlying reader, but currently I only have 8190 due to 2 buffered bytes! I cannot do a partial read of 8190 because 8192 is the minumum buffer len I need.

So: do I just assert and fail there, or should I have handle it somewhere else that has the responsibility to check and throw that?

I assume I must be missing something here, since stream documentation says:

Implementations are encouraged to utilize mandatory minimum buffer sizes combined with short reads (returning a value less than limit ) in order to minimize complexity.

My question in this post is exactly about this. Hope this clarified more.

That is talking about the total buffer size, not the available capacity.

if there is already buffered data is irrelevant, its going to be data you would have gotten anyway.

Now that confuses me. By irrelevant, do you mean I can just toss the buffered data inside stream to make space for my freshly read bytes? That must break peek because it still needs the buffered data.

Am I missing something here? :smiling_face_with_tear:

If you need to always ensure there is 8192 available for a read, such as if you are reading chunked data, decompressing, etc. where that size is needed, you can simply implement a custom stream function that never reads more than 8192 bytes instead of defaulting to fill the complete unused capacity. Your buffer size could then just be 8192 plus the maximum number of bytes you will ever need to peek.

Say I have data that looks like [3 byte header][10 byte data]
logically, I would have use 13 byte buffer, but let’s use 10 to explain why your problem doesn’t exist.
I have a reader with a 10 byte buffer.

I take(3), first the reader will fill the buffer, it’d look like this [3 byte header][7 bytes of the data]

then it takes the 3 requested so the buffer looks like this: [3 bytes header (now read, considered unused)][7 bytes of the data]
and I have a slice of the header bytes to do stuff with.

Now I pass to my data processing which take(10)s
it’d rebase the buffer to look like [7 bytes data][3 bytes unused]
then it’d fill the buffer: [10 bytes of data]
and the processing has a slice to all 10 bytes of data.

It is ofc more efficient with a 13 byte buffer to account for my extra header, but it shouldn’t break any code that isn’t doing something sketchy.

the minimum buffer size is not how much you want the underlying reads to be, its how much you need to store temporarily to do your processing.

1 Like

then it’d fill the buffer: [10 bytes of data]

Again, this is where it fails: I cannot do a partial read of 3 bytes just to fill the buffer. It is always 10 bytes read. In your example, I always need a 10 bytes empty buffer to fulfill an underlying read!

For the reason why, ForeverZer0 is correct: I am working with chunk and decompress related stuff, so I always need that exact amount.

1 Like

I mean: I understand that the buffer must be larger to fit this, it is logically that buf.len >= peek + 8192. I could stuff a std.mem.Allocator inside my Reader to satisfy the hunger for buffer space.

Still, my question is: Where to enforce this, where do I assert that one need to pass a bigger buffer? Do I return an error in stream, rebase? Do I return a 0 in stream when it happens and somewhere else will std.debug.assert it? Do I assert it myself inside rebase or stream?

Or… Where to enforce minimum buffer len?, as per the title.

When the reader is being initialized and a buffer is being passed to it.
The minimum size should be documented, and backed up with an assert/error return that executes on the first line of the function body.

since its also the underlying read that has the restriction, you would also assert in that read. as the issue in their case is from previously buffered data it cant be asserted in init

Makes sense, so buf.len >= max_peek + 8192. That is reasonable, I can do that.

Still, I am pretty sure that I need to assert peek_amount <= max_peek. Where should I put that as of the contract of Reader? rebase, or stream? Do I return 0 and some functions will handle it? Do I std.debug.assert it or return error.ReadFailed etc etc.

Sorry if I sounded patronising.

That makes more sense, but I am confused where that requirement comes from. Before, I assumed it was processing of the data. But now it sounds like a driver or library restriction (or protocol).

As @vulpesx pointed out, inside your stream function would be another good place for some safety checks. If you are restricted to fixed-size reads of 8192 bytes that cannot be done partially, I would sprinkle asserts all over the place to ensure things are performing as-expected. As long as you aren’t modifying state to make the assertion, they are free, so no harm in being zealous with them.

For example, files opened with O_DIRECT requires read()s to always be page aligned. So you always need a buffer of at least a page size to do one read().

3 Likes

Thanks. So I think the answer is:

  • How much to enforce: take peek into account, enforce the minimum buffer len to be max_peek + 8192 instead of just 8192. I think this is the key thing I missed.
  • How to enforce: do std.debug.assert instead of returning errors or zeroes.
  • Where to enforce: in stream, where you assert enough buffer space. This should also take care of the case where peek >= max_peek.

It makes sense now. I appreciated all the helps, thanks @vulpesx and @ForeverZer0.