The cost/benefit of virtual functions and interfaces

Tosti · March 10, 2024, 2:30pm

How a basic usage of, for example, an ArrayList would look like with inlining? For example, imagine a function that creates an ArrayList, resizes it, fills it in a loop and returns it.

pub export fn foo1() *anyopaque {
    const size: u8 = 32;
    var buffer: [size]u8 = undefined;
    var fba = std.heap.FixedBufferAllocator.init(&buffer);
    const allocator = fba.allocator();
    var arr = @call(.always_inline, std.ArrayList(u8).init, .{ allocator });
    @call(.always_inline, std.ArrayList(u8).resize, .{ &arr, size }) catch unreachable;
    for (0..size) |i| {
        arr.items[i] = @intCast(i);
    }
    // ensure that arr is not optimized away
    return &arr;
}

It does not look great, imagine there are more ArrayList functions and all of them have to be wrapped into @call. And if you forgot to do it, you won’t be notified by the compiler.

But what if a function foo2 takes an Allocator as a parameter? What the author of this function should do? @call(.always_inline) every ArrayList function? It would help only if foo2 itself is inlined. But the author of the function can’t control that.

Personally, I don’t want to specify inlining frequently, because it may hurt perfromance. I want to rely on compiler for that.

That’s why I believe that the approach of @LucasSantos91 is better. Static polymorphism is a more basic building block than dynamic polymorhphism, because you can get latter from the former (by passing std.mem.Allocator as a comptime Allocator argument), but not vice versa. Advantages:

you don’t have to remember to wrap function calls with @call(.always_inline);
you let the compiler to decide what functions should be inlined;
it gives more possibilities to the user (the user may decide whether they want to pass a concrete allocator or the interface std.mem.Allocator).

Disadvantages stem from the fact that it is harder to use.

The user has to understand whether to pass their ConcreteAllocator or *ConcreteAllocator. I don’t know how to improve it properly.
It may lead to code bloat if used extensively. It may be fixed by passing std.mem.Allocator on call sites (the user decides where exactly they want to do that).
Duck typing with anytype. I believe it should be solved by something like concepts from C++.

AndrewCodeDev · March 10, 2024, 3:38pm

Hey Tosti - good to hear from you again, it’s been a while.

Yeah, and I think @LucasSantos91’s idea is good and can be broadened out to the more general case. In general, I think relying on concretization where we can is usually the right approach. It’s basically saying “I know what I want, and I want it here.”

Where that can be applied, it should be and this is another case of that. The strategy here is to push the erased thing further down the chain and you try to use it as a fallback (if we’re still trying to tackle the problem of strategizing around erased things).

chung-leong · March 10, 2024, 10:25pm

Indirect calls typically require one or two cycles more. In practice it takes no more time than direct calls since the call instruction can run parallel to other instructions. Whether it’s fast depends on whether the branch target predictor manages to correctly guess the target address. If not, then we pay a penalty. The penalty is lower for direct call since the mistake can be detect earlier, right after decode. For indirect call, we have to wait for the load instruction to finish. If the actual target isn’t in the L1 cache, we pay another penalty. We pay a bigger penalty still (30 to 40 cycles) if it’s not in the L2 cache either. Cache misses are less likely for direct calls I believe, since the prefetcher would have brought in the code ahead of time.

Simpler CPUs like the Intel Atom don’t perform sophisticated branch target prediction. They’ll just use the last address seen. A situation where you have two virtual functions alternately evicting each other from the BTB is quite possible.