The cost/benefit of virtual functions and interfaces

I’ve been thinking about this on and off for years now and I’ve had different takes depending on the year and circumstance.

Virtual functions are deeply disputed in regards to their performance characteristics (or lack thereof). However, there’s a talk I’d like to introduce to start this discussion: https://www.youtube.com/watch?v=i5MAXAxp_Tw

In this talk, the presenter makes some important points about issues surrounding benchmarks and comes up with several surprising results. One result that seems sound is the profound role that cpu caches have in making calls more generally. The takeaway that the presenter provides is that it’s a highly circumstantial set of factors that determines the performance of virtual functions.

I think this is an important subject because many of our important interfaces in the standard library are virtual table based (think Allocator). We’ve also had topics about the apparent lack of meta-programming interfaces that could be supplemented with interface types like Allocator if the cost is affordable. I’m hoping to garner some conversation around these trade-offs and maybe even see people post some numbers from their systems with their CPU specs.

If anyone is interested, we could draft a virtual vs concrete Zig program that users could download and run on their systems to see if we get conversions/diversions similar to what people report with GCC and Clang. I think it would be a valuable experiment to get some data-points behind the talking-points. Anyone interested in that?

9 Likes

Another interesting data-point would be regarding the difference between virtual calls and single function pointers - if your interface only requires one function to be implemented, then a single function pointer example is probably a better approach.

For me the main benefit of virtual calls through interfaces is flexibility, if allocator was a concrete type instead, I could generate every combination of some data structure with every allocator I use it with, but those data structures themselves would be separate types, which would make it awkward to work with these hyper specialized data structures.

Sometimes you just want to use a list of lists, where some of the sublists use this allocator and some use that one. (Or other datastructures that may use one allocator in one situation and another in a different situation)

Basically everytime you want to write code that doesn’t care what the allocator is (because some other code already cares about that and made a decision) the allocator interface seems useful, so you don’t have to generate n variants with different allocators.

So my main question is, what would be the proposed alternative to interfaces?
How do you avoid a combinatorial explosion of the code size, in a different, possibly faster way?

2 Likes

I think that’s a very interesting question and for the record, we see the same problem and I’m in total agreement about the issue of combinatorial expansion.

The important question that I’m driving at is why do people opt for one method over the other. In my estimation (and in the estimation of the presenter) it’s the common rule that indirection is less performant. This leads to alternative implementations like unions (or variants as some folks call them). The idea there is that a switch statement will perform better than an indirection (or double indirection for virtual methods).

If this is only circumstantially true like the presenter argues, then we can resolve many issues like the virality of anytype using interfaces instead where customization points are ideal.

Even today, when I write a virtual table, I furrow my brow and wonder if I’m accepting sub-optimal performance. However, if this is more directly influenced by hardware considerations like caching, then that changes my impression of what the actual cost is.

What I’m trying to call into question is the validity of performance assumptions. If languages like C++ still have open debates about this sort of thing, I’m sure you’ll agree that it isn’t a closed issue for Zig. The only way I know how to answer this question is measure, but there’s pitfalls there regarding things like AMD vs Intel hardware and I can only see what’s directly on my machine. That’s why I’m suggesting a public project.

1 Like

If you disallow adding new interface implementations at runtime (for example by dynamically linking to a library). Then the interface fat-pointer and tagged union are logically the same. (If you allow adding new implementations at runtime, the tagged union is unlikely to be able to add a case at runtime)

I would argue that it would be neat if you could write your program in a way where you can switch between both (in this case where the set of implementations is known).

With a compiler with hotcode reloading support, it might even be thinkable that it can reload-add a new implementation into all the relevant switch expressions. The cool thing about that would be that then both versions would be very similar (in terms of how they can be used).

Without that the interface fat-pointer seems more like the open-ended solution and the tagged union like the closed one.

Introducing linking or hot swapping will definitely change things, and I think it would be interesting as well.

For the sake of this topic, here’s a situation:

We have a function foo that can take in an object bar and call methods on it. We can write foo in several ways, but here’s just two:

pub fn foo(bar: anytype) void {
    bar.doSomething();
    // more stuff
    bar.doSomethingElse();
}
pub fn foo(bar: VirtualInterface) void {
    bar.doSomething();
    // more stuff
    bar.doSomethingElse();
}

The common perception here is that bar: VirtualInterface performs worse. Even for this simple example, there’s a lot of assumptions upfront. If those assumptions are dependent on hardware considerations though, then we may actually get better strong-typing and less ambiguous API’s without suffering the performance concerns that are often assumed upfront.

1 Like

The problem here is assumptions, and as we all know assumption kills (kudo for those who have the ref), for example recently at school a friend of mine and I were discussing about the choice of data structures for a school project (in C), he wanted to use a linked list, and I told him that he would be better off using an array, as in most cases I can memmove faster than you can insert/remove in a linked list, (at least to a certain point). We took the time to make tests on the school’s old computer and I was right, for the input size we were expecting, (a few kb) the memmove was indeed faster when doing a lot of operations, but when I ran the same test on my M2 the linked list was actually faster for the size we were expecting.

This is a silly example but the point being that hardware is moving faster than we can all imagine, and today’s truth might not be tomorrow’s truth, so while considering performance for the features one might want to add to a language, or the API design that one might choose, it’s important to keep that in mind.

As a rule of thumb from my understanding of modern hardware, the main factors of slowdown in modern CPU are cache-misses/cache-locality and branch miss–prediction, more often than not the CPU is simply waiting for memory to give it something to work with. In that regard less indirection, and more locality would probably be the best option.

but recently I stumbled upon a video that was discussing recent progress in memory hardware, that would make it significantly cheaper / larger / faster, the larger part is not as important as the faster part, faster caches could potentially make the cost of indirection less relevant in the future, of course this is highly speculative but maybe this is a good reason to not be too concerned with that sort of implementation detail, or maybe I’m wrong.

1 Like

Apologies in advance for a slightly cranky tone here, did have my morning pur-erh yet :slight_smile:

In my opinion, this is not a good talk. It needlessly mystifies the topic, without communicating the absolute minimum amount of clarity that is required in any virtual-vs-concrete call discussion. It’s like a tribal shaman pointing at the thunder and lightning and claiming those to be a wrath of gods, instead of, you know, explaining this electricity thing.

This is the core fact about virtual functions:

You don’t compare indirect calls with direct calls. You compare a call with fully inlining the function at the call site.

The first benchmark they present is excellent to make this point, but the point isn’t made in the presentation.

The code looks roughly like this (pardon me for my C++):

// Somewhere in your codebase
struct Base {
    int concrete() { return 7; }
    virtual int virt() { return 7; }
};

// Call-site
void benchmark(Base* b, int iteration_count) {
    Timer timer; // Made up class

    timer.start();
    for(int i = 0; i < iteration_count; i++) {
        b.concrete();
    }
    timer.print_elapsed();    

    timer.start();
    for(int i = 0; i < iteration_count; i++) {
        b.virt();
    }
    timer.print_elapsed();    
}

The question is, which loop is going to be faster, the one with virtual calls or the one with concrete ones?

The thing is, the answer here is crystal clear. You might get a different result, but that would reveal a flaw in your benchmark or in your program.

The concrete loop should be infinitely faster than the virtual one. The concrete loop should be compiled away, while, for the virtual one, there must be iteration_count calls present.

The thing about concrete call is that it’s concrete — there’s one specific function in the program that is being called, compiler knows about it, and it should be able to inline it, and then constant-fold the entire thing to nothing.

In contrast, the thing about virtual call is that it is unknowable. There might be any amount of implementations of virt in the code base, and it could even be loaded at runtime from a dynamic library (.so). As the function could do anything (eg, print to the screen or change program’s global state), it would be incorrect for the compiler to eliminate the call.


Now, of course you might get a different result in practice. For example, you might notice that the concrete call is not a no-op. This generally means that you’ve messed up something about your physical architecture, so that the compiler lost access to the body of the function. E.g., in Rust, you might be hitting effects described in Inline In Rust.

You might also see that your virtual calls take no time. That means that the compiler was able to devirtualize the call. This is surprising — that means that somehow compiler figured that there is in fact only one function that can actually be called here and inlined it, going against your intention as a programmer to declare the code as runtime polymorphic. This likely means that you should just change your code to explicitly communicate to the compiler that no virtual call is possible there


EDIT: to clarify, I am not trivializing the problem here. Where exactly do you put type-erasure/virtual-call boundary is a hard design problem! Similarly to how a lightning is a pretty complicated phenomenon. But if you start addressing the issue with the “wrath of Caches”, I don’t think you’ll get anywhere.

7 Likes

I didn’t take it as cranky. As per usual, you make good points.

This was actually brought up at one point, as was the point about the fallacy of comparing customization points to direct calls. They aren’t the same thing and shouldn’t be treated as such.

I’d like to point out that I think the presenter should have been more curious about the generated assembly. I’m not in agreement with the presentation on a lot of points, but it brings up the issue in a way that most people can start a discussion from.

In fact, my take here is quite the opposite of yours but to a similar conclusion. I think the issue was over simplified (which can have the same effect as mystification). Now, that works effectively as a presentation strategy because most people probably won’t crack the assembly open either. That said, relatability doesn’t always warrant avoiding the cost of rigor.

That’s why I’m curious about getting some data on this. Unknowable in this case just means “depends on the circumstances”. In other words, if you bet money on it you’ll probably lose a lot of dough. In practice, I don’t think someone can make a bet on this but observing trends is not beyond the pale here.

For sure, and it’s not a trivial problem. Some folks have an allergic reaction to virtual tables that may be well warranted and betting on de-virtualization to fix slow code paths isn’t a strong enough stance, imo.

All this said, I think having a utility where people can see the effect of virtualization and whether or not their machine handles it like they expect isn’t a bad thing for the Zig community to have. That’s why I’m interested in brainstorming it rather than trying to get people to pick a side.

2 Likes

Faster hardware just means that good utilization of that hardware is even better, it doesn’t make badly performing software better / less bad. Specific forms of indirection on certain hardware might be ok, I think testing different cases is a good idea, to see what works better. I also agree that today’s hardware may perform differently, then tomorrows hardware.

Ideally you would have a multitude of programs that are implementing many variations, to give you a way to tell which cases work best on your particular hardware.

In the case where different concrete types need to interact, like in the list of list examples you mentioned, you can simply instantiate the meta-type with a type erasure. Since the current std’s Allocator is already a type erasure, you could have something like this:

pub fn List(comptime T: type, comptime Allocator: type) type;

Then the concrete type is instantiated like this:

const Element = List(i32, std.mem.Allocator);
const ConcreteList = List(Element, MyConcreteAllocator);

The top-level list uses an allocator directly, but the elements use a runtime determined allocator. This allows users to pay only for what they need. When they need the runtime polymorphism, they get can get it (and pay for it), by using the indirection through std.mem.allocator. Those who don’t need the indirection, like the top-level list, use whatever allocator they want directly, without paying the cost of indirection. This technique also allows us to avoid binary size explosion, as there are only two instantiations of the List, one with the the sts.mem.Allocator type erasure and one with the concrete allocator, even if the individual elements use a lot of other allocators.

3 Likes

Faster hardware just means that good utilization of that hardware is even better, it doesn’t make badly performing software better / less bad.

Yes this is entirely true, what I was saying is more like if something was slow and now it’s fast, maybe it comes to a point where it’s actually on par or so close to the original fast option that it’s not worth trying to “optimize it”, if that make sense, for example in my first test, the array/memmove was around 400% faster on the computer in that micro micro benchmark for that very specific size of input, but on my computer it was actually faster to use a linked list. Again the nature of the example is not the important part of the argument. What’s important to consider, is that one particular operation that used to cost 400 cycles on a cpu, might cost 30 cycles on a new architecture. and if the fast operation went from 100 cycles down to 25 than yes the faster option is indeed still faster, and on a scale those 5 cycles might still make a significant difference, but as hardware progress more and more, those sorts of micro decision, become less effective at predicting the performance signature of a particular program or function.

As for virtual function calls specifically, it has already been said, but I have a hard time imagining that significant progress in hardware/compilers/linker could reduce the cost of that type of operation. But it’s also worth considering that the cost of virtualization that you have to pay upfront might lead to a more efficient and in fine faster solution in the end, like in the case of the Allocator interface, what you loose upfront can pay for itself later if your allocation strategy improves the performance of your program.

Stupid question.

Consider a function that is doing something with relatively large object (a struct or an array).
Say, I want to just increment each element of an array by given value (does not matter, using SIMD or “manually”).

And imagine I have two implementations.

  • direct call, pass the entire array by value and return the result by value too.
  • indirect call (via virtual methods table), but pass the pointer and do modifications in place

Which one will be faster? I guess the second one.

Invoking a function is not only doing a call by itself, also there are procedures for passing arguments and returning result. Why are you talking about calling per se only? Passing arguments / returning result can have much more greater influence on performance than difference between direct/indirect call.

If I am talknig nonsense, just nevermind :slight_smile:

1 Like

Even if we ignore the direct vs indirect function call, the issue of passing by value and returning vs passing a pointer is an excellent question on its own, and I’ve asked myself that many times.

Wild variability in performance is probably an even stronger argument against virtual functions than lower performance. Who wants to deal with this sort of unpredictability? Imagine, your code is working fine. Then your colleague willy-nilly sticks an inline in front of a function. The large increase in branch instructions overwhelms the branch predictors of all but the latest and greatest CPU. Your colleague happens to be using that.

1 Like

Maybe, at least while there are still many other slow parts, but this is sort of irrelevant in a discussion about what is faster, or what it does cost.

Yes you are right that there may be cases, where you make the personal decision “good enough for now” or “almost the same”, but that doesn’t necessarily mean it is good enough for everyone.

I think it is fine to argue that the gained flexibility/convenience is worth the dip in performance in many cases, however I don’t think that paying this cost upfront, somehow improves performance. Compared to what?

Also it would be good to know what you are trading, instead of just saying it will hopefully buy me something.

I am more interest in the performance of making the choice.

That choosing between different implementations can have an algorithmic benefit is clear, but here we are more focused on different ways in which making that choice could be done and what performance implications that has.

You can always come up with examples where some program spends 90% of its time on some other large problem, or has other problems that add bigger slowdowns.

1 Like

You can find “virtual functions” in every Unix-like OS, I mean that (for instance) read() is polymorphic - when you read from a terminal (or a terminal emulator, like xterm) , the code being executed in kernel space is not the code when you read from a socket, execution pathes are completely different.

It seems to me you are mixing all in one. Correct me if my impression is wrong.

Let’s decompose “function call”:

  • step 1: pass arguments
  • step 2: do a call (directly or indirectly)
  • step 3: return result

I guess the overhead introduced at step 2 (direct vs indirect) is miserable by comparison with overhead that can be “achieved” at steps 1 and 3. How exactly you call a function is mostly controlled by compiler (including programmer’s desire to inline), but the way of how you pass arguments (and also how you return result) is mostly controlled by a programmer, isn’t it?

There is almost no problem for a compiler designer to implement (emit a machine code) for step 2, but steps 1 and 3 is a torture :slight_smile:

This is precisely why I’d like to move this discussion towards examples and I have a very simple one to start with that shows the compiler’s ability to inline a function pointer.

In this case, I have an interface struct with a single function pointer, and a concrete type that returns a value. I’m constant qualifying everything and the compiler’s ability on ReleaseFast is impressive here:

pub const Interface = struct {
    func: *const fn(*const anyopaque) usize,
    ctx: *const anyopaque,

    pub inline fn call(self: Interface) usize {
        return self.func(self.ctx);
    }
};

pub const Concrete = struct {
    x: usize = 42,

    pub fn foo(ctx: *const anyopaque) usize {
        const self: *const Concrete = @ptrCast(@alignCast(ctx));
        return self.x;
    }

    pub fn interface(self: *const Concrete) Interface {
        return Interface {
            .func = &foo,
            .ctx = self,
        };
    }
};

pub export fn bar() usize {
    const concrete = Concrete{ };

    const interface = concrete.interface();

    return interface.call();
}

Compiled with ReleaseFast:

bar:
        mov     eax, 42
        ret

Now, this is a very simple example, but I’d like to know the extent of this. In this case, that’s the same as a concrete function call.

This is why I’d like to experiment with this and move away from the philosophical discussions so we can really see what the effects of these interface types are under different circumstances.

I’ll play around with changing const qualifications and adding more indirection and complexity, but ya gotta start somewhere.

3 Likes

For instance, this is not effected by default values. By changing the value of the struct in the function call, we see that we get the same effect:

pub export fn bar() usize {
    const concrete = Concrete{ .x = 32 };
    const interface = concrete.interface();
    return interface.call();
}
bar:
        mov     eax, 32
        ret

Likewise, changing const interface and const concrete to var interface and var concrete has the same outcome:

pub export fn bar() usize {
    var concrete = Concrete{ .x = 32 };
    var interface = concrete.interface();
    return interface.call();
}
bar:
        mov     eax, 32
        ret

So how far can we push this? What factors cause this to fail and are there general rules we can develop to understand the actual cost of interface types and indirection?

I had a thought about introducing ambiguity here using functions that take an interface and came up with an example that taught me something interesting about using interfaces and function pointers more generally.

Using the example above, I’ll make an ambiguity point in the code:

pub fn useInterface(interface: Interface) usize {
    return interface.call();
}

useInterface has ambiguity because it doesn’t have the surrounding context regarding which type an interface came from. In the example above, the compiler quickly figured out that we were going through concrete. First, we can check this assumption by forcing the call to be always inline:

pub export fn bar1() usize {
    const concrete = Concrete{ };
    const interface = concrete.interface();
    return @call(.always_inline, useInterface, .{ interface });
}

As expected, we get the following again:

bar1:
        mov     eax, 42
        ret

Now if we outline the function instead, we see the ambiguity here:

bar1:
        mov     edi, offset example.Concrete.foo
        mov     esi, offset __anon_1458
        jmp     example.useInterface

example.useInterface:
        mov     rax, rdi
        mov     rdi, rsi
        jmp     rax

example.Concrete.foo:
        mov     rax, qword ptr [rdi]
        ret

__anon_1458:
        .quad   42

In bar1 we have a jump instruction to example.useInterface and useInterface now has no special knowledge of the type being used to generate the interface variable.

So one interesting factor here is for functions that are parameterized with an Interface type, you can control their behavior much better using inlining which allows the compiler to see through the function pointer using the surrounding context. In other words, you can get direct call-like behavior depending on how you handle these ambiguity points where possible.

Worthy of note - if you provide both versions (one inlined, one outlined) you can actually change the behavior of the inlined function as well:

bar1:
        mov     rax, qword ptr [rip + __anon_1459]
        ret

bar2:
        mov     edi, offset example.Concrete.foo
        mov     esi, offset __anon_1459
        jmp     example.useInterface

example.useInterface:
        mov     rax, rdi
        mov     rdi, rsi
        jmp     rax

example.Concrete.foo:
        mov     rax, qword ptr [rdi]
        ret

__anon_1459:
        .quad   42

Here, bar1 is the inlined function call. We can see that because there is an outlined version, it references __anon_1459 even though it has the surrounding context regarding which type created it. It no longer directly returns 42.

This is part of the solution to my original question:

If the foo that takes an interface bar is inlined where bar’s parent type is observable, then yes, it can have identical behavior. That does not mean that they are the same at all and they serve two separate purposes, but this helps spell out the cost of indirection-based types which deploy some kind of erasure.

2 Likes