How to break a switch body?

zigo · September 11, 2023, 8:27am

If looks only the h way works in the following code.
Am I missing something?

const std = @import("std");

fn f() void {
	var n:usize = 5;
	switch (n) blk: {
		else => {
			break: blk;
		},
	}
}

fn g() void {
	var n:usize = 5;
	blk: switch (n) {
		else => {
			break: blk;
		},
	}
}

fn h() void {
	var n:usize = 5;
	blk: { switch (n) {
		else => {
			break: blk;
		},
	}}
}

pub fn main() void {
	f();
	g();
	h();
}

AndrewCodeDev · September 11, 2023, 8:47am

Check out this thread if you haven’t already: Multiline IF blocks as expressions?

From the Zig documentation:

All branches of a switch expression must be able to be coerced to a common type.

We’d have a real conundrum on our hands given that you can return values from a switch. For instance, if we assign from a switch statement but could also arbitrarily break them, you could have inconsistent return values (one would be void, the other might be int).

To be clear, when you say “break from a switch”, are you referring to the classic case of fallthrough (which Zig does not have)?

IntegratedQuantum · September 11, 2023, 8:54am

Since there is no case fallthrough in zig, breaking from the inner block also has the same effect:

switch (n) {
	else => blk: {
		if(someCondition) break :blk;
		...
	},
}

In my opinion this is more readable because the label is closer to the break.

kristoff · September 11, 2023, 9:46am

IntegratedQuantum:

Since there is no case fallthrough in zig, breaking from the inner block also has the same effect:
switch (n) {
	else => blk: {
		if(someCondition) break :blk;
		...
	},
}
In my opinion this is more readable because the label is closer to the break.

This is the preferred way of doing it.

zigo · September 11, 2023, 4:19pm

Since there is no case fallthrough in zig, breaking from the inner block also has the same effect: …

I’m aware of this. But there are tons of branches. Labeling every one seems verbose.

zigo · September 11, 2023, 4:24pm

From the Zig documentation:
All branches of a switch expression must be able to be coerced to a common type.
We’d have a real conundrum on our hands given that you can return values from a switch. For instance, if we assign from a switch statement but could also arbitrarily break them, you could have inconsistent return values (one would be void, the other might be int).

I’m not sure I totally understand these. Is there any bad of the pattern used in function h?

To be clear, when you say “break from a switch”, are you referring to the classic case of fallthrough (which Zig does not have)?

I mean making execution jump out of the whole switch block.

AndrewCodeDev · September 11, 2023, 9:19pm

No problem, let’s have a look at some pseudo code to understand what is meant by that statement in the documentation. I’m using pseudo code because the example would not compile, so this is a counter factual.

Let’s say that I want to have a variable assigned to the result of a switch statement:

var result = switch (value) {
    /// Branch 1: 
    ///     return an integer

    /// Branch 2:
    ///     if predicate, break the switch
    ///     else return an integer
}

If we go through branch 1, we’ll return an integer - so the type of “result” would be the same integer type.

However, in branch 2, if the predicate is true, we’ll just break. So that leaves the question - if we break, then what is the type of “result”? It’s not an integer, because we aren’t returning an integer, we’re just breaking out of the statement (it would probably be “void” because we aren’t returning anything). That’s the problem - one code path returns integers, the other code path can return void.

So returning to the documentation:

All branches of a switch expression must be able to be coerced to a common type.

We can’t coerce the types to be each other (void → int or int → void).

zigo · September 12, 2023, 5:50am

Thanks for the more detailed explanations. So the switch and if-else blocks are always treated as expressions. That is reasonable.

andrewrk · September 14, 2023, 5:42am

introduce labeled continue syntax inside a switch expression

opened 06:39AM - 12 Mar 21 UTC

andrewrk

proposal accepted bounty

## Background The `goto` keyword was removed in #630. This remains the right …call because all the control flow in Zig can be expressed in a better way: `continue` to goto backwards, and `break` to goto forwards. However, in C, there is another concept, called "computed goto". This is described in #5950 and briefly discussed in #2162. This concept is not currently possible in Zig. It is possible to model the desired semantics with existing control flow features quite simply, but it is not possible to obtain the desired machine code, even in optimized builds. ## Problem Statement For example ([godbolt link](https://godbolt.org/z/fKd7Y3)): ```zig const Inst = extern struct { tag: Tag, const Tag = extern enum { add, addwrap, alloc, alloc_mut, alloc_inferred, alloc_inferred_mut, anyframe_type, array_cat, array_mul, array_type, array_type_sentinel, indexable_ptr_len, }; }; export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void { const inst_list = inst_list_ptr[0..inst_list_len]; for (inst_list) |inst, i| { map[i] = switch (inst.tag) { .add => analyze_add(inst), .addwrap => analyze_addwrap(inst), .alloc => analyze_alloc(inst), .alloc_mut => analyze_alloc_mut(inst), .alloc_inferred => analyze_alloc_inferred(inst), .alloc_inferred_mut => analyze_alloc_inferred_mut(inst), .anyframe_type => analyze_anyframe_type(inst), .array_cat => analyze_array_cat(inst), .array_mul => analyze_array_mul(inst), .array_type => analyze_array_type(inst), .array_type_sentinel => analyze_array_type_sentinel(inst), .indexable_ptr_len => analyze_indexable_ptr_len(inst), }; } } extern fn analyze_add(inst: Inst) u32; extern fn analyze_addwrap(inst: Inst) u32; extern fn analyze_alloc(inst: Inst) u32; extern fn analyze_alloc_mut(inst: Inst) u32; extern fn analyze_alloc_inferred(inst: Inst) u32; extern fn analyze_alloc_inferred_mut(inst: Inst) u32; extern fn analyze_anyframe_type(inst: Inst) u32; extern fn analyze_array_cat(inst: Inst) u32; extern fn analyze_array_mul(inst: Inst) u32; extern fn analyze_array_type(inst: Inst) u32; extern fn analyze_array_type_sentinel(inst: Inst) u32; extern fn analyze_indexable_ptr_len(inst: Inst) u32; ``` In the generated machine code, each prong ends up jumping back to the loop condition, before getting re-dispatched to the next prong: ```asm .LBB0_3: xor edi, edi call analyze_add jmp .LBB0_15 .LBB0_4: mov edi, 1 call analyze_addwrap jmp .LBB0_15 ``` The reason this machine code is not what we desire is described in [this paper](http://www.cs.toronto.edu/~matz/dissertation/matzDissertation-latex2html/node6.html) in the section "Direct Threading" and "The Context Problem": > Mispredicted branches pose a serious challenge to modern processors because they threaten to starve the processor of instructions. The problem is that before the destination of the branch is known the execution of the pipeline may run dry. To perform at full speed, modern CPUs need to keep their pipelines full by correctly predicting branch targets. > This problem is even worse for direct call threading and switch dispatch. For these techniques there is only one dispatch branch and so all dispatches share the same BTB entry. Direct call threading will mispredict all dispatches except when the same virtual instruction body is dispatched multiple times consecutively. They explain it in a nice, intuitive way here: > Another perspective is that the destination of the indirect dispatch branch is unpredictable because its destination is not correlated with the hardware pc. Instead, its destination is correlated to the vPC. We refer to this lack of correlation between the hardware pc and vPC as the context problem. We choose the term context following its use in context sensitive inlining [#!Grove_Chambers_2002!#] because in both cases the context of shared code (in their case methods, in our case virtual instruction bodies) is important to consider. So the problem statement here is that we want to be able to write zig code that outputs machine code that matches this Direct Threading pattern. In one sense, it is an optimization problem, since we can model the same semantics with other language constructs and other machine code. But in another sense, it is more fundamental than an optimization problem, because Zig is a language that wants to generate *optimal* machine code, meaning it is possible to write Zig code that generates machine code equivalent or better to what you could write by hand. In short summary, we want to be able to express zig code where each switch prong jumps directly to the next prong, instead of all switch prongs sharing the same indirect jump, in order to benefit the branch predictor. ## Research Dump ### Can LLVM Do the Optimization? In this example ([godbolt link](https://godbolt.org/z/nb5vrP)), I changed the loop to while(true) and manually inlined the continue expression into each switch prong, with a continue. It does not get much simpler than this; we are practically begging LLVM to do the optimization. ```zig const Inst = extern struct { tag: Tag, const Tag = extern enum { add, addwrap, alloc, alloc_mut, alloc_inferred, alloc_inferred_mut, anyframe_type, array_cat, array_mul, }; }; export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void { const inst_list = inst_list_ptr[0..inst_list_len]; var i: usize = 0; while (true) { const inst = inst_list[i]; switch (inst.tag) { .add => { map[i] = analyze_add(inst); i += 1; if (i < inst_list_len) continue; }, .addwrap => { map[i] = analyze_addwrap(inst); i += 1; if (i < inst_list_len) continue; }, .alloc => { map[i] = analyze_alloc(inst); i += 1; if (i < inst_list_len) continue; }, .alloc_mut => { map[i] = analyze_alloc_mut(inst); i += 1; if (i < inst_list_len) continue; }, .alloc_inferred => { map[i] = analyze_alloc_inferred(inst); i += 1; if (i < inst_list_len) continue; }, .alloc_inferred_mut => { map[i] = analyze_alloc_inferred_mut(inst); i += 1; if (i < inst_list_len) continue; }, .anyframe_type => { map[i] = analyze_anyframe_type(inst); i += 1; if (i < inst_list_len) continue; }, .array_cat => { map[i] = analyze_array_cat(inst); i += 1; if (i < inst_list_len) continue; }, .array_mul => { map[i] = analyze_array_mul(inst); i += 1; if (i < inst_list_len) continue; }, } break; } } extern fn analyze_add(inst: Inst) u32; extern fn analyze_addwrap(inst: Inst) u32; extern fn analyze_alloc(inst: Inst) u32; extern fn analyze_alloc_mut(inst: Inst) u32; extern fn analyze_alloc_inferred(inst: Inst) u32; extern fn analyze_alloc_inferred_mut(inst: Inst) u32; extern fn analyze_anyframe_type(inst: Inst) u32; extern fn analyze_array_cat(inst: Inst) u32; extern fn analyze_array_mul(inst: Inst) u32; ``` Snippet of assembly: ```asm .LBB0_3: mov dword ptr [r14 + 4*rbx], eax inc rbx cmp rbx, r15 jae .LBB0_4 .LBB0_1: mov eax, dword ptr [r12 + 4*rbx] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_2: xor edi, edi call analyze_add jmp .LBB0_3 .LBB0_6: mov edi, 2 call analyze_alloc jmp .LBB0_3 .LBB0_5: mov edi, 1 call analyze_addwrap jmp .LBB0_3 .LBB0_7: mov edi, 3 call analyze_alloc_mut jmp .LBB0_3 ``` Here, LLVM actually figured out the continue expression was duplicated N times, and *un-inlined it*, putting the code back how it was! So crafty. ### EDIT: New Discovery > It does not get much simpler than this Wrong! After typing up this whole proposal, I realized that I did not try that optimization with using an "end" tag in the above code. Here is the case, modified ([godbolt link](https://godbolt.org/z/T3v881)): ```zig const Inst = extern struct { tag: Tag, const Tag = extern enum { end, add, addwrap, alloc, alloc_mut, alloc_inferred, alloc_inferred_mut, anyframe_type, array_cat, array_mul, }; }; export fn entry(inst_list: [*]const Inst, map: [*]u32) void { var i: usize = 0; while (true) { const inst = inst_list[i]; switch (inst.tag) { .end => return, .add => { map[i] = analyze_add(inst); i += 1; continue; }, .addwrap => { map[i] = analyze_addwrap(inst); i += 1; continue; }, .alloc => { map[i] = analyze_alloc(inst); i += 1; continue; }, .alloc_mut => { map[i] = analyze_alloc_mut(inst); i += 1; continue; }, .alloc_inferred => { map[i] = analyze_alloc_inferred(inst); i += 1; continue; }, .alloc_inferred_mut => { map[i] = analyze_alloc_inferred_mut(inst); i += 1; continue; }, .anyframe_type => { map[i] = analyze_anyframe_type(inst); i += 1; continue; }, .array_cat => { map[i] = analyze_array_cat(inst); i += 1; continue; }, .array_mul => { map[i] = analyze_array_mul(inst); i += 1; continue; }, } break; } } extern fn analyze_add(inst: Inst) u32; extern fn analyze_addwrap(inst: Inst) u32; extern fn analyze_alloc(inst: Inst) u32; extern fn analyze_alloc_mut(inst: Inst) u32; extern fn analyze_alloc_inferred(inst: Inst) u32; extern fn analyze_alloc_inferred_mut(inst: Inst) u32; extern fn analyze_anyframe_type(inst: Inst) u32; extern fn analyze_array_cat(inst: Inst) u32; extern fn analyze_array_mul(inst: Inst) u32; ``` Two example prongs from the machine code: ```asm .LBB0_2: mov edi, 1 call analyze_add mov dword ptr [r14 + rbx], eax add rbx, 4 mov eax, dword ptr [r15 + rbx] jmp qword ptr [8*rax + .LJTI0_0] .LBB0_3: mov edi, 2 call analyze_addwrap mov dword ptr [r14 + rbx], eax add rbx, 4 mov eax, dword ptr [r15 + rbx] jmp qword ptr [8*rax + .LJTI0_0] ``` It's perfect! This is exactly what we wanted. This compromises the entire proposal. I will still post the proposal but this new discovery makes it seem unnecessary, since, in fact, we are hereby observing #2162 already implemented and working inside LLVM. ## Real Actual Use Case Here's one in the self-hosted compiler: https://github.com/ziglang/zig/blob/e9a038c33bbf171695b08540536f307b9e418173/src/zir_sema.zig#L30 This switch is inside a loop over ZIR instructions. In optimized builds, we noticed non-trivial amount of time spent in the overhead of this dispatch, when analyzing a recursive comptime fibonacci function call. This pattern also exists in: * The tokenizer * The parser * astgen * sema (this is the linked one above) * codegen * translate-c * zig fmt (pretty much in every stage of the pipeline) ## Other Possible Solution: Tail Calls Tail calls solve this problem. Each switch prong would `return foo()` (tail call) and `foo()` at the end of its business would inline call a function which would do the switch and then tail call the next prong. This is reasonable in the sense that it is doable right now; however there are some problems: * As far as I understand, tail calls don't work on some architectures. - (what are these? does anybody know?) * I'm also concerned about trying to debug when doing dispatch with tail calls. * It forces you to organize your logic into functions. That's another jump that maybe you did not want in your hot path. ## Proposal I propose to add `continue :label expression` syntax, and the ability to label switch expressions. Here is an example: ```zig const Inst = extern struct { tag: Tag, const Tag = extern enum { end, add, addwrap, alloc, alloc_mut, alloc_inferred, alloc_inferred_mut, anyframe_type, array_cat, array_mul, }; }; export fn entry(inst_list: [*]const Inst, map: [*]u32) void { var i: usize = 0; sw: switch (inst_list[i].tag) { .end => return, .add => { map[i] = analyze_add(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .addwrap => { map[i] = analyze_addwrap(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .alloc => { map[i] = analyze_alloc(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .alloc_mut => { map[i] = analyze_alloc_mut(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .alloc_inferred => { map[i] = analyze_alloc_inferred(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .alloc_inferred_mut => { map[i] = analyze_alloc_inferred_mut(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .anyframe_type => { map[i] = analyze_anyframe_type(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .array_cat => { map[i] = analyze_array_cat(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, .array_mul => { map[i] = analyze_array_mul(inst_list[i]); i += 1; continue :sw inst_list[i].tag; }, } } extern fn analyze_add(inst: Inst) u32; extern fn analyze_addwrap(inst: Inst) u32; extern fn analyze_alloc(inst: Inst) u32; extern fn analyze_alloc_mut(inst: Inst) u32; extern fn analyze_alloc_inferred(inst: Inst) u32; extern fn analyze_alloc_inferred_mut(inst: Inst) u32; extern fn analyze_anyframe_type(inst: Inst) u32; extern fn analyze_array_cat(inst: Inst) u32; extern fn analyze_array_mul(inst: Inst) u32; ``` The new labeled continue syntax is syntactically unambiguous at a glance that it jumps to a `switch` expression, because it is the only form where `continue` accepts an operand. More details: * labeled `continue` with an operand on a loop would be a compile error * labeled `break` with a switch would be OK. ### How to Lower this to LLVM Note: I wrote this section before the EDIT: New Discovery section. One idea I had was to put the switchbr instruction inside each prong. I did some LLVM IR surgery to try out this idea ([godbolt link](https://godbolt.org/z/PK6or7)): ```llvm SwitchProngAdd: ; preds = %WhileBody %9 = load i64, i64* %i, align 8 %10 = load i32*, i32** %map, align 8 %11 = getelementptr inbounds i32, i32* %10, i64 %9 %12 = bitcast %Inst* %inst to i32* %13 = load i32, i32* %12, align 4 %14 = call i32 @analyze_add(i32 %13) store i32 %14, i32* %11, align 4 %15 = load i64, i64* %i, align 8 %16 = add nuw i64 %15, 1 store i64 %16, i64* %i, align 8 %17 = load i64, i64* %i, align 8 %18 = load %Inst*, %Inst** %inst_list, align 8 %19 = getelementptr inbounds %Inst, %Inst* %18, i64 %17 %20 = getelementptr inbounds %Inst, %Inst* %19, i32 0, i32 0 %a20 = load i32, i32* %20, align 4 switch i32 %a20, label %SwitchElse18 [ i32 0, label %SwitchProngEnd i32 1, label %SwitchProngAdd i32 2, label %SwitchProngAddWrap i32 3, label %SwitchProngAlloc i32 4, label %SwitchProngAllocMut i32 5, label %SwitchProngAllocInferred i32 6, label %SwitchProngAllocInferredMut i32 7, label %SwitchProngAnyframeType i32 8, label %SwitchProngArrayCat i32 9, label %SwitchProngArrayMul ] SwitchProngAddWrap: ; preds = %WhileBody %21 = load i64, i64* %i, align 8 %22 = load i32*, i32** %map, align 8 %23 = getelementptr inbounds i32, i32* %22, i64 %21 %24 = bitcast %Inst* %inst to i32* %25 = load i32, i32* %24, align 4 %26 = call i32 @analyze_addwrap(i32 %25) store i32 %26, i32* %23, align 4 %27 = load i64, i64* %i, align 8 %28 = add nuw i64 %27, 1 store i64 %28, i64* %i, align 8 %29 = load i64, i64* %i, align 8 %30 = load %Inst*, %Inst** %inst_list, align 8 %31 = getelementptr inbounds %Inst, %Inst* %30, i64 %29 %32 = getelementptr inbounds %Inst, %Inst* %31, i32 0, i32 0 %a32 = load i32, i32* %32, align 4 switch i32 %a32, label %SwitchElse18 [ i32 0, label %SwitchProngEnd i32 1, label %SwitchProngAdd i32 2, label %SwitchProngAddWrap i32 3, label %SwitchProngAlloc i32 4, label %SwitchProngAllocMut i32 5, label %SwitchProngAllocInferred i32 6, label %SwitchProngAllocInferredMut i32 7, label %SwitchProngAnyframeType i32 8, label %SwitchProngArrayCat i32 9, label %SwitchProngArrayMul ] ``` The machine code for the prongs looks like this: ```asm <snip> .LBB0_8: # %SwitchProngAnyframeType mov edi, r12d call analyze_anyframe_type mov dword ptr [r14 + 4*rbx], eax mov eax, dword ptr [r15 + 4*rbx + 4] add rbx, 1 jmp qword ptr [8*rax + .LJTI0_7] .LBB0_9: # %SwitchProngArrayCat mov edi, r12d call analyze_array_cat mov dword ptr [r14 + 4*rbx], eax mov eax, dword ptr [r15 + 4*rbx + 4] add rbx, 1 jmp qword ptr [8*rax + .LJTI0_8] .LBB0_10: # %SwitchProngArrayMul mov edi, r12d call analyze_array_mul mov dword ptr [r14 + 4*rbx], eax mov eax, dword ptr [r15 + 4*rbx + 4] add rbx, 1 jmp qword ptr [8*rax + .LJTI0_9] <snip> ``` Pretty nice. This is exactly what we want - there is an indirect jump in each prong directly to the next prong. But the problem is that even though we should have the same jump table 9 times, LLVM duplicates the jump table 9 times: ```asm .LJTI0_0: .quad .LBB0_1 .quad .LBB0_2 .quad .LBB0_3 .quad .LBB0_4 .quad .LBB0_5 .quad .LBB0_6 .quad .LBB0_7 .quad .LBB0_8 .quad .LBB0_9 .quad .LBB0_10 .LJTI0_1: .quad .LBB0_1 .quad .LBB0_2 .quad .LBB0_3 .quad .LBB0_4 .quad .LBB0_5 .quad .LBB0_6 .quad .LBB0_7 .quad .LBB0_8 .quad .LBB0_9 .quad .LBB0_10 <snip> ``` The duplicated jump tables are problematic, because in reality there could reasonably be about 150-200 instruction tags, which makes the jump table 600-800 bytes. This is fine; for example my L1 cache size is 256 KiB. But I wouldn't want to multiply that jump table by 200! It would be 156 KiB just for the jump tables alone. That would wreak havoc on the cache. Unless this improves upstream, the best strategy to lower this language feature will be for Zig to manually create the jump table itself instead of relying on LLVM to do it, using LLVM's ability to take the address of basic blocks and put them into an array. This will essentially generate the same code that you would get in Clang if you used computed goto in the traditional way. ### How to Lower this in Self-Hosted Backends We have lots of options here. It would be quite straightforward, since we have full control over AIR, as well as the backend code generation. ### OK But Is The Perf Actually Good? I don't know. I think realistically in order to benchmark this and find out if the machine code performs better we have to implement it first.