Performance "double" loop

Little question: would solution loop 1 be faster than solution loop 2?

const src_line: []T = src.data[src_idx..src_idx + copy_width];
const dst_line: []T = dst.data[dst_idx..dst_idx + copy_width];

// solution 1
for(src_line, dst_line) |s, *d|
{
    copy(s, &d.*);
}

// solution 2
for(0..copy_width) |i|
{
    copy(src_line[i], &dst_line[i]);
}

when in doubt: measure

sidenote:

You can instead write this like this:

const src_line: []T = src.data[src_idx..][0..copy_width];
const dst_line: []T = dst.data[dst_idx..][0..copy_width];
2 Likes

The two loops are identical.

3 Likes

Dereferencing and taking the address cancel each other out, you could do this:

for(src_line, dst_line) |s, *d|
    copy(s, d);

Compare the disassembly of both options. It’s easier than measuring and a lot of times you can derive a conclusive answer without having to measure. I’m pretty sure both options will generate the same machine code.

1 Like

That is interesting! And good to know.

Dereferencing and taking the address cancel each other out, you could do this:

Ok. Great. I was already dissatisfied with the look of &d.*

In my copy routine (which i adjusted to your example) i however have to deref y.

fn copy(x: i64, y: *i64) void
{
    if (x > 0)
    {
        y.* = x;
    }
}

(Note that I made this for a generic comptype).

I was wondering if the optimizer could do something SIMD when there are u8’s involved.

Compile it and check the assembly, my friend. If you see SIMD instructions emitted for the platform(s) you care about, then it does SIMD.

If you are unfamiliar with assembly, you can look for vector registers being used. On x86 this would be any time you see xmm/ymm/zmm registers being used. On ARM it’s usually a register with a v in front but unfortunately they have aliases as well.

I am a bit familiar with assembler, no expert, but i cannot see the assembler anywhere. I still have to find out some details.
Currently in vscode I cannot build or debug. Just run…