New Zigger exploring performance

Hi all,

I’m only a week or so into my Zig journey (so everything I’m doing is on 0.15.1), but I’m liking what I’m seeing. In order to help me learn the language I decided to take some C code and port it to Zig. My day job is all around CPU design, and there’s this really old crusty benchmark called Dhrystone that still gets used. It’s an awful benchmark because all it really does is measure the speed of strcmp. It stopped being a useful CPU benchmark 40 years ago, but there’s enough awful C pointer manipulation in it that I thought I’d learn from porting it.

As I was going along a made a note of the “performance” of the code:

  • Started with the original, compiled by basic old CC – 26M Dhrystones
  • Switched on -O2 – 67M Dhrystones
  • Moved to Zig CC – 154M Dhrystones

“Wow!” I’m thinking. “Amazing what an optimising compiler can do.”

  • Do a translate-c on the code, and zig build --release=fast – 100M Dhrystones

“Ok, that’s pretty good out of the box. Code is noisy as hell though…”

  • Remove export off everything. Free up the compiler – 118M Dhrystones
  • Remove unused symbols – No change
  • Remove all about a thousand @import("std") statements – No change
  • Replace printf / fprintf / exit / atoi calls with Zig versions – 68M Dhrystones.

“WHAT!?!?!”

For those interested, it’s this change. Nothing inside the timed section is changed (between Begin_Time = and End_Time = ), yet if you look at the generated code it’s very different. "Poop"ing the two versions against each other gives me:

Benchmark 1 (15 runs): ./dhrystone-de008fc 500000000
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.26s  ± 10.6ms    4.24s  … 4.28s           0 ( 0%)        0%
  peak_rss           1.52MB ± 64.6KB    1.39MB … 1.63MB          0 ( 0%)        0%
  cpu_cycles         20.1G  ± 24.3M     20.0G  … 20.1G           0 ( 0%)        0%
  instructions       70.0G  ± 44.9      70.0G  … 70.0G           0 ( 0%)        0%
  cache_references    286K  ± 32.1K      230K  …  332K           0 ( 0%)        0%
  cache_misses       16.8K  ± 4.00K     12.4K  … 26.9K           0 ( 0%)        0%
  branch_misses      14.1K  ±  244      13.6K  … 14.6K           0 ( 0%)        0%
Benchmark 2 (8 runs): ./dhrystone-f1b6789 500000000
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.50s  ±  134ms    7.34s  … 7.66s           0 ( 0%)        💩+ 76.1% ±  1.7%
  peak_rss           1.28MB ± 32.9KB    1.25MB … 1.32MB          0 ( 0%)        ⚡- 15.7% ±  3.4%
  cpu_cycles         36.0G  ±  576M     35.4G  … 36.5G           0 ( 0%)        💩+ 79.3% ±  1.5%
  instructions       77.5G  ±  160      77.5G  … 77.5G           0 ( 0%)        💩+ 10.7% ±  0.0%
  cache_references    453K  ± 67.4K      367K  …  569K           0 ( 0%)        💩+ 58.6% ± 15.0%
  cache_misses       28.2K  ± 6.17K     21.3K  … 39.9K           0 ( 0%)        💩+ 68.1% ± 26.2%
  branch_misses      24.9K  ±  658      24.0K  … 25.9K           0 ( 0%)        💩+ 76.3% ±  2.8%

Can anybody help me understand why removing calls to glibc caused such different outcomes to unrelated code?

2 Likes

Welcome to Ziggit, @WeeBull !

One thing I see here is that you are using std.debug.print which is not equivalent to printf/fprintf. It has a lot more going on behind the scenes.

You probably want to get std.fs.File.stdout() or std.fs.File.stderr() and use that file to write your output. You can use the functions on File directly, or get a Writer and use that interface for printing. If you do the Writer, don’t forget to flush()!

Example:


const stdout = std.fs.File.stdout();
var stdout_buf: [256]u8 = undefined; // Can make buffer as big or as small as needed.
var writer = stdout.writer(&stdout_buf);

// Later on
try writer.interface.print("Something that needs {d} formatting", .{10});


// Don't forget
try writer.interface.flush();

So I agree, that is probably more equivalent. I’ll give it a go and report back.

Still… I wouldn’t expect anything I do before the timer starts, and after the timer ends to affect what’s in the middle. All the prints are before and after.

No significant change. I changed all the std.debug.print and log.info statements to writer.interface.print with catch unreachable on the end (to avoid any error handling code). Looking at poop, it did reduce the number of instructions executed back to the original number, but all the other stats are still the same.

Benchmark 3 (9 runs): ./zig-out/bin/dhrystone 500000000
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.37s  ±  101ms    7.24s  … 7.54s           0 ( 0%)        💩+ 73.0% ±  1.3%
  peak_rss           1.28MB ± 66.3KB    1.13MB … 1.32MB          1 (11%)        ⚡- 15.9% ±  4.2%
  cpu_cycles         35.4G  ±  584M     34.8G  … 36.3G           0 ( 0%)        💩+ 76.3% ±  1.6%
  instructions       70.0G  ±  129      70.0G  … 70.0G           0 ( 0%)          -  0.0% ±  0.0%
  cache_references    494K  ± 48.4K      428K  …  567K           0 ( 0%)        💩+ 74.0% ± 13.1%
  cache_misses       41.6K  ± 4.92K     35.4K  … 49.1K           0 ( 0%)        💩+164.8% ± 22.8%
  branch_misses      23.7K  ±  431      23.2K  … 24.5K           0 ( 0%)        💩+ 68.9% ±  2.4%

(Delta is with respect to Benchmark 1 - above)

Yeah, you have a good point. It doesn’t affect anything inside the hot loop, so not sure why it drops so much. Perhaps someone smarter than me can chime in and give a bit more guidance.

No worries @Southporter. It was a good thing to exclude.

One extra clue might be that I’ve noticed the binary size varies. Printf version is 40k. Std.Fs.stdout version is 209k (down from 277k for std.debug.print). Maybe it’s a memory layout / cache type thing. The original whole executable could probably fit in the L1 cache… certainly L2

It’s probably that given the number of instructions executed is similar. The codegen of std.fmt isn’t the best, and is pretty verbose. Maybe try with ReleaseSmall ?

1 Like

Interesting… That gets me back up from 68M to 83M. 40% more instructions, but 22% faster.

Benchmark 2 (10 runs): ./zig-out/bin/dhrystone 500000000
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.17s  ± 81.3ms    6.05s  … 6.28s           0 ( 0%)        💩+ 44.7% ±  1.0%
  peak_rss           1.27MB ± 40.2KB    1.20MB … 1.31MB          0 ( 0%)        ⚡- 16.1% ±  2.7%
  cpu_cycles         29.3G  ±  346M     28.8G  … 29.8G           0 ( 0%)        💩+ 45.7% ±  0.9%
  instructions        100G  ±  146       100G  …  100G           0 ( 0%)        💩+ 42.9% ±  0.0%
  cache_references    400K  ± 44.0K      346K  …  478K           0 ( 0%)        💩+ 38.2% ± 14.6%
  cache_misses       21.4K  ± 3.87K     15.5K  … 29.0K           0 ( 0%)          + 20.2% ± 21.6%
  branch_misses      19.8K  ±  822      18.8K  … 21.4K           0 ( 0%)        💩+ 40.2% ±  3.4%

I’ve been back over all the versions and --release=small is always about the same performance. The only exception is the very first translate-c version which had all the symbols exported, and that was ~20% slower.

Looks like it’s just ReleaseFast that’s on a knife-edge.

Here are a bunch of changes I made and it runs a lot faster with that.

I think one reason why it doesn’t run as fast as the c version is because a program that is half written in c and half written in Zig is less likely to be well optimized across borders in conventions/abi and subtle differences between languages (c and zig).

That is why I started to use more Zig constructs (@memcpy, std.mem.orderZ) (would take more time to move more of the code to be written in Zig style). Not completely sure if everything is right, but I don’t have more time right now.

const std = @import("std");
const log = std.log;

const Enumeration = enum(u8) {
    Ident_1,
    Ident_2,
    Ident_3,
    Ident_4,
    Ident_5,
};

const One_Thirty = i32;
const One_Fifty = i32;
const Capital_Letter = u8;
const Boolean = bool;
const Str_30 = [31]u8;
const Arr_1_Dim = [50]i32;
const Arr_2_Dim = [50][50]i32;

const struct_unnamed_29 = extern struct {
    Enum_Comp: Enumeration = .Ident_1,
    Int_Comp: i32 = 0,
    Str_Comp: [31]u8 = std.mem.zeroes([31]u8),
};
const struct_unnamed_30 = extern struct {
    E_Comp_2: Enumeration = .Ident_1,
    Str_2_Comp: [31]u8 = std.mem.zeroes([31]u8),
};
const struct_unnamed_31 = extern struct {
    Ch_1_Comp: u8 = 0,
    Ch_2_Comp: u8 = 0,
};
const union_unnamed_28 = extern union {
    var_1: struct_unnamed_29,
    var_2: struct_unnamed_30,
    var_3: struct_unnamed_31,
};
const struct_record = extern struct {
    Ptr_Comp: [*c]struct_record = std.mem.zeroes([*c]struct_record),
    Discr: Enumeration = .Ident_1,
    variant: union_unnamed_28 = std.mem.zeroes(union_unnamed_28),
};
const Rec_Type = struct_record;
const Rec_Pointer = [*c]struct_record;

// Global Variables

// For Portability's sake we allocate two records in global space
// rather than using malloc to get them off the heap.
var RecA_Glob: Rec_Type = .{
    .Ptr_Comp = &RecB_Glob,
    .Discr = .Ident_1,
    .variant = .{ .var_1 = .{
        .Enum_Comp = .Ident_3,
        .Int_Comp = 40,
    } },
};
var RecB_Glob: Rec_Type = .{};

var Ptr_Glob: Rec_Pointer = &RecA_Glob;
var Next_Ptr_Glob: Rec_Pointer = &RecB_Glob;
var Int_Glob: i32 = 0;
var Bool_Glob: bool = false;
var Ch_1_Glob: u8 = 0;
var Ch_2_Glob: u8 = 0;
var Arr_1_Glob: [50]i32 = std.mem.zeroes([50]i32);
var Arr_2_Glob: [50][50]i32 = std.mem.zeroes([50][50]i32);

fn Func_1(Ch_1_Par_Val: Capital_Letter, Ch_2_Par_Val: Capital_Letter) Enumeration {
    const Ch_1_Loc: Capital_Letter = Ch_1_Par_Val;
    const Ch_2_Loc: Capital_Letter = Ch_1_Loc;
    if (@as(i32, @bitCast(@as(u32, Ch_2_Loc))) != @as(i32, @bitCast(@as(u32, Ch_2_Par_Val)))) return .Ident_1 else {
        Ch_1_Glob = Ch_1_Loc;
        return .Ident_2;
    }
    return 0;
}

fn Func_2(Str_1_Par_Ref: [*c]u8, Str_2_Par_Ref: [*c]u8) bool {
    var Int_Loc: One_Thirty = 2;
    var Ch_Loc: Capital_Letter = undefined;
    while (Int_Loc <= @as(i32, 2)) if (Func_1((blk: {
        const tmp = Int_Loc;
        if (tmp >= 0) break :blk Str_1_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Str_1_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    }).*, (blk: {
        const tmp = Int_Loc + @as(i32, 1);
        if (tmp >= 0) break :blk Str_2_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Str_2_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    }).*) == .Ident_1) {
        Ch_Loc = 'A';
        Int_Loc += @as(One_Thirty, @bitCast(@as(i32, 1)));
    };
    if ((@as(i32, @bitCast(@as(u32, Ch_Loc))) >= @as(i32, 'W')) and (@as(i32, @bitCast(@as(u32, Ch_Loc))) < @as(i32, 'Z'))) {
        Int_Loc = 7;
    }
    if (@as(i32, @bitCast(@as(u32, Ch_Loc))) == @as(i32, 'R')) return true else {
        if (std.mem.orderZ(u8, Str_1_Par_Ref, Str_2_Par_Ref) == .gt) {
            Int_Loc += @as(One_Thirty, @bitCast(@as(i32, 7)));
            Int_Glob = Int_Loc;
            return true;
        } else return false;
    }
    return false;
}

fn Func_3(Enum_Par_Val: Enumeration) bool {
    return (Enum_Par_Val == .Ident_3);
}

fn Proc_1(Ptr_Val_Par: Rec_Pointer) void {
    const Next_Record: Rec_Pointer = Ptr_Val_Par.*.Ptr_Comp;
    Ptr_Val_Par.*.Ptr_Comp.* = Ptr_Glob.*;
    Ptr_Val_Par.*.variant.var_1.Int_Comp = 5;
    Next_Record.*.variant.var_1.Int_Comp = Ptr_Val_Par.*.variant.var_1.Int_Comp;
    Next_Record.*.Ptr_Comp = Ptr_Val_Par.*.Ptr_Comp;
    Proc_3(&Next_Record.*.Ptr_Comp);
    if (Next_Record.*.Discr == .Ident_1) {
        Next_Record.*.variant.var_1.Int_Comp = 6;
        Proc_6(Ptr_Val_Par.*.variant.var_1.Enum_Comp, &Next_Record.*.variant.var_1.Enum_Comp);
        Next_Record.*.Ptr_Comp = Ptr_Glob.*.Ptr_Comp;
        Proc_7(Next_Record.*.variant.var_1.Int_Comp, @as(i32, 10), &Next_Record.*.variant.var_1.Int_Comp);
    } else {
        Ptr_Val_Par.* = Ptr_Val_Par.*.Ptr_Comp.*;
    }
}

fn Proc_2(Int_Par_Ref: *One_Fifty) void {
    var Int_Loc: One_Fifty = Int_Par_Ref.* + @as(i32, 10);
    var Enum_Loc: Enumeration = undefined;
    while (true) {
        if (@as(i32, @bitCast(@as(u32, Ch_1_Glob))) == @as(i32, 'A')) {
            Int_Loc -= @as(One_Fifty, @bitCast(@as(i32, 1)));
            Int_Par_Ref.* = Int_Loc - Int_Glob;
            Enum_Loc = .Ident_1;
        }
        if (!(Enum_Loc != .Ident_1)) break;
    }
}

fn Proc_3(Ptr_Ref_Par: [*c]Rec_Pointer) void {
    if (Ptr_Glob != null) {
        Ptr_Ref_Par.* = Ptr_Glob.*.Ptr_Comp;
    }
    Proc_7(@as(i32, 10), Int_Glob, &Ptr_Glob.*.variant.var_1.Int_Comp);
}

fn Proc_4() void {
    const Bool_Loc: bool = Ch_1_Glob == 'A';
    Bool_Glob = Bool_Loc | Bool_Glob;
    Ch_2_Glob = 'B';
}

fn Proc_5() void {
    Ch_1_Glob = 'A';
    Bool_Glob = false;
}

fn Proc_6(Enum_Val_Par: Enumeration, Enum_Ref_Par: *Enumeration) void {
    Enum_Ref_Par.* = Enum_Val_Par;
    if (!Func_3(Enum_Val_Par)) {
        Enum_Ref_Par.* = .Ident_4;
    }
    switch (Enum_Val_Par) {
        .Ident_1 => {
            Enum_Ref_Par.* = .Ident_1;
        },
        .Ident_2 => {
            if (Int_Glob > @as(i32, 100)) {
                Enum_Ref_Par.* = .Ident_1;
            } else {
                Enum_Ref_Par.* = .Ident_4;
            }
        },
        .Ident_3 => {
            Enum_Ref_Par.* = .Ident_2;
        },
        .Ident_4 => {},
        .Ident_5 => {
            Enum_Ref_Par.* = .Ident_3;
        },
    }
}

fn Proc_7(Int_1_Par_Val: One_Fifty, Int_2_Par_Val: One_Fifty, Int_Par_Ref: *One_Fifty) void {
    const Int_Loc: One_Fifty = Int_1_Par_Val + @as(i32, 2);
    Int_Par_Ref.* = Int_2_Par_Val + Int_Loc;
}
var Int_Index: One_Fifty = undefined;

fn Proc_8(Arr_1_Par_Ref: [*c]i32, Arr_2_Par_Ref: [*c][50]i32, Int_1_Par_Val: i32, Int_2_Par_Val: i32) void {
    const Int_Loc: One_Fifty = Int_1_Par_Val + @as(i32, 5);
    (blk: {
        const tmp = Int_Loc;
        if (tmp >= 0) break :blk Arr_1_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_1_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    }).* = Int_2_Par_Val;
    (blk: {
        const tmp = Int_Loc + @as(i32, 1);
        if (tmp >= 0) break :blk Arr_1_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_1_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    }).* = (blk: {
        const tmp = Int_Loc;
        if (tmp >= 0) break :blk Arr_1_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_1_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    }).*;
    (blk: {
        const tmp = Int_Loc + @as(i32, 30);
        if (tmp >= 0) break :blk Arr_1_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_1_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    }).* = Int_Loc;
    {
        Int_Index = Int_Loc;
        while (Int_Index <= (Int_Loc + @as(i32, 1))) : (Int_Index += 1) {
            Arr_2_Par_Ref[@as(u32, @intCast(Int_Loc))][@as(u32, @intCast(Int_Index))] = Int_Loc;
            // (blk: {
            //     const tmp = Int_Loc;
            //     if (tmp >= 0) break :blk Arr_2_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_2_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
            // }).*[@as(c_uint, @intCast(Int_Index))] = Int_Loc;
        }
    }
    Arr_2_Par_Ref[@as(u32, @intCast(Int_Loc))][@as(u32, @intCast(Int_Loc - 1))] += 1;
    // (blk: {
    //     const tmp = Int_Loc;
    //     if (tmp >= 0) break :blk Arr_2_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_2_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    // }).*[@as(c_uint, @intCast(Int_Loc - @as(c_int, 1)))] += @as(c_int, 1);
    Arr_2_Par_Ref[@as(u32, @intCast(Int_Loc + 20))][@as(u32, @intCast(Int_Loc))] = Arr_1_Par_Ref[@intCast(Int_Loc)];
    // (blk: {
    //     const tmp = Int_Loc + @as(c_int, 20);
    //     if (tmp >= 0) break :blk Arr_2_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_2_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    // }).*[@as(c_uint, @intCast(Int_Loc))] = (blk: {
    //     const tmp = Int_Loc;
    //     if (tmp >= 0) break :blk Arr_1_Par_Ref + @as(usize, @intCast(tmp)) else break :blk Arr_1_Par_Ref - ~@as(usize, @bitCast(@as(isize, @intCast(tmp)) +% -1));
    // }).*;
    Int_Glob = 5;
}

// Was originally this, but having the function be exported avoids all
// the zig startup code and std.os.argv won't be initialised.
// pub export fn main(arg_argc: c_int, arg_argv: [*c][*c]u8) u8 {
//     var argc = arg_argc;
//     _ = &argc;
//     var argv = arg_argv;
//     _ = &argv;
pub fn main() u8 {
    const argc = std.os.argv.len;
    const argv = std.os.argv;
    var Int_1_Loc: One_Fifty = std.mem.zeroes(One_Fifty);
    var Int_2_Loc: One_Fifty = std.mem.zeroes(One_Fifty);
    var Int_3_Loc: One_Fifty = std.mem.zeroes(One_Fifty);
    var Ch_Index: u8 = std.mem.zeroes(u8);
    var Enum_Loc: Enumeration = undefined;
    var Str_1_Loc: Str_30 = std.mem.zeroes(Str_30);
    var Str_2_Loc: Str_30 = std.mem.zeroes(Str_30);
    var Run_Index: i32 = std.mem.zeroes(i32);
    var Number_Of_Runs: i32 = std.mem.zeroes(i32);

    if (argc > @as(i32, 2)) {
        std.debug.print("Usage: {s} [number of loops]\n", .{argv[0]});
        return 1;
    }
    if (argc == @as(i32, 2)) {
        Number_Of_Runs = std.fmt.parseInt(i32, std.mem.span(argv[1]), 10) catch 0;
    } else {
        Number_Of_Runs = 50000;
    }
    if (Number_Of_Runs <= @as(i32, 0)) {
        Number_Of_Runs = 50000;
    }
    @memcpy(@as([*c]u8, &Ptr_Glob.*.variant.var_1.Str_Comp), "DHRYSTONE PROGRAM, SOME STRING");
    @memcpy(@as([*c]u8, &Str_1_Loc), "DHRYSTONE PROGRAM, 1'ST STRING");
    Arr_2_Glob[8][7] = 10;
    std.debug.print("\n", .{});
    std.debug.print("Dhrystone Benchmark, Version Zig, Version 2.2\n", .{});
    std.debug.print("Using std.time.milliTimestamp()\n", .{});
    std.debug.print("\n", .{});
    var User_Time: i64 = undefined;
    var Done: bool = false;
    while (!Done) {
        std.debug.print("Trying {d} runs through Dhrystone:\n", .{Number_Of_Runs});
        const Begin_Time = std.time.milliTimestamp();
        {
            Run_Index = 1;
            while (Run_Index <= Number_Of_Runs) : (Run_Index += 1) {
                Proc_5();
                Proc_4();
                Int_1_Loc = 2;
                Int_2_Loc = 3;
                @memcpy(@as([*c]u8, &Str_2_Loc), "DHRYSTONE PROGRAM, 2'ND STRING");
                Enum_Loc = .Ident_2;
                Bool_Glob = !(Func_2(@as([*c]u8, @ptrCast(@alignCast(&Str_1_Loc[@as(usize, @intCast(0))]))), @as([*c]u8, @ptrCast(@alignCast(&Str_2_Loc[@as(usize, @intCast(0))])))));
                while (Int_1_Loc < Int_2_Loc) {
                    Int_3_Loc = (@as(i32, 5) * Int_1_Loc) - Int_2_Loc;
                    Proc_7(Int_1_Loc, Int_2_Loc, &Int_3_Loc);
                    Int_1_Loc += @as(One_Fifty, @bitCast(@as(i32, 1)));
                }
                Proc_8(@as([*c]i32, &Arr_1_Glob[0]), @as([*c][50]i32, @ptrCast(@alignCast(&Arr_2_Glob[@as(usize, @intCast(0))]))), Int_1_Loc, Int_3_Loc);
                Proc_1(Ptr_Glob);
                {
                    Ch_Index = 'A';
                    while (@as(i32, @bitCast(@as(u32, Ch_Index))) <= @as(i32, @bitCast(@as(u32, Ch_2_Glob)))) : (Ch_Index +%= 1) {
                        if (Enum_Loc == Func_1(Ch_Index, @as(Capital_Letter, @bitCast(@as(i8, @truncate(@as(i32, 'C'))))))) {
                            Proc_6(.Ident_1, &Enum_Loc);
                            @memcpy(@as([*c]u8, @ptrCast(@alignCast(&Str_2_Loc[@as(usize, @intCast(0))]))), "DHRYSTONE PROGRAM, 3'RD STRING");
                            Int_2_Loc = Run_Index;
                            Int_Glob = Run_Index;
                        }
                    }
                }
                Int_2_Loc = Int_2_Loc * Int_1_Loc;
                Int_1_Loc = @divTrunc(Int_2_Loc, Int_3_Loc);
                Int_2_Loc = (@as(i32, 7) * (Int_2_Loc - Int_3_Loc)) - Int_1_Loc;
                Proc_2(&Int_1_Loc);
            }
        }
        const End_Time = std.time.milliTimestamp();
        User_Time = End_Time - Begin_Time;
        if (User_Time < 2000) {
            std.debug.print("Measured time too small to obtain meaningful results\n", .{});
            Number_Of_Runs *= 10;
            std.debug.print("\n", .{});
        } else {
            Done = true;
        }
    }
    log.info("Final values of the variables used in the benchmark:\n", .{});
    log.info("", .{});
    log.info("Int_Glob:            {d}", .{Int_Glob});
    log.info("        should be:   {d}", .{@as(i32, 5)});
    log.info("Bool_Glob:           {any}", .{Bool_Glob});
    log.info("        should be:   {any}", .{true});
    log.info("Ch_1_Glob:           {c}", .{Ch_1_Glob});
    log.info("        should be:   {c}", .{'A'});
    log.info("Ch_2_Glob:           {c}", .{Ch_2_Glob});
    log.info("        should be:   {c}", .{'B'});
    log.info("Arr_1_Glob[8]:       {d}", .{Arr_1_Glob[@as(u32, @intCast(@as(i32, 8)))]});
    log.info("        should be:   {d}", .{@as(i32, 7)});
    log.info("Arr_2_Glob[8][7]:    {d}", .{Arr_2_Glob[@as(u32, @intCast(@as(i32, 8)))][@as(u32, @intCast(@as(i32, 7)))]});
    log.info("        should be:   Number_Of_Runs + 10", .{});
    log.info("Ptr_Glob->", .{});
    log.info("  Ptr_Comp:          {d}", .{@as(i32, @intCast(@intFromPtr(Ptr_Glob.*.Ptr_Comp)))});
    log.info("        should be:   (implementation-dependent)", .{});
    log.info("  Discr:             {d}", .{Ptr_Glob.*.Discr});
    log.info("        should be:   {d}", .{@as(i32, 0)});
    log.info("  Enum_Comp:         {d}", .{Ptr_Glob.*.variant.var_1.Enum_Comp});
    log.info("        should be:   {d}", .{@as(i32, 2)});
    log.info("  Int_Comp:          {d}", .{Ptr_Glob.*.variant.var_1.Int_Comp});
    log.info("        should be:   {d}", .{@as(i32, 17)});
    log.info("  Str_Comp:          {s}", .{@as([*c]u8, @ptrCast(@alignCast(&Ptr_Glob.*.variant.var_1.Str_Comp[@as(usize, @intCast(0))])))});
    log.info("        should be:   DHRYSTONE PROGRAM, SOME STRING", .{});
    log.info("Next_Ptr_Glob->", .{});
    log.info("  Ptr_Comp:          {d}", .{@as(i32, @intCast(@intFromPtr(Next_Ptr_Glob.*.Ptr_Comp)))});
    log.info("        should be:   (implementation-dependent), same as above", .{});
    log.info("  Discr:             {d}", .{Next_Ptr_Glob.*.Discr});
    log.info("        should be:   {d}", .{@as(i32, 0)});
    log.info("  Enum_Comp:         {d}", .{Next_Ptr_Glob.*.variant.var_1.Enum_Comp});
    log.info("        should be:   {d}", .{@as(i32, 1)});
    log.info("  Int_Comp:          {d}", .{Next_Ptr_Glob.*.variant.var_1.Int_Comp});
    log.info("        should be:   {d}", .{@as(i32, 18)});
    log.info("  Str_Comp:          {s}", .{@as([*c]u8, @ptrCast(@alignCast(&Next_Ptr_Glob.*.variant.var_1.Str_Comp[@as(usize, @intCast(0))])))});
    log.info("        should be:   DHRYSTONE PROGRAM, SOME STRING", .{});
    log.info("Int_1_Loc:           {d}", .{Int_1_Loc});
    log.info("        should be:   {d}", .{@as(i32, 5)});
    log.info("Int_2_Loc:           {d}", .{Int_2_Loc});
    log.info("        should be:   {d}", .{@as(i32, 13)});
    log.info("Int_3_Loc:           {d}", .{Int_3_Loc});
    log.info("        should be:   {d}", .{@as(i32, 7)});
    log.info("Enum_Loc:            {d}", .{Enum_Loc});
    log.info("        should be:   {d}", .{@as(i32, 1)});
    log.info("Str_1_Loc:           {s}", .{@as([*c]u8, @ptrCast(@alignCast(&Str_1_Loc[@as(usize, @intCast(0))])))});
    log.info("        should be:   DHRYSTONE PROGRAM, 1'ST STRING", .{});
    log.info("Str_2_Loc:           {s}", .{@as([*c]u8, @ptrCast(@alignCast(&Str_2_Loc[@as(usize, @intCast(0))])))});
    log.info("        should be:   DHRYSTONE PROGRAM, 2'ND STRING", .{});
    log.info("", .{});
    const Microseconds: f32 = (@as(f32, @floatFromInt(User_Time)) * 1000.0) / @as(f32, @floatFromInt(Number_Of_Runs));
    const Dhrystones_Per_Second: f32 = 1000000.0 / Microseconds;
    std.debug.print("Microseconds for one run through Dhrystone: ", .{});
    std.debug.print("{e:10.3} \n", .{@as(f64, @floatCast(Microseconds))});
    std.debug.print("Dhrystones per Second:                      ", .{});
    std.debug.print("{e:10.3} \n", .{@as(f64, @floatCast(Dhrystones_Per_Second))});
    std.debug.print("\n", .{});
    return 0;
}
1 Like

Just to clarify since it hasn’t been mentioned so far: higher Dhyrstones is better, right?

Also, FYI translate-c is only really meant for auto-translating C header files, not for fully converting C to Zig.

Running the version with this change and the version before this change with strace -c will likely give you the answer for the slowdown (printf/fprintf does buffering behind the scenes, Zig’s std.debug.print does not)

See here for info on the current way to get buffered IO with Zig (you’d then pass around the *std.Io.Writer you got from const stdout = &stdout_writer.interface; to the functions that need it).

This problem has been mentioned here, but it didn’t change much the result.

1 Like

Just to complete the tale on this one because I put some more effort in to try to work it out.

I ran perf on the various versions of the code and discovered that the whole benchmark seemed to be bottle-necked by two structure assignments, with them taking about 60% of all cycles. A good run tended to have much better cache-prediction rates. The structure assignment itself was only 4 AVX instructions (2 read and 2 write - 32 bytes each), but the first one seemed to be taking a high penalty.

The 56-byte structures were allocated in the .data region of the executable (i.e. they’re static global) and the performance depended on how the compiler decided to align them. Good alignment meant they fitted in one cache-line, but at 56-bytes (with a natural 8-byte alignment) it was much more likely that they’d cross two. Being static, their alignment is determined by their location in the ELF file. Add or remove code, and the alignment changed. With them being so critical to the overall number, crossing two cache-lines had a big impact.

It’s a bit of a working theory, but it seems to fit …and also shows how terrible the benchmark is.

6 Likes