Bitsize riddle of little array

// this thing is 6 bits
pub const Letter = packed struct
{
    machinechar: u5 = 0,
    is_blank: bool = false,
}
// a little alias
type Letters7 = [7]Letter;

When I execute:

std.debug.print("letter 7 array bytes={}, bits={}\n", .{@sizeOf(Letters7), @bitSizeOf(Letters7)});

I get the baffling result:
letter 7 array bytes=7, bits=54

54 bits… can anyone explain that?? Instead of 56 I would expect from an array.

Second thing: I am looking for the fastest way to insert at index #0 a new letter. (only at index #0 is required).
(I am keeping a separate ‘len’ variable inside my Move struct).

Is there something faster than this? Some bitwise trick?
I am ‘worried’ about the copyBackwards.
(with the inexplicable 54 bits I guess we have a problem anyway)

fn insert(self: *Move, letter: Letter) void
{
    if (self.len > 0)
        std.mem.copyBackwards(Letter, self.letters[1..1 + self.len], self.letters[0..self.len]);
    self.letters[0] = letter;
    self.len += 1;
}

A little explanation for the structure. I needed to do this because of size-reasons. It was the only way to keep my move inside 128 bits.

In regards to the first thing: @bitSizeOf() returns the amount of bits a type would take up in a packed container.
Since arrays aren’t packed, the individual elements in it get padded to a C ABI compatible size, in this case 8 bits. Hence @sizeOf(Letters7) == 7.
So the amount of bits Letters7 would take up in memory is 7 * 8 = 56, as you’ve correctly assumed. However memory is not packed by default, individual bits cannot be addressed directly.
If you would put Letter7 into a packed container, the last two bits of the last element inside the array could be freely used since they are only padding and will not be counted by @bitSizeOf(). The rest of the padding between the array elements will be counted however since they are part of the array type (which has a well-defined memory layout) (not when inside of a packed container).

To avoid this you could pad Letter to 8 bits:

pub const Letter = packed struct(u8)
{
    machinechar: u5 = 0,
    is_blank: bool = false,
    _: u2 = undefined,
}

but you really don’t need to worry about it unless you plan to use Letters7 inside of a packed container.

This array is not allowed by the compiler to be put in a packed struct anyway!

1 Like

aha! the padding could be an idea!!

I tried this, but no speed gained. The compiler is smarter.

    fn insert_no_speedgain(self: *Move, letter: Letter) void
    {
        const a: *u56 = @alignCast(@ptrCast(&self.letters));
        a.* <<= 8;
        self.letters[0] = letter;
        self.len += 1;
    }

Yeah, you’re right I forgot about that. I guess it is a bit of an inconsistency that @bitSizeOf() doesn’t consider padding for arrays.
For example it will return @sizeOf(T) * 8 for struct and union(enum) but you can’t put those inside a packed container either.

True. A small kinda inconsistency in the language i guess

I didn’t do any performance tests, but you might want to try this (it generates the simplest assembly):

const temp = letters[0..6].*;
letters[1..].* = temp;
letters[0] = letter;

You can compare the original technique, the bitshift technique, and my technique here: https://godbolt.org/z/z68vob4n7

2 Likes

The original approach with copyBackwards actually compiles to the exact same assembly as yours if you pad Letter to 8 bits: godbolt
Seems like LLVM has trouble generating optimal code for looping over non-C-ABI-sized types.

1 Like

Oh yeah that makes sense, I’ve ran into many performance issues like this. It’s unfortunate, as it pressures you use larger byte-aligned int types when smaller ones would be more expressive (and in theory could be just as performant).

1 Like

True!! if Letter is a packed struct(u8) with an added _reserved u2, there is no difference.
Interesting.

Packing is generally anti-performance. It’s useful for reducing size, but it comes at an execution cost on existing computer architectures to unpack to machine-addressable units.

The exception can be if packing gets you to fit into a cache line when it doesn’t otherwise.

True. In most cases I noticed speedup when packing, generating some data.
I am planning to try everything not-packed as well to check.
(Altough profiling is not yet in my knowledge base. Just some time measurements)

My node-trie contains a packed struct. This is around 6-10 MB. There is certainly pays off to pack.

I tried something similar to your shift technique again, but this time with the 2 bits of padding and with 8 letters instead of 7, and I was able to get the ideal assembly of a single shift and byte move. (I don’t know how efficient it is on archs other than x86_64, though).

Even if you don’t use this function, I still think you should store your letters in a 8-length array instead of a 7-length array, as that would allow them to be converted to a u64 easier, and the byte you ‘save’ with the 7-length one will be padding in a struct anyway.

fn insert_shift(letters: *[8]Letter, letter: Letter) void {
    const letters_bytes = std.mem.asBytes(letters);
    const letters_shifted: u64 = std.mem.readInt(u64, letters_bytes, .little) << 8;
    std.mem.writeInt(u64, letters_bytes, letters_shifted, .little);
    letters[0] = letter;
}

I’m using the read/write functions rather than bitcasting or pointer reinterpretation to ensure consistent output on big endian.

https://godbolt.org/z/Gx39hxjM7

Aha. also interesting code that and the assembly code as well. It looks very short!
I will write that down in my notes.
I could contemplate sacrificing some of the _reserved bits with a u4 for the length and use 8 bytes for the letters.

Because of space reasons i put my move inside a 128 bit structure.
(I need transpositiontables, movelists etc .etc.).
btw: i did not yet notice any performance difference using a 24 or 32 bit struct.
I guess space / code trade eachother off in speed.

pub const Move = struct
{
    pub const Data = packed struct(u64)
    {
        ori: Orientation = .Horizontal, // 1 bit
        mode: GenMode = .Normal, // 2 bit
        movetype: MoveType = .Normal, // 2 bit
        lane: Lane = 0, // 15 bit
        score: MoveScore = 0, // 12 bit
        anchor: Square = 0, // 8 bit
        bow: Square = 0, // 8 bit
        eow: Square = 0, // 8 bit
        _reserved: u9 = 0, // 9 spare
    };

    data: Data = .{}, // 8 bytes
    letters: [7]Letter = @splat(Letter.empty), // 7 bytes,
    len: u8 = 0, // 1 byte
}

Ah, I forgot that the array would have an alignment of 1 and so wouldn’t need padding for the last byte.

I’m not sure what the rest of your system is like and in what context length is used so this may not be efficient, but one thing you could do is make letters be 8-long and null-terminated. This would work nicely with insertion, because you wouldn’t have to manually increment the length. The null value could be 255 bitcasted to a letter, since this would make the machinechar value be 31, which is outside of the range of 26 letters. If you need all 32 machinechar states, you could just make machinechar 6 bits long since you have 2 free bits.

Also, another plus for the null-terminated technique is, if you guarantee that all letters after the first null letter are also null, you could compute the length by making one of the reserved bits always be 0 if not null, then apply a mask + popcount to the u64 representing the letters.

Yeah somewhere in the near future the first version (after my first scrabble trial learning Zig) will be on github.
That is when i have a primitive thinking engine…

MachineChar will be 5 bits for now. No Hungarian support (which breaks down all my rules). And no support yet for languages with more than 31 letters. (value 0 is used for different purposes in different situations).

len is only used to mark the used amount of letters.
where they are put on the board is calculated by some magic bitmask in Data.

Move is quite tricky and working closely together with the move generator etc. hence its strange structure.

I am using popcount on data.lane.

Just as a note mostly for myself in here… I think I found the fastest solution.
And no need for nested extra struct.

pub const Move = packed struct(u128)
{
    // fake array of Letter
    letters: u56 = 0,
    len: u3 = 0,
    // more stuff...

    fn slice_mut(self: *Move) [*]Letter
    {
        const ptr: [*]Letter = @ptrCast(@constCast(&self.letters));
        return ptr;
    }

    pub fn insert(self: *Move, letter: Letter) void
    {
        assert(self.len < 7);
        self.letters <<= 8;
        self.slice_mut()[0] = letter;
        self.len += 1;
    }
}