Is there any plan to add an ord function for T and []T at std library?

lianxmfor · November 21, 2024, 8:37am

There is two kinds of ord function in std library of zig 0.13.0:

std.math.ord for T
std.mem.ord for []T

Is can provide other ord function for both T and []T? The reason for that is some scenario need a comparable field being key when implement some structure like ord Tree:

pub fn Tree(comptime K: type, comptime V: type) type {
    return struct {
        root: ?*Node,
        allocator: std.mem.Allocator,
   
        const Self = @This();

        const Node = struct {
            key: K,  // the key must be comparable, but now can't use either std.mem.order or std.math.order directly.
            value: V,
            left: ?*Node,
            right: ?*Node,
            allocator: std.mem.Allocator,

            fn insert(node: *Node) (older_value: ?V, root: *Node) {
                const order = // an order function for both T and []T;
                switch (order) {
                  .eq => // emit
                  .lt => // emit
                  .gt => // emit
                }

            }
        }

    }
}

It is great if std library have an implementation like this:

fn Ord(comptime K: type) type {
    return struct {
        pub fn ord(a: K, b: K) std.math.Order {
            if (comptime is_array_or_slice(K)) {
                const typeinfo = @typeInfo(K);
                const child_type = if (typeinfo == .Array) typeinfo.Array.child else if (typeinfo == .Pointer) typeinfo.Pointer.child else unreachable;
                return std.mem.order(child_type, a, b);
            } else {
                return std.math.order(a, b);
            }
        }

        fn is_array_or_slice(comptime T: type) bool {
            const typeinfo = @typeInfo(T);
            if (typeinfo == .Array) {
                return true;
            } else if (typeinfo == .Pointer) {
                const pointer = typeinfo.Pointer;
                return pointer.size == .Slice or pointer.size == .Many;
            }
            return false;
        }
    };
}

test Ord {
    const ord = Ord([]const u8);
    assert(ord.ord("A", "A") == .eq);
    assert(ord.ord("A", "B") == .lt);

    const ord2 = Ord(i32);
    assert(ord2.ord(1, 1) == .eq);
    assert(ord2.ord(1, 2) == .lt);
}

LucasSantos91 · November 21, 2024, 10:36am

That’s a very specific need, I don’t think the std library should bother with this. Perhaps it would be better to receive the ordering function as a parameter:

pub fn Tree(comptime K: type, comptime V: type, comptime order: fn(K, K) std.math.Order)

lianxmfor · November 21, 2024, 11:23am

A great idea and make sense for me.

mnemnion · November 22, 2024, 2:15am

Take a look at how std.PriorityQueue implements comparison, it should help.

The nice thing about Order it it’s just an enum, you can reuse it however you’d like.

lianxmfor · November 22, 2024, 3:32am

Thanks! I learned the idiotic way how zig handle the ord by std.PriorityQueue you referred.
The reason of I raise the question is that the zig is not had an ord function can handle both string and primitive type, that is confused for some new zig learner from other modern language like rust or go.
Again, thanks!

pachde · November 23, 2024, 11:49pm

Ordering strings is complex. While many languages offer a default ordering, which is good enough for implementing data structures and simple lookups, but useless for dozens of other applications of ordering text.

buzmeg · November 24, 2024, 3:53am

There is no such thing as a universal ordering of “string”. The sooner you come to grips with that, the better.

The ordering of “string” is dependent upon things like use case, language and location. For example, collation of strings can differ even for the same language depending upon which country is doing it (see: Aachen).

mnemnion · November 24, 2024, 6:04pm

This really depends on what you’re doing, and how you define ‘universal’. If by universal you mean one single collation appropriate to all cases, then sure, no such thing.

Unicode defines a standard collation, for Unicode, which simply compares the values of the codepoints. UTF-8, and the other official Unicode encodings, make this equivalent to directly comparing the codeunits.

This is universal in some sense. Since Unicode made a point of ordering scripts in a way which makes it easy to transcode to Unicode from various language-specific encodings which predate it, and that included the codepoint orderings of the original encodings, then sometimes the Unicode collation is identical or fairly close to the natural collation.

That doesn’t work for a lot of languages, in a way which is particularly acute for European languages which use the Latin alphabet with accented marks.

In other cases it works better. For example, the Soviet ordering of Cyrillic was chosen for Unicode, so Russian collates fairly well using the naïve approach, as well as it did for the system which Russians designed for their language. Armenian will collate entirely correctly if normalized to decomposed form, otherwise ligatures will interfere with this.

English doesn’t collate correctly either, for the record, we would want ‘naïve’ to come before ‘natal’ instead of after.

Unicode, and hence Unicode collation, was carefully designed to work as well as it can under the circumstances, but is necessarily a compromise. I consider that a reasonable definition of universal! The standard also defines a proliferation of locale-specific collations, for when those are needed.

pachde · November 24, 2024, 7:39pm

All of which is well beyond the scope of Zig’s std lib.

mnemnion · November 24, 2024, 7:45pm

True… sort of. A user library providing ICU-compliant locale collations could implement that using Order and a context object for standard library sorting algorithms. That would be very nice to have actually.

Still, in general ‘advanced’ Unicode handling is up to libraries. That’s the correct division of labor, if only because Zig shouldn’t need to update each time Unicode releases a new standard.

jibal · November 26, 2024, 7:37am

Yes, Unicode explicitly rejects ordering strings by numerically comparing code points:

FAQ - Collation

pachde · November 26, 2024, 8:22pm

yes, but that doesn’t stop hordes of programmers and libraries from doing so.

mnemnion · November 27, 2024, 12:23am

It just depends on what you’re doing.

If you’re building a trie, or organizing strings for binary search, any order will do in principle, and the natural codepoint order is far and away the most efficient. Most uses for string ordering a programmer will have are this sort of use. An efficient index on a database column doesn’t care about nationality.

If you’re building an index for a book or something of that nature, then yes, you can’t do that without a locale-specific collation algorithm.