Memory layout suggestions for Tcl interpreter

Hi! I recently started on porting a Tcl interpreter over to Zig. I’m basing my overall design on Jimtcl, which I’ve worked with a lot before. As I’ve been starting to port over the main data structures though, I’ve been thinking a lot about how I could make them more cache-friendly.

Just a quick description of how objects work in Tcl: everything is a string in Tcl, or at least appears to be a string. In actuality, Tcl stores an internal representation of that string (integer, boolean, list, etc), and lazily generates strings when needed.

Now, to how Tcl is implemented in Jim. Jim is very much a sea of pointers. Let’s take a list for example. You have the head of the list (a Jim_Obj) which has a pointer to an array. That array then has pointers to each of the elements. So, you have two levels of indirection already, and items spread out wherever the allocator felt like putting them. Each Jim_Obj is also fairly large (clocking in at 72 bytes on a 64-bit machine, that’s already straddling two cache lines). I’d really like to reduce both the size, and the indirection.

I just finished watching Andrew Kelley’s Practical Data Oriented Design talk, and it has gotten the wheels turning. Some ideas I’ve come up with so far are:

  • Allow objects to be stack allocated. Whenever a piece of code needs to start reference counting a stack allocated object (for example, putting it in a list), it’ll be copied and then ref counted (all objects are immutable in Tcl so this is valid). I think this is a pretty easy case, though it could lead to a lot of copying. I’ve done this by implementing a more general “borrow” concept, instead of just incrementing the ref count. Example
  • Store objects directly in the list, instead of object pointers. Also not bad, but again I’m worried it’ll lead to a lot of copying, because the individual items can’t be reference counted. Example
  • Have a reference object type. Small objects would be stored inline, but large objects would have a level of indirection. The only downside I can think of is the extra logic to handle this edge case.
  • Using indexes instead of pointers. This was something that really stood out to me in Andrew’s talk. My biggest issue with this is how would one allocate a contiguous block of objects, and avoid fragmentation over time? I don’t know how else I would be able to implement list elements anyways. Like should I have a buddy allocator sitting on top of an array of objects?
  • Using a tag for common object types, and only using a type pointer for custom types. This should potentially help with indirection too? Example
  • If all list items are of the same type, storing only the bodies of the objects, and keeping one tag for all of them. I tried that here, but it ended up being such a pain that I don’t know if it’ll be worth it in the long run.

So, what makes interpreters fast and cache-friendly? I know that this is not the most actionable post, but I wanted to get a sense of what the state-of-the-art is with interpreters! The only examples I’ve seen so far (Python and Tcl) have been a sea of pointers, and I’m not really sure where else to look.

2 Likes

Welcome to Ziggit @smj-edison!

I converted your links to actual links, the link limit only applies to really fresh users.

I think that would be similar to what is called an archetype style ECS (Entity Component System) in game-dev, so reading about those might be interesting to you, if you want to explore that direction further.


I am a bit confused about your body union, with dod I would expect something that is more group oriented and organizes values into arrays of concrete types, instead of arrays of heterogeneous values.

Alternatively you also could devise something that is more similar to a byte code interpreter format with variable length encoding, if you really want heterogenous values, but I guess you would have to try out both approaches and it would probably depend on the use case which one is better.

Here is a rough sketch of one approach I find useful for interpreters that shouldn’t waste a bunch of memory space on small values. With this, what was an Object before is now a Handle, which is basically 5 bytes and can have an arbitrary amount of extra bytes if you indirect it to a tag specific array.

const std = @import("std");

// NOTE this is a rough sketch
// NOTE there are a bunch of things I don't know
// NOTE so take it with a grain of salt

pub const Kind = packed struct {
    pub const Flags = packed struct {
        /// Whether this object can be ref counted (else it needs to be cloned)
        ref_counted: bool,
        /// If shared across threads
        cross_thread: bool = false,
        /// Whether the body can be mutated (else it needs to be cloned)
        mutable: bool,

        pub const local_immutable: Flags = .{ .ref_counted = false, .mutable = false };
    };
    pub const Tag = enum(u5) { // NOTE only u4 needed
        none,
        index,

        // NOTE if return_code is really just a bool its tag could be packed together with
        // none something like code: enum {none, return_true, return_false} with three
        // interned instances in the handles array
        //
        // that would save one value for this enum, allowing it to fit in a u3
        // making the Tag smaller could allow for more Flags or alternatively we
        // could pack the .kind and the .index in the Handle together into a single u32
        //
        // u3 tag + u29 index
        // then Interpreter becomes a FlagGroup (Data that shares the same flags)
        // and you can create 8 of them for all the permutations of the 3 flags
        return_code,
        number,
        float,
        // string,
        // dictionary,
        // /// Non-compact list
        // list,
        // custom_type,
    };
    flags: Flags,
    tag: Tag,

    pub fn li(tag: Tag) Kind {
        return .{ .flags = .local_immutable, .tag = tag };
    }
};

pub const Object = struct {
    pub const Body = packed union {
        nothing: void,

        /// Array index
        index: u32,
        return_code: bool,
        number: i64,
        float: f64,
        // string: packed struct {
        //     /// bytes should always equal Head.bytes (why deduplicate? in case
        //     /// there's a compact list of strings--the Head is not stored)
        //     bytes: ?[*:0]u8,
        //     byte_length: u32,
        //     /// If = maxInt(u32), it means the length has not been determined
        //     utf8_length: u32,
        // },
        // list: List,
        /// Compact list stores only the body
        compact_list: CompactList,
        // dictionary: packed struct {
        //     /// Every object must have a pregenerated string representation in the hash map
        //     hash_map: *HashMap,
        // },
        // custom_type: packed struct {
        //     type_ptr: *anyopaque,
        //     value: *anyopaque,
        // },

        const List = packed struct {
            elements: [*]Handle, // NOTE also could potentially be an offset
            capacity: u32,
            length: u32,
        };
        const CompactList = packed struct {
            elements: [*]Handle.Index,
            capacity: u32,
            length: u32,
        };

        pub const HashMap = std.ArrayHashMapUnmanaged(Handle, Handle, Handle.HashContext, true);
    };

    kind: Kind,
    body: Body,
};

pub const Handle = struct {
    pub const Index = enum(u32) { _ };

    kind: Kind,

    /// NOTE if kind == .index  stores directly the target index
    /// NOTE if kind == .return_code  stores directly the return_code
    /// NOTE otherwise it is an index into a type specific array
    index: u32,

    pub const HashContext = struct {
        i: *Interpreter,

        pub fn hash(self: HashContext, obj: Handle) u32 {
            var hasher = std.hash.Wyhash.init(0);
            std.hash.autoHash(&hasher, obj.kind);
            switch (obj.kind) {
                .none => {},
                .index, .return_code => std.hash.autoHash(&hasher, obj.index),
                .number => std.hash.autoHash(&hasher, self.i.numbers.items[obj.index]),
                .floats => std.hash.autoHash(&hasher, self.i.floats.items[obj.index]),
            }
            return hasher.final();
        }

        pub fn eql(self: HashContext, lhs: Handle, rhs: Handle) bool {
            if (lhs.kind != rhs.kind) return false;

            return switch (lhs.kind) {
                .none => true,
                .index, .return_code => lhs.index == rhs.index,
                .number => self.i.numbers.items[lhs.index] == self.i.numbers.items[rhs.index],
                .floats => self.i.numbers.items[lhs.index] == self.i.numbers.items[rhs.index],
            };
        }
    };
};

const Interpreter = struct {
    const Handles = std.MultiArrayList(Handle);

    allocator: std.mem.Allocator,
    handles: Handles,

    numbers: std.ArrayList(i64),
    floats: std.ArrayList(f64),
    // ... more type specific arrays

    pub fn init(allocator: std.mem.Allocator) Interpreter {
        return .{
            .allocator = allocator,
            .handles = .empty,

            .numbers = .empty,
            .floats = .empty,
        };
    }

    const interned = [_]Handle{
        .{ .kind = .li(.return_code), .index = 0 },
        .{ .kind = .li(.return_code), .index = 1 },
        .{ .kind = .li(.none), .index = 2 },
    };
    /// creates interned instances
    pub fn setup(self: *Interpreter) !void {
        for (interned) |i| {
            try self.handles.append(self.allocator, i);
        }
    }

    pub fn deinit(self: *Interpreter) void {
        self.handles.deinit(self.allocator);
        self.numbers.deinit(self.allocator);
        self.floats.deinit(self.allocator);
    }

    pub fn add(self: *Interpreter, object: Object) !Handle.Index {
        try self.handles.ensureUnusedCapacity(self.allocator, 1);
        const new: Handle.Index = @enumFromInt(self.handles.len);
        const index: u32 = res: switch (object.kind.tag) {
            // NOTE if none is not mutable we can instead return the index
            // to the one constant interned instance which has a hardcoded index
            .return_code => return if (object.body.return_code) @enumFromInt(0) else @enumFromInt(1),
            .none => return @enumFromInt(2),
            .index => object.body.index,
            .number => {
                const i: u32 = @intCast(self.numbers.items.len);
                try self.numbers.append(self.allocator, object.body.number);
                break :res i;
            },
            .float => {
                const i: u32 = @intCast(self.floats.items.len);
                try self.floats.append(self.allocator, object.body.float);
                break :res i;
            },
        };
        self.handles.appendAssumeCapacity(.{
            .kind = object.kind,
            .index = index,
        });
        return new;
    }
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();
    defer _ = gpa.deinit();

    var interpreter: Interpreter = .init(allocator);
    defer interpreter.deinit();
    try interpreter.setup();

    _ = try interpreter.add(.{ .kind = .li(.none), .body = .{ .nothing = {} } });
    _ = try interpreter.add(.{ .kind = .li(.index), .body = .{ .index = 5 } });
    _ = try interpreter.add(.{ .kind = .li(.float), .body = .{ .float = 1244.5 } });
}

There are a bunch of things about this that would require more experimentation, but hopefully this gives you some ideas to play with. I haven’t done anything tcl related, so I am not very familiar with it, but it always seemed like an interesting language to me with its “everything is a string” shenanigans.

3 Likes

Wow, thank you for the incredibly thorough analysis!

That’s fair. That’s the vestiges of Jim’s design carrying over to this impementation!

Tcl does have a fair amount of heterogenous lists, since all data structures have to be a combination of primitive types. Here’s an example:

set bouncingBall [dict create \
      position [list 100 100] \
      enabled false]

This would be the json equivalent of

{
    position: [100, 100],
    enabled: false,
}

Lists intermingled with other types is a pretty common idiom in Tcl, so unfortunately heterogenous lists is the norm, with a compact representation as the exception.

The other thing that can be tricky with in Tcl is that changing internal representation is fairly common, for example:

set foo [list a 10 b] ;# starts as a list
lappend foo 20
set x [dict get $foo b] ;# converts to a dictionary internally before returning

This sounds really intriguing! I’m trying to think of what I know about bytecode interpreters, but I’m not quite understanding how that would apply to value packing. Would you mind elaborating?

Okay, now to the code!

pub const Kind = packed struct {

I really liked how you separated out this header information from the other parts of the object.

pub const local_immutable: Flags = .{ .ref_counted = false, .mutable = false };

This is a neat trick to avoid flag typing repetition and mistakes!

// then Interpreter becomes a FlagGroup (Data that shares the same flags)
// and you can create 8 of them for all the permutations of the 3 flags

Wait, so you’d have 8 interpreters, one for each of the permutations? Or 8 enum values?

pub const Object = struct {

I noticed you removed the cached string representation—is that because you also removed the heterogenous list type? You’re right that caching a string representation of a number is a bit ridiculous, but it can come in handy if you have a large nested list, as the string can become quite long. I suppose you could always store those strings in an array list in the interpreter!

const CompactList = packed struct {
    elements: [*]Handle.Index,
    capacity: u32,
    length: u32,
};

Okay, now I’m starting to see how this list type would work. You’d still have some indirection, but you also wouldn’t need to lay out the objects contiguously.

pub const Handle = struct {

I like this design a lot better than what I had with my Ref attempt, both naming-wise and design.

pub const Index = enum(u32) { _ };

This is another neat trick, essentially creating an opaque u32!

const interned = [_]Handle{
    .{ .kind = .li(.return_code), .index = 0 },
    .{ .kind = .li(.return_code), .index = 1 },
    .{ .kind = .li(.none), .index = 2 },
};

This makes sense to intern the stateless values, so you don’t need to waste values in the tag.

Some closing thoughts:

  • I just realized that indexes might not play nice with being multithreaded. Is there a reasonable lock-free way to grow an array, or should I have all the object arrays statically allocated?
  • I think heterogenous lists will show up quite often, so I’ll need to think some more about the best way to represent them.

Thank you again for taking the time to look over the code, and for sharing your own experience with interpreters.

1 Like

I am thinking about something along the lines of Efficient instatiation of union of pointers, but instead of having bytecode operations you would have things like:

const Kind = enum {
    return_false,
    return_true,
    none,
    index,
    number,
    float,
    // [snip]
};
const Index = u32;

const Data = union(Kind) {
    return_false: void,
    return_true: void,
    none: void,
    index: *align(1) const Index,
    number: *align(1) const i64,
    float: *align(1) const f64,
    // [snip]
};

That way you would end up with a data stream like this:

03 in de x_ __ 02 00 05 fl oa tv al ue __ __ __ ...

Data would be used to provide a view into the [Kind Data] ... stream (instead of into a instruction stream), for details see the link. I think something like this could make sense for serializing/deserializing data, or for data that is pretty much read only, it also could possibly be switched between this representation and the one I wrote about in my post, based on how the data is used/accessed. (And also based on which representation turns out to be more compact and efficient for different kinds of data / usage patterns)

That said this is mostly an Idea I had, because I still had the linked post in the back of my mind, I would have to spend more time exploring the idea and testing usage scenarios to find out whether it is a good fit, for usage in an interpreter.


From a user point of view I would see the interpreter as the thing that combines the whole machinery, so I would think of the interpreter as the api surface to your project and that it would use those 8 groups.

If you then add multi-threading into the mix, I think you potentially could have something like this where shared_groups could be shared memory for some groups and use some kind of merge/unification mechanism for other groups:

pub const Interpreter = struct {
     pub const Thread = struct {
         local_groups: [4]*Group,
         // other thread local stuff

         shared_groups: [4]*Group,
         // because your flags already categorize what can move across threads
         // it would make sense that only the groups which have cross_thread = true
         // have associate syncronization mechanisms here
         // further those syncronization mechanisms could be tailored to the group
         // for mutable ones you could have some kind of fine-grained locking or optimistic thing?
         // for immutable ones you could have some kind of log structured merge?
     }
     threads: std.ArrayList(Thread),
     // other thread syncronization and global stuff
     // mechanisms for cross thread syncronization
}

For threading I still have a lot of practical work and experimentation ahead of me, so I definitely need to collect more experience there myself.

So I am not super sure about how to structure multi-threading things. From a theoretical standpoint (working on a script language) I would want to focus on things working in a erlang-ish way, so less shared-state used by multiple threads and more each green-thread having their own thing to work on and then handing results to other threads.

Regarding this high level organization, I think I can’t really say a lot more than that, without starting to write a tcl interpreter myself (and finding out what works well, but I am working on something else), I hope it makes sense, let me know if it doesn’t.

(But who knows what remains of my aspirations once I actually implement, use and stress test them, in my own projects)


I mostly removed the cached string representation because I am not deep enough down the rabbit hole, to have a good intuition about how to implement it properly and where this cache should sit exactly. Also I wanted a small single file example that runs.

I currently suspect that those threads should collect some kind of metrics about where and how those string representations are used; and that those measurements then could at runtime switch on and of policy flags that decide whether certain types string representations get cached or not.

So basically my approach would be to remove caching, see where sh*t hits the fan and then reintroduce caching, while in parallel building tools for introspecting performance and measuring what is going on.

Also I think it would be important to add caching in a way where you can easily enable/disable it entirely. So I think I would want it to be in the form of a parallel array or some kind of (parallel to the uncached core data) sparse data structure (bloomfilter, hash map / hamt, bitset, sparse-set) that adds the cached representation when it is needed, instead of always.

I think you also could have a variant that is basically just an array of one type, here I just changed the type you had to what it would become with the Handle/Index change.

I think you would need to check it in practice which representations are good.

I think this could be good for data with chaotic locality (which is kind of what I expect from the heterogeneous list usage you described as typical in tcl), but I also could see a type that requires continuous memory as useful, if somebody wants to write tcl in some way that doesn’t mix types, to basically opt-in into better cache locality.

So it might be nice if the interpreter could detect and separate between different usage scenarios, I think this would basically be profile guided optimization.

I prefer handles much more over pointers, because with handles you save space and you just have a lot more ways to deal with problems, it also can become way easier to save the state to disk and read it in again. (which is nice if you want to implement live editing environments). This blog post by @floooh is a good read Handles are the better pointers and describes a lot of the good things about using handles.


I think for that it is useful to start thinking in a systems perspective where some central system is in control over what happens (this is described a bit in @floooh’s blog post) with a big bunch of data (for example many parallel arrays of data that are all related).

That way the system knows the data and actually knows what level of granularity and parallelism is possible and the system can decide whether it can for example process all elements in batches by giving it to a number of threads.

The system also can coordinate whether some handle can be written to by a particular thread (for example if you are batching each thread can only write to handles that are part of its batch, and writes from externally would be delayed for example some other system could prepare some data that then gets written once the batching is done).

Basically my questions would be:
What does it mean to grow an array?
What kind of array?
A heterogenous list with a capacity? → get a new one with bigger capacity → copy the old data → release the old one (all that could happen hidden behind the handle)
Who grows the list and when?
Maybe the system should do the growing.

Regarding lock-free I don’t really know, I think in a lot of cases it is better to not share unnecessarily to begin with, so maybe just share safe handles.

Wanting lock-free makes me wonder whether you want to keep

where everything chaotically writes to everything else and I would heavily advise against this, mostly because I find it extremely hard to reason about, I also have no idea how to even begin to optimize or analyze such a thing. With the systems approach it seems more like you keep things organized and grouped, so that you can measure things, see which groups are slow and then optimize those specifically.

But if multiple people want to write one thing, then I think there either needs to be some sort of syncronization, or atomic operations, or accumulating / monotonic-operations?

I think with systems you can do that through ordering data dependencies between systems. (I still haven’t hit big scales yet, so that is an area where I still have a lot to learn in practice)

Towards the end I mix Handle / ECS / Interpreter aspects a bit, mostly because that is what I experiment with and find it somewhat difficult to keep them clearly separated. I think the Interpreter could possibly be seen as a system.

Ooops, sorry for the long answer :sweat_smile:

4 Likes

I’d love to write more about this and discuss in in depth but for now I can only provide a link to the heap.zig file of a Lisp interpreter I wrote years ago, which I think has quite an interesting & novel memory layout, thoroughly inspired by that very talk on data-oriented design: wisp/core/heap.zig at master · mbrock/wisp · GitHub

The idea is that we use 32 bit tagged pointers for all values. Half of those are just immediate 31 bit numbers. Some other tags are also used for smaller immediate values:

pub const Tag = enum(u5) {
    int = 0x00, // 31-bit fixnum

    sys = 0x11, // static constant value
    chr = 0x12, // unicode codepoint
    jet = 0x13, // builtin operator
    pin = 0x1f, // gc-pinned value

    duo = 0x15, // cons pair pointer
    sym = 0x16, // symbol pointer
    fun = 0x17, // closure pointer
    mac = 0x18, // macro pointer
    v32 = 0x19, // word vector pointer
    v08 = 0x1a, // byte vector pointer
    pkg = 0x1b, // package pointer

    run = 0x1c, // evaluation state
    ktx = 0x1d, // evaluation context

    ext = 0x1e, // external object
};

The first five types are immediates; the rest are pointers.

pub const Ptr = packed struct {
    pub const Idx = u26;

    era: Era,
    idx: Idx,
    tag: Tag,

    pub fn make(tag: Tag, idx: Idx, era: Era) Ptr {
        return .{ .era = era, .idx = idx, .tag = tag };
    }

    pub fn from(x: u32) Ptr {
        return @as(Ptr, @bitCast(x));
    }

    pub fn word(ptr: Ptr) u32 {
        return @as(u32, @bitCast(ptr));
    }
};

Each pointer type has its own storage: cons cells in one bucket, closures in another, and so on. This layout lets me have ~64 million entries for each pointer type—which ought to be enough for anyone, and if not, the whole scheme can be upgraded to use 64 bit words without much difficulty.

Now those buckets could easily just be ArrayLists, so

conses: ArrayList(struct { car: u32, cdr: u32 })

and so on, but instead, each object type has a MultiArrayList.

This means that the fields of each object type are contiguous arrays. So every cdr in the heap is in one array. The object type columns are like this:

/// Each pointer type has a set of columns.
pub fn ColEnum(comptime t: Tag) type {
    return switch (t) {
        .int, .sys, .chr, .jet, .pin => void,
        .duo => enum { car, cdr },
        .sym => enum { str, pkg, val, fun, dyn },
        .fun => enum { env, par, exp, sym, cnt },
        .mac => enum { env, par, exp, sym, cnt },
        .v08 => enum { idx, len },
        .v32 => enum { idx, len },
        .pkg => enum { nam, sym, use },
        .ktx => enum { hop, env, fun, acc, arg },
        .run => enum { exp, val, err, env, way },
        .ext => enum { idx, val },
    };
}

All the fields are just 32 bit words. So the heap ultimately consists of 36 arrays of 32-bit tagged pointer words. Zig’s SoA MultiArrayList in fact keeps the field arrays contiguous with each other, so it’s really only 10 allocations. That’s for the structures; a separate ArrayList(u8) keeps string bytes, and an ArrayList(u32) keeps arbitrary-length vector data, as backing stores for the v08 and v32 values.

Then there’s the garbage collector keeping the heap tidy. wisp/core/tidy.zig at master · mbrock/wisp · GitHub

It’s a Cheney-style copying two-space collector that essentially runs a breadth-first scan of the live values, moving everything into a fresh set of MultiArrayLists. This discards garbage in time proportional only to the live data—and simultaneously compacts the data. Right after the GC is done, the cdr array will be packed with only live pointers—in traversal order, so if the root set references a certain long linked list, the whole tail of that list will likely be a contiguous sequence.

A major benefit of this heap structure is that there are no pointers anywhere. The whole thing can be trivially serialized to disk with a single syscall—just gather up the slices into iovecs and hand them all to a writev. Loading the heap from disk is two syscalls: first read the header to know how much to allocate per slice, then do a single readv. This also works in WebAssembly, etc—the Lisp system can instantly save itself to local storage in the middle of an execution.

(The one-bit era field of the packed pointer structure is needed for the two-space collector logic to know, as it’s traversing, whether any given pointer is already in the new arena or not. It flips between 0 and 1 with every collection cycle. Since this metadata is also serialized, we can dump and load at any state, though usually you’d want to dump right after a GC to avoid saving dead data.)

Is it the best heap layout in general? Probably not! Have I benchmarked it to verify the performance benefits? Not at all! Is it kind of cool? I really think so!

Of course this doesn’t even really begin to address your actual questions! Your interpreter, I guess, would have most of its data in what I call the v32 blob, and in my scheme that means you’d also have separate handle values allocated in the v32 object table. If my Lisp interpreter is used in a way that relies heavily on such “vectors” or “structs”, it’d also entail a lot of indirection.

I think one way to cope with that would be to design a way for the heap to have a bunch of different vector blobs, corresponding to the lengths of small vectors—like, say, each vector size between 3 and 8 gets its own pointer tag and its own blob in the heap—so the index is immediately in the handle word, not indirectly in idx+len table.

6 Likes

Don’t apologize about it being too long—this was exactly the information I need!


With regards to bytecode encoding, that does make sense with serialization, but I’m thinking it wouldn’t work well for arbitrary indexing. Definitely a cool idea though!


With regards to having different object buckets based on the various flags, it really got the wheels turning. I really like the idea of having one bucket for the shared objects, and another bucket for each of the thread’s local objects. That way when an object gets passed between threads, it gets moved to the shared bucket. This could cause issues with large objects being copied, but at least for what I’m using Tcl for it shouldn’t be an issue.

The ref_counted flag would still need to be included in the object header though, since that marks whether an object is stack allocated or not (so I can allocate an object on the stack, pass said object to any function, and that function won’t free it). Well, maybe it doesn’t need to be in the object itself. But it would need to be passed in as the object’s context. (Maybe something like RefWithCtx? Or honestly Handle is still a great name.)


With respect to the cached string representation, I’ve thought it over some more, and I’ve realized that the string representation is vital. My reasoning is that a string is often the source of truth, and when it’s initially passed in, you can’t infer what type it is. Is Hello World! a user string, a two-item list, or a dictionary? It all depends how it’s used later in the code. Tcl really does have just one type, and that’s a string.

Having string ranges in a single ArrayList makes a lot of sense though, especially since you wouldn’t need a 64-bit pointer any more. (@mbrock’s representation of strings looks interesting with the v8 field). You could even have small strings inside of the 8 byte object body.

That’s a good point, I’ll definitely have to run some profiling on this!


With respect to multithreaded handles and arrays, I’ve been reading a bit about how Linux handles memory, and it looks like I could just allocate 8GB of memory, but it won’t actually take any physical memory until I write to it. Honestly, it makes the most sense to just allocate all multithreaded memory up-front, so I never have to deal with the headaches of synchronizing moving the array. It seems this is how WASM runtimes work already. For thread-local storage it makes sense to have a growable array though (especially if I’m planning to compile the single-threaded version to embedded systems at some point).

This is a great point and I’ll try to remember to keep things centralized.

Since values are immutable in Tcl already, it’s not too bad to do this since it’s just copy-on-write. Jim uses an optimization where if there’s only one reference it’ll mutate it in place, but otherwise it’ll copy. Recursive structures are also not allowed, so atomic reference counting works pretty well.


Okay, now to @mbrock’s post:

That is insanely packed! I’m quite impressed you were able to squeeze so much out of a single u32.


This is interesting, did you ever get cache hit numbers on this? I could see it both working really well because everything’s so packed, and also I could see it choking on a larger program’s data.


How did you handle freeing and fragmentation? Or is that part of the copying collector? I’m looked through the code you linked but I’m afraid I’m still pretty unfamiliar with MultiArrayList and comptime patterns :sweat_smile:.


Now this is a really cool consequence of your GC design—and it’s something that would really help with the dynamic tree structures that I see a lot in Tcl. I’m still somewhat hesitant to use GC over ref counting, mainly because I’m working with a soft realtime system (I’m using Tcl for live scripting), but if the GC doesn’t take too long that would be really promising, especially for fragmentation reduction.


Yeah. I’d love to have an actual allocator sitting on top of my objects bucket—even something as simple as a buddy allocator—so I could request variable-length ranges of objects. Fo example, if I had a 5 element list, I could allocate 6 contigious objects (head object + 5 element objects). That way I could keep indirection low, and also not store a list index.


With all your wonderful insights in mind, I’m starting to think of an object layout something like this. (I like to draw out my data structures on paper, hope my handwriting is ledgible :sweat_smile:.)

Each object would be 16 bytes long. The first 8 bytes encodes the string location and type tag. The second 8 bytes is the body (broken down after the next picture). The first 8 bytes can be of two types: short string and long string. A short string is an index and length pointing into a string array (similar to @mbrock’s v08). A long string is a tagged pointer, with its lowest 6 bits being used to encode the tag type and that it is a long string. A long string is also atomically reference counted, so it isn’t copied when moving between threads.

This is the body representation:


Some thoughts about the body layout:

  • If I can get an allocator working on top of the object buckets, I won’t even need to store the index in the List field.
  • I’m not settled on the dictionary representation, but I also don’t want too much indirection. Maybe store the dictionary fields in the same way as a list (directly after), but have the dictionary index separate?
  • I also forgot to include something like TinyString, where if the string is 8 characters or less, it could be stored right in the body

These are all still tentative ideas, so I’m open to any problems you see. And thank you for taking time out of your day to take a look at all of this!

3 Likes

By that logic what if you didn’t have a flat string representing the object, but instead just used whatever structures which are most convenient to sort of build a rope data structure or a tree data structure that when traversed in its logical order results in the equivalent to the flat string representation?

That way you could do things like intern a bunch of strings that show up regularly, share fragments between different objects, or even create some data structure that finds a good tradeoff between string fragment length and de-duplication. Also would make it easy to concat strings if you had a rope-like data-structure. I guess it would require some analysis about what kinds of operations are most common in tcl.

Just an idea, I don’t really know enough about tcl to know whether it could make sense, maybe you can figure that out :wink:

1 Like

Yeah, that’s a great point! Especially with this design, it would be easy to extend with profiling, to get an idea of what sort of things need to be optimized.

1 Like

Just a single little note for the time being!

One thing I really like about these compacting garbage collection algorithms is that even when implemented, as I do, as a completely blocking stop-the-world process, the GC—unlike most sweepers in common use—only looks at the live data, so the runtime is exactly proportional to the number of reachable pointers. In fact the algorithm is precisely a breadth-first traversal of the pointer graph.

That means, for example, if your program generates a bunch of garbage constantly while working on some relatively small amount of live reachable data, it doesn’t even matter how often you run it—it always takes the same amount of time, it never even looks at any of the garbage at all. In this way it’s similar to using arena allocators! But you get to keep the stuff you want instead of throwing everything away!