Memory layout suggestions for Tcl interpreter

Sze · October 18, 2025, 11:11pm

I am thinking about something along the lines of Efficient instatiation of union of pointers, but instead of having bytecode operations you would have things like:

const Kind = enum {
    return_false,
    return_true,
    none,
    index,
    number,
    float,
    // [snip]
};
const Index = u32;

const Data = union(Kind) {
    return_false: void,
    return_true: void,
    none: void,
    index: *align(1) const Index,
    number: *align(1) const i64,
    float: *align(1) const f64,
    // [snip]
};

That way you would end up with a data stream like this:

03 in de x_ __ 02 00 05 fl oa tv al ue __ __ __ ...

Data would be used to provide a view into the [Kind Data] ... stream (instead of into a instruction stream), for details see the link. I think something like this could make sense for serializing/deserializing data, or for data that is pretty much read only, it also could possibly be switched between this representation and the one I wrote about in my post, based on how the data is used/accessed. (And also based on which representation turns out to be more compact and efficient for different kinds of data / usage patterns)

That said this is mostly an Idea I had, because I still had the linked post in the back of my mind, I would have to spend more time exploring the idea and testing usage scenarios to find out whether it is a good fit, for usage in an interpreter.

smj-edison:

// then Interpreter becomes a FlagGroup (Data that shares the same flags)
// and you can create 8 of them for all the permutations of the 3 flags
Wait, so you’d have 8 interpreters, one for each of the permutations? Or 8 enum values?

From a user point of view I would see the interpreter as the thing that combines the whole machinery, so I would think of the interpreter as the api surface to your project and that it would use those 8 groups.

If you then add multi-threading into the mix, I think you potentially could have something like this where shared_groups could be shared memory for some groups and use some kind of merge/unification mechanism for other groups:

pub const Interpreter = struct {
     pub const Thread = struct {
         local_groups: [4]*Group,
         // other thread local stuff

         shared_groups: [4]*Group,
         // because your flags already categorize what can move across threads
         // it would make sense that only the groups which have cross_thread = true
         // have associate syncronization mechanisms here
         // further those syncronization mechanisms could be tailored to the group
         // for mutable ones you could have some kind of fine-grained locking or optimistic thing?
         // for immutable ones you could have some kind of log structured merge?
     }
     threads: std.ArrayList(Thread),
     // other thread syncronization and global stuff
     // mechanisms for cross thread syncronization
}

For threading I still have a lot of practical work and experimentation ahead of me, so I definitely need to collect more experience there myself.

So I am not super sure about how to structure multi-threading things. From a theoretical standpoint (working on a script language) I would want to focus on things working in a erlang-ish way, so less shared-state used by multiple threads and more each green-thread having their own thing to work on and then handing results to other threads.

Regarding this high level organization, I think I can’t really say a lot more than that, without starting to write a tcl interpreter myself (and finding out what works well, but I am working on something else), I hope it makes sense, let me know if it doesn’t.

(But who knows what remains of my aspirations once I actually implement, use and stress test them, in my own projects)

I mostly removed the cached string representation because I am not deep enough down the rabbit hole, to have a good intuition about how to implement it properly and where this cache should sit exactly. Also I wanted a small single file example that runs.

I currently suspect that those threads should collect some kind of metrics about where and how those string representations are used; and that those measurements then could at runtime switch on and of policy flags that decide whether certain types string representations get cached or not.

So basically my approach would be to remove caching, see where sh*t hits the fan and then reintroduce caching, while in parallel building tools for introspecting performance and measuring what is going on.

Also I think it would be important to add caching in a way where you can easily enable/disable it entirely. So I think I would want it to be in the form of a parallel array or some kind of (parallel to the uncached core data) sparse data structure (bloomfilter, hash map / hamt, bitset, sparse-set) that adds the cached representation when it is needed, instead of always.

I think you also could have a variant that is basically just an array of one type, here I just changed the type you had to what it would become with the Handle/Index change.

I think you would need to check it in practice which representations are good.

I think this could be good for data with chaotic locality (which is kind of what I expect from the heterogeneous list usage you described as typical in tcl), but I also could see a type that requires continuous memory as useful, if somebody wants to write tcl in some way that doesn’t mix types, to basically opt-in into better cache locality.

So it might be nice if the interpreter could detect and separate between different usage scenarios, I think this would basically be profile guided optimization.

I prefer handles much more over pointers, because with handles you save space and you just have a lot more ways to deal with problems, it also can become way easier to save the state to disk and read it in again. (which is nice if you want to implement live editing environments). This blog post by @floooh is a good read Handles are the better pointers and describes a lot of the good things about using handles.

I think for that it is useful to start thinking in a systems perspective where some central system is in control over what happens (this is described a bit in @floooh’s blog post) with a big bunch of data (for example many parallel arrays of data that are all related).

That way the system knows the data and actually knows what level of granularity and parallelism is possible and the system can decide whether it can for example process all elements in batches by giving it to a number of threads.

The system also can coordinate whether some handle can be written to by a particular thread (for example if you are batching each thread can only write to handles that are part of its batch, and writes from externally would be delayed for example some other system could prepare some data that then gets written once the batching is done).

Basically my questions would be:
What does it mean to grow an array?
What kind of array?
A heterogenous list with a capacity? → get a new one with bigger capacity → copy the old data → release the old one (all that could happen hidden behind the handle)
Who grows the list and when?
Maybe the system should do the growing.

Regarding lock-free I don’t really know, I think in a lot of cases it is better to not share unnecessarily to begin with, so maybe just share safe handles.

Wanting lock-free makes me wonder whether you want to keep

where everything chaotically writes to everything else and I would heavily advise against this, mostly because I find it extremely hard to reason about, I also have no idea how to even begin to optimize or analyze such a thing. With the systems approach it seems more like you keep things organized and grouped, so that you can measure things, see which groups are slow and then optimize those specifically.

But if multiple people want to write one thing, then I think there either needs to be some sort of syncronization, or atomic operations, or accumulating / monotonic-operations?

I think with systems you can do that through ordering data dependencies between systems. (I still haven’t hit big scales yet, so that is an area where I still have a lot to learn in practice)

Towards the end I mix Handle / ECS / Interpreter aspects a bit, mostly because that is what I experiment with and find it somewhat difficult to keep them clearly separated. I think the Interpreter could possibly be seen as a system.

Ooops, sorry for the long answer