I’d like to open a new thread on this topic, as it’s a quite specific question (I hope that is alright for everyone), and look at the specific case of deinit functions. This is somewhat special as we don’t mutate but usually invalidate a struct or handle. What type for self do we use here?
Let’s look at ArrayList or StringHashMap in the standard library. What kind of self argument does their deinit method take?
Why the distinction? As far as I understand the documentation, the data structure is “invalidated” in each case.
And what should I do if I create my own struct that might contain one of those data structures (now, or possibly in the future)?
I wonder if the self: Self argument of ArrayList is actually a mistake in std, and it should be self: *Self? Also, I’ve seen the common idiom of doing self.* = undefined in a struct’s deinit method. That doesn’t work if you don’t take a pointer.
On the other hand, “invalidation” also happens in other methods that do not take pointers, for example std.Thread.detach.
So what is the way to go?
Some deinit methods also require the allocator to be passed to it, e.g. ArrayListUnmanaged.deinit, so there is no “unique” signature to follow. Yet, I feel like there should (could?) be some consensus whether the self argument should be always a pointer or not. The standard library seems to be inconsistent here, but maybe there is some “rule” that I don’t know yet.
I wonder: Maybe the distinction between deinit(self: Self) and deinit(self: *Self) depends on whether during the procedure of “deinitializing”, it is helpful to temporarily modify the data structure throughout the process?
But isn’t that an implementation detail that shouldn’t be reflected in the interface visible to the caller of the function? And then again, it has implications for self.* = undefined (which might be unnecessary anyway?).
Maybe I’m overthinking, but as I read earlier, Zig’s philosophy (Zen) suggests there should be:
Only one obvious way to do things.
So it would be helpful to have some sort of rule or indication when to use which for deinit (and other functions that invalidate a data structure).
Now the following is just a thought-experiment / brainstorming:
Would it be (in theory) justified to have a language feature for arguments that are deinitialized? Something like deinit(deinitialize x: Self). The compiler would then know that the passed argument will be considered unusable after the function returned. Or, from the point of view of the caller, already be unusable right after the function is called (i.e. when it starts executing). The compiler could then decide on its own whether to pass the argument by value or by reference. But perhaps we have something like this in the language already? What about:
Due to the noalias keyword, the caller would have to guarantee to not use the memory from the moment when the deinit function is called (which seems reasonable), and the compiler could then, in the spirit of Parameter Reference Optimization (PRO), decide on its own whether it will pass self by value or by reference.
Currently, this doesn’t seem idiomatic at all though. But I’m curious if my thoughts are just nonsense, or if they make sense to some extent?
Also, these hypothetical considerations aside, it still would be helpful to have some rule-of-thumb at hand, which way to go. (Maybe such advice could change in future though, e.g. due to #1108).
Update:
I did some more research, and found two issues on GitHub in that matter:
This proposal suggested that "by convention, deinit never invalidates memory and always takes self as a value, i.e. the opposite of what I proposed/assumed above.
Issue #9814: Constness inconsistencies in the standard library (still open).
Here, inconsistencies in Zig’s standard library are brought up, specifically regarding functions that invalidate values.
As of January 2022, @andrewrkstates that he is “willing to walk back the current convention, which is leaning towards mutable self pointer for deinit.” (I assume this means that there was a leaning towards using self: *Self for deinitialization, but openness to shift towards using self: Self instead.)
However, if I understand right, turning down PRO happened about two and a half years later, which leaves me wondering if shifting towards self: Self is still a good idea.
So it looks like the question is unresolved yet. I hope any inconsistencies in std will get resolved, and that there will be an easy rule I can follow as a programmer when creating my own deinit functions.
I personally feel like “noalias self: *Self” is the right solution, because it allows the implementer of deinit to modify the value during the process of invalidation, which might come in handy (and, as a bonus, be able to set it to undefined in the end).
But I would really like to hear other people’s opinion on that, and maybe arguments/implications that I’m overlooking.
P.S.: I think requiring a mutable pointer is also semantically correct because it forces the user to use var rather than const, which is (in my opinion) correct, as invalidation is some sort of mutation the programmer should be aware of.
Setting the pointer to undefined has no effect, except to make it easier to catch invalid use of the data structure after deinit. It isn’t guaranteed to have any effect, in practice it will in some release modes, and won’t in others.
TL;DR you’re overthinking this. The contract for a deinit isn’t how the member parameter is passed, the contract is that after deinitcode must not use it. Passing by pointer allows self.* = undefined; at the end, and that makes it a little easier to catch violations of that contract. Most useful deinit functions will need the ability to mutate the struct in some fashion, so practically speaking, pass-by-pointer is the most useful way to write one.
I would hope it has no effect in ReleaseFast, as that would be unnecessary overhead.
Yes, I understand that.
Exactly.
Yes, I also understand that. But my point is:
Why provide a different interface depending on the “internal needs” of the particular deinit implementation? This has implications for the caller.
Particularly, if we move from self: T to self: *T in the function definition, this is a backward-incompatible change. Thus we have to consider in advance how our interface shall look like, if we don’t want to break things later.
And for the caller, it doesn’t (or shouldn’t) really matter howdeinit does its work. Yet the choice of how we pass self will impact the interface. For example, the following code won’t work:
const std = @import("std");
const T = struct {
// Let's assume this was `self: T` but we later changed it:
pub fn deinit(self: *T) void { _ = self; }
};
pub fn main() void {
const x = T{};
x.deinit();
}
This is why I believe there should be some convention regarding how we pass to-be-invalidated arguments. This actually has nothing to do with whether we use self.* = undefined; or not (except that it’s a bonus that if we agree on using self: *T, we may use it).
deinit isn’t an interface. It’s a collection of functions. There’s an implicit contract to that collection but they are not formally related.
Yes, I suppose it would be a breaking change. I don’t anticipate data structures frequently switching from needing to be constant to having to be mutable though. It’s generally fairly clear which one is going to apply.
Well, that works both ways. If there were some convention, as you say, then anything with deinit would have to be mutable, right? What would that get us.
I still think you’re overthinking this. If you can deinitialize something which is constant, then pass by reference or *const. Use the latter unless the struct is small, if you need a threshold for ‘small’, 3 usize.
If it needs to mutate, pass by *T. If you like setting it to undefined at the end, more power to you. I admit I haven’t bothered usually but maybe I should.
Just don’t require try. That isn’t deinit, it’s something else.
Sorry if I was imprecise. What I meant was: The deinit function of a specific data type is consisting of an interface and an implementation. If we change the type of self in the function declaration, then we change the interface. (E.g. if we only look at one type like ArrayList, then we have one interface: ArrayList.deinit).
Maybe not, but this means if std (or any other module) stabilizes, this will be difficult to change later. So it might be worth spending some time thinking on it. Even if the probability of the “need” to change is unlikely. (And even then, maybe a different solution may also turn out to be more elegant or consistent.)
Yes, that would be the implication, I guess.
But what would it get us?
Avoiding footguns by forcing users of our (invalidateable) types to store them in a var rather than const, which helps noticing that there is some sort of “state” insofar that after deinit the thing is unusable.
Allowing self.* = undefined;
Consistency and an easy to follow guideline for programmers.
Please note that my argumentation only applies to deinit and other invalidating functions. For ordinary functions, there are different arguments regarding why using self: T, self: *const T, self: *T, and so on is the right choice. But in case of deinit, the caller (or anyone else) likely will never use the data structure whiledeinit runs. (This is why I proposed noalias self: *T.)
For deinit (and other invalidating functions), the only reason for different passing-styles of self is the implementation in the function. And that should not be relevant for the interface (in my opinion).
(I hope I sort of made my point clear, but I find it difficult to explain. Please feel free to ask if some point isn’t clear.)
P.S.: And using noalias self: *T also would come with no overhead (in theory).
But it is relevant to the interface. If deinit needs to mutate, then you have to use that type as var. If it doesn’t, you don’t. It’s not an interior detail at all, and it has implications about what the type is actually for.
But deinit doesn’t mean that the type is mutable. It means it controls resources which need to be finalized in an error-free manner. It doesn’t make sense to “force” anyone to do this, and, there’s no mechanism with which to do so.
Technically yes. But there’s something special about deinit (and other invalidating functions), which is that the function “consumes” the value (semantically, not technically). Speaking of “contracts”, that means that the function’s argument (or the pointed-by value, in case of self: *T) shouldn’t be touched anymore when deinit is called.
So nothing stops us from doing something like this:
const std = @import("std");
const T = struct {
some_state: i32 = 5,
// Let's imagine there are many more fields in addition
// to `some_state`.
pub fn deinit(self: T) void {
// Let's imagine we must mutate `self`.
// We can do that simply by copying it (not efficient,
// but works).
var this = self;
while (this.some_state > 0) {
this.some_state -= 1;
}
}
};
pub fn main() void {
const x = T{};
x.deinit();
}
My point is that disregarding whether we declare deinit to take a self: T or self: *T, we can always get our work done. But one of those is (in practice) more efficient than the other.
Now my argument in the second post of this thread is that if we always use noalias self: *T, we can always achieve maximum efficiency while simply not needing to expose the implementation’s needs/internals to the caller.
“Forcing” the programmer to use var instead const (consistently) and allowing self.* = undefined; (consistently) is just a bonus. Of course you may argue if each of those bonuses is a pro or con. In my opinion they are advantages, but some people might consider requiring var consistently being a disadvantage (and prefer to be able to use const at least in some cases, even if that’s just due to implementation details of ArrayList.deinit, Thread.detach, etc).
Note that std.StringHashMap.deinit currently does not allow the programmer to use const, even if it could by following the scheme I demonstrated in the code above.
Now I propose consistency by always forbidding it, while #6322 proposed consistency by always allowing it. And, if I get you right, you propose that whether it’s allowed or not should depend on the deinit implementation (which I don’t think is a good idea, but maybe there are arguments that I overlook).
I don’t think types (structs) are mutable/immutable, but bindings (and maybe pointers) are?
But since the original isn’t used (and must not be used), mutating the copy effectively serves the same purpose (and has the same side effects). So it’s practically (i.e. effectively) identical, except being differently implemented and being (likely) more inefficient.
If the point is to mutate the original like when setting self.* = undefined for better debug detection, than mutating a copy just doesn’t work.
Also when the struct only has fields which are already pointers (and is small enough) or the struct represents a sort of handle, then using self: T may make more sense.
I don’t think only using one or the other everywhere, is a good idea.
I guess nobody does so far. Maybe I need to give it a few days rest or write a more coherent article/post on it. I think I do have a point, but maybe I fail to explain it properly.
I don’t think so, and I tried to explain why. (Note that with “deinit” functions, I mean functions that “consume” an argument, e.g. also std.Thread.detach. This does not just apply to functions that are called deinit or which release memory/resources.)
I don’t understand that sentence. Who should get which hint and why?
Exactly. That’s (edit: one of the reasons) why I would prefer to consistently use noalias self: *T rather than self: T.
More sense in which way? Being semantically correct? Being more efficient? Leading to less programming errors? I.e. what is the objective function here?
Well, I tried to outline in this thread, beginning with my second post, why I think it may be a good idea (in terms of consistency, semantics, and achievable efficiency). But it seems like I failed to come up with a convincing explanation.
If someone else understands my point here, maybe they can rephrase my idea in different words.
P.S.: I acknowledge that, following this discussion, consistently using noalias self: *T is apparently not idiomatic (at least as of now), disregarding of what I feel like (and tried to argue) would be right.
When you use a type thinking it can be used defined as:
const instance: T = .init(...);
defer instance.deinit();
You will get a compile error if that type requires a *T for its deinit function (or others). It tells you whether the struct is mutated in that function, it doesn’t tell you whether something one of its fields points to get mutated, but I think it is still useful information.
When you use self: T the compiler has more choice how to optimize it and the reader knows that self won’t be mutated directly.
With *T you are telling the compiler to always use a pointer, but if your T is pointer-like or a small value, that isn’t efficient because small values can be passed as values more efficiently. If it is pointer-like you could end up with pointer to pointer instead of just the pointer-like being copied, that would mean the code uses an indirection that isn’t needed and creates additional overhead.
Cases where this could happen:
when creating a struct that represents a tagged pointer or handles
small value types like a color or position
Basically if a type doesn’t have a discernible identity apart from its value and it is small enough then it can be passed around as value more efficiently.
But the compiler may not be able to recognize that the struct isn’t used in a way where its position/address/identity in memory doesn’t matter, if you explicitly tell it to use *T.
Hmmm I understand now what you meant. But do I really need to know if deinit mutates the state after I decide to give up ownership, promise to not touch the struct anymore, and request cleanup?
I think the more relevant information (for the caller!) would be whether something happens to my struct. Here, for example ArrayList.deinit and StringHashMap.deinit are the same: In both cases I can’t use the struct afterwards anymore. That is the relevant information (if any) for the caller in my opinion.
Well, I didn’t want to compare self: T with self: *T, but self: T with noalias self: *T. I (naively) assumed that by passing noalias self: *T and setting self.* = undefined; at the end of the function, the compiler should have enough information to optimize the call in such a way that it may pass the argument by value or reference internally, whichever is faster.
Also not when using noalias in combination with a mutable pointer?
I’m not really versed with details of compiler optimization. Yet, say the compiler (currently) can’t always optimize either case; then it would be sad that we, as programmers, have to make a choice in our interface declaration that entirely depends on internal implementation/optimization issues of the function. And it’s also unfortunate that these details dictate us whether we can use const or must use var. As a user of an interface, I’m normally not interested in those details. I’d rather want to know stuff like:
If the function requires a reference, do I need to ensure that this reference is exclusive (noalias)?
Will the function mutate my value (*T versus T or *const T)?
Will the function work on a copy of the value (T) or could memory writes interfere with what the function works on while it’s running (*const T)?
But in case of “consumed” (i.e. to-be-deinitialized) values, these questions really do not matter to me (edit: as a caller of the function) as I give up ownership anyway.
Maybe that clarifies a bit where my idea was coming from.
As cited above, my initial idea actually was that “deinitialization” or “passing ownership” is sort-of a kind of its own that might even deserve its own language construct. (Syntactically could be x: ~T or something like that. Just hypothetically; I don’t want to make a real proposal here.)
I then thought that perhaps noalias x: *T with x.* = undefined; gets close to that idea (without having to modify the language).
I care about the opposite, I want to have lots of things that don’t mutate and declare that with self: T, so that I don’t have to bother about what it is doing / have to read less code to verify what it is doing.
For the compiler deinit is just another function, so you may have promised something to the programmer reading the code, but not to the compiler.
I don’t know what the exact definiton of noalias would be in Zig (there currently isn’t documentation for it), but even without it, just because it is set to self.* = undefined at the end doesn’t mean that it being a pointer is irrelevant before that and recognizing whether it could be a value, requires analysis of all the code before that which could be many functions deep, with self: T it is immediately clear that it doesn’t matter without any analysis, you immediately know that the pointer isn’t stored away anywhere where it will later be retrieved from (and used in a useful/working way), or similar shenanigans.
I don’t really want to distinguish between invalidating functions vs other functions, instead I want to write the code in such a way that it is easy to understand and doesn’t require a bunch of different rules/logical thinking frameworks based on what kind of function it is.
And also so that the compiler doesn’t have to be too clever and analyze too much to figure out what it is allowed to do to optimize it, because the compiler should be fast, but if it has to do too much work to figure out things, it probably won’t be fast.
Basically I think const and var and its effects are useful to make it clearer to the programmer what can and can’t change.
Zig doesn’t deal with ownership, the programmer does.
I think if you wanted to deal with ownership and lifetimes in the language, that would quickly turn Zig into a very different language than what it currently aims at. It might be an interesting thing to explore, but I don’t think that is within what is being considered for Zig itself.
Regarding such things, you might be interested in this topic:
First of all, thanks to you both for having had the patience of discussing this issue with me/us so far. Maybe it’s not just me but also other users of Zig who struggle with the concept of invalidation and how it’s idiomatically handled in Zig. At least I still struggle here.
Let me respond to your last response first, and then I’ll try to summarize my p.o.v. on the topic, as I suggested yesterday to myself.
But in case where a function takes ownershipconsumes a value for deinitialization/invalidation (I’ll get to the word “ownership” below), you do have to care: You must be aware that after
ArrayList.deinit
StringHashMap.deinit
Thread.detach
File.close (now this one is interesting as it supposedly can be undone according to the docs, but I don’t think that is possible anymore?)
your struct (or slice) is effectively invalid and its state (whatever it is in) can not be used anymore.
But I admit (and just realized): That’s not completely true. After invalidation, you might still inspect a slice’s length and work with it. Maybe you could still compare a closed File’s handle if you know that you haven’t opened any new files after closing several files. So my argument could be invalidated (no pun intended) here.
However, as a counter example, ArrayListUnmanaged doesn’t have any usable fields after deinitialization (compared to ArrayList, where you could still make use of the stored allocator, in theory). Now interestingly ArrayListUnmanaged.deinit takes a *Self and ArrayList.deinit takes a Self. And note that the distinction I just pointed out isn’t on how these two deinit functions are internally implemented, but regarding whether they completely render self unusable or not.
So concluding here, I do see that sometimes taking Self has an advantage (beyond optimization questions): It gives a guarantee to the caller that all fields of the struct stay unmodified. In cases where “invalidation” means that the thing cannot be used as a whole but maybe some parts of it are still usable, this is a slight advantage. (Examples: Using a slice’s length after deallocation, comparing file handle’s after closing, etc.)
Normally yes, but if I would use noalias (or, in a hypothetical world, a custom ~Self language extension for passing ownershipindicating destructive consumption), then I actually do make a promise to the compiler.
As far as I understand, noalias works like restrict in C, i.e. we promise that for the lifetime of the pointer (usually the runtime of the function), no other pointer that refers to the same memmory will be used. This is the case when we call functions like deinit or Thread.detach. So, even if not idiomatic, it seems to be fine to use deinit(noalias self: *Self) in places where we currently use deinit(self: *Self). (Again, this is not idiomatic, and not sure how much optimization can really be done, but I would like to point that out.)
For an analysis with no false negatives, I would agree. But if the function never passes that pointer around, then it should be possible, I guess? (But again, I don’t really have a lot of knowledge about optimization, so I may be totally wrong.)
Well, I made my OP because the language and it’s associated idioms force me to distinguish between using self: Self and self: *Self when writing my own deinitialization functions. As shown above, we can always implement deinitialization in either way, but our choice has implications for the caller of our deinitialization function. I would like to have a clear rule (that doesn’t just sound nice but also makes sense) when it comes to picking one of those (or a different pattern, like noalias self: *Self).
I.e. my post is to aid me in thinking less in the future by understanding the subject and the reasons for choosing one or the other way.
I understand that for non-invalidating functions there are some existing rules. And while they may be fuzzy, I can understand their motivation. But for invalidating functions, I challenge(d) the existing rules because here mutation isn’t relevant for the caller (with some possible exceptions as outlined in the beginning of this post).
So I think we’re on the same side here as in “I don’t want to think a lot”, because:
But for me, this isn’t true in case of functions that assume ownership ofconsume/destroy a value. (Which I will get back to in my summary below).
Now I used the term “ownership” again, so let me first answer this:
I totally agree. And when I said, “a function that assumes ownership”, then I meant this: The function’s documentation (or semantics) will clarify that passing Self (whether by-value or by-reference) will make it the function’s responsibility to invalidate the structure.
Now passing ownership isn’t the same as invalidating, but in case of invalidating functions, the invalidation happens immediately (at least when the function returns).
But: Even if Zig doesn’t deal with ownership, it deals with uninitialized values (to some extent and only in Debug mode) as well as pointer aliasing (well, at least it should care). Now what about these two things when the programmer(!) passes ownership with the intent to deinitialize? First of all, the value could be considered uninitialized afterwards (which can be indicated by the programmer with self.* = undefined; but as we know, this doesn’t work when we pass self: Self). Secondly, if we pass by reference, then the compiler may assume that there are no aliases. (It’s currently not idiomatic to declare that with noalias though, and maybe there are reasons not to do it, I don’t know. At least it’s syntactically more verbose if we would declare it.)
So what is my point here? The intent to deinitialize immediately has potential implications for the compiler. Maybe referring to this as “passing ownership” has been misleading, as there are other reasons to pass ownership (where there is no immediate deinitialization). So sorry about any confusion I created.
Okay, that is something I didn’t consider yet. If we want to make the compiler to work easily, then we obviously must distinguish between deinit(self: T) and deinit(self: *T) depending on our implementation. Still, I find it unfortunate that implementation details here leak to the caller-side (for no other reason than optimization).
For the record: I don’t want to track lifetimes and ownership. But I think aliasing rules need to be clear (#1108, #1521, #5973).
Now let me try to summarize my p.o.v. as I said in the beginning of this post:
As a user of Zig, I currently have a problem when writing functions that deinitialize a struct. On the one hand, it’s a common idiom to set the struct to undefined after it has been invalidated. On the other hand, it doesn’t seem to be idiomatic to always use self: *Self when “assuming ownership” and “deinitializing” a structure.
This violates
Only one obvious way to do things.
because I have to weigh different paradigms here. And I don’t know how.
I’m in pursuit of consistency (and easy-to-apply rules). Thus I wondered if there could be a unique way to go.
I pointed out that deinitialization functions are somewhat different from other functions, because the caller promises to not use the value anymore. This may (or may not) lead to different rules when to use self: Self and self: *Self. In particular it means:
If we pass self by value, then it is still possible to mutate it by creating a copy (maybe inefficient, maybe not). (demonstrated in this post above)
If we pass self by reference, then we can (and the compiler could, if we annotate that somehow) that there are no aliases. (argued in my second post here).
So there may (or may not) be different rules how to write idiomatic Zig when our function renders a deinitialized value unusable. And I’m trying to understand these rules and their motivation.
In my opinion it is a suboptimal choice to purely select self: Self and self: *Self based on how the function is implemented interally. Maybe that’s an utopy though as our compilers aren’t clever enough (and if they are, they would get bloated and complex). Nevertheless, for non-deinitializing functions, the question whether we pass
x: T
x: *T
x: *const T
matters because there are implications for the caller:
x: T ensures that the value x is copied, thus that accesses to the original x while the function is running will have no impact on the copy.
x: *T allows mutation.
x: *const T disallows mutation, but does not ensure that accesses to the original (pointed-to value) won’t affect x.* while the function is running.
These implications don’t matter for functions that (by contract) pass ownership for the purpose of immediate deinitialization. This means the only reason to choose between deinit(self: Self) or deinit(noalias self: *Self) depends on the internal implementation of the the particular deinit function. That seems unsatisfactory.
Now what is the bottom line of this?
Maybe:
Idiomatic Zig requires us to pick between deinit(self: Self) and deinit(self: *Self) depending on the internal implementation of the function. Whether we (or I) like it or not.
We should think about whether passing a value with the intent of deinitializing it (not tracking ownership!) could deserve its own language feature, something like deinit(self: ~Self), which basically means: “pass as value or reference, I don’t care, because I’m not going to use this anymore”. This is similar, but not identical, to what noalias does.
Even if the language isn’t changed, this thought experiment may give us insight regarding the problem we’re tackling here.
If the language is not changed, we should give careful thought to how std exposes its deinit functions because changing it later may break code.
It means the function is written with the syntax/semantics as if it is copied, but the compiler can decide whether to copy it or pass it as a reference.
When these types are passed as parameters, Zig may choose to copy and pass by value, or pass by reference, whichever way Zig decides will be faster. This is made possible, in part, by the fact that parameters are immutable.