Hello again, I hope everyone is doing well!
I’m aiming to do a series of posts where we build well-established machine learning models and utilities. These could include anything from semantic search to transformers (I’m particularly excited about building MAMBA). I’d like to kick it off today with a simple one: Neural Networks.
Instead of making this a detailed tutorial on how they work, I’d like to show you how easy it is to build using Metaphor. If you’d like to follow along, you can find the full example file in my library: https://github.com/andrewCodeDev/Metaphor/blob/main/src/examples/feedforward.zig
Feed-Forward
Neural networks are series of matrix multiplications, followed by a vector addition and then run through an activation function. Activation functions bring in non-linearity to the calculation which allows for modelling things that can’t be captured by merely placing straight lines. For instance, think think of the parabola created by f(x) = x^2
- the line bends upwards at both ends, so we need that kind of behaviour if we want to capture that shape.
Basically, we need something that goes like : y = f(A.x + b)
where we can repeat this operation continually. Each time we repeat the operation, we’ll call that a layer. Here’s how that looks:
pub fn FeedForward(comptime Tag: mp.scalar.Tag) type {
return struct {
const Self = @This();
const T = mp.scalar.Tag.asType(Tag);
W: mp.types.LeafTensor(T, .wgt),
b: mp.types.LeafTensor(T, .wgt),
y: mp.types.NodeTensor(T) = undefined,
alpha: f16,
pub fn init(G: *mp.Graph, m: usize, n: usize) Self {
return .{
.W = G.tensor(.wgt, Tag, mp.Rank(2){ m, n }),
.b = G.tensor(.wgt, Tag, mp.Rank(1){ m }),
.alpha = 1.0 / @as(f16, @floatFromInt(n)),
};
}
pub fn forward(self: *Self, x: anytype) mp.types.NodeTensor(T) {
self.y = mp.ops.selu(mp.ops.linearScaled(self.W, x, self.alpha, self.b, "ij,j->i"));
self.y.detach();
return self.y;
}
pub fn reverse(self: *Self, cleanup: bool) void {
self.y.reverse(if (cleanup) .free else .keep);
}
};
}
That’ll do it! Not too shabby.
You may notice that I’m setting alpha to 1/n
- that’s common and used to trim down the size of the resulting values… it stops them from skyrocketing. I’m using selu
as an activation function (Scaled Exponential Linear Unit), but this is optional. Selu has good reversal properties, has no worries of causing an over/underflow, and is cheap to compute.
Building the Network
To add some more information to our network, we can repeat multiple feed-forwards one after another, passing the output of one as the input to another. Sounds like an array (if you ask me, lol). Sure enough, an array of these will certainly do the job. I’ve also added two additional ones called head
and tail
(the array is body
). The head
block first projects the incoming vector to a bigger one (if I use a matrix with dimensions (N, M) against a vector of size (M), the resulting vector will have size N which could be larger). The tail
does the opposite - it returns us back to the vector size we originally started with. Here’s the code for that:
pub fn NeuralNet(
comptime Tag: mp.scalar.Tag,
comptime layers: usize
) type {
if (comptime layers == 0) {
@compileError("NeuralNet needs at least 1 layer.");
}
return struct {
const Self = @This();
const T = mp.scalar.Tag.asType(Tag);
head: FeedForward(Tag),
body: [layers]FeedForward(Tag),
tail: FeedForward(Tag),
cleanup: bool,
pub fn init(G: *mp.Graph, m: usize, n: usize, cleanup: bool) Self {
var body: [layers]FeedForward(Tag) = undefined;
for (0..layers) |i| {
body[i] = FeedForward(Tag).init(G, m, m);
}
return .{
.head = FeedForward(Tag).init(G, m, n),
.body = body,
.tail = FeedForward(Tag).init(G, n, m),
.cleanup = cleanup,
};
}
pub fn forward(self: *Self, x: mp.types.LeafTensor(T, .inp)) mp.types.NodeTensor(T) {
var h = self.head.forward(x);
for (0..layers) |i| {
h = self.body[i].forward(h);
}
return self.tail.forward(h);
}
pub fn reverse(self: *Self) void {
self.tail.reverse(self.cleanup);
var i: usize = layers;
while (i > 0) {
i -= 1;
self.body[i].reverse(self.cleanup);
}
self.head.reverse(self.cleanup);
}
};
Again, nothing to it. Believe it or not, that’s all we need!
Let me explain one thing you may have picked up on above. First detach
… what’s up with that? Detach helps us form sub-graphs. Each sub-graph can be independently freed, allowing us to clean-up as we go along. These freed values are caught by the caching allocator, meaning we can reuse them as we go backwards (we can offload values to the cpu and copy them back over going forward, giving us an additional massive reduction in memory usage). To see more about sub-graphs, checkout: Metaphor/src/examples/subgraphs.zig at main · andrewCodeDev/Metaphor · GitHub
In the main file, I did some testing with toy data to see how it all turns out - the results look good
First couple runs…
info: score - 3.8343
info: score - 3.7344
info: score - 3.6351
info: score - 3.5362
info: score - 3.4379
Towards the end…
info: score - 0.1938
info: score - 0.1906
info: score - 0.1875
info: score - 0.1845
info: score - 0.1815
Anyhow, if this kind of thing interests people here, I’ll make a note of picking out the best networks (or even take suggestions) and make publicly available versions of them using Metaphor that you can use in the comfort of your own Zig projects
Thanks! Let me know if you want me to build any models in particular!