zr - Simple, batteries included hot-reloading for Zig

This is my first relatively-serious Zig project. It is a very easy to use library that helps you implement hot reloading for your application in just a couple of minutes. The dynamically loaded symbols are actually statically typed thanks to some meta-programming, so it’s very safe to use, and it supports a “turn-off” switch that disables dynamic library shenanigans and just links the symbols statically.

Additionally it solves 2/4 of the biggest caveats of hot reloading with dynamic libraries, which are static/global variables and function pointers. With zr you can use them relatively seamlessly.

I would like to make this the best hot-reloading solution for Zig until the compiler gets support for hot-patching binaries on its own with eldritch magic. I think it already is that, but I would like some help if anyone’s interested. I’ve tested this fully works on Linux, and it cross-compiles just fine to Windows but I’m not sure if it runs just fine. I would appreciate someone with a Windows computer to help me test if anything’s wrong. MacOS should work fine but it’s the same story.

I also think this can solve caveat 3 (threading) with some kind of thread tracker, and caveat 4 (struct size/layout changes) could potentially be something zr detects on its own with meta-programming shenanigans, to either crash or allow for custom handling. But that’s in the future for now.

https://codeberg.org/zwynd/zr

24 Likes

Awesome! I was looking for something like this to try out hot reloading in one of my project! And source code is really nice to read
Thanks :slight_smile:

2 Likes

I appreciate it! Do tell me how it goes if you decide to integrate it into your project.

1 Like

Building and running the example on Debug x86_64-windows-gnu (win11) gives me the following panic:

C:\zigtest\clones\zr>zig build run_example_print
Exit with Ctrl-C
initialized
thread 321300 panic: load of misaligned address 0x7ffb89c311ea for type 'DWORD' (aka 'unsigned long'), which requires 4 byte alignment
???:?:?: 0x7ffb89af485e in ??? (print.dll.tmp)
???:?:?: 0x7ffb89af419a in ??? (print.dll.tmp)
???:?:?: 0x7ffb89af3308 in ??? (print.dll.tmp)
???:?:?: 0x7ffb89af6e1e in ??? (print.dll.tmp)
???:?:?: 0x7ffb89af5ddc in ??? (print.dll.tmp)
???:?:?: 0x7ffb89af51a5 in ??? (print.dll.tmp)
???:?:?: 0x7ffb89af4e6e in ??? (print.dll.tmp)
C:\zigtest\clones\zr\examples\print\host.zig:51:9: 0x7ff788241f64 in main (print_zcu.obj)
    init(&plugin.registry);
        ^
C:\zig\lib\std\start.zig:602:28: 0x7ff78824360d in main (print_zcu.obj)
    return callMainWithArgs(@as(usize, @intCast(c_argc)), @as([*][*:0]u8, @ptrCast(c_argv)), envp);
                           ^
C:\zig\lib\libc\mingw\crt\crtexe.c:259:0: 0x7ff78825693b in __tmainCRTStartup (crt2.obj)
    mainret = _tmain (argc, argv, envp);

C:\zig\lib\libc\mingw\crt\crtexe.c:179:0: 0x7ff78825699b in mainCRTStartup (crt2.obj)
  ret = __tmainCRTStartup ();

???:?:?: 0x7ffc2e6de8d6 in ??? (KERNEL32.DLL)
???:?:?: 0x7ffc3074c53b in ??? (ntdll.dll)
run_example_print
└─ run exe print failure

Building for ReleaseSafe gives me a different but clearly related panic:

C:\zigtest\clones\zr>zig build run_example_print --release=safe
Exit with Ctrl-C
initialized
Illegal instruction at address 0x7ffbc0de229b
C:\zigtest\clones\zr\src\zr.zig:256:50: 0x7ffbc0de26a4 in track_fnptr__anon_1873 (print_zcu.obj)
        const name = try get_symbol_name_from_ptr(@ptrCast(@constCast(fnptr.*)));
                                                 ^
C:\zigtest\clones\zr\examples\print\plugin.zig:33:20: 0x7ffbc0de23c6 in init (print_zcu.obj)
    reg.track_fnptr(&funcs.bleep) catch {};
                   ^
???:?:?: 0x7ff7aa6f33f4 in ??? (print.exe)
???:?:?: 0x7ff7aa71f28b in ??? (print.exe)
???:?:?: 0x7ff7aa71f2eb in ??? (print.exe)
???:?:?: 0x7ffc2e6de8d6 in ??? (KERNEL32.DLL)
???:?:?: 0x7ffc3074c53b in ??? (ntdll.dll)
run_example_print
└─ run exe print failure

Building for ReleaseFast or ReleaseSmall makes it bleep once and then stop printing output forever.

Thanks for trying it out. I got a private message with the same error happening on Windows. It is on dlfcn-win32’s side. This is the stack trace with location info:

dlfcn.c:670:0
dlfcn.c:837:0
zr.zig:117:19
zr.zig:574:29
zr.zig:256:50
plugin.zig:33:20

Concretely, this is the line that’s panicking:

for( i = 0; i < ied->NumberOfFunctions; i++ )
    {
        if( (void *) ( base + functionAddressesOffsets[i] ) > addr || candidateAddr >= (void *) ( base + functionAddressesOffsets[i] ) )
            continue;

My only guess is that it’s trying to add DWORD values to void* and causing some issues. Or perhaps not passing RTLD_NOW on Windows is causing some issues?

I sadly just cannot debug this myself as I lack a Windows machine. I will try to setup a VM during the weekend.

For now, I updated the dlfcn-win32 used in the repository from the latest release to the latest commit from the master branch. See if maybe that helps please?

Didn’t help, sadly.
From the specific message that it’s returning, it seems like Windows expects function pointers to be aligned to exact 4-byte boundaries.
Specifically, it’s likely that it’s altering the function pointer to match that alignment, which causes it to execute the machine code with an offset and never return from the function pointer it calls.
I tried tinkering with a few things (e.g. changing export linkage to strong, changing export section) to try to make it export pointers that Windows likes, but I also couldn’t get it to work.

That is quite frustrating. From skimming the source code it seems like dlfcn-win32 just parses the PE of the dll file and looks up the address in the export table… It doesn’t sound too hard to replicate in Zig. I may give it a try when I setup the Windows VM.

Alternatively, just having to specify the export name in reg.track_fnptr would remove the need for this altogether, at the cost of a little more inconvenience.

… What weirds me out though, is that, in this specific line that is causing the panic:

if( (void *) ( base + functionAddressesOffsets[i] ) > addr || candidateAddr >= (void *) ( base + functionAddressesOffsets[i] ) )

I assume addr is not causing the misaligned load, as that is the actual function pointer and isn’t modified in the code. candidateAddr gets offset, but only by the values in the IMAGE_EXPORT_DIRECTORY:

candidateAddr = (void *) ( base + functionAddressesOffsets[i] );
DWORD *functionAddressesOffsets = (DWORD *) (base + (DWORD) ied->AddressOfFunctions);

I tried running the print example on Mac OS but got a bunch of compilation errors. I get the same with both Zig 0.15.2 and 0.15.1. Hope this helps, I can open an issue on Codeberg if you prefer.

13:00:29 ~/Developer/zr (main) % zig build run_example_print
run_example_print
└─ run exe print
└─ install
└─ install print
└─ compile lib print Debug native 1 errors
src/zr.zig:257:51: error: expected type ‘**const anyopaque’, found ‘**const fn () callconv(.c) void’
try self.fnptrs.append(self.allocator, .{.ptr = fnptr, .name = name});
~^~~~~~~~~~~
src/zr.zig:257:51: note: pointer type child ‘*const fn () callconv(.c) void’ cannot cast into pointer type child ‘*const anyopaque’
src/zr.zig:257:51: note: pointer alignment ‘4’ cannot cast into pointer alignment ‘1’
referenced by:
init: examples/print/plugin.zig:33:20
comptime: examples/print/plugin.zig:71:14
4 reference(s) hidden; use ‘-freference-trace=6’ to see all references
error: the following command failed with 1 compilation errors:
/Users/me/.zvm/0.15.1/zig build-lib -ODebug --dep zr -Mroot=/Users/me/Developer/zr/examples/print/plugin.zig -ODebug -Mzr=/Users/me/Developer/zr/src/zr.zig -lc --cache-dir .zig-cache --global-cache-dir /Users/me/.cache/zig --name print -dynamic -install_name @rpath/libprint.dylib --zig-lib-dir /Users/me/.zvm/0.15.1/lib/ --listen=-
run_example_print
└─ run exe print
└─ install
└─ install print
└─ compile exe print Debug native 3 errors
src/zr.zig:257:51: error: expected type ‘**const anyopaque’, found ‘**const fn () callconv(.c) void’
try self.fnptrs.append(self.allocator, .{.ptr = fnptr, .name = name});
~^~~~~~~~~~~
src/zr.zig:257:51: note: pointer type child ‘*const fn () callconv(.c) void’ cannot cast into pointer type child ‘*const anyopaque’
src/zr.zig:257:51: note: pointer alignment ‘4’ cannot cast into pointer alignment ‘1’
referenced by:
init: examples/print/plugin.zig:33:20
comptime: examples/print/plugin.zig:71:14
6 reference(s) hidden; use ‘-freference-trace=8’ to see all references
src/zr.zig:393:42: error: @ptrCast increases pointer alignment
@field(syms, name) = @ptrCast(sym);
^~~~~~~~~~~~~
src/zr.zig:393:51: note: ‘*const anyopaque’ has alignment ‘1’
@field(syms, name) = @ptrCast(sym);
^~~
src/zr.zig:393:42: note: ‘*const fn (*zr.Registry) callconv(.c) void’ has alignment ‘4’
src/zr.zig:393:42: note: use @alignCast to assert pointer alignment
src/zr.zig:490:43: error: @ptrCast increases pointer alignment
@field(self.syms, name) = @ptrCast(sym);
^~~~~~~~~~~~~
src/zr.zig:490:52: note: ‘*const anyopaque’ has alignment ‘1’
@field(self.syms, name) = @ptrCast(sym);
^~~
src/zr.zig:490:43: note: ‘*const fn (*zr.Registry) callconv(.c) void’ has alignment ‘4’
src/zr.zig:490:43: note: use @alignCast to assert pointer alignment
error: the following command failed with 3 compilation errors:
/Users/me/.zvm/0.15.1/zig build-exe -ODebug --dep zr --dep plugin -Mroot=/Users/me/Developer/zr/examples/print/host.zig -ODebug -Mzr=/Users/me/Developer/zr/src/zr.zig -ODebug --dep zr -Mplugin=/Users/me/Developer/zr/examples/print/plugin.zig -lc --cache-dir .zig-cache --global-cache-dir /Users/me/.cache/zig --name print --zig-lib-dir /Users/me/.zvm/0.15.1/lib/ --listen=-

Build Summary: 0/7 steps succeeded; 2 failed
run_example_print transitive failure
└─ run exe print transitive failure
├─ compile exe print Debug native 3 errors
└─ install transitive failure
├─ install print transitive failure
│  └─ compile lib print Debug native 1 errors
└─ install print transitive failure
└─ compile exe print Debug native (reused)

error: the following build command failed with exit code 1:
.zig-cache/o/6cb943ea53ba5f5daeb106df0b7204ef/build /Users/me/.zvm/0.15.1/zig /Users/me/.zvm/0.15.1/lib /Users/me/Developer/zr .zig-cache /Users/me/.cache/zig --seed 0x7ef9eaaa -Z861376baa313f9bc run_example_print

Thank you for the help.

Apparently function pointers in MacOS have different alignment than in Linux and Windows. I just tried to do a @ptrCast(@alignCast(fnptr)), check if that works?

Yeah, adding @alignCast to the @ptrCasts on lines 257, 393 and 490 allows it to build and run. I haven’t had the chance to figure out how to trigger a reload so I don’t know if it’s fully working yet.

I’d appreciate it if you could PR that. Otherwise I’ll get it working in a bit.

Err, to reload all you have to do is get it to build the dynamic library again.

On macOS, with commit e6443c86e8, I added @alignCast lines 393 and 490 in order for the example to work.
But now that it’s working, when I modify the plugin (print.zig), I see in the console that it has been reloaded (even number gets incremented) but the new function’s body isn’t loaded.

For example, I modify bleep to print bleeEEEeep but I still se bleep in stdout.

Any ideas?

Just pushed a new commit with the added @alignCasts, thank you for the help!

As for your issue, does it happen just with function pointers (like bleep) or is no change ever registered? If it’s the latter, it sounds like some kind of dynamic linker issue, where it’s caching the .dylib and not really unloading+reloading it. I’d have to lookup about dynamic libraries on MacOS to really know - but I do know glibc’s dynamic linker is known for that kind of stuff.

I tested again and your latest commit fixed the issues I described, the main branch builds and runs on Mac OS now.

I can also confirm what @Zonion said, it increments the number but no changes are loaded. I tried changing the output of the two debug prints as well as the number that the test_number is incremented by, neither was reflected in the output until I stopped and re-ran the zig build run_example_print command. I just used zig build to trigger the reload, is that the correct approach?

Normally I would say that maybe I did the build.zig wrong and it’s not triggering a reload. But what you’re describing is plugin.reload() returning true with no errors, and yet, the dynamic library is not being reloaded…

That is a really weird bug. Once again my only guess is that MacOS’s dynamic linker needs some extra stuff to setup/do before it properly closes a dynamic library, and otherwise simply keeps it open and returns some sort of cached version. There’s an article about Linux’s dynamic linker doing that because of thread local storage.

When I patch this up for Windows I will try to research if that’s the case. But I can’t really dive deep into it, as I do not nor ever plan to have a Mac device, sadly.

1 Like

All right, I can help test solutions. I’ll also try to look deeper into this on my own when I get a chance.

When I implemented hot reloading in my project I don’t recall anything platform specific required other than having to load a copy of the DLL on Windows. That was built using the Zig std library though, not libc.

1 Like

std.DynLib goes to DlDynLib in MacOS, which use the libc symbols, so it should be the same…

Very tempted to do away with dladdr all together and simply require the export name of function pointers in reg.track_fnptr. It further dissuades using function pointers unless totally necessary, and it allows for zr to use std.DynLib directly… potentially avoiding a libc dependency alltogether.

zr 0.2 is out!

Changelog

  • Removed dependency on dlfcn-win32 and dladdr as a whole: zr should now work on Windows and MacOS without issues, hopefully. This, however, does mean that zr.Registry.track_fnptr has to take an extra parameter: the function the function pointer is pointing to’s export name.
  • Added support for multiple dynamic library APIs: zr can now use libc symbols directly as it did before, OR use std.DynLib (which unlocks Windows and Linux and no libc support for those operating systems), OR a completely custom backend provided by the host! This is controlled by the .dl_library field in zr.PluginCfg. It now defaults to Zig’s std.DynLib backend, though keep in mind that one has no support for loading null symbols without erroring out.
  • Added support for different link modes: zr now has an option to still link dynamically, but not perform hot reloading logic, by setting .link_mode = .dynamic_no_reload in PluginCfg. This is useful in case the host still wants to use dynamic libraries for its Plugins even on release mode, and it just wants to disable live reloading code. You can still link fully statically by doing .link_mode = .static.
  • Unified error tracking: Dynamic library APIs have very inconsistent errors since they come from the OS. Most of them are irrecoverable and come from programmer errors. zr now mimics libc’s approach of a global error variable (though atomic, so thread-safe) that describes what’s going wrong. Most functions return error.Unexpected and you are supposed to get more information via zr.err.load(.seq_cst).

It should hopefully fix Windows and MacOS issues since it no longer depends on dlfcn-win32. Please try it out and tell me if it works now!

I wanted to add a new example for using a custom backend, using SDL’s shared library API. However, I couldn’t find a way to add an SDL dependency to the project that’s only ever built if running an example. It’s very simple though, here’s how I do it in my own game:

const GamePlugin = zr.Plugin(@import("game"), .{
    // ...
    .dl_library = .{.custom = .{ 
        .open = sdl_dl_open,
        .sym = sdl_dl_sym,
        .close = sdl_dl_close,
    }},
    // ...
});

fn sdl_dl_open(path: [:0]const u8) error{Unexpected}!*anyopaque {
    const so = sdl.SharedObject.load(path) catch {
        if (sdl.errors.get()) |err| {
            zr.err.store(err.ptr, .seq_cst);
        }
        return error.Unexpected;
    };
    return so.value;
}

fn sdl_dl_sym(handle: *anyopaque, name: [:0]const u8) error{Unexpected}!?*const anyopaque {
    var so = sdl.SharedObject { .value = @ptrCast(handle) };
    const sym = so.loadFunction(name) catch {
        if (sdl.errors.get()) |err| {
            zr.err.store(err.ptr, .seq_cst);
        }
        return error.Unexpected;
    };
    return sym;
}

fn sdl_dl_close(handle: *anyopaque) void {
    var so = sdl.SharedObject { .value = @ptrCast(handle) };
    so.unload();
}

Hopefully this addition makes zr usable basically anywhere.

1 Like

Nice! I’ll try again for macOS tonight or tomorrow :slight_smile:
Thanks

1 Like