Runtime linking not working on Linux and Windows but works on macOS

Before I provide more details, I just want to clearify some wordings here:

  • Dynamic Linking: The library is specified at compile time via module.linkSystemLibrary.
  • Runtime Linking: The library and functions are linked at runtime time via std.c.dlopen and std.c.dlsym (or alike).

Introduction

I’m trying to embed a standalone build of CPython into my project by linking the libpython library at runtime. The goal is to make PyTorch run in a Python runtime controlled by my Zig program.

Source

Project source code: https://github.com/LmanTW/dekun-new

Note that it currently only works on the Linux target (x86 tested, aarch64 should work). It should also work on the aarch64 macOS target, but I’m not sure.

  • src/core/Python.zig Loads / lookup / interact with the libcpython library.
  • src/core/Bridge.zig Abstraction layer between the Python functions and Zig.

Problem

The main idea is that my Zig program will import a Python module then hold on to it. It will also use PyModule_GetDict and PyDict_GetItemString to get the functions from the module. It will borrow the reference to those functions and store them, using them later on.

The problem is that sometimes I get Segment Fault from trying to Py_DecRef the returned object from PyObject_CallObject. Other times I get General Protection Exception when trying to call PyObject_CallObject for the second time.

Attempts

I started by trying to check if I’m de-referencing objects or finalizing the interpreter too early, but they’re all working as expected. Then I thought maybe I’m miscasting some function pointers, but that isn’t the case either.

I’ve also tried removing the PyTorch part from the Python code, thinking, it might be causing some sort of memory corruption. But that isn’t the case either.

Eventually I tried dynamically linking the same standalone build with identical code (just replacing the runtime linked functions with the imported C ones) and it worked with no problem.

You can check the dynamic branch to see the dynamically linked version.

Versions

Zig: 0.15.1 ~ 0.15.2
CPython: 3.13.9

Some folks in the Python discord server have been able to reproduce this issue, so it should be pretty universal.

I tested it on x86_64 macOS (with the same code, just removed some parts to make it run on macOS) and it works as expected. I don’t know if this is a Zig bug or I’m using dlopen and dlsym wrong on Linux.

Do you link to a C library on linux? (musl or gnu). On macOS zig always link to the C library because it is the only valid way to make system calls. On linux you must specify -lc or set root_module.link_libc = true.
If you link to the C library, do you use the same C library that python links to? (e.g. target native-linux-gnu).

The ABI of the Zig program and the C library are both GNU. And yes I’ve linked libc.

I’ve been using the install_only_stripped build, and was thinking they might have different build configuration. But after trying to use the pgo+lto-full build on both Linux and macOS the problem still persists on Linux.

I’ve also tested it on x86_64 Windows, also crashes with Segment Fault.

Segmentation fault at address 0x0
C:\Users\LmanTW\Downloads\dekun\src\core\Python.zig:250:52: 0x7ff71babb5f8 in callObject (dekun_zcu.obj)
        if (self.python.binding.PyObject_CallObject(self.internal, if (arguments) |object| object.internal else null)) |object| {
                                                   ^
C:\Users\LmanTW\Downloads\dekun\src\core\Bridge.zig:89:61: 0x7ff71babbd6d in initializeMarker (dekun_zcu.obj)
    const object = try self.functions.init_marker.callObject(arguments);
                                                            ^
C:\Users\LmanTW\Downloads\dekun\src\core\Dekun.zig:1088:36: 0x7ff71ba71c59 in init_command_callback (dekun_zcu.obj)
        try bridge.initializeMarker(backend, width, height, depth);
                                   ^
C:\Users\LmanTW\Downloads\dekun\src\core\Console.zig:298:37: 0x7ff71ba82694 in use__anon_31264 (dekun_zcu.obj)
        try current_command.callback(context, current_command, &parsed_arguments);
                                    ^
C:\Users\LmanTW\Downloads\dekun\src\main.zig:27:20: 0x7ff71ba84629 in main (dekun_zcu.obj)
    try console.use(Dekun, &dekun, dekun.root_command);
                   ^
C:\Users\LmanTW\.zvm\0.15.1\lib\std\start.zig:602:28: 0x7ff71ba84bed in main (dekun_zcu.obj)
    return callMainWithArgs(@as(usize, @intCast(c_argc)), @as([*][*:0]u8, @ptrCast(c_argv)), envp);
                           ^
C:\Users\LmanTW\.zvm\0.15.1\lib\libc\mingw\crt\crtexe.c:259:0: 0x7ff71bdf26fb in __tmainCRTStartup (crt2.obj)
    mainret = _tmain (argc, argv, envp);

C:\Users\LmanTW\.zvm\0.15.1\lib\libc\mingw\crt\crtexe.c:179:0: 0x7ff71bdf275b in mainCRTStartup (crt2.obj)
  ret = __tmainCRTStartup ();

???:?:?: 0x7ffb7e257373 in ??? (KERNEL32.DLL)
???:?:?: 0x7ffb800bcc90 in ??? (ntdll.dll)

Again, it’s all the same code, so there should be no logical error here since it works on macOS.

Forgot to provide the stack traces on Linux.

Segmentation fault at address 0x708239158210
???:?:?: 0x708239158210 in ??? (???)
Unwind information for `???:0x708239158210` was not available, trace may be incomplete

/root/dekun/src/core/Bridge.zig:85:35: 0x116f813 in initializeMarker (dekun)
    defer object.decreaseReference();
                                  ^
/root/dekun/src/core/Dekun.zig:1068:36: 0x112b6de in init_command_callback (dekun)
        try bridge.initializeMarker(backend, width, height, depth);
                                   ^
/root/dekun/src/core/Console.zig:298:37: 0x113af15 in use__anon_23730 (dekun)
        try current_command.callback(context, current_command, &parsed_arguments);
                                    ^
/root/dekun/src/main.zig:27:20: 0x113cf47 in main (dekun)
    try console.use(Dekun, &dekun, dekun.root_command);
                   ^
/root/.zvm/0.15.1/lib/std/start.zig:627:37: 0x113d645 in main (dekun)
            const result = root.main() catch |err| {
                                    ^
run
└─ run exe dekun failure
error: the following command terminated unexpectedly:
./.zig-cache/o/2278f6c654f1954374279686505d4a91/dekun marker init a

Build Summary: 2/4 steps succeeded; 1 failed
run transitive failure
└─ run exe dekun failure

error: the following build command failed with exit code 1:
.zig-cache/o/1d1713ec263b76d00246040b3052ef58/build /root/.zvm/0.15.1/zig /root/.zvm/0.15.1/lib /root/dekun .zig-cache /root/.cache/zig --seed 0x7aa375c5 -Z757eb274fff67ad3 run -- marker init a
General protection exception (no address available)
/root/dekun/src/core/Python.zig:263:52: 0x116f09d in callObject (dekun)
        if (self.python.binding.PyObject_CallObject(self.internal, if (arguments) |object| object.internal else null)) |object| {
                                                   ^
/root/dekun/src/core/Bridge.zig:95:50: 0x1170425 in saveMarker (dekun)
    _ = try self.functions.save_marker.callObject(arguments);
                                                 ^
/root/dekun/src/core/Dekun.zig:1085:30: 0x112bbd6 in init_command_callback (dekun)
        try bridge.saveMarker(model_path);
                             ^
/root/dekun/src/core/Console.zig:298:37: 0x113af15 in use__anon_23730 (dekun)
        try current_command.callback(context, current_command, &parsed_arguments);
                                    ^
/root/dekun/src/main.zig:27:20: 0x113cf47 in main (dekun)
    try console.use(Dekun, &dekun, dekun.root_command);
                   ^
/root/.zvm/0.15.1/lib/std/start.zig:627:37: 0x113d645 in main (dekun)
            const result = root.main() catch |err| {
                                    ^
../sysdeps/nptl/libc_start_call_main.h:58:16: 0x792988829d8f in __libc_start_call_main (../sysdeps/x86/libc-start.c)
../csu/libc-start.c:392:3: 0x792988829e3f in __libc_start_main_impl (../sysdeps/x86/libc-start.c)
???:?:?: 0x108c5e4 in ??? (???)
???:?:?: 0x0 in ??? (???)
run
└─ run exe dekun failure
error: the following command terminated unexpectedly:
./.zig-cache/o/7dd4899d31a734a49fc37224776f0489/dekun marker init a

Build Summary: 2/4 steps succeeded; 1 failed
run transitive failure
└─ run exe dekun failure

error: the following build command failed with exit code 1:
.zig-cache/o/f3f036138f28b916ac86c1a4a911ec4f/build /root/.zvm/0.15.1/zig /root/.zvm/0.15.1/lib /root/dekun .zig-cache /root/.cache/zig --seed 0xd322bd36 -Z7192d63333f16f07 run -- marker init a

Digging into the memory using gdb, it seems like it is is trying to jump to an invalid address (I’m assuming this is happening in the CPython interptreter):

(gdb) x/20i $pc-20
   0x1174329 <initial+145169>:  add    %al,(%rax)
   0x117432b <initial+145171>:  add    %bl,-0x3d(%rbp)
   0x117432e <initial+145174>:  mov    -0x58(%rbp),%rax
   0x1174332 <initial+145178>:  mov    -0x50(%rbp),%rdi
   0x1174336 <initial+145182>:  mov    -0x90(%rbp),%rsi
=> 0x117433d <initial+145189>:  call   *%rax
   0x117433f <initial+145191>:  mov    %rax,-0x88(%rbp)
   0x1174346 <initial+145198>:  cmp    $0x0,%rax
   0x117434a <initial+145202>:  jne    0x117436a <initial+145234>
   0x117434c <initial+145204>:  jmp    0x117438b <initial+145267>
   0x117434e <initial+145206>:  mov    -0x80(%rbp),%rax
   0x1174352 <initial+145210>:  mov    0x8(%rax),%rax
   0x1174356 <initial+145214>:  mov    %rax,-0x90(%rbp)
   0x117435d <initial+145221>:  jmp    0x117432e <initial+145174>
   0x117435f <initial+145223>:  xor    %eax,%eax
   0x1174361 <initial+145225>:  mov    %rax,-0x90(%rbp)
   0x1174368 <initial+145232>:  jmp    0x117432e <initial+145174>
   0x117436a <initial+145234>:  mov    -0x88(%rbp),%rax
   0x1174371 <initial+145241>:  mov    -0x78(%rbp),%rcx
   0x1174375 <initial+145245>:  mov    %rax,-0x48(%rbp)
(gdb) info register rax
rax            0xaaaaaaaaaaaaaaaa  -6148914691236517206

Turns out the function pointer I’m storing somehow get re-assigned to 0xaaaaaaaaaaaaaaaa somewhere between after the first call and before the second call to the function.

0xaa is the value that zig sets for undefined. This can happen, for example, when freeing memory.

2 Likes

Does allocator.free() works correctly with null terminated slices like [:0]? Or do I need to std.mem.span() before passing the slice?

Hmm nvm, I guess it should work since [:0] still fold a length field.

Logging the function pointer table before the PyObject_CallObject call:

  1. The first call:
.{ .Py_Initialize = fn () callconv(.c) void@7d3c23431a40, .Py_InitializeFromConfig = fn ([*c]const cimport.struct_PyConfig) callconv(.c) cimport.PyStatus@7d3c23770a00, .Py_IncRef = fn ([*c]cimport.struct__object) callconv(.c) void@7d3c233bc5e0, .Py_D
ecRef = fn ([*c]cimport.struct__object) callconv(.c) void@7d3c233bc5f0, .Py_Finalize = fn () callconv(.c) void@7d3c23431ac0, .PyConfig_InitPythonConfig = fn ([*c]cimport.struct_PyConfig) callconv(.c) void@7d3c23763170, .PyConfig_InitIsolatedConfig = 
fn ([*c]cimport.struct_PyConfig) callconv(.c) void@7d3c2341bd80, .PyConfig_Clear = fn ([*c]cimport.struct_PyConfig) callconv(.c) void@7d3c23762c40, .PyConfig_SetBytesString = fn ([*c]cimport.struct_PyConfig, [*c][*c]c_int, [*c]const u8) callconv(.c) 
cimport.PyStatus@7d3c2341bdd0, .PyStatus_IsExit = fn (cimport.PyStatus) callconv(.c) c_int@7d3c2341bae0, .PyStatus_IsError = fn (cimport.PyStatus) callconv(.c) c_int@7d3c2341bad0, .PyModule_GetDict = fn ([*c]cimport.struct__object) callconv(.c) [*c]c
import.struct__object@7d3c236a3ff0, .PyDict_GetItemString = fn ([*c]cimport.struct__object, [*c]const u8) callconv(.c) [*c]cimport.struct__object@7d3c233b6cd0, .PyObject_CallObject = fn ([*c]cimport.struct__object, [*c]cimport.struct__object) callcon
v(.c) [*c]cimport.struct__object@7d3c2366f250, .PyLong_FromLong = fn (c_long) callconv(.c) [*c]cimport.struct__object@7d3c23525eb0, .PyLong_AsLong = fn ([*c]cimport.struct__object) callconv(.c) c_long@7d3c235267e0, .PyFloat_FromDouble = fn (f64) call
conv(.c) [*c]cimport.struct__object@7d3c23519780, .PyFloat_AsDouble = fn ([*c]cimport.struct__object) callconv(.c) f64@7d3c23519900, .PyUnicode_FromString = fn ([*c]const u8) callconv(.c) [*c]cimport.struct__object@7d3c23575900, .PyUnicode_AsUTF8AndS
ize = fn ([*c]cimport.struct__object, [*c]isize) callconv(.c) [*c]const u8@7d3c2357ac20, .PyTuple_New = fn (isize) callconv(.c) [*c]cimport.struct__object@7d3c23560eb0, .PyTuple_GetItem = fn ([*c]cimport.struct__object, isize) callconv(.c) [*c]cimpor
t.struct__object@7d3c23561170, .PyTuple_SetItem = fn ([*c]cimport.struct__object, isize, [*c]cimport.struct__object) callconv(.c) c_int@7d3c236b59d0, .PyErr_Clear = fn () callconv(.c) void@7d3c237552e0, .PyErr_Occurred = fn () callconv(.c) [*c]cimpor
t.struct__object@7d3c235fab90, .PyRun_SimpleString = fn ([*c]const u8) callconv(.c) c_int@7d3c23435ca0, .PyImport_ImportModule = fn ([*c]const u8) callconv(.c) [*c]cimport.struct__object@7d3c2375ed60 }
  1. The second call:
.{ .Py_Initialize = fn () callconv(.c) void@ffffffff00022fe6, .Py_InitializeFromConfig = fn ([*c]const cimport.struct_PyConfig) callconv(.c) cimport.PyStatus@7ffc9f13d2e8, .Py_IncRef = fn ([*c]cimport.struct__object) callconv(.c) void@7ffc9f13a5c0, .
Py_DecRef = fn ([*c]cimport.struct__object) callconv(.c) void@7ffc9f13df70, .Py_Finalize = fn () callconv(.c) void@8, .PyConfig_InitPythonConfig = fn ([*c]cimport.struct_PyConfig) callconv(.c) void@7ffc9f13a160, .PyConfig_InitIsolatedConfig = fn ([*c
]cimport.struct_PyConfig) callconv(.c) void@0, .PyConfig_Clear = fn ([*c]cimport.struct_PyConfig) callconv(.c) void@7ffc9f13a5c0, .PyConfig_SetBytesString = fn ([*c]cimport.struct_PyConfig, [*c][*c]c_int, [*c]const u8) callconv(.c) cimport.PyStatus@7
ffc9f13d2e8, .PyStatus_IsExit = fn (cimport.PyStatus) callconv(.c) c_int@7ffc9f13a160, .PyStatus_IsError = fn (cimport.PyStatus) callconv(.c) c_int@8, .PyModule_GetDict = fn ([*c]cimport.struct__object) callconv(.c) [*c]cimport.struct__object@7ffc9f1
3a5c0, .PyDict_GetItemString = fn ([*c]cimport.struct__object, [*c]const u8) callconv(.c) [*c]cimport.struct__object@aaaaaaaaaaaaaaaa, .PyObject_CallObject = fn ([*c]cimport.struct__object, [*c]cimport.struct__object) callconv(.c) [*c]cimport.struct_
_object@22fe600022fe6, .PyLong_FromLong = fn (c_long) callconv(.c) [*c]cimport.struct__object@8, .PyLong_AsLong = fn ([*c]cimport.struct__object) callconv(.c) c_long@aaaaaaaaaaaaaaaa, .PyFloat_FromDouble = fn (f64) callconv(.c) [*c]cimport.struct__ob
ject@aaaaaaaaaaaaaaaa, .PyFloat_AsDouble = fn ([*c]cimport.struct__object) callconv(.c) f64@aaaaaaaaaaaaaaaa, .PyUnicode_FromString = fn ([*c]const u8) callconv(.c) [*c]cimport.struct__object@aaaaaaaaaaaaaaaa, .PyUnicode_AsUTF8AndSize = fn ([*c]cimpo
rt.struct__object, [*c]isize) callconv(.c) [*c]const u8@7ffc9f13a160, .PyTuple_New = fn (isize) callconv(.c) [*c]cimport.struct__object@8, .PyTuple_GetItem = fn ([*c]cimport.struct__object, isize) callconv(.c) [*c]cimport.struct__object@7ffc9f13d2e8,
 .PyTuple_SetItem = fn ([*c]cimport.struct__object, isize, [*c]cimport.struct__object) callconv(.c) c_int@8, .PyErr_Clear = fn () callconv(.c) void@aaaaaaaaaaaaaaaa, .PyErr_Occurred = fn () callconv(.c) [*c]cimport.struct__object@aaaaaaaaaaaaaaaa, .P
yRun_SimpleString = fn ([*c]const u8) callconv(.c) c_int@7ffc9f13a160, .PyImport_ImportModule = fn ([*c]const u8) callconv(.c) [*c]cimport.struct__object@8 }

Hello, I read your code and found a reference to the stack pointer. The problem mainly occurs here:

The module here is constructed by Object.init, which uses a python pointer on the stack, so it becomes invalid when you leave this function.

1 Like

This is it! This is it!
Again it’s such a simple and stupid bug, took me 5 days :skull:
Thank you so much man!

1 Like

Never mind. This might be another example supporting the cancellation of simplifying (&stack_obj).method() to stack_obj.method() as syntactic sugar. The implicit expression here might make it easy to forget that this is operating on an object on the stack.