7 Apr 2025 |
Robert Hensing (roberth) | maybe it's just eval.cc ? | 20:25:14 |
Robert Hensing (roberth) | can't glean much from this, but maybe this happens due to a missing garbage collection root (erroneous decref / missing incref ?) or any other kind of corruption | 20:26:35 |
Robert Hensing (roberth) | a change to code layout may well have knock-on effects that cause a corruption to happen (all else equal). That's regardless of whether your overrides produce a valid binary, which I do not know. | 20:28:39 |
Robert Hensing (roberth) | It's like overrideAttrs but applies to all the components | 20:29:11 |
8 Apr 2025 |
Leonardo Santiago | me neither hahaha | 12:18:49 |
Leonardo Santiago | I don't think so, I'm relying on Drop from Rust and am not really messing with the GC of nix. | 12:19:47 |
Leonardo Santiago | This only seems to happen when I use it through the pyo3 adapter, from the python side, which kind of hints me that's something there. The part where I'm most afraid is that pyo3 requires all public structs to be Send , and mine clearly aren't as they hold raw pointers, so I wrapped everything public in an Arc<Mutex<T>> and it seemed to work fine until now. | 12:21:43 |
Robert Hensing (roberth) | Ohh, Rust wasn't enough? ^^' | 12:22:27 |
Leonardo Santiago | Wdym? If I try to evaluate the exact same expression through the Rust side it works fine. | 12:22:49 |
Robert Hensing (roberth) | Didn't mean anything specific with that | 12:23:11 |
Robert Hensing (roberth) | Did you publish your Rust bindings btw? Might be interesting to merge efforts | 12:24:19 |
Leonardo Santiago | Yes, github.com/o-santi/nix-forall | 12:24:38 |
Leonardo Santiago | I took some inspiration from yours at some places hahaha, specially at the read_string/read_hashmap callbacks | 12:25:26 |
Leonardo Santiago | And I think yours is much more careful when handling thread locking and GC. I don't really understand the constraints there so maybe there's something wrong I did that caused this. | 12:26:45 |
Robert Hensing (roberth) | If everything is on the main thread you're fine, and nix_value registers/deregisters itself with the GC just fine (if all is well), but the GC may not be happy if it needs to operate from a thread it doesn't know about. That includes allocation, so for all intents and purposes that's the whole of nix-expr that should be called from registered threads (or the main thread) only | 12:29:59 |
Robert Hensing (roberth) | You'd get an error message along the lines of "trying to GC from unknown thread". I don't think it can cause corruption necessarily | 12:30:37 |
Robert Hensing (roberth) | but maybe my assumption about GC roots is wrong, and my code does rely on stack scanning regardless | 12:31:07 |
Leonardo Santiago | I think the problem may lie there, pyo3 requires that all your structs be freely movable between threads, as python is not really a single threaded interpreter, it just heavily relies on the GIL. I'm not doing anything multithreaded from the python side, much to the contrary, I'm just state.eval_file('path').get(attr) but I wouldn't say it isn't moving it to another thread either | 12:32:48 |
Robert Hensing (roberth) | Thing is, you might get away with coincidentally not triggering GC in your other threads / stacks | 12:32:58 |
Leonardo Santiago | One more weird detail is that the segfault is really consisten, happens everytime I run the program, but it doesn't happen always at the same place. | 12:33:59 |
Leonardo Santiago | Sometimes it happens at nix::ExprSelect::eval , sometimes at nix::ExprAttr::eval , in the original message I think it happened at nix::ExprVar::eval | 12:34:32 |
Leonardo Santiago | So indeed this may be related to something that is shared/passed to all of them, most likely the EvalState itself, as I may be doing something incorrectly with it. | 12:35:21 |
Leonardo Santiago | But I don't understand what it is yet, I'll dig further | 12:35:50 |
Leonardo Santiago | => if this was a race condition I don't think it would be this reproducible, it would sometimes fail and sometimes not. | 12:36:47 |
Leonardo Santiago | * One more weird detail is that the segfault is really consistent, happens everytime I run the program, but it doesn't happen always at the same place. | 12:36:55 |
Robert Hensing (roberth) | yeah | 12:37:53 |
Leonardo Santiago | Guess what? Setting ulimit -s unlimited made it work. | 16:29:35 |
Leonardo Santiago | It was an uncaught stack overflow. | 16:29:45 |
Leonardo Santiago | Didn't even occur to me until now. | 16:29:57 |
10 Apr 2025 |
Leonardo Santiago | @roberth how does nix circunvent this issue in their main binary? I see I can try leveraging ld 's -z stack_size=X but it only seems to work if you set it in the entry point elf binary, which I can't do as it's python ! I didn't want to bleed this problem elsewhere, like force people to set ulimit -s unlimited , but I don't see many other ways around it, and surely nix has had to deal with this; though it is the elf entry point. Any tips or hints? | 13:21:26 |