Sender | Message | Time |
---|---|---|
8 Apr 2025 | ||
And I think yours is much more careful when handling thread locking and GC. I don't really understand the constraints there so maybe there's something wrong I did that caused this. | 12:26:45 | |
If everything is on the main thread you're fine, and nix_value registers/deregisters itself with the GC just fine (if all is well), but the GC may not be happy if it needs to operate from a thread it doesn't know about. That includes allocation, so for all intents and purposes that's the whole of nix-expr that should be called from registered threads (or the main thread) only | 12:29:59 | |
You'd get an error message along the lines of "trying to GC from unknown thread". I don't think it can cause corruption necessarily | 12:30:37 | |
but maybe my assumption about GC roots is wrong, and my code does rely on stack scanning regardless | 12:31:07 | |
I think the problem may lie there, pyo3 requires that all your structs be freely movable between threads, as python is not really a single threaded interpreter, it just heavily relies on the GIL. I'm not doing anything multithreaded from the python side, much to the contrary, I'm just state.eval_file('path').get(attr) but I wouldn't say it isn't moving it to another thread either | 12:32:48 | |
Thing is, you might get away with coincidentally not triggering GC in your other threads / stacks | 12:32:58 | |
One more weird detail is that the segfault is really consisten, happens everytime I run the program, but it doesn't happen always at the same place. | 12:33:59 | |
Sometimes it happens at nix::ExprSelect::eval , sometimes at nix::ExprAttr::eval , in the original message I think it happened at nix::ExprVar::eval | 12:34:32 | |
So indeed this may be related to something that is shared/passed to all of them, most likely the EvalState itself, as I may be doing something incorrectly with it. | 12:35:21 | |
But I don't understand what it is yet, I'll dig further | 12:35:50 | |
=> if this was a race condition I don't think it would be this reproducible, it would sometimes fail and sometimes not. | 12:36:47 | |
* One more weird detail is that the segfault is really consistent, happens everytime I run the program, but it doesn't happen always at the same place. | 12:36:55 | |
yeah | 12:37:53 | |
Guess what? Setting ulimit -s unlimited made it work. | 16:29:35 | |
It was an uncaught stack overflow. | 16:29:45 | |
Didn't even occur to me until now. | 16:29:57 | |
10 Apr 2025 | ||
@roberth how does nix circunvent this issue in their main binary? I see I can try leveraging ld 's -z stack_size=X but it only seems to work if you set it in the entry point elf binary, which I can't do as it's python ! I didn't want to bleed this problem elsewhere, like force people to set ulimit -s unlimited , but I don't see many other ways around it, and surely nix has had to deal with this; though it is the elf entry point. Any tips or hints? | 13:21:26 | |
There's the possiblity of spawning a new thread with an increased stack size, but that adds the context switching overhead to every nix evaluation, which is something I'd like to avoid if possible. | 13:22:44 | |
* @roberth how does nix circunvent this issue in their main binary? I see I can try leveraging ld 's -z stack_size=X but it only seems to work if you set it in the entry point elf binary, which I can't do as it's python and I'm merely offering a .so extension library. I didn't want to bleed this problem elsewhere, like force people to set ulimit -s unlimited , but I don't see many other ways around it, and surely nix has had to deal with this; though it is the elf entry point. Any tips or hints? | 13:23:25 | |
There's also the possibility of using setrlimit to increase the process's own stack size, that seems to me like the most graceful solution | 14:09:58 | |
And indeed, that seems the solution that nix's binary uses, src/nix/main.cc calls nix::setStackSize(64MB) which internally calls setrlimit with that value. Awesome to know, most likely I'll try going down this route. | 14:12:10 | |
15 Apr 2025 | ||
I notice I may be bringing all the stupid problems and ideas to the chat, but would it be possible to static link against the nix C libraries? | 15:55:05 | |
* I notice I may be the onebringing all the stupid problems and ideas to the chat, but would it be possible to static link against the nix C libraries? | 15:56:12 | |
* I notice I may be the one bringing all the stupid problems and ideas to the chat, but would it be possible to static link against the nix C libraries? | 15:56:14 | |
I'm trying to optimize the performance of a custom eval cache using the C API I wrote and the cache hit is at around 20ms, which seems very good, but knowing what it does I know it's not really that impressive, since it's hashing some 20 files and querying on a sqlite file, 20ms is actually very poor performance if you consider it. | 15:57:49 | |
I tried perf record ing and most of the time (15ms~ish) seems to be spent on do_lookup_x , which seems to be a libc function related to finding the dynamic libraries. | 15:59:04 | |
* I tried perf record ing and most of the time (15ms~ish) seems to be spent on do_lookup_x , which seems to be a libc function related to finding the dynamic libraries, and there are ~66 linked libraries, most related to nix stuff like libaws-c-sdkutils.so.1.0.0 which shouldn't even be used in this case but are loaded anyway before the program starts. | 16:01:47 | |
If I remove most of the C API usage and just compile a simple binary to query the sqlite file it reduces down to 4 linked libraries and returns in ~6ms~ in --release mode, which seems to hint that indeed most of the time is spent finding the libraries | 16:03:47 | |
I'm using
| 16:06:07 | |
* I'm using
| 16:06:18 |