| 21 Mar 2025 |
hexa | so what is the best to trawl through these many coredumps and group them? | 13:58:45 |
hexa | * so what is the best way to trawl through these many coredumps and group them? | 13:58:50 |
Vladimír Čunát | The backtraces look all the same to me at a quick glance.
for pid in $(cat /tmp/core-pids); do echo bt | coredumpctl gdb 2803640 | sed '1,/Program terminated .*/d'; done
| 14:16:20 |
Vladimír Čunát | Random one:
warning: core file may not match specified executable file.
#0 0x0000ffffa27ca5b4 in __pthread_kill_implementation () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
[Current thread is 1 (Thread 0xffffa059e440 (LWP 2803641))]
(gdb) #0 0x0000ffffa27ca5b4 in __pthread_kill_implementation () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#1 0x0000ffffa277b15c in raise () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#2 0x0000ffffa2765a00 in abort () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#3 0x0000ffffa27bca78 in __libc_message_impl () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#4 0x0000ffffa27bcae8 in __libc_fatal () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#5 0x0000ffffa27d1f64 in unwind_cleanup () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#6 0x0000ffffa2dd8970 in nix::unix::triggerInterrupt() () from /nix/store/qdcg0g77rbj8bipd8xy3k6kw32yh17vr-nix-2.24.12/lib/libnixutil.so
#7 0x0000ffffa2f62490 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}> > >::_M_run() ()
from /nix/store/qdcg0g77rbj8bipd8xy3k6kw32yh17vr-nix-2.24.12/lib/libnixstore.so
#8 0x0000ffffa2add33c in execute_native_thread_routine () from /nix/store/9df8irigdgxl3cnyfwir2xw4fs2q9my7-gcc-13.3.0-lib/lib/libstdc++.so.6
#9 0x0000ffffa27c87cc in start_thread () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
#10 0x0000ffffa2834e4c in thread_start () from /nix/store/g4agqy9fxhnzmq31idkhp2kqk4sgp3i0-glibc-2.40-66/lib/libc.so.6
| 14:16:47 |
Vladimír Čunát | That was aarch64. But the last x86_64 stack-trace looks exactly the same. | 14:22:36 |
Vladimír Čunát | So I copied one of the x86_64 cores to a local machine with services.nixseparatedebuginfod.enable = true; and got
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7f73f68d96c0 (LWP 1994564))]
(gdb)
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007f73f855f8f3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007f73f850d576 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f73f84f5935 in __GI_abort () at abort.c:79
#4 0x00007f73f84f67f3 in __libc_message_impl (fmt=fmt@entry=0x7f73f8679ffc "%s") at ../sysdeps/posix/libc_fatal.c:134
#5 0x00007f73f85528e9 in __GI___libc_fatal (message=<optimized out>) at ../sysdeps/posix/libc_fatal.c:143
#6 0x00007f73f8566474 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at unwind.c:114
#7 0x00007f73f8b2e78c in nix::unix::triggerInterrupt () at src/libutil/unix/signals.cc:94
#8 0x00007f73f8d5b8d5 in nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}::operator()() const (__closure=0x561807cd7598) at src/libutil/unix/monitor-fd.hh:55
#9 std::__invoke_impl<void, nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}>(std::__invoke_other, nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}&&) (__f=...)
at /nix/store/yg4ahy7gahx91nq80achmzilrjyv0scj-gcc-13.3.0/include/c++/13.3.0/bits/invoke.h:61
#10 std::__invoke<nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}>(nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}&&) (__fn=...)
at /nix/store/yg4ahy7gahx91nq80achmzilrjyv0scj-gcc-13.3.0/include/c++/13.3.0/bits/invoke.h:96
#11 std::thread::_Invoker<std::tuple<nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x561807cd7598)
at /nix/store/yg4ahy7gahx91nq80achmzilrjyv0scj-gcc-13.3.0/include/c++/13.3.0/bits/std_thread.h:292
#12 std::thread::_Invoker<std::tuple<nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}> >::operator()() (this=0x561807cd7598)
at /nix/store/yg4ahy7gahx91nq80achmzilrjyv0scj-gcc-13.3.0/include/c++/13.3.0/bits/std_thread.h:299
#13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<nix::MonitorFdHup::MonitorFdHup(int)::{lambda()#1}> > >::_M_run() (this=0x561807cd7590)
at /nix/store/yg4ahy7gahx91nq80achmzilrjyv0scj-gcc-13.3.0/include/c++/13.3.0/bits/std_thread.h:244
#14 0x00007f73f88bb6d3 in execute_native_thread_routine () from /nix/store/mhd0rk497xm0xnip7262xdw9bylvzh99-gcc-13.3.0-lib/lib/libstdc++.so.6
#15 0x00007f73f855daf3 in start_thread (arg=<optimized out>) at pthread_create.c:447
#16 0x00007f73f85dcf4c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
| 14:47:20 |
Vladimír Čunát | So we continue by creating a nix ticket with nix devs taking the lead on this? | 14:48:05 |
hexa | and tag it with infra, as proposed | 14:48:27 |
hexa | do you want to file the issue? | 14:48:43 |
Vladimír Čunát | I can. (But I don't care who does it.) | 14:54:18 |
hexa | Please do 🙂 | 14:54:37 |
hexa | similar to https://github.com/NixOS/nix/issues/8946? | 15:00:31 |
Vladimír Čunát | Yes, I've been looking at that for some time now. | 15:08:42 |
Vladimír Čunát | The stack traces are almost the same, so let's not open a new one. | 15:08:56 |
Mic92 | vcunat: let me know, when you are creating one. | 15:10:04 |
Mic92 | I will bring this up in the next meeting. | 15:10:18 |
Vladimír Čunát | I put the extra info into the existing issue and labeled it. Feel free to ask for more (I do know C debugging at least). | 15:12:48 |
Mic92 | This looks like exceptions. nix-daemon doesn't list anything, right? | 15:13:34 |
Mic92 | Ah. I think it's the "unreachable()" instance in this function | 15:14:22 |
Vladimír Čunát | You mean journalctl logs? | 15:14:23 |
hexa | hard to say, given that every build causes an "accepted connection" line 😄 | 15:14:33 |
hexa | Mar 21 15:13:47 elated-minsky nix-daemon[1647693]: FATAL: exception not rethrown
| 15:14:46 |
hexa | there are these intermittently | 15:14:50 |
Vladimír Čunát | Yes, timestamp fits exactly. | 15:17:24 |
Vladimír Čunát | (in another instance: coredumpctl crash stamp matches the log stamp with this line) | 15:18:20 |
Mic92 | I am inclined to backport https://github.com/NixOS/nix/pull/12636/files to cherry-pick into the nix version that we use in NixOS infra to get more insights | 15:18:44 |
Mic92 | Probably not needed for this bug but in general | 15:22:48 |
Vladimír Čunát | It shouldn't be a problem to switch versions, too. As long as it isn't too unstable. If that makes something easier. | 15:22:51 |
Vladimír Čunát | We generally simply use defaults on the stable NixOS. | 15:23:45 |
hexa | ok, so bump to 2.26 and try to repro? | 15:27:29 |