| 3 Sep 2023 |
hexa | because this is really weird | 08:56:20 |
hexa | why is it sufficient on my builders, but not on hydra's when both essentially start a naked nixos in qemu | 08:56:42 |
vcunat | (I have no idea how this happens.) | 09:01:02 |
hexa | OK, will dig into it a bit once i can sit down | 09:19:00 |
vcunat | For now unblocked. Fixed t4b and got lucky scheduling it there on the first retry. But it will be nice if you address the fragility anyway, as restarts are needed often. | 11:54:57 |
hexa | for me locally the resource limit for nofile on the test script is 1048576 | 20:32:46 |
hexa | both soft and hard | 20:33:48 |
| 4 Sep 2023 |
nbp | Is this an error on the emulated system or on the host? Maybe hydra has too many concurrent jobs. | 09:32:39 |
hexa | machine: must succeed: sleep 2
(finished: must succeed: sleep 2, in 2.02 seconds)
machine # base64: error while loading shared libraries: libpthread.so.0: cannot open shared object file: Error 24
machine # tail: error while loading shared libraries: libcrypto.so.3: cannot open shared object file: Error 24
machine: must succeed: stat -c '%s' /tmp/last
machine # bash: line 1: /run/current-system/sw/bin/stat: Too many open files
machine: output:
Test "Check whether Firefox can play sound" failed with error: "command `stat -c '%s' /tmp/last` failed (exit code 126)"
cleanup
kill machine (pid 6)
machine # qemu-kvm: terminating on signal 15 from pid 4 (/nix/store/pkj7cgmz66assy7l18zc7j992npb41nx-python3-3.10.12/bin/python3.10)
(finished: cleanup, in 0.05 seconds)
kill vlan (pid 5)
| 13:07:44 |
hexa | could be the host, or the test runner itself | 13:08:24 |
nbp | maybe lsof would help tell them apart. | 13:10:25 |
nbp | change the test case to include the output of lsof command. | 13:11:06 |
hexa | a quick sampling with pustil reveals that qemu_kvm holds too many fds | 13:34:50 |
hexa | vm-test-run-firefox-unwrapped> (finished: waiting for the X11 server, in 17.94 seconds)
vm-test-run-firefox-unwrapped> machine: bash=4
vm-test-run-firefox-unwrapped> machine: .nixos-test-dri=13
vm-test-run-firefox-unwrapped> machine: vde_switch=6
vm-test-run-firefox-unwrapped> machine: qemu-kvm=551
| 13:34:55 |
hexa | vm-test-run-firefox-unwrapped> machine: bash=4
vm-test-run-firefox-unwrapped> machine: .nixos-test-dri=13
vm-test-run-firefox-unwrapped> machine: vde_switch=6
vm-test-run-firefox-unwrapped> machine: qemu-kvm=2006
vm-test-run-firefox-unwrapped> subtest: Check whether Firefox can play sound
| 13:35:07 |
hexa | to me that makes it hydra's fault for constraining build jobs like that | 13:36:03 |
K900 | But why would it do that on Hydra and not on other systems | 13:37:01 |
hexa | yeah, the open question | 13:37:26 |
hexa | ajs124: maybe something hydra does? | 13:38:01 |
ajs124 | don't think that's a hydra thing. more like some strange config on the hydra build nodes. | 13:38:56 |
hexa | yeah, trying to find that config as we speak | 13:39:19 |
hexa | I think we're using https://github.com/DeterminateSystems/nix-netboot-serve to serve netboot images | 13:40:15 |
hexa | runs on eris apparently | 13:40:46 |
hexa | wondering if our runner configs are private? | 14:03:42 |
hexa | or state on eris even | 14:03:45 |
hexa | the nix-netboot-serve configures is too minimal | 14:05:49 |
hexa | https://github.com/NixOS/equinix-metal-builders/blob/main/modules/nix.nix#L34 | 14:22:06 |
hexa | there is a hard fdlimit on the nix-daemon | 14:22:18 |
vcunat | A million (per process) sounds quite a lot. | 14:42:54 |
vcunat | Unless some bad leak happens. Maybe it's more likely that it's stuck on a low soft limit or that it doesn't propagate as we'd expect. | 14:43:44 |