| 3 Sep 2023 |
hexa | both soft and hard | 20:33:48 |
| 4 Sep 2023 |
nbp | Is this an error on the emulated system or on the host? Maybe hydra has too many concurrent jobs. | 09:32:39 |
hexa | machine: must succeed: sleep 2
(finished: must succeed: sleep 2, in 2.02 seconds)
machine # base64: error while loading shared libraries: libpthread.so.0: cannot open shared object file: Error 24
machine # tail: error while loading shared libraries: libcrypto.so.3: cannot open shared object file: Error 24
machine: must succeed: stat -c '%s' /tmp/last
machine # bash: line 1: /run/current-system/sw/bin/stat: Too many open files
machine: output:
Test "Check whether Firefox can play sound" failed with error: "command `stat -c '%s' /tmp/last` failed (exit code 126)"
cleanup
kill machine (pid 6)
machine # qemu-kvm: terminating on signal 15 from pid 4 (/nix/store/pkj7cgmz66assy7l18zc7j992npb41nx-python3-3.10.12/bin/python3.10)
(finished: cleanup, in 0.05 seconds)
kill vlan (pid 5)
| 13:07:44 |
hexa | could be the host, or the test runner itself | 13:08:24 |
nbp | maybe lsof would help tell them apart. | 13:10:25 |
nbp | change the test case to include the output of lsof command. | 13:11:06 |
hexa | a quick sampling with pustil reveals that qemu_kvm holds too many fds | 13:34:50 |
hexa | vm-test-run-firefox-unwrapped> (finished: waiting for the X11 server, in 17.94 seconds)
vm-test-run-firefox-unwrapped> machine: bash=4
vm-test-run-firefox-unwrapped> machine: .nixos-test-dri=13
vm-test-run-firefox-unwrapped> machine: vde_switch=6
vm-test-run-firefox-unwrapped> machine: qemu-kvm=551
| 13:34:55 |
hexa | vm-test-run-firefox-unwrapped> machine: bash=4
vm-test-run-firefox-unwrapped> machine: .nixos-test-dri=13
vm-test-run-firefox-unwrapped> machine: vde_switch=6
vm-test-run-firefox-unwrapped> machine: qemu-kvm=2006
vm-test-run-firefox-unwrapped> subtest: Check whether Firefox can play sound
| 13:35:07 |
hexa | to me that makes it hydra's fault for constraining build jobs like that | 13:36:03 |
K900 ⚡️ | But why would it do that on Hydra and not on other systems | 13:37:01 |
hexa | yeah, the open question | 13:37:26 |
hexa | ajs124: maybe something hydra does? | 13:38:01 |
ajs124 | don't think that's a hydra thing. more like some strange config on the hydra build nodes. | 13:38:56 |
hexa | yeah, trying to find that config as we speak | 13:39:19 |
hexa | I think we're using https://github.com/DeterminateSystems/nix-netboot-serve to serve netboot images | 13:40:15 |
hexa | runs on eris apparently | 13:40:46 |
hexa | wondering if our runner configs are private? | 14:03:42 |
hexa | or state on eris even | 14:03:45 |
hexa | the nix-netboot-serve configures is too minimal | 14:05:49 |
hexa | https://github.com/NixOS/equinix-metal-builders/blob/main/modules/nix.nix#L34 | 14:22:06 |
hexa | there is a hard fdlimit on the nix-daemon | 14:22:18 |
vcunat | A million (per process) sounds quite a lot. | 14:42:54 |
vcunat | Unless some bad leak happens. Maybe it's more likely that it's stuck on a low soft limit or that it doesn't propagate as we'd expect. | 14:43:44 |
nbp | I wish we could have a wireguard-boot, where one image would connect using wireguard to download its latest image. This way we could make it work without having to redo the DHCP of the network. | 15:11:38 |
hexa |
Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2).
| 15:12:14 |
hexa | I think knowing what number of open fds we're having on the builders would be an easy first step | 15:21:40 |
K900 ⚡️ | In reply to @vcunat:matrix.org A million (per process) sounds quite a lot. It's not per process though | 15:22:43 |
K900 ⚡️ | It's per cgroup | 15:22:46 |
K900 ⚡️ | And everything is in the cgroup | 15:22:53 |