14 Dec 2023 |
SomeoneSerge (UTC+U[-12,12]) | (also FindMPI begins by trying to locate mpiexec, this all looks so odd) | 18:22:16 |
SomeoneSerge (UTC+U[-12,12]) | 🤔 Neither mpich nor openmpi in Nixpkgs ship these .pc s: https://gitlab.kitware.com/cmake/cmake/-/blob/2744f14db1e87f4b7f6eb6f30f7c84ea52ce4a7a/Modules/FindMPI.cmake#L1584-1592 | 19:49:50 |
SomeoneSerge (UTC+U[-12,12]) | Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? | 20:31:14 |
SomeoneSerge (UTC+U[-12,12]) | * Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? Or should we ask upstream why don't they generate those? | 20:41:54 |
SomeoneSerge (UTC+U[-12,12]) | * Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? Or should we ask upstream why don't they generate those?
For context, openmpi more or less maps onto cmake's expectations, whereas mpich just merges everything into a single file:
❯ ls /nix/store/nf8fqx39w8ib34hp24lrz48287xdbxd8-openmpi-4.1.6/lib/pkgconfig/
ompi-c.pc ompi-cxx.pc ompi-f77.pc ompi-f90.pc ompi-fort.pc ompi.pc orte.pc
❯ ls /nix/store/24sv27w3j1j3p7lxyh689bzbhmixxf35-mpich-4.1.2/lib/pkgconfig
mpich.pc
❯ cat /nix/store/24sv27w3j1j3p7lxyh689bzbhmixxf35-mpich-4.1.2/lib/pkgconfig/mpich.pc
...
Cflags: -I${includedir}
...
# pkg-config does not understand Cxxflags, etc. So we allow users to
# query them using the --variable option
cxxflags= -I${includedir}
fflags=-fallow-argument-mismatch -I${includedir}
fcflags=-fallow-argument-mismatch -I${includedir}
| 21:01:10 |
3 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | Aj damn, running nixpkgs-built singularity images on nixos with nixpkgs' apptainer is broken (again): https://github.com/apptainer/apptainer/blob/3c5a579e51f57b66a92266a0f45504d55bcb6553/internal/pkg/util/gpu/nvidia.go#L96C2-L103 | 21:53:59 |
SomeoneSerge (UTC+U[-12,12]) | singularityce adopted nvidia-container-cli too -> they broke --nv | 22:14:03 |
SomeoneSerge (UTC+U[-12,12]) | * singularityce adopted nvidia-container-cli too -> --nv broken there as well | 22:14:31 |
4 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | ok so the libnvidia-container patch is still functional (it does ignore ldconfig and scan /run/eopgnl-driver/lib) | 04:40:05 |
SomeoneSerge (UTC+U[-12,12]) | But apptainer doesn't seem to even run it:
❯ NVIDIA_VISIBLE_DEVICES=all strace ./result/bin/singularity exec --nv --nvccli writable.img python -c "" |& rg nvidia
futex(0x55fed1c4c888, FUTEX_WAIT_PRIVATE, 0, NULLINFO: Setting --writable-tmpfs (required by nvidia-container-cli)
| 04:40:29 |
SomeoneSerge (UTC+U[-12,12]) | Although it claims it does:
❯ APPTAINER_MESSAGELEVEL=100000 NVIDIA_VISIBLE_DEVICES=all ./result/bin/singularity exec --nv --nvccli writable.img python -c ""
...
DEBUG [U=1001,P=888790] create() nvidia-container-cli
DEBUG [U=1001,P=888822] findOnPath() Found "nvidia-container-cli" at "/run/current-system/sw/bin/nvidia-container-cli"
DEBUG [U=1001,P=888822] findOnPath() Found "ldconfig" at "/run/current-system/sw/bin/ldconfig"
DEBUG [U=1001,P=888822] NVCLIConfigure() nvidia-container-cli binary: "/run/current-system/sw/bin/nvidia-container-cli" args: ["--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"]
DEBUG [U=1001,P=888822] NVCLIConfigure() Running nvidia-container-cli in user namespace
DEBUG [U=1001,P=888790] create() Chroot into /nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final
...
| 04:41:45 |
SomeoneSerge (UTC+U[-12,12]) | (result refers to an apptainer build) | 04:42:19 |
SomeoneSerge (UTC+U[-12,12]) | Btw
❯ strace /run/current-system/sw/bin/nvidia-container-cli "--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"
...
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_CHOWN|1<<CAP_DAC_OVERRIDE|1<<CAP_DAC_READ_SEARCH|1<<CAP_FOWNER|1<<CAP_KILL|1<<CAP_SETGID|1<<CAP_SETUID|1<<CAP_SETPCAP|1<<CAP_NET_ADMIN|1<<CAP_SYS_CHROOT|1<<CAP_SYS_PTRACE|1<<CAP_SYS_ADMIN|1<<CAP_MKNOD, inheritable=1<<CAP_WAKE_ALARM}) = -1 EPERM (Operation not permitted)
...
nvidia-container-cli: permission error: capability change failed: operation not permitted
Was this even supposed to work without root? | 04:50:21 |
SomeoneSerge (UTC+U[-12,12]) | Also, is it ok that $out/var/lib/apptainer/mnt/session/final is read-only? | 04:50:54 |
5 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | In reply to @ss:someonex.net
Although it claims it does:
❯ APPTAINER_MESSAGELEVEL=100000 NVIDIA_VISIBLE_DEVICES=all ./result/bin/singularity exec --nv --nvccli writable.img python -c ""
...
DEBUG [U=1001,P=888790] create() nvidia-container-cli
DEBUG [U=1001,P=888822] findOnPath() Found "nvidia-container-cli" at "/run/current-system/sw/bin/nvidia-container-cli"
DEBUG [U=1001,P=888822] findOnPath() Found "ldconfig" at "/run/current-system/sw/bin/ldconfig"
DEBUG [U=1001,P=888822] NVCLIConfigure() nvidia-container-cli binary: "/run/current-system/sw/bin/nvidia-container-cli" args: ["--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"]
DEBUG [U=1001,P=888822] NVCLIConfigure() Running nvidia-container-cli in user namespace
DEBUG [U=1001,P=888790] create() Chroot into /nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final
...
Propagating --debug to nvidia-container-cli : https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310 | 20:47:01 |
SomeoneSerge (UTC+U[-12,12]) | * Propagating --debug to nvidia-container-cli : https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310
Which is the same as if you manually run
❯ strace /run/current-system/sw/bin/nvidia-container-cli "--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"
...
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=1<<CAP_WAKE_ALARM}) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_CHOWN|1<<CAP_DAC_OVERRIDE|1<<CAP_DAC_READ_SEARCH|1<<CAP_FOWNER|1<<CAP_KILL|1<<CAP_SETGID|1<<CAP_SETUID|1<<CAP_SETPCAP|1<<CAP_NET_ADMIN|1<<CAP_SYS_CHROOT|1<<CAP_SYS_PTRACE|1<<CAP_SYS_ADMIN|1<<CAP_MKNOD, inheritable=1<<CAP_WAKE_ALARM}) = -1 EPERM (Operation not permitted)
write(2, "nvidia-container-cli: ", 22nvidia-container-cli: ) = 22
write(2, "permission error: capability cha"..., 67permission error: capability change failed: operation not permitted) = 67
write(2, "\n", 1
) = 1
exit_group(1) = ?
+++ exited with 1 +++
| 21:03:29 |
SomeoneSerge (UTC+U[-12,12]) | * Propagating --debug to nvidia-container-cli : https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310
Which is the same as if you manually run https://matrix.to/#/%23hpc%3Anixos.org/%24aCLdJvRqyXSNc0_LfuTb7tFxL3hBhMfXUzc13whct0U?via=someonex.net&via=matrix.org&via=kde.org&via=dodsorf.as | 21:03:57 |
SomeoneSerge (UTC+U[-12,12]) | * Btw
❯ strace /run/current-system/sw/bin/nvidia-container-cli "--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"
...
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=1<<CAP_WAKE_ALARM}) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_CHOWN|1<<CAP_DAC_OVERRIDE|1<<CAP_DAC_READ_SEARCH|1<<CAP_FOWNER|1<<CAP_KILL|1<<CAP_SETGID|1<<CAP_SETUID|1<<CAP_SETPCAP|1<<CAP_NET_ADMIN|1<<CAP_SYS_CHROOT|1<<CAP_SYS_PTRACE|1<<CAP_SYS_ADMIN|1<<CAP_MKNOD, inheritable=1<<CAP_WAKE_ALARM}) = -1 EPERM (Operation not permitted)
write(2, "nvidia-container-cli: ", 22nvidia-container-cli: ) = 22
write(2, "permission error: capability cha"..., 67permission error: capability change failed: operation not permitted) = 67
write(2, "\n", 1
) = 1
exit_group(1) = ?
+++ exited with 1 +++
...
nvidia-container-cli: permission error: capability change failed: operation not permitted
Was this even supposed to work without root? | 21:04:36 |
SomeoneSerge (UTC+U[-12,12]) | * Propagating --debug to nvidia-container-cli : https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310
Which is the same as if you manually run https://matrix.to/#/%23hpc%3Anixos.org/%24aCLdJvRqyXSNc0_LfuTb7tFxL3hBhMfXUzc13whct0U?via=someonex.net&via=matrix.org&via=kde.org&via=dodsorf.as
Like, did it even require CAP_SYS_ADMIN before? | 21:59:19 |
7 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | Filed the issues upstream finally (apptainer and libnvidia-docker). Thought I'd never get around to do that, I feel exhausted smh\ | 01:04:08 |
8 Jan 2024 |
| SomeoneSerge (UTC+U[-12,12]) changed their display name from SomeoneSerge (UTC+2) to SomeoneSerge (hash-versioned python modules when). | 04:50:14 |
9 Jan 2024 |
| David Guibert joined the room. | 14:58:17 |
10 Jan 2024 |
ShamrockLee (Yueh-Shun Li) | I'm terribly busy the following weeks, and probably don't have time until the end of January. | 16:21:32 |
ShamrockLee (Yueh-Shun Li) | * I'll be terribly busy the following weeks, and probably don't have time until the end of January. | 16:21:58 |
11 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | Merged the apptainer --nv patch. Still no idea what on earth could've broken docker run --gpus all . Going to look into the mpi situation again, as far as I'm concerned it's totally broken but maybe I just don't get it | 01:04:11 |
17 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | A very typical line from Nixpkgs' SLURM's build logs: -g -O2 ... -ggdb3 -Wall -g -O1 | 17:25:26 |
SomeoneSerge (UTC+U[-12,12]) | What's there not to love about autotools | 17:25:41 |
@connorbaker:matrix.org | Thanks, I hate it | 21:03:22 |
18 Jan 2024 |
SomeoneSerge (UTC+U[-12,12]) | ❯ ag eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee result-lib/ --search-binary
result-lib/lib/security/pam_slurm_adopt.la
41:libdir='/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/security'
result-lib/lib/perl5/5.38.2/x86_64-linux-thread-multi/perllocal.pod
7:C<installed into: /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/perl5/site_perl/5.38.2>
29:C<installed into: /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/perl5/site_perl/5.38.2>
Binary file result-lib/lib/libslurm.so.40.0.0 matches.
result-lib/lib/security/pam_slurm.la
41:libdir='/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/security'
Binary file result-lib/lib/slurm/libslurmfull.so matches.
Binary file result-lib/lib/slurm/mpi_pmi2.so matches.
Binary file result-lib/lib/slurm/libslurm_pmi.so matches
❯ strings result-lib/lib/slurm/mpi_pmi2.so | rg eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/bin/srun
arghhghhhghghghghghggh why | 15:32:31 |
SomeoneSerge (UTC+U[-12,12]) | Context: error: cycle detected in build of '/nix/store/391cjl6zqqsaz33disfcn3nzv87bygc1-slurm-23.11.1.1.drv' in the references of output 'bin' from output 'lib' | 15:34:41 |