!kFJOpVCFYFzxqjpJxm:nixos.org

Empty Room

60 Members
19 Servers

Load older messages


SenderMessageTime
14 Dec 2023
@ss:someonex.netSomeoneSerge (utc+3) *

Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? Or should we ask upstream why don't they generate those?

For context, openmpi more or less maps onto cmake's expectations, whereas mpich just merges everything into a single file:

❯ ls /nix/store/nf8fqx39w8ib34hp24lrz48287xdbxd8-openmpi-4.1.6/lib/pkgconfig/
ompi-c.pc  ompi-cxx.pc  ompi-f77.pc  ompi-f90.pc  ompi-fort.pc  ompi.pc  orte.pc
❯ ls /nix/store/24sv27w3j1j3p7lxyh689bzbhmixxf35-mpich-4.1.2/lib/pkgconfig
mpich.pc
❯ cat /nix/store/24sv27w3j1j3p7lxyh689bzbhmixxf35-mpich-4.1.2/lib/pkgconfig/mpich.pc
...
Cflags:   -I${includedir}
...
# pkg-config does not understand Cxxflags, etc. So we allow users to
# query them using the --variable option

cxxflags=  -I${includedir}
fflags=-fallow-argument-mismatch -I${includedir}
fcflags=-fallow-argument-mismatch -I${includedir}
21:01:10
3 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3) Aj damn, running nixpkgs-built singularity images on nixos with nixpkgs' apptainer is broken (again): https://github.com/apptainer/apptainer/blob/3c5a579e51f57b66a92266a0f45504d55bcb6553/internal/pkg/util/gpu/nvidia.go#L96C2-L103 21:53:59
@ss:someonex.netSomeoneSerge (utc+3) singularityce adopted nvidia-container-cli too -> they broke --nv 22:14:03
@ss:someonex.netSomeoneSerge (utc+3) * singularityce adopted nvidia-container-cli too -> --nv broken there as well 22:14:31
4 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3)ok so the libnvidia-container patch is still functional (it does ignore ldconfig and scan /run/eopgnl-driver/lib)04:40:05
@ss:someonex.netSomeoneSerge (utc+3)

But apptainer doesn't seem to even run it:

❯ NVIDIA_VISIBLE_DEVICES=all strace ./result/bin/singularity exec --nv --nvccli writable.img python -c "" |& rg nvidia
futex(0x55fed1c4c888, FUTEX_WAIT_PRIVATE, 0, NULLINFO:    Setting --writable-tmpfs (required by nvidia-container-cli)
04:40:29
@ss:someonex.netSomeoneSerge (utc+3)

Although it claims it does:

❯ APPTAINER_MESSAGELEVEL=100000 NVIDIA_VISIBLE_DEVICES=all ./result/bin/singularity exec --nv --nvccli writable.img python -c ""
...
DEBUG   [U=1001,P=888790]  create()                      nvidia-container-cli
DEBUG   [U=1001,P=888822]  findOnPath()                  Found "nvidia-container-cli" at "/run/current-system/sw/bin/nvidia-container-cli"
DEBUG   [U=1001,P=888822]  findOnPath()                  Found "ldconfig" at "/run/current-system/sw/bin/ldconfig"
DEBUG   [U=1001,P=888822]  NVCLIConfigure()              nvidia-container-cli binary: "/run/current-system/sw/bin/nvidia-container-cli" args: ["--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"]
DEBUG   [U=1001,P=888822]  NVCLIConfigure()              Running nvidia-container-cli in user namespace
DEBUG   [U=1001,P=888790]  create()                      Chroot into /nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final
...
04:41:45
@ss:someonex.netSomeoneSerge (utc+3) (result refers to an apptainer build) 04:42:19
@ss:someonex.netSomeoneSerge (utc+3)

Btw

❯ strace /run/current-system/sw/bin/nvidia-container-cli "--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"
...
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_CHOWN|1<<CAP_DAC_OVERRIDE|1<<CAP_DAC_READ_SEARCH|1<<CAP_FOWNER|1<<CAP_KILL|1<<CAP_SETGID|1<<CAP_SETUID|1<<CAP_SETPCAP|1<<CAP_NET_ADMIN|1<<CAP_SYS_CHROOT|1<<CAP_SYS_PTRACE|1<<CAP_SYS_ADMIN|1<<CAP_MKNOD, inheritable=1<<CAP_WAKE_ALARM}) = -1 EPERM (Operation not permitted)
...
nvidia-container-cli: permission error: capability change failed: operation not permitted

Was this even supposed to work without root?

04:50:21
@ss:someonex.netSomeoneSerge (utc+3) Also, is it ok that $out/var/lib/apptainer/mnt/session/final is read-only? 04:50:54
5 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3)
In reply to @ss:someonex.net

Although it claims it does:

❯ APPTAINER_MESSAGELEVEL=100000 NVIDIA_VISIBLE_DEVICES=all ./result/bin/singularity exec --nv --nvccli writable.img python -c ""
...
DEBUG   [U=1001,P=888790]  create()                      nvidia-container-cli
DEBUG   [U=1001,P=888822]  findOnPath()                  Found "nvidia-container-cli" at "/run/current-system/sw/bin/nvidia-container-cli"
DEBUG   [U=1001,P=888822]  findOnPath()                  Found "ldconfig" at "/run/current-system/sw/bin/ldconfig"
DEBUG   [U=1001,P=888822]  NVCLIConfigure()              nvidia-container-cli binary: "/run/current-system/sw/bin/nvidia-container-cli" args: ["--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"]
DEBUG   [U=1001,P=888822]  NVCLIConfigure()              Running nvidia-container-cli in user namespace
DEBUG   [U=1001,P=888790]  create()                      Chroot into /nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final
...
Propagating --debug to nvidia-container-cli: https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310
20:47:01
@ss:someonex.netSomeoneSerge (utc+3) *

Propagating --debug to nvidia-container-cli: https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310

Which is the same as if you manually run

❯ strace /run/current-system/sw/bin/nvidia-container-cli "--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"
...
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=1<<CAP_WAKE_ALARM}) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_CHOWN|1<<CAP_DAC_OVERRIDE|1<<CAP_DAC_READ_SEARCH|1<<CAP_FOWNER|1<<CAP_KILL|1<<CAP_SETGID|1<<CAP_SETUID|1<<CAP_SETPCAP|1<<CAP_NET_ADMIN|1<<CAP_SYS_CHROOT|1<<CAP_SYS_PTRACE|1<<CAP_SYS_ADMIN|1<<CAP_MKNOD, inheritable=1<<CAP_WAKE_ALARM}) = -1 EPERM (Operation not permitted)
write(2, "nvidia-container-cli: ", 22nvidia-container-cli: )  = 22
write(2, "permission error: capability cha"..., 67permission error: capability change failed: operation not permitted) = 67
write(2, "\n", 1
)                       = 1
exit_group(1)                           = ?
+++ exited with 1 +++
21:03:29
@ss:someonex.netSomeoneSerge (utc+3) *

Propagating --debug to nvidia-container-cli: https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310

Which is the same as if you manually run https://matrix.to/#/%23hpc%3Anixos.org/%24aCLdJvRqyXSNc0_LfuTb7tFxL3hBhMfXUzc13whct0U?via=someonex.net&via=matrix.org&via=kde.org&via=dodsorf.as

21:03:57
@ss:someonex.netSomeoneSerge (utc+3) *

Btw

❯ strace /run/current-system/sw/bin/nvidia-container-cli "--user" "configure" "--no-cgroups" "--device=all" "--compute" "--utility" "--ldconfig=@/run/current-system/sw/bin/ldconfig" "/nix/store/rzycmg66zpap6gjb5ylmvd8ymlfb7fag-apptainer-1.2.5/var/lib/apptainer/mnt/session/final"
...
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=1<<CAP_WAKE_ALARM}) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_CHOWN|1<<CAP_DAC_OVERRIDE|1<<CAP_DAC_READ_SEARCH|1<<CAP_FOWNER|1<<CAP_KILL|1<<CAP_SETGID|1<<CAP_SETUID|1<<CAP_SETPCAP|1<<CAP_NET_ADMIN|1<<CAP_SYS_CHROOT|1<<CAP_SYS_PTRACE|1<<CAP_SYS_ADMIN|1<<CAP_MKNOD, inheritable=1<<CAP_WAKE_ALARM}) = -1 EPERM (Operation not permitted)
write(2, "nvidia-container-cli: ", 22nvidia-container-cli: )  = 22
write(2, "permission error: capability cha"..., 67permission error: capability change failed: operation not permitted) = 67
write(2, "\n", 1
)                       = 1
exit_group(1)                           = ?
+++ exited with 1 +++
...
nvidia-container-cli: permission error: capability change failed: operation not permitted

Was this even supposed to work without root?

21:04:36
@ss:someonex.netSomeoneSerge (utc+3) *

Propagating --debug to nvidia-container-cli: https://gist.github.com/SomeoneSerge/a4317ccec07e33324c588eb6f7c6f04a#file-gistfile0-txt-L310

Which is the same as if you manually run https://matrix.to/#/%23hpc%3Anixos.org/%24aCLdJvRqyXSNc0_LfuTb7tFxL3hBhMfXUzc13whct0U?via=someonex.net&via=matrix.org&via=kde.org&via=dodsorf.as

Like, did it even require CAP_SYS_ADMIN before?

21:59:19
7 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3)Filed the issues upstream finally (apptainer and libnvidia-docker). Thought I'd never get around to do that, I feel exhausted smh\01:04:08
8 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3) changed their display name from SomeoneSerge (UTC+2) to SomeoneSerge (hash-versioned python modules when).04:50:14
9 Jan 2024
@dguibert:matrix.orgDavid Guibert joined the room.14:58:17
10 Jan 2024
@shamrocklee:matrix.orgShamrockLee (Yueh-Shun Li)I'm terribly busy the following weeks, and probably don't have time until the end of January.16:21:32
@shamrocklee:matrix.orgShamrockLee (Yueh-Shun Li)* I'll be terribly busy the following weeks, and probably don't have time until the end of January.16:21:58
11 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3) Merged the apptainer --nv patch. Still no idea what on earth could've broken docker run --gpus all. Going to look into the mpi situation again, as far as I'm concerned it's totally broken but maybe I just don't get it 01:04:11
17 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3) A very typical line from Nixpkgs' SLURM's build logs: -g -O2 ... -ggdb3 -Wall -g -O1 17:25:26
@ss:someonex.netSomeoneSerge (utc+3)What's there not to love about autotools17:25:41
@connorbaker:matrix.orgconnor (he/him) (UTC-7)Thanks, I hate it21:03:22
18 Jan 2024
@ss:someonex.netSomeoneSerge (utc+3)
❯ ag eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee result-lib/ --search-binary
result-lib/lib/security/pam_slurm_adopt.la
41:libdir='/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/security'

result-lib/lib/perl5/5.38.2/x86_64-linux-thread-multi/perllocal.pod
7:C<installed into: /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/perl5/site_perl/5.38.2>
29:C<installed into: /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/perl5/site_perl/5.38.2>

Binary file result-lib/lib/libslurm.so.40.0.0 matches.

result-lib/lib/security/pam_slurm.la
41:libdir='/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/lib/security'

Binary file result-lib/lib/slurm/libslurmfull.so matches.

Binary file result-lib/lib/slurm/mpi_pmi2.so matches.

Binary file result-lib/lib/slurm/libslurm_pmi.so matches
❯ strings result-lib/lib/slurm/mpi_pmi2.so | rg eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-slurm-23.11.1.1/bin/srun

arghhghhhghghghghghggh why

15:32:31
@ss:someonex.netSomeoneSerge (utc+3) Context: error: cycle detected in build of '/nix/store/391cjl6zqqsaz33disfcn3nzv87bygc1-slurm-23.11.1.1.drv' in the references of output 'bin' from output 'lib' 15:34:41
@ss:someonex.netSomeoneSerge (utc+3)Just as mpich and openmpi aren't amenable to splitting their outputs (can't just link the library but must keep the executables in the runtime closure for no good reason), neither is slurm apparently15:35:29
19 Jan 2024
@markuskowa:matrix.orgmarkuskowa SomeoneSerge (hash-versioned python modules when): I have managed to split the dev outputs of the mpi implementations. I will open a PR soon. 09:34:16
@ss:someonex.netSomeoneSerge (utc+3) WOW! What did you do to the config.h? 10:15:59
@ss:someonex.netSomeoneSerge (utc+3)I managed to make slurm build libpmi2.so and to split it out into a separate output last night10:16:18

Show newer messages


Back to Room ListRoom Version: 9