3 Dec 2023 |
SomeoneSerge (utc+3) | (nccl and mpi tests on the way) | 17:33:53 |
4 Dec 2023 |
SomeoneSerge (utc+3) | ShamrockLee (Yueh-Shun Li) btw the squashfs compression is behaving oddly: I went over some threshold maybe and now a 5.7GiB buildEnv maps into an 11GiB sif | 10:10:34 |
SomeoneSerge (utc+3) | AJAJAJA all you need is ram: I built the same image with memSize = 20 * 1024 instead of 4 * 1024 , and now the final result is like 2.8G... | 11:08:56 |
5 Dec 2023 |
| @federicodschonborn:matrix.org changed their profile picture. | 00:38:43 |
jbedo | In reply to @ss:someonex.net ShamrockLee (Yueh-Shun Li)jbedo do you know any downsides to this VM-free approach to assembling singularity images? https://github.com/NixOS/nixpkgs/issues/177908#issuecomment-1495625986 not aware of any downsides, it's a pretty nice approach | 01:02:55 |
11 Dec 2023 |
markuskowa | SomeoneSerge (UTC+2): I'm pretty busy at the moment. I will look at the slurm PR on the weekend. | 08:48:56 |
13 Dec 2023 |
SomeoneSerge (utc+3) | https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61
Does it make sense that this consistently blocks after the following message?
vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # Local host: node1
vm-test-run-slurm> submit # PID: 831
| 01:12:13 |
SomeoneSerge (utc+3) | * https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61
Does it make sense that this consistently blocks after the following message?
vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # Local host: node1
vm-test-run-slurm> submit # PID: 831
Also the same test with final: prev: { mpi = final.mpich; } reports the wrong world size of 1 | 02:27:48 |
SomeoneSerge (utc+3) | * https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61
Does it make sense that this consistently blocks after the following message?
vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # Local host: node1
vm-test-run-slurm> submit # PID: 831
Also the same test with final: prev: { mpi = final.mpich; } reports the wrong world size of 1 (thrice) | 02:29:50 |
| connor (he/him) (UTC-7) joined the room. | 14:47:19 |
SomeoneSerge (utc+3) | (trying to build mpich with pmix support)
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Resolve_nodes
...
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Abort'
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Commit'
collect2: error: ld returned 1 exit status
why won't people just use cmake | 18:54:34 |
SomeoneSerge (utc+3) | I mean, the [official guide](user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm
--with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst) literally suggests to use LD_LIBRARY_PATH to communicate the location of host libraries 🤦:
user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
> ../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm \
> --with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst
| 18:55:41 |
SomeoneSerge (utc+3) | * I mean, the official guide literally suggests to use LD_LIBRARY_PATH to communicate the location of host libraries 🤦:
user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
> ../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm \
> --with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst
| 18:56:07 |
14 Dec 2023 |
SomeoneSerge (utc+3) | mpich outside the VM:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==122066==ERROR: AddressSanitizer: SEGV on unknown address 0x601efafd3868 (pc 0x2b4e42acd6c7 bp 0x000000000001 sp 0x7ffc5f4344a0 T0)
==122066==The signal is caused by a READ memory access.
#0 0x2b4e42acd6c7 in MPIR_pmi_init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7)
#1 0x2b4e42a497f8 in MPII_Init_thread (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ed7f8)
#2 0x2b4e42a4a574 in MPIR_Init_impl (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ee574)
#3 0x2b4e42784d5b in PMPI_Init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x128d5b)
#4 0x4027f0 in main (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x4027f0)
#5 0x2b4e454dffcd in __libc_start_call_main (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x27fcd)
#6 0x2b4e454e0088 in __libc_start_main_alias_1 (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x28088)
#7 0x403d14 in _start (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x403d14)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7) in MPIR_pmi_init
==122066==ABORTING
| 03:25:12 |
SomeoneSerge (utc+3) | * mpich+pmix outside the VM:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==122066==ERROR: AddressSanitizer: SEGV on unknown address 0x601efafd3868 (pc 0x2b4e42acd6c7 bp 0x000000000001 sp 0x7ffc5f4344a0 T0)
==122066==The signal is caused by a READ memory access.
#0 0x2b4e42acd6c7 in MPIR_pmi_init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7)
#1 0x2b4e42a497f8 in MPII_Init_thread (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ed7f8)
#2 0x2b4e42a4a574 in MPIR_Init_impl (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ee574)
#3 0x2b4e42784d5b in PMPI_Init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x128d5b)
#4 0x4027f0 in main (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x4027f0)
#5 0x2b4e454dffcd in __libc_start_call_main (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x27fcd)
#6 0x2b4e454e0088 in __libc_start_main_alias_1 (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x28088)
#7 0x403d14 in _start (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x403d14)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7) in MPIR_pmi_init
==122066==ABORTING
| 03:25:20 |
SomeoneSerge (utc+3) | `/nix/store/kzwmscs89rbmv9nycpvj02qz36lsgf9h-openpmix-4.2.7/include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI ... '--bindir=/nix/store/4c64iac91ja5r5llzs525kgl90996h09-openpmix-4.2.7-bin/bin'
Beautiful, it's literally impossible to put all of this pmicc non-sense into a separate output so that gcc/gfortran wouldn't be part of the closure | 03:53:39 |
SomeoneSerge (utc+3) | * /nix/store/kzwmscs89rbmv9nycpvj02qz36lsgf9h-openpmix-4.2.7/include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI ... \'--bindir=/nix/store/4c64iac91ja5r5llzs525kgl90996h09-openpmix-4.2.7-bin/bin
Beautiful, it's literally impossible to put all of this pmicc non-sense into a separate output so that gcc/gfortran wouldn't be part of the closure | 03:53:51 |
SomeoneSerge (utc+3) | * /nix/store/kzwmscs89rbmv9nycpvj02qz36lsgf9h-openpmix-4.2.7/include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI ... \'--bindir=/nix/store/4c64iac91ja5r5llzs525kgl90996h09-openpmix-4.2.7-bin/bin ...
Beautiful, it's literally impossible to put all of this pmicc non-sense into a separate output so that gcc/gfortran wouldn't be part of the closure | 03:53:59 |
SomeoneSerge (utc+3) | * /nix/store/kzwmscs89rbmv9nycpvj02qz36lsgf9h-openpmix-4.2.7/include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI ... \'--bindir=/nix/store/4c64iac91ja5r5llzs525kgl90996h09-openpmix-4.2.7-bin/bin ...
Beautiful, it's literally impossible to put all of this pmicc non-sense into a separate output so that gccwouldn't be part of the closure | 03:55:15 |
SomeoneSerge (utc+3) | * /nix/store/kzwmscs89rbmv9nycpvj02qz36lsgf9h-openpmix-4.2.7/include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI ... \'--bindir=/nix/store/4c64iac91ja5r5llzs525kgl90996h09-openpmix-4.2.7-bin/bin ...
Beautiful, it's literally impossible to put all of this pmicc non-sense into a separate output so that gcc wouldn't be part of the runtime closure | 03:55:21 |
SomeoneSerge (utc+3) | * /nix/store/kzwmscs89rbmv9nycpvj02qz36lsgf9h-openpmix-4.2.7/include/pmix/src/include/pmix_config.h:#define PMIX_CONFIGURE_CLI ... \'--bindir=/nix/store/4c64iac91ja5r5llzs525kgl90996h09-openpmix-4.2.7-bin/bin ...
Beautiful, it's literally impossible to put all of this pmicc stuff into a separate output so that gcc wouldn't be part of the runtime closure | 03:55:54 |
SomeoneSerge (utc+3) | * mpich+pmix outside the VM:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==122066==ERROR: AddressSanitizer: SEGV on unknown address 0x601efafd3868 (pc 0x2b4e42acd6c7 bp 0x000000000001 sp 0x7ffc5f4344a0 T0)
==122066==The signal is caused by a READ memory access.
#0 0x2b4e42acd6c7 in MPIR_pmi_init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7)
#1 0x2b4e42a497f8 in MPII_Init_thread (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ed7f8)
#2 0x2b4e42a4a574 in MPIR_Init_impl (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ee574)
#3 0x2b4e42784d5b in PMPI_Init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x128d5b)
#4 0x4027f0 in main (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x4027f0)
#5 0x2b4e454dffcd in __libc_start_call_main (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x27fcd)
#6 0x2b4e454e0088 in __libc_start_main_alias_1 (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x28088)
#7 0x403d14 in _start (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x403d14)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7) in MPIR_pmi_init
==122066==ABORTING
Works fine in the VM though | 11:03:20 |
SomeoneSerge (utc+3) | * mpich+pmix outside the VM:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==122066==ERROR: AddressSanitizer: SEGV on unknown address 0x601efafd3868 (pc 0x2b4e42acd6c7 bp 0x000000000001 sp 0x7ffc5f4344a0 T0)
==122066==The signal is caused by a READ memory access.
#0 0x2b4e42acd6c7 in MPIR_pmi_init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7)
#1 0x2b4e42a497f8 in MPII_Init_thread (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ed7f8)
#2 0x2b4e42a4a574 in MPIR_Init_impl (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x3ee574)
#3 0x2b4e42784d5b in PMPI_Init (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x128d5b)
#4 0x4027f0 in main (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x4027f0)
#5 0x2b4e454dffcd in __libc_start_call_main (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x27fcd)
#6 0x2b4e454e0088 in __libc_start_main_alias_1 (/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6+0x28088)
#7 0x403d14 in _start (/nix/store/3raslv01lvsk5f5vx30wcivx28fwsh92-pps-samples-0.0.0/bin/roundtrip+0x403d14)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/nix/store/f40za7vl7n0r4awd8b0jz1xla5srkmnw-mpich-4.1.2/lib/libmpi.so.12+0x4716c7) in MPIR_pmi_init
==122066==ABORTING
| 11:05:23 |
SomeoneSerge (utc+3) | I'm desperate, whenever I build mpich or ompi with -fsanitize=address the FindMPI.cmake stops working | 18:21:22 |
SomeoneSerge (utc+3) | (also FindMPI begins by trying to locate mpiexec, this all looks so odd) | 18:22:16 |
SomeoneSerge (utc+3) | 🤔 Neither mpich nor openmpi in Nixpkgs ship these .pc s: https://gitlab.kitware.com/cmake/cmake/-/blob/2744f14db1e87f4b7f6eb6f30f7c84ea52ce4a7a/Modules/FindMPI.cmake#L1584-1592 | 19:49:50 |
SomeoneSerge (utc+3) | Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? | 20:31:14 |
SomeoneSerge (utc+3) | * Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? Or should we ask upstream why don't they generate those? | 20:41:54 |
SomeoneSerge (utc+3) | * Should we do something like https://github.com/NixOS/nixpkgs/blob/5c1d2b0d06241279e6a70e276397a5e40e867499/pkgs/build-support/alternatives/blas/default.nix? Or should we ask upstream why don't they generate those?
For context, openmpi more or less maps onto cmake's expectations, whereas mpich just merges everything into a single file:
❯ ls /nix/store/nf8fqx39w8ib34hp24lrz48287xdbxd8-openmpi-4.1.6/lib/pkgconfig/
ompi-c.pc ompi-cxx.pc ompi-f77.pc ompi-f90.pc ompi-fort.pc ompi.pc orte.pc
❯ ls /nix/store/24sv27w3j1j3p7lxyh689bzbhmixxf35-mpich-4.1.2/lib/pkgconfig
mpich.pc
❯ cat /nix/store/24sv27w3j1j3p7lxyh689bzbhmixxf35-mpich-4.1.2/lib/pkgconfig/mpich.pc
...
Cflags: -I${includedir}
...
# pkg-config does not understand Cxxflags, etc. So we allow users to
# query them using the --variable option
cxxflags= -I${includedir}
fflags=-fallow-argument-mismatch -I${includedir}
fcflags=-fallow-argument-mismatch -I${includedir}
| 21:01:10 |
3 Jan 2024 |
SomeoneSerge (utc+3) | Aj damn, running nixpkgs-built singularity images on nixos with nixpkgs' apptainer is broken (again): https://github.com/apptainer/apptainer/blob/3c5a579e51f57b66a92266a0f45504d55bcb6553/internal/pkg/util/gpu/nvidia.go#L96C2-L103 | 21:53:59 |