27 Oct 2023 |
| @federicodschonborn:matrix.org changed their profile picture. | 01:24:47 |
29 Oct 2023 |
| SomeoneSerge (utc+3) changed their display name from SomeoneSerge (UTC+1) to SomeoneSerge (UTC+2). | 22:42:17 |
3 Nov 2023 |
| @fizihcyst:matrix.org joined the room. | 11:50:01 |
5 Nov 2023 |
| kotatsuyaki joined the room. | 03:52:10 |
9 Nov 2023 |
| Ido Samuelson changed their display name from snick to Ido Samuelson. | 06:33:43 |
10 Nov 2023 |
| globin joined the room. | 00:49:30 |
15 Nov 2023 |
| @grahamc:nixos.orgchanged room power levels. | 16:08:35 |
| @grahamc:nixos.org left the room. | 16:08:36 |
| NixOS Moderation Botchanged room power levels. | 18:12:37 |
| NixOS Moderation Botchanged room power levels. | 18:12:37 |
19 Nov 2023 |
| pbsds changed their display name from pbsds to pbsds (federation borken, may not see reply). | 03:36:14 |
| ZXGU joined the room. | 10:59:25 |
| pbsds changed their display name from pbsds (federation borken, may not see reply) to pbsds. | 20:39:13 |
21 Nov 2023 |
| hdzki ⚡️ joined the room. | 18:23:55 |
29 Nov 2023 |
pie_ | I'm slightly procrastinating. What's up in HPC these days? | 01:34:33 |
3 Dec 2023 |
SomeoneSerge (utc+3) | ShamrockLee (Yueh-Shun Li)jbedo do you know any downsides to this VM-free approach to assembling singularity images? https://github.com/NixOS/nixpkgs/issues/177908#issuecomment-1495625986 | 14:16:57 |
SomeoneSerge (utc+3) | Finally (lol) got around to trying Nix-built singularity images on the cluster:
❯ nom build .#pkgsCuda.some-pkgs-py.edm.image -L
❯ du -hs $(readlink ./result)
3.4G /nix/store/axhpdc96qgzk720yciwfach93f3xrqby-singularity-image-edm.img
❯ du -hs /scratch/cs/graphics/singularity-images/images/edm.sif # Baseline based off NGC
7.6G /scratch/cs/graphics/singularity-images/images/edm.sif
❯ rsync -LP ./result triton:
❯ ssh triton srun --mem=8G --time=0:05:00 --gres=gpu:a100:1 singularity exec -B /m:/m -B /scratch:/scratch -B /l:/l --nv ./result nixglhost -- python -m edm.example
...
Saving image grid to "imagenet-64x64.png"...
Done
| 17:33:47 |
SomeoneSerge (utc+3) | (nccl and mpi tests on the way) | 17:33:53 |
4 Dec 2023 |
SomeoneSerge (utc+3) | ShamrockLee (Yueh-Shun Li) btw the squashfs compression is behaving oddly: I went over some threshold maybe and now a 5.7GiB buildEnv maps into an 11GiB sif | 10:10:34 |
SomeoneSerge (utc+3) | AJAJAJA all you need is ram: I built the same image with memSize = 20 * 1024 instead of 4 * 1024 , and now the final result is like 2.8G... | 11:08:56 |
5 Dec 2023 |
| @federicodschonborn:matrix.org changed their profile picture. | 00:38:43 |
jbedo | In reply to @ss:someonex.net ShamrockLee (Yueh-Shun Li)jbedo do you know any downsides to this VM-free approach to assembling singularity images? https://github.com/NixOS/nixpkgs/issues/177908#issuecomment-1495625986 not aware of any downsides, it's a pretty nice approach | 01:02:55 |
11 Dec 2023 |
markuskowa | SomeoneSerge (UTC+2): I'm pretty busy at the moment. I will look at the slurm PR on the weekend. | 08:48:56 |
13 Dec 2023 |
SomeoneSerge (utc+3) | https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61
Does it make sense that this consistently blocks after the following message?
vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # Local host: node1
vm-test-run-slurm> submit # PID: 831
| 01:12:13 |
SomeoneSerge (utc+3) | * https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61
Does it make sense that this consistently blocks after the following message?
vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # Local host: node1
vm-test-run-slurm> submit # PID: 831
Also the same test with final: prev: { mpi = final.mpich; } reports the wrong world size of 1 | 02:27:48 |
SomeoneSerge (utc+3) | * https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61
Does it make sense that this consistently blocks after the following message?
vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit #
vm-test-run-slurm> submit # Local host: node1
vm-test-run-slurm> submit # PID: 831
Also the same test with final: prev: { mpi = final.mpich; } reports the wrong world size of 1 (thrice) | 02:29:50 |
| connor (he/him) (UTC-7) joined the room. | 14:47:19 |
SomeoneSerge (utc+3) | (trying to build mpich with pmix support)
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Resolve_nodes
...
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Abort'
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Commit'
collect2: error: ld returned 1 exit status
why won't people just use cmake | 18:54:34 |
SomeoneSerge (utc+3) | I mean, the [official guide](user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm
--with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst) literally suggests to use LD_LIBRARY_PATH to communicate the location of host libraries 🤦:
user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
> ../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm \
> --with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst
| 18:55:41 |
SomeoneSerge (utc+3) | * I mean, the official guide literally suggests to use LD_LIBRARY_PATH to communicate the location of host libraries 🤦:
user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
> ../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm \
> --with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst
| 18:56:07 |