!kFJOpVCFYFzxqjpJxm:nixos.org

Empty Room

60 Members
19 Servers

Load older messages


SenderMessageTime
27 Oct 2023
@federicodschonborn:matrix.org@federicodschonborn:matrix.org changed their profile picture.01:24:47
29 Oct 2023
@ss:someonex.netSomeoneSerge (utc+3) changed their display name from SomeoneSerge (UTC+1) to SomeoneSerge (UTC+2).22:42:17
3 Nov 2023
@fizihcyst:matrix.org@fizihcyst:matrix.org joined the room.11:50:01
5 Nov 2023
@kotatsuyaki:matrix.kotatsu.devkotatsuyaki joined the room.03:52:10
9 Nov 2023
@bootstrapper:matrix.orgIdo Samuelson changed their display name from snick to Ido Samuelson.06:33:43
10 Nov 2023
@globin:toznenetl.chatglobin joined the room.00:49:30
15 Nov 2023
@grahamc:nixos.org@grahamc:nixos.orgchanged room power levels.16:08:35
@grahamc:nixos.org@grahamc:nixos.org left the room.16:08:36
@mjolnir:nixos.orgNixOS Moderation Botchanged room power levels.18:12:37
@mjolnir:nixos.orgNixOS Moderation Botchanged room power levels.18:12:37
19 Nov 2023
@pederbs:pvv.ntnu.nopbsds changed their display name from pbsds to pbsds (federation borken, may not see reply).03:36:14
@zxgu:matrix.orgZXGU joined the room.10:59:25
@pederbs:pvv.ntnu.nopbsds changed their display name from pbsds (federation borken, may not see reply) to pbsds.20:39:13
21 Nov 2023
@hdzki:hdzki.kozow.comhdzki ⚡️ joined the room.18:23:55
29 Nov 2023
@jcie74:matrix.orgpie_I'm slightly procrastinating. What's up in HPC these days?01:34:33
3 Dec 2023
@ss:someonex.netSomeoneSerge (utc+3) ShamrockLee (Yueh-Shun Li)jbedo do you know any downsides to this VM-free approach to assembling singularity images? https://github.com/NixOS/nixpkgs/issues/177908#issuecomment-1495625986 14:16:57
@ss:someonex.netSomeoneSerge (utc+3)

Finally (lol) got around to trying Nix-built singularity images on the cluster:

❯ nom build .#pkgsCuda.some-pkgs-py.edm.image -L
❯ du -hs $(readlink ./result)
3.4G    /nix/store/axhpdc96qgzk720yciwfach93f3xrqby-singularity-image-edm.img
❯ du -hs /scratch/cs/graphics/singularity-images/images/edm.sif  # Baseline based off NGC
7.6G    /scratch/cs/graphics/singularity-images/images/edm.sif
❯ rsync -LP ./result triton:
❯ ssh triton srun --mem=8G --time=0:05:00 --gres=gpu:a100:1 singularity exec -B /m:/m -B /scratch:/scratch -B /l:/l --nv ./result nixglhost -- python -m edm.example
...
Saving image grid to "imagenet-64x64.png"...
Done
17:33:47
@ss:someonex.netSomeoneSerge (utc+3)(nccl and mpi tests on the way)17:33:53
4 Dec 2023
@ss:someonex.netSomeoneSerge (utc+3) ShamrockLee (Yueh-Shun Li) btw the squashfs compression is behaving oddly: I went over some threshold maybe and now a 5.7GiB buildEnv maps into an 11GiB sif 10:10:34
@ss:someonex.netSomeoneSerge (utc+3) AJAJAJA all you need is ram: I built the same image with memSize = 20 * 1024 instead of 4 * 1024, and now the final result is like 2.8G... 11:08:56
5 Dec 2023
@federicodschonborn:matrix.org@federicodschonborn:matrix.org changed their profile picture.00:38:43
@jb:vk3.wtfjbedo
In reply to @ss:someonex.net
ShamrockLee (Yueh-Shun Li)jbedo do you know any downsides to this VM-free approach to assembling singularity images? https://github.com/NixOS/nixpkgs/issues/177908#issuecomment-1495625986
not aware of any downsides, it's a pretty nice approach
01:02:55
11 Dec 2023
@markuskowa:matrix.orgmarkuskowa SomeoneSerge (UTC+2): I'm pretty busy at the moment. I will look at the slurm PR on the weekend. 08:48:56
13 Dec 2023
@ss:someonex.netSomeoneSerge (utc+3)

https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61

Does it make sense that this consistently blocks after the following message?

vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit # 
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit # 
vm-test-run-slurm> submit #   Local host: node1
vm-test-run-slurm> submit #   PID:        831
01:12:13
@ss:someonex.netSomeoneSerge (utc+3) *

https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61

Does it make sense that this consistently blocks after the following message?

vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit # 
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit # 
vm-test-run-slurm> submit #   Local host: node1
vm-test-run-slurm> submit #   PID:        831

Also the same test with final: prev: { mpi = final.mpich; } reports the wrong world size of 1

02:27:48
@ss:someonex.netSomeoneSerge (utc+3) *

https://gist.github.com/SomeoneSerge/3f894ffb5f97e55a0a5cfc10dfbc66e1#file-slurm-nix-L45-L61

Does it make sense that this consistently blocks after the following message?

vm-test-run-slurm> submit # WARNING: Open MPI accepted a TCP connection from what appears to be a
vm-test-run-slurm> submit # another Open MPI process but cannot find a corresponding process
vm-test-run-slurm> submit # entry for that peer.
vm-test-run-slurm> submit # 
vm-test-run-slurm> submit # This attempted connection will be ignored; your MPI job may or may not
vm-test-run-slurm> submit # continue properly.
vm-test-run-slurm> submit # 
vm-test-run-slurm> submit #   Local host: node1
vm-test-run-slurm> submit #   PID:        831

Also the same test with final: prev: { mpi = final.mpich; } reports the wrong world size of 1 (thrice)

02:29:50
@connorbaker:matrix.orgconnor (he/him) (UTC-7) joined the room.14:47:19
@ss:someonex.netSomeoneSerge (utc+3)

(trying to build mpich with pmix support)

/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Resolve_nodes
...
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Abort'
/nix/store/p58l5qmzifl20qmjs3xfpl01f0mqlza2-binutils-2.40/bin/ld: lib/.libs/libmpi.so: undefined reference to `PMIx_Commit'
collect2: error: ld returned 1 exit status

why won't people just use cmake

18:54:34
@ss:someonex.netSomeoneSerge (utc+3)

I mean, the [official guide](user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \

../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm
--with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst) literally suggests to use LD_LIBRARY_PATH to communicate the location of host libraries 🤦:

user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
> ../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm \
> --with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst
18:55:41
@ss:someonex.netSomeoneSerge (utc+3) *

I mean, the official guide literally suggests to use LD_LIBRARY_PATH to communicate the location of host libraries 🤦:

user@testbox:~/mpich-4.0.2/build$ LD_LIBRARY_PATH=~/slurm/22.05/inst/lib/ \
> ../configure --prefix=/home/user/bin/mpich/ --with-pmilib=slurm \
> --with-pmi=pmi2 --with-slurm=/home/lipi/slurm/master/inst
18:56:07

Show newer messages


Back to Room ListRoom Version: 9