| 18 Nov 2024 |
connor (he/him) | Unrelated -- if anyone has experience with NixOS VM tests and getting multiple nodes to talk to each other, I'd appreciate pointers. ping can resolve hostnames but curl can't for some reason (https://github.com/ConnorBaker/nix-eval-graph/commit/c5a1e2268ead6ff6ffaab672762c1eedee53f403). | 08:43:02 |
SomeoneSerge (back on matrix) | In reply to @connorbaker:matrix.org Currently I don't know how I'd even model it... but I've been told that job scheduling is a well-researched problem in HPC communities ;) I started to write something about how I think of high-level tradeoffs between choosing where to build to build moar fast, reduce the number of rebuilds (if they are at all permitted), reduce network traffic, etc. and then thought "well what if the machines aren't homogenous" and I've decided it's time for bed. True. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides) | 12:20:00 |
SomeoneSerge (back on matrix) | You should chat with picnoir too | 12:20:44 |
SomeoneSerge (back on matrix) | In reply to @connorbaker:matrix.org Unrelated -- if anyone has experience with NixOS VM tests and getting multiple nodes to talk to each other, I'd appreciate pointers. ping can resolve hostnames but curl can't for some reason (https://github.com/ConnorBaker/nix-eval-graph/commit/c5a1e2268ead6ff6ffaab672762c1eedee53f403). Should just work, what is the error? | 12:22:30 |
connor (he/him) | In reply to @ss:someonex.net True. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides) Woah! Thanks for the links, I wasn't aware of these | 20:17:47 |
| 19 Nov 2024 |
hexa | python-updates with numpy 2.1 has landed in staging | 00:31:36 |
hexa | sowwy | 00:31:40 |
connor (he/him) | In reply to @ss:someonex.net Should just work, what is the error? Curl threw connection refused or something similar; I’ll try to get the log tomorrow | 06:34:11 |
| 20 Nov 2024 |
| Conroy joined the room. | 04:47:44 |
connor (he/him) | I did not get a chance; rip | 07:22:37 |
| Daniel joined the room. | 18:53:01 |
| 22 Nov 2024 |
| deng23fdsafgea joined the room. | 06:27:37 |
| Morgan (@numinit) joined the room. | 17:52:10 |
| 24 Nov 2024 |
sielicki | https://negativo17.org/nvidia-driver/ pretty good read | 21:49:05 |
sielicki | most of this is stuff that nixos gets right, but it's a nice collection of gotchas and solutions | 22:01:49 |
sielicki | anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency. | 22:16:05 |
SomeoneSerge (back on matrix) | In reply to @sielicki:matrix.org anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency. iirc we put it in there because if you set tensorflow = ...callPackage ... { cudaPackages = cudaPackages_XX_y; } you'll need to also pass a compatible nccl | 22:17:33 |
SomeoneSerge (back on matrix) | so it's just easier to instantiate each cudaPackages variant with its own nccl and pass it along | 22:17:55 |
sielicki | I guess that's fair, and there is a pretty strong coupling of cuda versions and nccl versions... eg: https://github.com/pytorch/pytorch/pull/133593 has been stalled for some time due to nvidia dropping the pypi cu11 package for nccl, so there's reason to keep them consistent even if they technically release separately. | 22:20:12 |
SomeoneSerge (back on matrix) | In reply to @sielicki:matrix.org https://negativo17.org/nvidia-driver/ pretty good read Any highlights, what we might be missing? | 22:22:09 |
sielicki | honestly I am not sure there's anything, I just like the thought that went into it | 22:27:21 |
sielicki | the special softdep for nvidia-uvm etc | 22:27:48 |
SomeoneSerge (back on matrix) | In reply to @sielicki:matrix.org the special softdep for nvidia-uvm etc yeah we have that, and iirc a special-case for the datacenter driver where it's not a softdep anymore | 22:28:24 |
SomeoneSerge (back on matrix) | In reply to @sielicki:matrix.org the special softdep for nvidia-uvm etc * yeah we have that, and iirc a special-case for the datacenter driver where it's not a softdep anymore (not sure what the exact situation is) | 22:29:12 |
| 25 Nov 2024 |
sielicki | is this useful? https://gist.github.com/sielicki/2601de3ad8d8c732af80b12e36d326aa | 04:31:08 |
sielicki | example of its output: https://gist.github.com/sielicki/2601de3ad8d8c732af80b12e36d326aa/24c08bb29f1397c7d006b01f7afddd5cb06e90a5 | 04:31:38 |
connor (he/him) | You can see what I eventually hope to move in-tree here: https://github.com/ConnorBaker/cuda-packages
Here’s the update script I’ve made for the different redists: https://github.com/ConnorBaker/cuda-packages/tree/main/scripts/cuda-redist | 07:01:12 |
connor (he/him) | Ugh we should write an update for the post Tom made on discourse (https://discourse.nixos.org/t/community-team-updates/56458)
@someoneserge anything we should mention in particular?
I think I started a draft for an update earlier this year so I’ll see if I can find it :/ | 07:03:39 |
SomeoneSerge (back on matrix) | In reply to @connorbaker:matrix.org Ugh we should write an update for the post Tom made on discourse (https://discourse.nixos.org/t/community-team-updates/56458)
@someoneserge anything we should mention in particular?
I think I started a draft for an update earlier this year so I’ll see if I can find it :/ Let's make a shared pad for the draft? | 14:18:46 |
SomeoneSerge (back on matrix) | Also maybe we've already reached the point where a room-wide voice call could be a better way to list the "challenges" | 14:41:51 |