| 18 Nov 2024 |
SomeoneSerge (back on matrix) | You should chat with picnoir too | 12:20:44 |
SomeoneSerge (back on matrix) | In reply to @connorbaker:matrix.org Unrelated -- if anyone has experience with NixOS VM tests and getting multiple nodes to talk to each other, I'd appreciate pointers. ping can resolve hostnames but curl can't for some reason (https://github.com/ConnorBaker/nix-eval-graph/commit/c5a1e2268ead6ff6ffaab672762c1eedee53f403). Should just work, what is the error? | 12:22:30 |
connor (he/him) | In reply to @ss:someonex.net True. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides) Woah! Thanks for the links, I wasn't aware of these | 20:17:47 |
| 19 Nov 2024 |
hexa | python-updates with numpy 2.1 has landed in staging | 00:31:36 |
hexa | sowwy | 00:31:40 |
connor (he/him) | In reply to @ss:someonex.net Should just work, what is the error? Curl threw connection refused or something similar; I’ll try to get the log tomorrow | 06:34:11 |
| 20 Nov 2024 |
| Conroy joined the room. | 04:47:44 |
connor (he/him) | I did not get a chance; rip | 07:22:37 |
| Daniel joined the room. | 18:53:01 |
| 22 Nov 2024 |
| deng23fdsafgea joined the room. | 06:27:37 |
| Morgan (@numinit) joined the room. | 17:52:10 |
| 24 Nov 2024 |
sielicki | https://negativo17.org/nvidia-driver/ pretty good read | 21:49:05 |
sielicki | most of this is stuff that nixos gets right, but it's a nice collection of gotchas and solutions | 22:01:49 |
sielicki | anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency. | 22:16:05 |
SomeoneSerge (back on matrix) | In reply to @sielicki:matrix.org anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency. iirc we put it in there because if you set tensorflow = ...callPackage ... { cudaPackages = cudaPackages_XX_y; } you'll need to also pass a compatible nccl | 22:17:33 |
SomeoneSerge (back on matrix) | so it's just easier to instantiate each cudaPackages variant with its own nccl and pass it along | 22:17:55 |
sielicki | I guess that's fair, and there is a pretty strong coupling of cuda versions and nccl versions... eg: https://github.com/pytorch/pytorch/pull/133593 has been stalled for some time due to nvidia dropping the pypi cu11 package for nccl, so there's reason to keep them consistent even if they technically release separately. | 22:20:12 |