!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

282 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
18 Nov 2024
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net
True. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides)
Woah! Thanks for the links, I wasn't aware of these
20:17:47
19 Nov 2024
@hexa:lossy.networkhexapython-updates with numpy 2.1 has landed in staging00:31:36
@hexa:lossy.networkhexasowwy00:31:40
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net
Should just work, what is the error?
Curl threw connection refused or something similar; I’ll try to get the log tomorrow
06:34:11
20 Nov 2024
@conroy:corncheese.orgConroy joined the room.04:47:44
@connorbaker:matrix.orgconnor (he/him)I did not get a chance; rip07:22:37
@damesberger:matrix.orgDaniel joined the room.18:53:01
22 Nov 2024
@deng23fdsafgea:matrix.orgdeng23fdsafgea joined the room.06:27:37
@numinit:matrix.orgMorgan (@numinit) joined the room.17:52:10
24 Nov 2024
@sielicki:matrix.orgsielickihttps://negativo17.org/nvidia-driver/ pretty good read 21:49:05
@sielicki:matrix.orgsielickimost of this is stuff that nixos gets right, but it's a nice collection of gotchas and solutions22:01:49
@sielicki:matrix.orgsielickianyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency.22:16:05
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency.
iirc we put it in there because if you set tensorflow = ...callPackage ... { cudaPackages = cudaPackages_XX_y; } you'll need to also pass a compatible nccl
22:17:33
@ss:someonex.netSomeoneSerge (back on matrix) so it's just easier to instantiate each cudaPackages variant with its own nccl and pass it along 22:17:55
@sielicki:matrix.orgsielickiI guess that's fair, and there is a pretty strong coupling of cuda versions and nccl versions... eg: https://github.com/pytorch/pytorch/pull/133593 has been stalled for some time due to nvidia dropping the pypi cu11 package for nccl, so there's reason to keep them consistent even if they technically release separately.22:20:12
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
https://negativo17.org/nvidia-driver/ pretty good read
Any highlights, what we might be missing?
22:22:09
@sielicki:matrix.orgsielickihonestly I am not sure there's anything, I just like the thought that went into it22:27:21
@sielicki:matrix.orgsielickithe special softdep for nvidia-uvm etc22:27:48
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
the special softdep for nvidia-uvm etc
yeah we have that, and iirc a special-case for the datacenter driver where it's not a softdep anymore
22:28:24
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
the special softdep for nvidia-uvm etc
* yeah we have that, and iirc a special-case for the datacenter driver where it's not a softdep anymore (not sure what the exact situation is)
22:29:12
25 Nov 2024
@sielicki:matrix.orgsielickiis this useful? https://gist.github.com/sielicki/2601de3ad8d8c732af80b12e36d326aa04:31:08
@sielicki:matrix.orgsielickiexample of its output: https://gist.github.com/sielicki/2601de3ad8d8c732af80b12e36d326aa/24c08bb29f1397c7d006b01f7afddd5cb06e90a504:31:38
@connorbaker:matrix.orgconnor (he/him) You can see what I eventually hope to move in-tree here: https://github.com/ConnorBaker/cuda-packages
Here’s the update script I’ve made for the different redists: https://github.com/ConnorBaker/cuda-packages/tree/main/scripts/cuda-redist
07:01:12
@connorbaker:matrix.orgconnor (he/him) Ugh we should write an update for the post Tom made on discourse (https://discourse.nixos.org/t/community-team-updates/56458)
@someoneserge anything we should mention in particular?
I think I started a draft for an update earlier this year so I’ll see if I can find it :/
07:03:39
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @connorbaker:matrix.org
Ugh we should write an update for the post Tom made on discourse (https://discourse.nixos.org/t/community-team-updates/56458)
@someoneserge anything we should mention in particular?
I think I started a draft for an update earlier this year so I’ll see if I can find it :/
Let's make a shared pad for the draft?
14:18:46
@ss:someonex.netSomeoneSerge (back on matrix)Also maybe we've already reached the point where a room-wide voice call could be a better way to list the "challenges"14:41:51
@hexa:lossy.networkhexais anyone here aware why tensorflow 2.13.0 on 24.11 now requires AVX CPU instructions and the same version on 24.05 did not? https://github.com/NixOS/nixpkgs/issues/35897317:49:22
@hexa:lossy.networkhexayes, the answer is dependencies, likely protobuf from the call trace17:49:39
@hexa:lossy.networkhexa * the likely answer is dependencies, should be protobuf from the call trace17:49:58
@ss:someonex.netSomeoneSerge (back on matrix)I was about to post "you sure it's not the pypi garbage" and then thought "I'm surely just being biased"18:07:06

Show newer messages


Back to Room ListRoom Version: 9