!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

290 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

Load older messages


SenderMessageTime
18 Nov 2024
@connorbaker:matrix.orgconnor (he/him)Unrelated to closure woes, I tried to package https://github.com/NVIDIA/MatX and https://github.com/NVIDIA/nvbench and nearly pulled my hair out. If anyone has suggestions for doing so without creating a patched and vendored copy of https://github.com/rapidsai/rapids-cmake or writing my own CMake for everything, I’d love to hear!05:23:26
@connorbaker:matrix.orgconnor (he/him)Also, anyone know how the ROCm maintainers are doing?05:26:35
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @connorbaker:matrix.org

Anyway in the interest of splitting my attention ever more thinly I decided to start trying to work on some approach toward evaluation of derivations and building them
The idea being to have

  1. a service which is given a flake ref and an attribute path and efficiently produces a list of attribute paths to derivations exiting under the given attribute path and stores the eval time somewhere
  2. a service which is given a flake ref and an attribute path to a derivation and produces the JSON representation of the closure of derivations required to realize the derivation, again storing eval time somewhere
  3. a service which functions as a job scheduler, using historical data about costs (space, time, memory, CPU usage, etc.) and information about locality (existing store paths on different builders) to realize a derivation, which is updated upon realization of a derivation
Awesome! I've been bracing myself to look into that too. What's your current idea regarding costs and locality?
07:09:42
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @connorbaker:matrix.org
Unrelated to closure woes, I tried to package https://github.com/NVIDIA/MatX and https://github.com/NVIDIA/nvbench and nearly pulled my hair out. If anyone has suggestions for doing so without creating a patched and vendored copy of https://github.com/rapidsai/rapids-cmake or writing my own CMake for everything, I’d love to hear!
we'd need to do that if were to package rapids itself too, wouldn't we?
07:11:11
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net
Awesome! I've been bracing myself to look into that too. What's your current idea regarding costs and locality?
Currently I don't know how I'd even model it... but I've been told that job scheduling is a well-researched problem in HPC communities ;)
I started to write something about how I think of high-level tradeoffs between choosing where to build to build moar fast, reduce the number of rebuilds (if they are at all permitted), reduce network traffic, etc. and then thought "well what if the machines aren't homogenous" and I've decided it's time for bed.
08:40:34
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net
we'd need to do that if were to package rapids itself too, wouldn't we?
I have been avoiding rapids so hard lmao 🙅‍♂️
08:40:49
@connorbaker:matrix.orgconnor (he/him) Unrelated -- if anyone has experience with NixOS VM tests and getting multiple nodes to talk to each other, I'd appreciate pointers. ping can resolve hostnames but curl can't for some reason (https://github.com/ConnorBaker/nix-eval-graph/commit/c5a1e2268ead6ff6ffaab672762c1eedee53f403). 08:43:02
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @connorbaker:matrix.org
Currently I don't know how I'd even model it... but I've been told that job scheduling is a well-researched problem in HPC communities ;)
I started to write something about how I think of high-level tradeoffs between choosing where to build to build moar fast, reduce the number of rebuilds (if they are at all permitted), reduce network traffic, etc. and then thought "well what if the machines aren't homogenous" and I've decided it's time for bed.
True. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides)
12:20:00
@ss:someonex.netSomeoneSerge (back on matrix)You should chat with picnoir too12:20:44
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @connorbaker:matrix.org
Unrelated -- if anyone has experience with NixOS VM tests and getting multiple nodes to talk to each other, I'd appreciate pointers. ping can resolve hostnames but curl can't for some reason (https://github.com/ConnorBaker/nix-eval-graph/commit/c5a1e2268ead6ff6ffaab672762c1eedee53f403).
Should just work, what is the error?
12:22:30
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net
True. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides)
Woah! Thanks for the links, I wasn't aware of these
20:17:47
19 Nov 2024
@hexa:lossy.networkhexapython-updates with numpy 2.1 has landed in staging00:31:36
@hexa:lossy.networkhexasowwy00:31:40
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net
Should just work, what is the error?
Curl threw connection refused or something similar; I’ll try to get the log tomorrow
06:34:11
20 Nov 2024
@conroy:corncheese.orgConroy joined the room.04:47:44
@connorbaker:matrix.orgconnor (he/him)I did not get a chance; rip07:22:37
@damesberger:matrix.orgDaniel joined the room.18:53:01
22 Nov 2024
@deng23fdsafgea:matrix.orgdeng23fdsafgea joined the room.06:27:37
@numinit:matrix.orgMorgan (@numinit) joined the room.17:52:10
24 Nov 2024
@sielicki:matrix.orgsielickihttps://negativo17.org/nvidia-driver/ pretty good read 21:49:05
@sielicki:matrix.orgsielickimost of this is stuff that nixos gets right, but it's a nice collection of gotchas and solutions22:01:49
@sielicki:matrix.orgsielickianyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency.22:16:05
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency.
iirc we put it in there because if you set tensorflow = ...callPackage ... { cudaPackages = cudaPackages_XX_y; } you'll need to also pass a compatible nccl
22:17:33
@ss:someonex.netSomeoneSerge (back on matrix) so it's just easier to instantiate each cudaPackages variant with its own nccl and pass it along 22:17:55
@sielicki:matrix.orgsielickiI guess that's fair, and there is a pretty strong coupling of cuda versions and nccl versions... eg: https://github.com/pytorch/pytorch/pull/133593 has been stalled for some time due to nvidia dropping the pypi cu11 package for nccl, so there's reason to keep them consistent even if they technically release separately.22:20:12
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
https://negativo17.org/nvidia-driver/ pretty good read
Any highlights, what we might be missing?
22:22:09
@sielicki:matrix.orgsielickihonestly I am not sure there's anything, I just like the thought that went into it22:27:21
@sielicki:matrix.orgsielickithe special softdep for nvidia-uvm etc22:27:48
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
the special softdep for nvidia-uvm etc
yeah we have that, and iirc a special-case for the datacenter driver where it's not a softdep anymore
22:28:24
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @sielicki:matrix.org
the special softdep for nvidia-uvm etc
* yeah we have that, and iirc a special-case for the datacenter driver where it's not a softdep anymore (not sure what the exact situation is)
22:29:12

Show newer messages


Back to Room ListRoom Version: 9