NixOS CUDA | 289 Members | |
| CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda | 57 Servers |
| Sender | Message | Time |
|---|---|---|
| 18 Nov 2024 | ||
| Anyway in the interest of splitting my attention ever more thinly I decided to start trying to work on some approach toward evaluation of derivations and building them
| 05:18:41 | |
| Because why have one project when you can have many? | 05:18:55 | |
| https://github.com/ConnorBaker/nix-eval-graph And I’ve decided to write it in Rust, which I am self teaching. And I’ll probably use a graph database, because why not. And I’ll use NixOS tests for integration testing, because also why not. | 05:20:02 | |
| All this is to say I am deeply irritated when I see my builders copying around gigantic CUDA libraries constantly. | 05:20:31 | |
| Unrelated to closure woes, I tried to package https://github.com/NVIDIA/MatX and https://github.com/NVIDIA/nvbench and nearly pulled my hair out. If anyone has suggestions for doing so without creating a patched and vendored copy of https://github.com/rapidsai/rapids-cmake or writing my own CMake for everything, I’d love to hear! | 05:23:26 | |
| Also, anyone know how the ROCm maintainers are doing? | 05:26:35 | |
In reply to @connorbaker:matrix.orgAwesome! I've been bracing myself to look into that too. What's your current idea regarding costs and locality? | 07:09:42 | |
In reply to @connorbaker:matrix.orgwe'd need to do that if were to package rapids itself too, wouldn't we? | 07:11:11 | |
In reply to @ss:someonex.netCurrently I don't know how I'd even model it... but I've been told that job scheduling is a well-researched problem in HPC communities ;) I started to write something about how I think of high-level tradeoffs between choosing where to build to build moar fast, reduce the number of rebuilds (if they are at all permitted), reduce network traffic, etc. and then thought "well what if the machines aren't homogenous" and I've decided it's time for bed. | 08:40:34 | |
In reply to @ss:someonex.netI have been avoiding rapids so hard lmao 🙅♂️ | 08:40:49 | |
Unrelated -- if anyone has experience with NixOS VM tests and getting multiple nodes to talk to each other, I'd appreciate pointers. ping can resolve hostnames but curl can't for some reason (https://github.com/ConnorBaker/nix-eval-graph/commit/c5a1e2268ead6ff6ffaab672762c1eedee53f403). | 08:43:02 | |
In reply to @connorbaker:matrix.orgTrue. I'm still yet to read up on how SLURM and friends do this. Shameless plug: https://github.com/sinanmohd/evanix (slides) | 12:20:00 | |
| You should chat with picnoir too | 12:20:44 | |
In reply to @connorbaker:matrix.orgShould just work, what is the error? | 12:22:30 | |
In reply to @ss:someonex.netWoah! Thanks for the links, I wasn't aware of these | 20:17:47 | |
| 19 Nov 2024 | ||
| python-updates with numpy 2.1 has landed in staging | 00:31:36 | |
| sowwy | 00:31:40 | |
In reply to @ss:someonex.netCurl threw connection refused or something similar; I’ll try to get the log tomorrow | 06:34:11 | |
| 20 Nov 2024 | ||
| 04:47:44 | ||
| I did not get a chance; rip | 07:22:37 | |
| 18:53:01 | ||
| 22 Nov 2024 | ||
| 06:27:37 | ||
| 17:52:10 | ||
| 24 Nov 2024 | ||
| https://negativo17.org/nvidia-driver/ pretty good read | 21:49:05 | |
| most of this is stuff that nixos gets right, but it's a nice collection of gotchas and solutions | 22:01:49 | |
| anyone have strong opinions on moving nccl and nccl-tests out of cudaModules? Rationale on moving them out: neither one is distributed as a part of the cuda toolkit and they release on an entirely separate cadence, so there's no real reason for it to be in there. It's no different than eg: torch in terms of the cuda dependency. | 22:16:05 | |
In reply to @sielicki:matrix.orgiirc we put it in there because if you set tensorflow = ...callPackage ... { cudaPackages = cudaPackages_XX_y; } you'll need to also pass a compatible nccl | 22:17:33 | |
so it's just easier to instantiate each cudaPackages variant with its own nccl and pass it along | 22:17:55 | |
| I guess that's fair, and there is a pretty strong coupling of cuda versions and nccl versions... eg: https://github.com/pytorch/pytorch/pull/133593 has been stalled for some time due to nvidia dropping the pypi cu11 package for nccl, so there's reason to keep them consistent even if they technically release separately. | 22:20:12 | |
In reply to @sielicki:matrix.orgAny highlights, what we might be missing? | 22:22:09 | |