!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

290 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

You have reached the beginning of time (for this room).


SenderMessageTime
13 Feb 2025
@glepage:matrix.orgGaétan LepageSure haha15:46:34
14 Feb 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)As of a few days ago Onnxruntime requires CUDA separable compilation… so I guess I gotta fix that now 🙃01:50:24
@ss:someonex.netSomeoneSerge (back on matrix)

RE: CI infra/yesterday's meeting
CC connor (he/him) (UTC-8):

By the way, while on my side I'm advertising both options for provisioning hardware, the spot instances and the owned hardware, I think we might want to incentivize companies to commit to support the latter path. While it's obviously more work, organisational and engineering, it is a much better long-term promise for the community. With the rented hardware, if two or three companies simultaneously decide to withdraw, we basically have to immediately scale down the CI. If we buy hardware for a non-profit and a few years later some companies decide they're not interested anymore, we maybe lose a retainer covering the maintenance work. With own hardware we can also be more flexible and maybe dedicate some machines to be used as community builders/devboxes for ad hoc experimentation.

11:15:16
@zopieux:matrix.zopi.euzopieux

It's me again :-) This time I have a genuinely surprising behavior from the community cache (the substituters are correctly configured): nccl was successfully built (derivation mv02…), the narinfo is available, but upon nix-shell -p cudaPackages_12.nccl I get

this derivation will be built:
  /nix/store/mv02rgvrhw9n1682dw7vs8w3pssc24lr-nccl-2.21.5-1.drv
(lots of compiling)

Others, like cudaPackages.cudnn, are successfully retrieved from the cache.

17:58:45
@ruroruro:matrix.orgruro

So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no cudaPackages_11_3.cuda_cudart).

Unfortunately, I only noticed this after refactoring cuda-samples to use the individual packages instead of cudatoolkit. sigh

21:12:48
15 Feb 2025
@zowoq:matrix.orgzowoq joined the room.00:48:50
@zowoq:matrix.orgzowoq

we can probably first bring up the alignment questions with nix-community just in their chat

We could do it here if you like, I think that between Jonas Chevalier and me we can represent nix-community and discussion is probably of more interest to the people in this room, we can post a summary in the nix-community matrix.

00:49:34
@zowoq:matrix.orgzowoqhttps://github.com/NixOS/rfcs/pull/185 I discovered this RFC a day ago, I don't think it has been mentioned here yet?00:49:48
@niederman:matrix.orgMax Niederman joined the room.03:10:37
@justbrowsing:matrix.orgKevin Mittman (UTC-8)
In reply to @ruroruro:matrix.org

So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no cudaPackages_11_3.cuda_cudart).

Unfortunately, I only noticed this after refactoring cuda-samples to use the individual packages instead of cudatoolkit. sigh

How far back are you looking for?
04:03:27
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) Apparently both Hydra and Nix support dynamic machine lists: https://github.com/NixOS/nix/issues/523#issuecomment-559516338
Here’s the code for Hydra: https://github.com/NixOS/hydra/blob/51944a5fa5696cf78043ad2d08934a91fb89e986/src/hydra-queue-runner/hydra-queue-runner.cc#L178
I assume you could have a script which provisions new machines and adds them to the list of remote builders, assuming you store the list of machines somewhere you can mutate it
09:08:45
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that’s makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances)09:11:31
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)* I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances)09:12:56
@ss:someonex.netSomeoneSerge (back on matrix) Yes, I didn't read the code yet, but I think this is just normal Nix remote builder protocol (unaware of any locality), and I suspect we still have to conjure something up to avoid the cold store issue, which must be more prominent with ephemeral builders than with permanent 09:24:58
@ss:someonex.netSomeoneSerge (back on matrix) Oh... I think I saw the discourse email, but was busy at the time and then completely forgot about this RFC 09:26:22

Show newer messages


Back to Room ListRoom Version: 9