!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

289 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
13 Feb 2025
@ss:someonex.netSomeoneSerge (back on matrix)

Regarding scheduling the future meetings,

  • we should probably aim to meet in 2-4 weeks to follow up on the patchelf exception and for a report on the ephemeral builders situation;
  • we can probably first bring up the alignment questions with nix-community just in their chat, without video because async is faster;
  • additionally, I think I should have hours this and next week to sort the backlog as mentioned in the notes; I think it'd still be useful, for onboarding new people, to do that with the audio and the screenshare, but it's not worth synchronizing people's schedules for this; maybe it'll be just a pop-in format?
15:38:04
@ss:someonex.netSomeoneSerge (back on matrix)(jaja, maybe we do this in Gaetan's twitch?)15:38:26
@glepage:matrix.orgGaétan LepageSure haha15:46:34
14 Feb 2025
@connorbaker:matrix.orgconnor (he/him)As of a few days ago Onnxruntime requires CUDA separable compilation… so I guess I gotta fix that now 🙃01:50:24
@ss:someonex.netSomeoneSerge (back on matrix)

RE: CI infra/yesterday's meeting
CC connor (he/him) (UTC-8):

By the way, while on my side I'm advertising both options for provisioning hardware, the spot instances and the owned hardware, I think we might want to incentivize companies to commit to support the latter path. While it's obviously more work, organisational and engineering, it is a much better long-term promise for the community. With the rented hardware, if two or three companies simultaneously decide to withdraw, we basically have to immediately scale down the CI. If we buy hardware for a non-profit and a few years later some companies decide they're not interested anymore, we maybe lose a retainer covering the maintenance work. With own hardware we can also be more flexible and maybe dedicate some machines to be used as community builders/devboxes for ad hoc experimentation.

11:15:16
@zopieux:matrix.zopi.euzopieux

It's me again :-) This time I have a genuinely surprising behavior from the community cache (the substituters are correctly configured): nccl was successfully built (derivation mv02…), the narinfo is available, but upon nix-shell -p cudaPackages_12.nccl I get

this derivation will be built:
  /nix/store/mv02rgvrhw9n1682dw7vs8w3pssc24lr-nccl-2.21.5-1.drv
(lots of compiling)

Others, like cudaPackages.cudnn, are successfully retrieved from the cache.

17:58:45
@ruroruro:matrix.orgruro

So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no cudaPackages_11_3.cuda_cudart).

Unfortunately, I only noticed this after refactoring cuda-samples to use the individual packages instead of cudatoolkit. sigh

21:12:48
15 Feb 2025
@zowoq:matrix.orgzowoq joined the room.00:48:50
@zowoq:matrix.orgzowoq

we can probably first bring up the alignment questions with nix-community just in their chat

We could do it here if you like, I think that between Jonas Chevalier and me we can represent nix-community and discussion is probably of more interest to the people in this room, we can post a summary in the nix-community matrix.

00:49:34
@zowoq:matrix.orgzowoqhttps://github.com/NixOS/rfcs/pull/185 I discovered this RFC a day ago, I don't think it has been mentioned here yet?00:49:48
@niederman:matrix.orgMax Niederman joined the room.03:10:37
@justbrowsing:matrix.orgKevin Mittman (EOY sleep)
In reply to @ruroruro:matrix.org

So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no cudaPackages_11_3.cuda_cudart).

Unfortunately, I only noticed this after refactoring cuda-samples to use the individual packages instead of cudatoolkit. sigh

How far back are you looking for?
04:03:27
@connorbaker:matrix.orgconnor (he/him) Apparently both Hydra and Nix support dynamic machine lists: https://github.com/NixOS/nix/issues/523#issuecomment-559516338
Here’s the code for Hydra: https://github.com/NixOS/hydra/blob/51944a5fa5696cf78043ad2d08934a91fb89e986/src/hydra-queue-runner/hydra-queue-runner.cc#L178
I assume you could have a script which provisions new machines and adds them to the list of remote builders, assuming you store the list of machines somewhere you can mutate it
09:08:45
@connorbaker:matrix.orgconnor (he/him)I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that’s makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances)09:11:31
@connorbaker:matrix.orgconnor (he/him)* I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances)09:12:56
@ss:someonex.netSomeoneSerge (back on matrix) Yes, I didn't read the code yet, but I think this is just normal Nix remote builder protocol (unaware of any locality), and I suspect we still have to conjure something up to avoid the cold store issue, which must be more prominent with ephemeral builders than with permanent 09:24:58
@ss:someonex.netSomeoneSerge (back on matrix) Oh... I think I saw the discourse email, but was busy at the time and then completely forgot about this RFC 09:26:22
@ss:someonex.netSomeoneSerge (back on matrix)

Great, thanks! So, the question essentially is: we (I think I say this with the cuda team hat on) can and want to scale up the CI for testing CUDA-enabled packages, both by increasing the number of builders, and by adding GPU instances. We want to build many more variants of nixpkgs for different architectures, and, ideally, run tests across a matrix of co-processor devices. For obvious reasons, want the infra to be owned by a transparent community-aligned entity with diversified funding - like nix-community. If this were to be done in nix-community, we'd have to do some work upfront, like ensuring sufficiently smart scheduling to not jam other jobsets hosted by the organization. This would also probably increasing maintenance workload. This also raises questions about the scope of nix-community: how niche and how large of a project is acceptable? E.g. if nix-community does some GPU hardware stuff, why also not mobile, not IoT, not FPGA? Etc. If we decide that buying physical hardware is in-scope, we need to figure out how to manage the inventory and how to manage trust.

Despite all that, I do like the notion of doing this through nix-community, because it already up and running, it has a compatible structure, and it's already a recognized name.

09:41:36
@zowoq:matrix.orgzowoqIs this only for testing or is serving a cache also a goal?10:11:13
@ss:someonex.netSomeoneSerge (back on matrix) Well, from whose perspective? From the PoV of the community, definitely a goal. As far as selling this idea to commercial entities goes, they couldn't care less, but we should advertise it as a prerequisite, because we need a cache to make development/maintenance reasonably efficient, and it might as well be a public cache 10:50:21
@zowoq:matrix.orgzowoqI imagine that the amount and size of builds would make cachix or other cloud storage unfeasible. If it was only a dev cache could probably get away with just serving it off the CI master, if it was a proper public cache with non-trivial amount of users probably want a dedicated machine (or more than one if you want to keep the cache around for a while).11:05:52
@glepage:matrix.orgGaétan LepageLet's fill a rack with compute and storage!11:07:01
@glepage:matrix.orgGaétan Lepage
In reply to @zowoq:matrix.org
I imagine that the amount and size of builds would make cachix or other cloud storage unfeasible. If it was only a dev cache could probably get away with just serving it off the CI master, if it was a proper public cache with non-trivial amount of users probably want a dedicated machine (or more than one if you want to keep the cache around for a while).
I guess it would be more a dev cache I guess.
11:07:18
@glepage:matrix.orgGaétan Lepage* I guess it would be more a dev cache.11:07:21
@zowoq:matrix.orgzowoq

I think we're probably close to users having problems with the cachix cache expiring too quickly. Our setup doesn't allow us to selectively push to the cache, everything that goes through CI gets pushed. Would need to move these jobs to another hydra/machine but that would also avoid needing to deal with this:

ensuring sufficiently smart scheduling to not jam other jobsets

11:13:25
@glepage:matrix.orgGaétan Lepage One thing that SomeoneSerge (UTC+U[-12,12]) 11:14:27
@glepage:matrix.orgGaétan Lepage * One thing that SomeoneSerge (UTC+U[-12,12]) touched on was to encourage companies to contribute (financially) to nix-community. 11:15:12
@zowoq:matrix.orgzowoqIt'll be less initial setup if a proper public cache isn't need.11:15:28
@ss:someonex.netSomeoneSerge (back on matrix)

I imagine that the amount and size of builds would make cachix or other cloud storage unfeasible

True. Maybe we should allocate setting up a tvix nar-bridge as a substituter as a separate task, so that public cache can still be a thing xD

11:29:04
@ss:someonex.netSomeoneSerge (back on matrix)

Would need to move these jobs to another hydra/machine but that would also avoid needing to deal with this:

Indeed, that's one way to get started at least

11:32:00

Show newer messages


Back to Room ListRoom Version: 9