NixOS CUDA - Public Room Timeline

	NixOS CUDA	289 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	58 Servers

You have reached the beginning of time (for this room).

Sender	Message	Time
15 Feb 2025
Kevin Mittman (EOY sleep)	In reply to @ruroruro:matrix.org So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no `cudaPackages_11_3.cuda_cudart`). Unfortunately, I only noticed this after refactoring `cuda-samples` to use the individual packages instead of `cudatoolkit`. sigh How far back are you looking for?	04:03:27
connor (he/him)	Apparently both Hydra and Nix support dynamic machine lists: https://github.com/NixOS/nix/issues/523#issuecomment-559516338 Here’s the code for Hydra: https://github.com/NixOS/hydra/blob/51944a5fa5696cf78043ad2d08934a91fb89e986/src/hydra-queue-runner/hydra-queue-runner.cc#L178 I assume you could have a script which provisions new machines and adds them to the list of remote builders, assuming you store the list of machines somewhere you can mutate it	09:08:45
connor (he/him)	I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that’s makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances)	09:11:31
connor (he/him)	* I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances)	09:12:56
SomeoneSerge (back on matrix)	Yes, I didn't read the code yet, but I think this is just normal Nix remote builder protocol (unaware of any locality), and I suspect we still have to conjure something up to avoid the cold store issue, which must be more prominent with ephemeral builders than with permanent	09:24:58
SomeoneSerge (back on matrix)	Oh... I think I saw the discourse email, but was busy at the time and then completely forgot about this RFC	09:26:22
SomeoneSerge (back on matrix)	Great, thanks! So, the question essentially is: we (I think I say this with the cuda team hat on) can and want to scale up the CI for testing CUDA-enabled packages, both by increasing the number of builders, and by adding GPU instances. We want to build many more variants of nixpkgs for different architectures, and, ideally, run tests across a matrix of co-processor devices. For obvious reasons, want the infra to be owned by a transparent community-aligned entity with diversified funding - like nix-community. If this were to be done in nix-community, we'd have to do some work upfront, like ensuring sufficiently smart scheduling to not jam other jobsets hosted by the organization. This would also probably increasing maintenance workload. This also raises questions about the scope of nix-community: how niche and how large of a project is acceptable? E.g. if nix-community does some GPU hardware stuff, why also not mobile, not IoT, not FPGA? Etc. If we decide that buying physical hardware is in-scope, we need to figure out how to manage the inventory and how to manage trust. Despite all that, I do like the notion of doing this through nix-community, because it already up and running, it has a compatible structure, and it's already a recognized name.	09:41:36
zowoq	Is this only for testing or is serving a cache also a goal?	10:11:13
SomeoneSerge (back on matrix)	Well, from whose perspective? From the PoV of the community, definitely a goal. As far as selling this idea to commercial entities goes, they couldn't care less, but we should advertise it as a prerequisite, because we need a cache to make development/maintenance reasonably efficient, and it might as well be a public cache	10:50:21
zowoq	I imagine that the amount and size of builds would make cachix or other cloud storage unfeasible. If it was only a dev cache could probably get away with just serving it off the CI master, if it was a proper public cache with non-trivial amount of users probably want a dedicated machine (or more than one if you want to keep the cache around for a while).	11:05:52
Gaétan Lepage	Let's fill a rack with compute and storage!	11:07:01
Gaétan Lepage	In reply to @zowoq:matrix.org I imagine that the amount and size of builds would make cachix or other cloud storage unfeasible. If it was only a dev cache could probably get away with just serving it off the CI master, if it was a proper public cache with non-trivial amount of users probably want a dedicated machine (or more than one if you want to keep the cache around for a while). I guess it would be more a dev cache I guess.	11:07:18
Gaétan Lepage	* I guess it would be more a dev cache.	11:07:21
zowoq	I think we're probably close to users having problems with the cachix cache expiring too quickly. Our setup doesn't allow us to selectively push to the cache, everything that goes through CI gets pushed. Would need to move these jobs to another hydra/machine but that would also avoid needing to deal with this: ensuring sufficiently smart scheduling to not jam other jobsets	11:13:25
Gaétan Lepage	One thing that SomeoneSerge (UTC+U[-12,12])	11:14:27
Gaétan Lepage	* One thing that SomeoneSerge (UTC+U[-12,12]) touched on was to encourage companies to contribute (financially) to nix-community.	11:15:12
zowoq	It'll be less initial setup if a proper public cache isn't need.	11:15:28
SomeoneSerge (back on matrix)	I imagine that the amount and size of builds would make cachix or other cloud storage unfeasible True. Maybe we should allocate setting up a tvix nar-bridge as a substituter as a separate task, so that public cache can still be a thing xD	11:29:04
SomeoneSerge (back on matrix)	Would need to move these jobs to another hydra/machine but that would also avoid needing to deal with this: Indeed, that's one way to get started at least	11:32:00
SomeoneSerge (back on matrix)	But with the intention of building a scalable persistent-ish cache later	11:32:56
zowoq	Running another dedicated machine (e.g. cheapish hetzner box) with just hydra and harmonia for the cache isn't a problem and spot instances for builders wouldn't be much maintenance overhead. Scope, funding, etc. are questions that I'll leave for @zimbatm.	12:27:05
ruro	Not 100% sure, what do you mean? The problematic cuda versions are `11.0`-`11.3`. `11.4` and later have individual redistributables. `10.x` are already deprecated/removed from nixpkgs, so no need to worry about those.	18:22:06
ruro	* Not 100% sure, what do you mean? The problematic cuda versions are `11.0`-`11.3`. `11.4` and later have individual redistributables. `10.x` are already deprecated/removed from nixpkgs, so no need to worry about those.	18:22:29
connor (he/him)	In reply to @ruroruro:matrix.org Not 100% sure, what do you mean? The problematic cuda versions are `11.0`-`11.3`. `11.4` and later have individual redistributables. `10.x` are already deprecated/removed from nixpkgs, so no need to worry about those. For what it’s worth 11.x will be removed prior to 25.05 from what I remember	19:34:06
	indoor_squirrel joined the room.	19:48:53
indoor_squirrel	In reply to @ss:someonex.net Great, thanks! So, the question essentially is: we (I think I say this with the cuda team hat on) can and want to scale up the CI for testing CUDA-enabled packages, both by increasing the number of builders, and by adding GPU instances. We want to build many more variants of nixpkgs for different architectures, and, ideally, run tests across a matrix of co-processor devices. For obvious reasons, want the infra to be owned by a transparent community-aligned entity with diversified funding - like nix-community. If this were to be done in nix-community, we'd have to do some work upfront, like ensuring sufficiently smart scheduling to not jam other jobsets hosted by the organization. This would also probably increasing maintenance workload. This also raises questions about the scope of nix-community: how niche and how large of a project is acceptable? E.g. if nix-community does some GPU hardware stuff, why also not mobile, not IoT, not FPGA? Etc. If we decide that buying physical hardware is in-scope, we need to figure out how to manage the inventory and how to manage trust. Despite all that, I do like the notion of doing this through nix-community, because it already up and running, it has a compatible structure, and it's already a recognized name. To this end, would public financial support for this project allow for anonymous contributions?	19:52:24
SomeoneSerge (back on matrix)	I think OpenCollective allows anonymous donations (in the sense of hiding the source from the public, but not from the project owners)	19:53:21
indoor_squirrel	In reply to @zowoq:matrix.org I imagine that the amount and size of builds would make cachix or other cloud storage unfeasible. If it was only a dev cache could probably get away with just serving it off the CI master, if it was a proper public cache with non-trivial amount of users probably want a dedicated machine (or more than one if you want to keep the cache around for a while). All substituter implantations today are centralized, right? It'd be neat to build one on top of IPFS, or similar, for example. In my head, no one host would control any one store path and something like Shamir secret sharing could support k of n hosts being able to sign a store path.	19:54:44
indoor_squirrel	* All substituter implementations today are centralized, right? It'd be neat to build one on top of IPFS, or similar, for example. In my head, no one host would control any one store path and something like Shamir secret sharing could support k of n hosts being able to sign a store path.	19:55:02
indoor_squirrel	In reply to @ss:someonex.net I think OpenCollective allows anonymous donations (in the sense of hiding the source from the public, but not from the project owners) This is potentially dangerous. Is there any effort that you know of for nix-community or you guys to work toward a funding solution which obscures this information from even OpenCollective, much less the project owners?	19:56:34

Show newer messages

Back to Room ListRoom Version: 9