| 13 Feb 2025 |
Gaétan Lepage | Sure haha | 15:46:34 |
| 14 Feb 2025 |
connor (burnt/out) (UTC-8) | As of a few days ago Onnxruntime requires CUDA separable compilation… so I guess I gotta fix that now 🙃 | 01:50:24 |
SomeoneSerge (back on matrix) | RE: CI infra/yesterday's meeting CC connor (he/him) (UTC-8):
By the way, while on my side I'm advertising both options for provisioning hardware, the spot instances and the owned hardware, I think we might want to incentivize companies to commit to support the latter path. While it's obviously more work, organisational and engineering, it is a much better long-term promise for the community. With the rented hardware, if two or three companies simultaneously decide to withdraw, we basically have to immediately scale down the CI. If we buy hardware for a non-profit and a few years later some companies decide they're not interested anymore, we maybe lose a retainer covering the maintenance work. With own hardware we can also be more flexible and maybe dedicate some machines to be used as community builders/devboxes for ad hoc experimentation.
| 11:15:16 |
zopieux | It's me again :-) This time I have a genuinely surprising behavior from the community cache (the substituters are correctly configured): nccl was successfully built (derivation mv02…), the narinfo is available, but upon nix-shell -p cudaPackages_12.nccl I get
this derivation will be built:
/nix/store/mv02rgvrhw9n1682dw7vs8w3pssc24lr-nccl-2.21.5-1.drv
(lots of compiling)
Others, like cudaPackages.cudnn, are successfully retrieved from the cache.
| 17:58:45 |
ruro | So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no cudaPackages_11_3.cuda_cudart).
Unfortunately, I only noticed this after refactoring cuda-samples to use the individual packages instead of cudatoolkit. sigh
| 21:12:48 |
| 15 Feb 2025 |
| zowoq joined the room. | 00:48:50 |
zowoq |
we can probably first bring up the alignment questions with nix-community just in their chat
We could do it here if you like, I think that between Jonas Chevalier and me we can represent nix-community and discussion is probably of more interest to the people in this room, we can post a summary in the nix-community matrix.
| 00:49:34 |
zowoq | https://github.com/NixOS/rfcs/pull/185 I discovered this RFC a day ago, I don't think it has been mentioned here yet? | 00:49:48 |
| Max Niederman joined the room. | 03:10:37 |
Kevin Mittman (UTC-8) | In reply to @ruroruro:matrix.org
So, uh... I just noticed that CUDA versions prior to 11.4 don't have the individual redistributables (for example, there is no cudaPackages_11_3.cuda_cudart).
Unfortunately, I only noticed this after refactoring cuda-samples to use the individual packages instead of cudatoolkit. sigh
How far back are you looking for? | 04:03:27 |
connor (burnt/out) (UTC-8) | Apparently both Hydra and Nix support dynamic machine lists: https://github.com/NixOS/nix/issues/523#issuecomment-559516338
Here’s the code for Hydra: https://github.com/NixOS/hydra/blob/51944a5fa5696cf78043ad2d08934a91fb89e986/src/hydra-queue-runner/hydra-queue-runner.cc#L178
I assume you could have a script which provisions new machines and adds them to the list of remote builders, assuming you store the list of machines somewhere you can mutate it | 09:08:45 |
connor (burnt/out) (UTC-8) | I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that’s makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances) | 09:11:31 |
connor (burnt/out) (UTC-8) | * I forget if azure’s placement groups allow adding more machines after the initial group, but if they do, that makes NFS over RDMA available at 200 to 400 gbps depending on instance type (precious HBv3/4 instances) | 09:12:56 |
SomeoneSerge (back on matrix) | Yes, I didn't read the code yet, but I think this is just normal Nix remote builder protocol (unaware of any locality), and I suspect we still have to conjure something up to avoid the cold store issue, which must be more prominent with ephemeral builders than with permanent | 09:24:58 |
SomeoneSerge (back on matrix) | Oh... I think I saw the discourse email, but was busy at the time and then completely forgot about this RFC | 09:26:22 |