NixOS CUDA | 290 Members | |
| CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda | 57 Servers |
| Sender | Message | Time |
|---|---|---|
| 10 Oct 2024 | ||
| Iterating on triton with ccache is so much faster lmao | 16:12:34 | |
| 11 Oct 2024 | ||
Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I dropped the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867 | 07:49:12 | |
Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867 | 07:53:31 | |
| * Iterating on triton with ccache is so much faster lmao EDIT: triton+torch in half an hour on a single node, this not perfect but is an improvement | 11:41:55 | |
In reply to @msanft:matrix.orgWhat'd be a reasonable way to test this, now that our docker/podman flows all migrated to CDI and our singularity IIRC uses a plain text file with the library paths? | 11:44:30 | |
| I tested it with an "OCI Hook", like so: https://github.com/confidential-containers/cloud-api-adaptor/blob/191ec51f6245a1a475c15312d354efaf07ff64de/src/cloud-api-adaptor/podvm/addons/nvidia_gpu/setup.sh#L11C1-L17C4 Getting that to work was also the particular reason for why I got to update this package in the first place. | 12:21:24 | |
The update is necessary to fix legacy library lookup for containers with GPU access, as newer drivers won't have the libnvidia-pkcs11.so (which corresponds to OpenSSL 1.1), but only the *.openssl3.so alternatives for OpenSSL 3. Just to give this some context. Legacy binary lookup doesn't work with 1.9.0 nor 1.16.2 as of now. I think we might even want to get the update itself merged without fixing that, as it's security-relevant and the binary availability is not a regression, but I'm also happy to hear your stance on that. | 12:24:09 | |
Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.stuff from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? | 17:03:50 | |
* Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.(things) from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? | 17:04:05 | |
| 12 Oct 2024 | ||
| 07:39:38 | ||
| 14 Oct 2024 | ||
It looks like python312Packages.onnx does not build when cudaSupport = true. | 08:11:25 | |
Gaétan Lepage: could you give https://github.com/NixOS/nixpkgs/pull/328247 another look? I just picked up where the author left off, I didn't try questioning whether e.g. adding a separate triton-llvm is the right way or whatever, and my brain is not in the place to think high-level rn | 18:43:40 | |
In reply to @zopieux:matrix.zopi.eu Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux | 18:50:31 | |
In reply to @zopieux:matrix.zopi.eu* Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution | 18:51:50 | |
| https://nix-community.org/cache/ | 18:52:36 | |
In reply to @ss:someonex.netIndeed, it seems to fail currently | 19:02:58 | |
In reply to @ss:someonex.netThis is building the cuda version of onnx ? | 19:03:19 | |
| Yes but also the hydra history is all green 🤷 | 19:08:54 | |
| Yes, weird... | 19:13:19 | |
| Noticed https://github.com/SomeoneSerge/nixpkgs-cuda-ci/issues/31#issuecomment-2412043822 only now, published a response | 19:22:08 | |
I can't get onnx to build...Here are the logs in case someone know what is happening: https://paste.glepage.com/upload/eel-falcon-sloth | 20:08:13 | |
lol | 20:19:08 | |
In reply to @ss:someonex.netMaybe that just came in from staging | 20:19:30 | |
| 15 Oct 2024 | ||
In reply to @glepage:matrix.orgOnnx's CMake isn't detecting at least one dependency, so it tries to download them all in order, starting with abseil. Since there's no networking in the sandbox, it fails. | 00:06:48 | |
| I'm currently working on Onnx packaging for a thing, and you can see what I've got going on here: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx.nix (It's a combination C++/Python install so it's gnarly. But better than having two separate derivations with libraries built with different flags, I guess.) | 00:09:04 | |
| Ok interesting, thanks for sharing | 05:46:57 | |
| Is your plan to upstream this to nixpkgs ? | 05:47:13 | |
[triton update]triton-llvm fails during the test phase.Logs: https://paste.glepage.com/upload/fish-jaguar-pig | 08:48:05 | |
| 11:38:21 | ||
In reply to @glepage:matrix.orgCan't reproduce, builds for me | 12:35:31 | |