NixOS CUDA | 289 Members | |
| CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda | 57 Servers |
| Sender | Message | Time |
|---|---|---|
| 8 Oct 2024 | ||
| 10:55:06 | ||
| From what I’ve seen in the Python ecosystem, compiling kernels at runtime is becoming more commonplace because it reduces the size of binaries you ship and allows optimizing for the hardware you’re specifically running on. For example, JAX (via XLA) support auto tuning via Triton by compiling and running a number of different kernels. | 15:46:17 | |
| Yes, compiling on the fly is the core spirit of tinygrad. | 15:47:06 | |
Trying to compose backendStdenv with ccacheStdenv 🙃 | 17:07:51 | |
In reply to @ss:someonex.net callPackage is a blessing and a curse | 17:50:29 | |
| It works with a bit of copypaste though | 17:50:43 | |
But has anyone run into weird PermissionDenied errors with ccache? the directory is visible in the sandbox and owned by nixbld group and id seemst o match... | 17:57:47 | |
| 19:36:06 | ||
| 9 Oct 2024 | ||
| 01:20:41 | ||
| 10 Oct 2024 | ||
| Iterating on triton with ccache is so much faster lmao | 16:12:34 | |
| 11 Oct 2024 | ||
Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I dropped the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867 | 07:49:12 | |
Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867 | 07:53:31 | |
| * Iterating on triton with ccache is so much faster lmao EDIT: triton+torch in half an hour on a single node, this not perfect but is an improvement | 11:41:55 | |
In reply to @msanft:matrix.orgWhat'd be a reasonable way to test this, now that our docker/podman flows all migrated to CDI and our singularity IIRC uses a plain text file with the library paths? | 11:44:30 | |
| I tested it with an "OCI Hook", like so: https://github.com/confidential-containers/cloud-api-adaptor/blob/191ec51f6245a1a475c15312d354efaf07ff64de/src/cloud-api-adaptor/podvm/addons/nvidia_gpu/setup.sh#L11C1-L17C4 Getting that to work was also the particular reason for why I got to update this package in the first place. | 12:21:24 | |
The update is necessary to fix legacy library lookup for containers with GPU access, as newer drivers won't have the libnvidia-pkcs11.so (which corresponds to OpenSSL 1.1), but only the *.openssl3.so alternatives for OpenSSL 3. Just to give this some context. Legacy binary lookup doesn't work with 1.9.0 nor 1.16.2 as of now. I think we might even want to get the update itself merged without fixing that, as it's security-relevant and the binary availability is not a regression, but I'm also happy to hear your stance on that. | 12:24:09 | |
Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.stuff from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? | 17:03:50 | |
* Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.(things) from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? | 17:04:05 | |
| 12 Oct 2024 | ||
| 07:39:38 | ||
| 14 Oct 2024 | ||
It looks like python312Packages.onnx does not build when cudaSupport = true. | 08:11:25 | |
Gaétan Lepage: could you give https://github.com/NixOS/nixpkgs/pull/328247 another look? I just picked up where the author left off, I didn't try questioning whether e.g. adding a separate triton-llvm is the right way or whatever, and my brain is not in the place to think high-level rn | 18:43:40 | |
In reply to @zopieux:matrix.zopi.eu Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux | 18:50:31 | |
In reply to @zopieux:matrix.zopi.eu* Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution | 18:51:50 | |
| https://nix-community.org/cache/ | 18:52:36 | |
In reply to @ss:someonex.netIndeed, it seems to fail currently | 19:02:58 | |
In reply to @ss:someonex.netThis is building the cuda version of onnx ? | 19:03:19 | |
| Yes but also the hydra history is all green 🤷 | 19:08:54 | |
| Yes, weird... | 19:13:19 | |
| Noticed https://github.com/SomeoneSerge/nixpkgs-cuda-ci/issues/31#issuecomment-2412043822 only now, published a response | 19:22:08 | |
I can't get onnx to build...Here are the logs in case someone know what is happening: https://paste.glepage.com/upload/eel-falcon-sloth | 20:08:13 | |