!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

289 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

Load older messages


SenderMessageTime
8 Oct 2024
@ss:someonex.netSomeoneSerge (back on matrix) But has anyone run into weird PermissionDenied errors with ccache? the directory is visible in the sandbox and owned by nixbld group and id seemst o match... 17:57:47
@kaya:catnip.eekaya 𖤐 changed their profile picture.19:36:06
9 Oct 2024
@john:friendsgiv.ingjohn joined the room.01:20:41
10 Oct 2024
@ss:someonex.netSomeoneSerge (back on matrix)Iterating on triton with ccache is so much faster lmao16:12:34
11 Oct 2024
@msanft:matrix.orgMoritz Sanft Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I dropped the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?

CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot.

Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867
07:49:12
@msanft:matrix.orgMoritz Sanft Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?

CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot.

Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867
07:53:31
@ss:someonex.netSomeoneSerge (back on matrix) * Iterating on triton with ccache is so much faster lmao EDIT: triton+torch in half an hour on a single node, this not perfect but is an improvement11:41:55
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @msanft:matrix.org
Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?

CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot.

Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867
What'd be a reasonable way to test this, now that our docker/podman flows all migrated to CDI and our singularity IIRC uses a plain text file with the library paths?
11:44:30
@msanft:matrix.orgMoritz SanftI tested it with an "OCI Hook", like so: https://github.com/confidential-containers/cloud-api-adaptor/blob/191ec51f6245a1a475c15312d354efaf07ff64de/src/cloud-api-adaptor/podvm/addons/nvidia_gpu/setup.sh#L11C1-L17C4 Getting that to work was also the particular reason for why I got to update this package in the first place.12:21:24
@msanft:matrix.orgMoritz Sanft The update is necessary to fix legacy library lookup for containers with GPU access, as newer drivers won't have the libnvidia-pkcs11.so (which corresponds to OpenSSL 1.1), but only the *.openssl3.so alternatives for OpenSSL 3. Just to give this some context. Legacy binary lookup doesn't work with 1.9.0 nor 1.16.2 as of now. I think we might even want to get the update itself merged without fixing that, as it's security-relevant and the binary availability is not a regression, but I'm also happy to hear your stance on that. 12:24:09
@zopieux:matrix.zopi.euzopieux Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.stuff from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? 17:03:50
@zopieux:matrix.zopi.euzopieux * Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.(things) from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? 17:04:05
12 Oct 2024
@mabau:matrix.org@mabau:matrix.org joined the room.07:39:38
14 Oct 2024
@glepage:matrix.orgGaƩtan Lepage It looks like python312Packages.onnx does not build when cudaSupport = true. 08:11:25
@ss:someonex.netSomeoneSerge (back on matrix) GaƩtan Lepage: could you give https://github.com/NixOS/nixpkgs/pull/328247 another look? I just picked up where the author left off, I didn't try questioning whether e.g. adding a separate triton-llvm is the right way or whatever, and my brain is not in the place to think high-level rn 18:43:40
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @zopieux:matrix.zopi.eu
Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.(things) from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken?

Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b

Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux

18:50:31
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @zopieux:matrix.zopi.eu
Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.(things) from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken?
*

Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b

Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution

18:51:50
@ss:someonex.netSomeoneSerge (back on matrix)https://nix-community.org/cache/18:52:36
@glepage:matrix.orgGaƩtan Lepage
In reply to @ss:someonex.net

Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b

Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution

Indeed, it seems to fail currently
19:02:58
@glepage:matrix.orgGaƩtan Lepage
In reply to @ss:someonex.net

Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b

Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution

This is building the cuda version of onnx ?
19:03:19
@ss:someonex.netSomeoneSerge (back on matrix)Yes but also the hydra history is all green 🤷19:08:54
@glepage:matrix.orgGaƩtan LepageYes, weird...19:13:19
@ss:someonex.netSomeoneSerge (back on matrix)Noticed https://github.com/SomeoneSerge/nixpkgs-cuda-ci/issues/31#issuecomment-2412043822 only now, published a response19:22:08
@glepage:matrix.orgGaƩtan Lepage I can't get onnx to build...
Here are the logs in case someone know what is happening: https://paste.glepage.com/upload/eel-falcon-sloth
20:08:13
@ss:someonex.netSomeoneSerge (back on matrix)
      error: downloading 'https://github.com/abseil/abseil-cpp/archive/refs/tags/20230125.3.tar.gz' failed

lol

20:19:08
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @ss:someonex.net
Yes but also the hydra history is all green 🤷
Maybe that just came in from staging
20:19:30
15 Oct 2024
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)
In reply to @glepage:matrix.org
I can't get onnx to build...
Here are the logs in case someone know what is happening: https://paste.glepage.com/upload/eel-falcon-sloth
Onnx's CMake isn't detecting at least one dependency, so it tries to download them all in order, starting with abseil. Since there's no networking in the sandbox, it fails.
00:06:48
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)I'm currently working on Onnx packaging for a thing, and you can see what I've got going on here: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx.nix (It's a combination C++/Python install so it's gnarly. But better than having two separate derivations with libraries built with different flags, I guess.)00:09:04
@glepage:matrix.orgGaƩtan LepageOk interesting, thanks for sharing05:46:57
@glepage:matrix.orgGaƩtan LepageIs your plan to upstream this to nixpkgs ?05:47:13

Show newer messages


Back to Room ListRoom Version: 9