NixOS CUDA - Public Room Timeline

	NixOS CUDA	290 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	57 Servers

Load older messages

Sender	Message	Time
10 Oct 2024
SomeoneSerge (back on matrix)	Iterating on triton with ccache is so much faster lmao	16:12:34
11 Oct 2024
Moritz Sanft	Hey folks! I tried to update `libnvidia-container`, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. `nvidia-smi`, `nvidia-persistenced`, ...). I just adapted the patches from the old version so that they apply on the new one. I dropped the replacement of PATH usage for binary lookup with fixing it to the `/run/nvidia-docker` directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling `nvidia-container-cli`? What do the experts think? CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867	07:49:12
Moritz Sanft	Hey folks! I tried to update `libnvidia-container`, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. `nvidia-smi`, `nvidia-persistenced`, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the `/run/nvidia-docker` directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling `nvidia-container-cli`? What do the experts think? CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867	07:53:31
SomeoneSerge (back on matrix)	* Iterating on triton with ccache is so much faster lmao EDIT: triton+torch in half an hour on a single node, this not perfect but is an improvement	11:41:55
SomeoneSerge (back on matrix)	In reply to @msanft:matrix.org Hey folks! I tried to update `libnvidia-container`, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. `nvidia-smi`, `nvidia-persistenced`, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the `/run/nvidia-docker` directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling `nvidia-container-cli`? What do the experts think? CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot. Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867 What'd be a reasonable way to test this, now that our docker/podman flows all migrated to CDI and our singularity IIRC uses a plain text file with the library paths?	11:44:30
Moritz Sanft	I tested it with an "OCI Hook", like so: https://github.com/confidential-containers/cloud-api-adaptor/blob/191ec51f6245a1a475c15312d354efaf07ff64de/src/cloud-api-adaptor/podvm/addons/nvidia_gpu/setup.sh#L11C1-L17C4 Getting that to work was also the particular reason for why I got to update this package in the first place.	12:21:24
Moritz Sanft	The update is necessary to fix legacy library lookup for containers with GPU access, as newer drivers won't have the `libnvidia-pkcs11.so` (which corresponds to OpenSSL 1.1), but only the `*.openssl3.so` alternatives for OpenSSL 3. Just to give this some context. Legacy binary lookup doesn't work with 1.9.0 nor 1.16.2 as of now. I think we might even want to get the update itself merged without fixing that, as it's security-relevant and the binary availability is not a regression, but I'm also happy to hear your stance on that.	12:24:09
zopieux	Pinning nixpkgs to `9357f4f23713673f310988025d9dc261c20e70c6` per this commit, I successfully manage to retrieve `cudaPackages.stuff` from cuda-maintainers cachix, however `onnxruntime` doesn't seem to be in there, is it broken?	17:03:50
zopieux	* Pinning nixpkgs to `9357f4f23713673f310988025d9dc261c20e70c6` per this commit, I successfully manage to retrieve `cudaPackages.(things)` from cuda-maintainers cachix, however `onnxruntime` doesn't seem to be in there, is it broken?	17:04:05
12 Oct 2024
	@mabau:matrix.org joined the room.	07:39:38
14 Oct 2024
Gaétan Lepage	It looks like `python312Packages.onnx` does not build when `cudaSupport = true`.	08:11:25
SomeoneSerge (back on matrix)	Gaétan Lepage: could you give https://github.com/NixOS/nixpkgs/pull/328247 another look? I just picked up where the author left off, I didn't try questioning whether e.g. adding a separate `triton-llvm` is the right way or whatever, and my brain is not in the place to think high-level rn	18:43:40
SomeoneSerge (back on matrix)	In reply to @zopieux:matrix.zopi.eu Pinning nixpkgs to `9357f4f23713673f310988025d9dc261c20e70c6` per this commit, I successfully manage to retrieve `cudaPackages.(things)` from cuda-maintainers cachix, however `onnxruntime` doesn't seem to be in there, is it broken? Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux	18:50:31
SomeoneSerge (back on matrix)	In reply to @zopieux:matrix.zopi.eu Pinning nixpkgs to `9357f4f23713673f310988025d9dc261c20e70c6` per this commit, I successfully manage to retrieve `cudaPackages.(things)` from cuda-maintainers cachix, however `onnxruntime` doesn't seem to be in there, is it broken? * Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution	18:51:50
SomeoneSerge (back on matrix)	https://nix-community.org/cache/	18:52:36
Gaétan Lepage	In reply to @ss:someonex.net Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution Indeed, it seems to fail currently	19:02:58
Gaétan Lepage	In reply to @ss:someonex.net Seems like dependencies failed to build: https://hercules-ci.com/accounts/github/SomeoneSerge/derivations/%2Fnix%2Fstore%2Fn3lww4jsfan66wyryh3ip3ryarn874q5-onnxruntime-1.18.1.drv?via-job=e51bf1d4-6191-4763-8780-dd317be0b70b Rather than debugging this, I'd advise you look into https://hydra.nix-community.org/job/nixpkgs/cuda/onnxruntime.x86_64-linux. There hasn't been any official announcements from nix-community's infra team to the best of my knowledge -> no "promises", but the hope is that this will become the supported and long-term maintained solution This is building the cuda version of onnx ?	19:03:19
SomeoneSerge (back on matrix)	Yes but also the hydra history is all green 🤷	19:08:54
Gaétan Lepage	Yes, weird...	19:13:19
SomeoneSerge (back on matrix)	Noticed https://github.com/SomeoneSerge/nixpkgs-cuda-ci/issues/31#issuecomment-2412043822 only now, published a response	19:22:08
Gaétan Lepage	I can't get `onnx` to build... Here are the logs in case someone know what is happening: https://paste.glepage.com/upload/eel-falcon-sloth	20:08:13
SomeoneSerge (back on matrix)	`error: downloading 'https://github.com/abseil/abseil-cpp/archive/refs/tags/20230125.3.tar.gz' failed` lol	20:19:08
SomeoneSerge (back on matrix)	In reply to @ss:someonex.net Yes but also the hydra history is all green 🤷 Maybe that just came in from staging	20:19:30
15 Oct 2024
connor (burnt/out) (UTC-8)	In reply to @glepage:matrix.org I can't get `onnx` to build... Here are the logs in case someone know what is happening: https://paste.glepage.com/upload/eel-falcon-sloth Onnx's CMake isn't detecting at least one dependency, so it tries to download them all in order, starting with abseil. Since there's no networking in the sandbox, it fails.	00:06:48
connor (burnt/out) (UTC-8)	I'm currently working on Onnx packaging for a thing, and you can see what I've got going on here: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx.nix (It's a combination C++/Python install so it's gnarly. But better than having two separate derivations with libraries built with different flags, I guess.)	00:09:04
Gaétan Lepage	Ok interesting, thanks for sharing	05:46:57
Gaétan Lepage	Is your plan to upstream this to nixpkgs ?	05:47:13
Gaétan Lepage	[triton update] `triton-llvm` fails during the test phase. Logs: https://paste.glepage.com/upload/fish-jaguar-pig	08:48:05
	atagen joined the room.	11:38:21
SomeoneSerge (back on matrix)	In reply to @glepage:matrix.org [triton update] `triton-llvm` fails during the test phase. Logs: https://paste.glepage.com/upload/fish-jaguar-pig Can't reproduce, builds for me	12:35:31

Show newer messages

Back to Room ListRoom Version: 9