NixOS CUDA - Public Room Timeline

	NixOS CUDA	290 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	57 Servers

Load older messages

Sender	Message	Time
13 Jun 2024
	shekhinah removed their display name yaldabaoth.	02:43:30
SomeoneSerge (back on matrix)	Ehhh tfw `/proc/sys/fs/file-max` is 20 characters long but nix build fails with "too many open files"	16:17:07
SomeoneSerge (back on matrix)	`$ sudo lsof \| wc -l 1191414` THat's not much is it	16:21:12
SomeoneSerge (back on matrix)	In reply to @gjvnq:matrix.org Yeah, I had already figured out it but the bug issue is that I don't know what is the "right" way to include the definition of the half type. To make matters worse, I've tried to compile AliceVision on a docker container and using the official compilation scripts and yet the thing keeps failing. This means I can't even look at how the thing is supposed to compile. I'm at a bit of a loss as for how to proceed, but I suspect that I'll have to either ask the original authors for help or carefully read the cmake compilation scripts in order to look for potential sources of the error. Theoretically AliceVision has a nice CI pipeline but I can't see their build history so I don't even know how useful their CI scripts are. Could it be this https://github.com/AcademySoftwareFoundation/Imath/blob/2fc9d89ec52003350fcfd20f337bb3d0b870ff5a/src/Imath/half.h#L180-L182	16:25:14
Mir	In reply to @ss:someonex.net Could it be this https://github.com/AcademySoftwareFoundation/Imath/blob/2fc9d89ec52003350fcfd20f337bb3d0b870ff5a/src/Imath/half.h#L180-L182 possibly, but I'm afraid of just patching source code to force the inclusion of CUDA's half without first exhausting config flags. I feel like something in CMake is misconfigured or bugged and I feel like I should patch CMakeFile.txt before patching the source code directly	16:32:00
aidalgol	In reply to @ss:someonex.net I suppose we need some kind of a fixpoint 🤷 Btw, I just had a look at the mangohud derivation, and we're still doing `inherit (linuxPackages.nvidia_x11) libXNVCtrl` which I think is a mauvais ton (referencing a concrete version of linuxPackages from nixpkgs) I still think it belongs in the top-level, maybe as an attrset, `libXNVCtrlVersions` probably a very bad idea, but the nvidia nixos module could add an overlay setting the respective default version of libXNVCtrl all packages taking `libXNVCtrl` from the top-level, we'd ensure only one version is in use in any given closure (probably at the cost of rebuilding reverse dependencies) passing a python package (mako) rather than python3Packages Honestly, not sure if this is worth the effort 🙃 I did have an attempt at moving it to the top-level, but decided to do it in a separate PR. I need to come back to that. I'm not entirely sure how best to approach that so that `nvidia_x11` can override it.	19:59:27
aidalgol	I'm not convinced that having a single fixed version of `libXNVCtrl` is worth the trouble, but I do want to move it to the top level.	20:00:03
15 Jun 2024
	shekhinah set their display name to shekhinah.	08:46:32
matthewcroughan	Did you guys know that `python311Packages.tensorrt` is broken because someone updated `pkgs/development/cuda-modules/tensorrt/releases.nix` without checking that it broke any derivations?	15:14:46
matthewcroughan	`[astraluser@edward:~/Downloads/f/TensorRT-8.6.1.6]$ ls python/tensorrt-8.6.1-cp3 tensorrt-8.6.1-cp310-none-linux_x86_64.whl tensorrt-8.6.1-cp36-none-linux_x86_64.whl tensorrt-8.6.1-cp38-none-linux_x86_64.whl tensorrt-8.6.1-cp311-none-linux_x86_64.whl tensorrt-8.6.1-cp37-none-linux_x86_64.whl tensorrt-8.6.1-cp39-none-linux_x86_64.whl`	15:15:11
matthewcroughan	python3.11-tensorrt> /nix/store/d3dzfy4amjl826fb8j00qp1d9887h7hm-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context error: builder for '/nix/store/8pw2fjq86vbkdd6s1bl6axfkhbnm18lr-python3.11-tensorrt-8.6.1.6.drv' failed with exit code 2; last 10 log lines: > Using pythonImportsCheckPhase > Sourcing python-namespaces-hook > Sourcing python-catch-conflicts-hook.sh > Sourcing auto-add-driver-runpath-hook > Using autoAddDriverRunpath > Sourcing fix-elf-files.sh > Running phase: unpackPhase > tar: TensorRT-8.6.1.6/python/tensorrt-8.6.1.6-cp311-none-linux_x86_64.whl: Not found in archive > tar: Exiting with failure status due to previous errors > /nix/store/d3dzfy4amjl826fb8j00qp1d9887h7hm-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context For full logs, run 'nix log /nix/store/8pw2fjq86vbkdd6s1bl6axfkhbnm18lr-python3.11-tensorrt-8.6.1.6.drv'.	15:15:34
matthewcroughan	they removed the `.6` from the release	15:15:47
matthewcroughan	`TensorRT-8.6.1.6/python/tensorrt-8.6.1.6-cp311-none-linux_x86_64.whl` is wrong `TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp311-none-linux_x86_64.whl` is correct	15:16:11
SomeoneSerge (back on matrix)	In reply to @matthewcroughan:defenestrate.it Did you guys know that `python311Packages.tensorrt` is broken because someone updated `pkgs/development/cuda-modules/tensorrt/releases.nix` without checking that it broke any derivations? Nvidia prevents unattended downloads, of course it broke	16:08:17
matthewcroughan	God we need archive-org-pkgs	16:22:55
teto	In reply to @connorbaker:matrix.org What revision of nixpkgs are you on? `master` fails to build (`go-stable-diffusion` errors during CMake configure) right sry I had disabled diffusion in an overlay. I've checked that it works on master now (following the ;local-ai 2.16 bump today). I've opened https://github.com/NixOS/nixpkgs/issues/320145 to help myself collect the info	22:26:22
17 Jun 2024
	grw00 joined the room.	12:25:16
grw00	hey all, has anyone had success using cuda libraries inside a docker container built with nix? i don't mean running a cuda container on nixos host but the opposite, running a nix container containing cuda program on another host i build a container with nix and pytorch etc and run it on runpod, it doesnt see nvidia drivers/device though, i guess i am missing something. currently i have: `dockerImages.default = pkgs.dockerTools.streamLayeredImage { name = "ghcr.io/my-image"; tag = "latest"; contents = [ pkgs.bash pkgs.uutils-coreutils-noprefix pkgs.cacert pkgs.libnvidia-container pythonEnv ]; config = { Cmd = [ "${pkgs.bash}/bin/bash" ]; Env = [ "CUDA_PATH=${pkgs.cudatoolkit}" "LD_LIBRARY_PATH=${pkgs.linuxPackages_5_4.nvidia_x11}/lib" ]; }; };`	12:30:49
SomeoneSerge (back on matrix)	grw00: are you using CDI or the runtime wrappers? Either way you need to have the drivers exposed in ld_library_path or mounted under /run/opengl-driver/lib	12:31:13
grw00	not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container	12:32:09
SomeoneSerge (back on matrix)	In reply to @grw00:matrix.org hey all, has anyone had success using cuda libraries inside a docker container built with nix? i don't mean running a cuda container on nixos host but the opposite, running a nix container containing cuda program on another host i build a container with nix and pytorch etc and run it on runpod, it doesnt see nvidia drivers/device though, i guess i am missing something. currently i have: `dockerImages.default = pkgs.dockerTools.streamLayeredImage { name = "ghcr.io/my-image"; tag = "latest"; contents = [ pkgs.bash pkgs.uutils-coreutils-noprefix pkgs.cacert pkgs.libnvidia-container pythonEnv ]; config = { Cmd = [ "${pkgs.bash}/bin/bash" ]; Env = [ "CUDA_PATH=${pkgs.cudatoolkit}" "LD_LIBRARY_PATH=${pkgs.linuxPackages_5_4.nvidia_x11}/lib" ]; }; };` Hard coding linuxPackages in the image is a bad idea. With cuda you normally don't want drivers in the image, you want the host's drivers mounted in the containet	12:33:17
SomeoneSerge (back on matrix)	No need for libnvidia-container in the imahe either i think	12:34:17
grw00	In reply to @ss:someonex.net Hard coding linuxPackages in the image is a bad idea. With cuda you normally don't want drivers in the image, you want the host's drivers mounted in the containet ah kk, got it. i'm specifically trying to use this on runpod.io, i don't think they offer this as a possibility. it seems like the images they offer all have cuda installed in image	12:35:10
SomeoneSerge (back on matrix)	In reply to @grw00:matrix.org not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container CDI is the new thing where you can specify where to mount things in the containers in a json file	12:36:28
SomeoneSerge (back on matrix)	In reply to @grw00:matrix.org ah kk, got it. i'm specifically trying to use this on runpod.io, i don't think they offer this as a possibility. it seems like the images they offer all have cuda installed in image They have to have a driver on the host, it's separate from the cuda toolkit	12:37:40
SomeoneSerge (back on matrix)	In reply to @grw00:matrix.org not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container Can you bind mount it using CLI flags maybe?	12:38:09
SomeoneSerge (back on matrix)	Bottom line is: this is not about the image, it's about the host configuration	12:38:39
grw00	ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though	12:40:49
grw00	`❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io -- RUNPOD.IO -- Enjoy your Pod #6r0gwnq7twsots ^_^ bash-5.2# nvidia-smi bash: /usr/bin/nvidia-smi: cannot execute: required file not found`	12:41:31
SomeoneSerge (back on matrix)	There's one thing you could do at the image level: anticipating that the host configuration assumes fhs (=is broken and non cross platform) you could wrap your entrypoint with numtide/nixglhost which will separate the meat from the flies and put libcuda (mounted in usr lib probably) in ld library path without any extra breakages	12:43:33

Show newer messages

Back to Room ListRoom Version: 9