NixOS CUDA - Public Room Timeline

	NixOS CUDA	290 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	57 Servers

Load older messages

Sender	Message	Time
13 Apr 2025
	ereslibre joined the room.	11:43:29
ereslibre	Hi everyone! I am looking at a bug we have with CDI (Container Device Interface, for forwarding GPU's to containers): https://github.com/NixOS/nixpkgs/issues/397065 I think the user has a correct configuration (unless there are settings that were not mentioned in the issue), my main question is when using the datacenter driver, why the nvidia-container-toolkit is reporting: `ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND` Do you have any idea on why NVML would not be present in this environment?	11:45:34
SomeoneSerge (back on matrix)	HI! I've a small announcement to make. I've been failing badly to keep up with the backlog as a maintainer, even though I'm recently able to spend some more time on Nixpkgs&c. Working in occasional 1:1 meetings, otoh, has always felt comparatively productive. We've just had another call with Gaétan Lepage and I find it was nice, so I now want to try the following: https://md.someonex.net/s/9S4E00sIb# This is not exactly "official", I'm not posting this e.g. on Discourse until I'm more confident, but as such it's an open invitation.	14:52:27
Gaétan Lepage	Indeed, it was great! We were able to finally finish fixint `mistral-rs`'s cuda support!	15:24:02
Gaétan Lepage	* Indeed, it was great! We were able to finally finish fixing `mistral-rs`'s cuda support!	15:24:22
15 Apr 2025
ereslibre	BTW folks, if you have a moment, I'd love to get this one merged: https://github.com/NixOS/nixpkgs/pull/367769	06:26:28
SomeoneSerge (back on matrix)	connor (he/him) (UTC-7): did you use something like josh for `cuda-legacy`? I suspect this produced at least a few pings 😅	13:35:27
connor (he/him)	I used https://github.com/newren/git-filter-repo — what would have pinged people?	13:37:06
SomeoneSerge (back on matrix)	User handles in commit messages xD	13:37:28
ereslibre	Hi! Given https://github.com/NixOS/nixpkgs/pull/362197 had conflicts recently due to the treewide formatting I closed it, and reopened it at https://github.com/NixOS/nixpkgs/pull/398993. I think we can merge this one too	21:24:34
ereslibre	We have been going back and forth with the author for a while, and I thought it would be good to go ahead on our side	21:25:10
ereslibre	Thanks!	21:29:32
17 Apr 2025
	luke-skywalker joined the room.	09:38:30
luke-skywalker	is this the right place to ask questions / get pointes on how to properly setup cuda container toolkit? For docker it seems to work when enabling deprecated `enableNvidia = true;` flag. However with neither `nvidia-container-toolkit` in systemPackages with or without `hardware.nvidia-container-toolkit.enable = true;` I cannot seem to get it to run...	11:01:34
luke-skywalker	was not lucky at all with containerd for k3s	11:02:09
luke-skywalker	for anybody stumbling over this: I'm pretty sure im on the right track using CDIs, having it work with docker (& compose). Should have read the docs properly. The relevant section section from the nixOS CUDA docs that got me here was all the way at the bottom: https://nixos.wiki/wiki/Nvidia#NVIDIA%20Docker%20not%20Working	14:38:50
luke-skywalker	from all I understand this gives a lot more flexibility to pass accelerators of different vendors to containerized workloads 🥳	14:39:36
SomeoneSerge (back on matrix)	Yes, CDI is the supported way (and has received a lot of care from @ereslibre), `enableNvidia` relies on end-of-life runtime wrappers	16:18:38
SomeoneSerge (back on matrix)	Should have read the docs properly. The relevant section section from Did you manage to get containerd to work?	16:20:27
ereslibre	+1, let us know if you run into any issues when enabling CDI :)	19:31:30
18 Apr 2025
connor (he/him)	SomeoneSerge (UTC+U[-12,12]) I removed all the module system stuff from https://github.com/connorbaker/cuda-packages	11:24:48
luke-skywalker	ereslibre: I got it to run with docker but still struggling to getting it to run with containerd and k8s-device-plugin.	20:46:45
ereslibre	In reply to @luke-skywalker:matrix.org ereslibre: I got it to run with docker but still struggling to getting it to run with containerd and k8s-device-plugin. Interesting. If you feel like it, please open an issue and we can follow up. I did not try to run CDI with either of those	20:48:38
20 Apr 2025
SomeoneSerge (back on matrix)	Updated https://github.com/NVIDIA/build-system-archive-import-examples/issues/5 to reflect preference for the`.note.dlopen` section over eager-loading	09:34:53
	@techyporcupine:matrix.org left the room.	18:15:53
21 Apr 2025
luke-skywalker	Redacted or Malformed Event	13:54:54
SomeoneSerge (back on matrix)	@luke-skywalker:matrix.org: the moderation bot is configured to drop all media in nixos spaces because there was a spam campaign disseminating csam matrix-wide, it's an unfortunate situation but the mods don't really have any other tools at their disposal	19:48:15
22 Apr 2025
	jaredmontoya joined the room.	09:32:38
luke-skywalker	ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml`	10:56:26
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	10:59:26

Show newer messages

Back to Room ListRoom Version: 9