NixOS CUDA - Public Room Timeline

	NixOS CUDA	291 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	58 Servers

Load older messages

Sender	Message	Time
22 Apr 2025
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	10:59:26
luke-skywalker	My suggestion to this point, that something is looking for some specific path resolution of either the nvidia drivers or the container runtime. 🤷‍♂️	11:00:32
luke-skywalker	* My suspicion to this point is, that something is looking for some specific path resolution of either the nvidia driver / library or the container runtime. 🤷‍♂️	11:00:58
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	11:04:17
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads 2. containerd runtime directly running cuda containers 3. get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. ```nix hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	11:08:12
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads ❯ docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi Tue Apr 22 11:08:39 2025 +-----------------------------------------------------------------------------------------+ \| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 \| \|-----------------------------------------+------------------------+----------------------+ \| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|=========================================+========================+======================\| \| 0 NVIDIA GeForce RTX 3080 Ti On \| 00000000:01:00.0 On \| N/A \| \| 0% 50C P8 42W / 350W \| 773MiB / 12288MiB \| 20% Default \| \| \| \| N/A \| +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=========================================================================================\| +-----------------------------------------------------------------------------------------+ containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	11:08:46
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy. `kube-system/nvidia-device-plugin-daemonset-j8rmc`: I0422 11:10:39.192906 1 main.go:235] "Starting NVIDIA Device Plugin" version=< 3c378193 commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea > I0422 11:10:39.193038 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins I0422 11:10:39.193372 1 main.go:245] Starting OS watcher. I0422 11:10:39.193730 1 main.go:260] Starting Plugins. I0422 11:10:39.193772 1 main.go:317] Loading configuration. I0422 11:10:39.194842 1 main.go:342] Updating config with default resource matching patterns. I0422 11:10:39.195036 1 main.go:353] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "nvidiaDevRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "deviceDiscoveryStrategy": "auto", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "imex": {} } I0422 11:10:39.195045 1 main.go:356] Retrieving plugins. E0422 11:10:39.195368 1 factory.go:112] Incompatible strategy detected auto E0422 11:10:39.195374 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0422 11:10:39.195378 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0422 11:10:39.195382 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0422 11:10:39.195385 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0422 11:10:39.195390 1 main.go:381] No devices found. Waiting indefinitely.	11:12:55
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy. `kube-system/nvidia-device-plugin-daemonset-j8rmc` I0422 11:10:39.193038 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins I0422 11:10:39.193372 1 main.go:245] Starting OS watcher. I0422 11:10:39.193730 1 main.go:260] Starting Plugins. I0422 11:10:39.193772 1 main.go:317] Loading configuration. I0422 11:10:39.194842 1 main.go:342] Updating config with default resource matching patterns. I0422 11:10:39.195036 1 main.go:353] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "nvidiaDevRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "deviceDiscoveryStrategy": "auto", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "imex": {} } I0422 11:10:39.195045 1 main.go:356] Retrieving plugins. E0422 11:10:39.195368 1 factory.go:112] Incompatible strategy detected auto E0422 11:10:39.195374 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0422 11:10:39.195378 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0422 11:10:39.195382 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0422 11:10:39.195385 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0422 11:10:39.195390 1 main.go:381] No devices found. Waiting indefinitely.	11:14:24
luke-skywalker	ok I think setting `"--default-runtime=nvidia" "--node-label=nvidia.com/gpu.present=true"` let rke-server find two nvidia runtimes. Now I get a completely different error about an undefined symbol in the used glibc. Feels like getting closer though 😅 🤏	13:39:48
ereslibre	luke-skywalker: you can open an issue in nixpkgs with the following template: https://github.com/NixOS/nixpkgs/issues/new?template=03_bug_report_nixos.yml You can use something like “nixos/nvidia-container-toolkit: containerd does not honor CDI specs” instead of “nixos/MODULENAME: BUG TITLE”	19:22:57
ereslibre	Looks like you will need something along the lines of https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#containerd-configuration	19:38:49
ereslibre	I have to check myself and reproduce the issue and check the fix. When done, I am positive that we can also automate this on the nixos module side	19:39:32
luke-skywalker	ok finally after 4 days of falling from one rabbit hole into the next deeper one, was finally able to deploy nvidia device plugin onto my initial cluster node and run cuda workloads 🥳 Will now proceed replicated the setup on a second machine with another NVIDIA GPU, join the cluster and see if I can do pipeline parellelised vLLM inference 🤞	20:44:24
luke-skywalker	yes that was indeed the last missing piece of the puzzle!	20:45:13
luke-skywalker	* 🥳 I was finally able to deploy nvidia device plugin onto my initial cluster node and run cuda workloads after 4 days of falling from one rabbit hole into the next 😅 Will now proceed replicated the setup on a second machine with another NVIDIA GPU, join the cluster and see if I can do pipeline parellelised vLLM inference 🤞	21:27:51
luke-skywalker	ereslibre: interestingly enough I was so far only able to successfully deploy the daemonset v14 and 15. Using the latest v17 results in a glib error.	22:12:36
luke-skywalker	th for pointing me to this. I have been scratchign my head on what the right channel and format is to give feedback to nixos project 🙏🙏🙏	22:13:27
luke-skywalker	* 🙏🙏🙏 thx for pointing me to this. I have been scratchign my head on what the right channel and format is to give feedback to nixOS project.	22:13:47
SomeoneSerge (back on matrix)	connor (he/him) (UTC-7): look familiar? https://mastodon.social/@effinbirds/114383881424822335	23:06:37
23 Apr 2025
ereslibre	Glad it worked! :)	05:48:55
SomeoneSerge (back on matrix)	luke-skywalker: looking forward to read the blog post xD	12:03:57
luke-skywalker	blog post? Shouldnt everyone have the joy of fighting through those dungeons of rabbit holes and come out the other end with some awesome loot? 😊 Will do when I find the time to write it down as a guide / article or make a PR to either rke2 or nvidia-container-toolkit. Might even wrap it into its own system module. But main thing is available time since this is just one of the stepping stones to a system to federate distributed "AI" capabilites. Dont actually want to be too public before I have a working "kernel" of the envisioned system.	15:22:21
ereslibre	I might be able to open a PR to enable CDI on containerd this weekend	21:14:48
luke-skywalker	FYI with the virtualisation.containerd module (not the one used by rke2) it works already out of the box.	21:16:15
24 Apr 2025
ereslibre	luke-skywalker: unless I’m missing something, nothing is setting https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#containerd-configuration, right? You had to do this manually, right?	06:06:26
luke-skywalker	funnily from all the detours I took to make ti work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it. So yes you need to profide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set `[plugins."io.containerd.grpc.v1.cri".cdi]` Could you give me the TLDR why using `image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x` fails with glibc issue. my understanding it was build build with a newer version of glibc that on my system (2.40)? ANy way to solve this or is it ok to simply stick to 16.x?	12:19:15
luke-skywalker	* funnily from all the detours I took to make it work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it. So yes you need to provide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set `[plugins."io.containerd.grpc.v1.cri".cdi]` Could you give me the TLDR why using `image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x` fails with glibc issue. my understanding it was build build with a newer version of glibc that on my system (2.40)? ANy way to solve this or is it ok to simply stick to 16.x?	12:19:41
luke-skywalker	* funnily from all the detours I took to make it work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it. So yes you need to provide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set `[plugins."io.containerd.grpc.v1.cri".cdi]` Could you give me the TLDR why using `image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x` fails with glibc issue. my understanding it was build build with a newer version of glibc that on my system (2.40)? Any way to solve this or shoudl I to simply stick to 16.x until the glibc version on unstable channel nixos is compatible again?	12:20:36
luke-skywalker	ui interesting. How does it compare to vllm? I see it supports device maps. Is that for pipeline parallelism so GPU devices on different nodes / machines as well? Is it somehow affiliated to mistral-ai or whats the reason for the name of the library? ;)	14:12:59
luke-skywalker	also does that work on k8s clusters? 🤔	14:13:43

Show newer messages

Back to Room ListRoom Version: 9