NixOS CUDA - Public Room Timeline

	NixOS CUDA	289 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	57 Servers

Load older messages

Sender	Message	Time
22 Apr 2025
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	11:04:17
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads 2. containerd runtime directly running cuda containers 3. get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. ```nix hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	11:08:12
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads ❯ docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi Tue Apr 22 11:08:39 2025 +-----------------------------------------------------------------------------------------+ \| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 \| \|-----------------------------------------+------------------------+----------------------+ \| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|=========================================+========================+======================\| \| 0 NVIDIA GeForce RTX 3080 Ti On \| 00000000:01:00.0 On \| N/A \| \| 0% 50C P8 42W / 350W \| 773MiB / 12288MiB \| 20% Default \| \| \| \| N/A \| +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=========================================================================================\| +-----------------------------------------------------------------------------------------+ containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.	11:08:46
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy. `kube-system/nvidia-device-plugin-daemonset-j8rmc`: I0422 11:10:39.192906 1 main.go:235] "Starting NVIDIA Device Plugin" version=< 3c378193 commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea > I0422 11:10:39.193038 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins I0422 11:10:39.193372 1 main.go:245] Starting OS watcher. I0422 11:10:39.193730 1 main.go:260] Starting Plugins. I0422 11:10:39.193772 1 main.go:317] Loading configuration. I0422 11:10:39.194842 1 main.go:342] Updating config with default resource matching patterns. I0422 11:10:39.195036 1 main.go:353] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "nvidiaDevRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "deviceDiscoveryStrategy": "auto", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "imex": {} } I0422 11:10:39.195045 1 main.go:356] Retrieving plugins. E0422 11:10:39.195368 1 factory.go:112] Incompatible strategy detected auto E0422 11:10:39.195374 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0422 11:10:39.195378 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0422 11:10:39.195382 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0422 11:10:39.195385 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0422 11:10:39.195390 1 main.go:381] No devices found. Waiting indefinitely.	11:12:55
luke-skywalker	* ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit? So far, the attached config got me to: docker (+compose) runtime working with CUDA workloads containerd runtime directly running cuda containers get rke2 running with a `config.toml` that points to all needed runtimes in the nix store. hardware.nvidia-container-toolkit = { enable = true; # package = pkgs.nvidia-container-toolkit; # # Use UUID for device naming - better for multi-GPU setups device-name-strategy = "uuid"; # one of "index", "uuid", "type-index" # Mount additional directories for compatibility mount-nvidia-docker-1-directories = true; # Mount NVIDIA executables into container mount-nvidia-executables = true; }; hardware.nvidia = { modesetting.enable = true; nvidiaPersistenced = true; }; services.rke2 = { enable = true; role = "server"; nodeName = "workstation-0"; cni = "canal"; # \| canal # Set the node IP directly nodeIP = "${systemProfile.network.staticIP}"; debug = true; # Set cluster CIDR ranges properly extraFlags = [ "--kubelet-arg=cgroup-driver=systemd" "--cluster-cidr=10.42.0.0/16" "--service-cidr=10.43.0.0/16" "--disable-cloud-controller" # Disable cloud controller for bare metal # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins ]; disable = ["traefik"]; # "servicelb" # environmentVars = { # NVIDIA_VISIBLE_DEVICES = "all"; # NVIDIA_DRIVER_CAPABILITIES = "all"; # # Set NVIDIA driver root to the standard location # # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia"; # # Home directory for RKE2 # HOME = "/root"; # }; }; `/var/lib/rancher/rke2/agent/etc/containerd/config.toml` # File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead. version = 3 root = "/var/lib/rancher/rke2/agent/containerd" state = "/run/k3s/containerd" [grpc] address = "/run/k3s/containerd/containerd.sock" [plugins.'io.containerd.internal.v1.opt'] path = "/var/lib/rancher/rke2/agent/containerd" [plugins.'io.containerd.grpc.v1.cri'] stream_server_address = "127.0.0.1" stream_server_port = "10010" [plugins.'io.containerd.cri.v1.runtime'] enable_selinux = false enable_unprivileged_ports = true enable_unprivileged_icmp = true device_ownership_from_security_context = false [plugins.'io.containerd.cri.v1.images'] snapshotter = "overlayfs" disable_snapshot_annotations = true [plugins.'io.containerd.cri.v1.images'.pinned_images] sandbox = "index.docker.io/rancher/mirrored-pause:3.6" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options] SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process] runtime_type = "io.containerd.runhcs.v1" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime" SystemdCgroup = true [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] runtime_type = "io.containerd.runc.v2" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi" SystemdCgroup = true [plugins.'io.containerd.cri.v1.images'.registry] config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d" `lsmod \| grep nvidia nvidia_drm 139264 81 nvidia_modeset 1830912 26 nvidia_drm nvidia_uvm 3817472 2 nvidia 97120256 533 nvidia_uvm,nvidia_modeset video 81920 2 asus_wmi,nvidia_modeset drm_ttm_helper 20480 2 nvidia_drm` However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy. `kube-system/nvidia-device-plugin-daemonset-j8rmc` I0422 11:10:39.193038 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins I0422 11:10:39.193372 1 main.go:245] Starting OS watcher. I0422 11:10:39.193730 1 main.go:260] Starting Plugins. I0422 11:10:39.193772 1 main.go:317] Loading configuration. I0422 11:10:39.194842 1 main.go:342] Updating config with default resource matching patterns. I0422 11:10:39.195036 1 main.go:353] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "nvidiaDevRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "deviceDiscoveryStrategy": "auto", "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "imex": {} } I0422 11:10:39.195045 1 main.go:356] Retrieving plugins. E0422 11:10:39.195368 1 factory.go:112] Incompatible strategy detected auto E0422 11:10:39.195374 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0422 11:10:39.195378 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0422 11:10:39.195382 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0422 11:10:39.195385 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0422 11:10:39.195390 1 main.go:381] No devices found. Waiting indefinitely.	11:14:24
luke-skywalker	ok I think setting `"--default-runtime=nvidia" "--node-label=nvidia.com/gpu.present=true"` let rke-server find two nvidia runtimes. Now I get a completely different error about an undefined symbol in the used glibc. Feels like getting closer though 😅 🤏	13:39:48
ereslibre	luke-skywalker: you can open an issue in nixpkgs with the following template: https://github.com/NixOS/nixpkgs/issues/new?template=03_bug_report_nixos.yml You can use something like “nixos/nvidia-container-toolkit: containerd does not honor CDI specs” instead of “nixos/MODULENAME: BUG TITLE”	19:22:57
ereslibre	Looks like you will need something along the lines of https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#containerd-configuration	19:38:49
ereslibre	I have to check myself and reproduce the issue and check the fix. When done, I am positive that we can also automate this on the nixos module side	19:39:32
luke-skywalker	ok finally after 4 days of falling from one rabbit hole into the next deeper one, was finally able to deploy nvidia device plugin onto my initial cluster node and run cuda workloads 🥳 Will now proceed replicated the setup on a second machine with another NVIDIA GPU, join the cluster and see if I can do pipeline parellelised vLLM inference 🤞	20:44:24
luke-skywalker	yes that was indeed the last missing piece of the puzzle!	20:45:13
luke-skywalker	* 🥳 I was finally able to deploy nvidia device plugin onto my initial cluster node and run cuda workloads after 4 days of falling from one rabbit hole into the next 😅 Will now proceed replicated the setup on a second machine with another NVIDIA GPU, join the cluster and see if I can do pipeline parellelised vLLM inference 🤞	21:27:51
luke-skywalker	ereslibre: interestingly enough I was so far only able to successfully deploy the daemonset v14 and 15. Using the latest v17 results in a glib error.	22:12:36
luke-skywalker	th for pointing me to this. I have been scratchign my head on what the right channel and format is to give feedback to nixos project 🙏🙏🙏	22:13:27
luke-skywalker	* 🙏🙏🙏 thx for pointing me to this. I have been scratchign my head on what the right channel and format is to give feedback to nixOS project.	22:13:47
SomeoneSerge (back on matrix)	connor (he/him) (UTC-7): look familiar? https://mastodon.social/@effinbirds/114383881424822335	23:06:37
23 Apr 2025
ereslibre	Glad it worked! :)	05:48:55
SomeoneSerge (back on matrix)	luke-skywalker: looking forward to read the blog post xD	12:03:57
luke-skywalker	blog post? Shouldnt everyone have the joy of fighting through those dungeons of rabbit holes and come out the other end with some awesome loot? 😊 Will do when I find the time to write it down as a guide / article or make a PR to either rke2 or nvidia-container-toolkit. Might even wrap it into its own system module. But main thing is available time since this is just one of the stepping stones to a system to federate distributed "AI" capabilites. Dont actually want to be too public before I have a working "kernel" of the envisioned system.	15:22:21
ereslibre	I might be able to open a PR to enable CDI on containerd this weekend	21:14:48
luke-skywalker	FYI with the virtualisation.containerd module (not the one used by rke2) it works already out of the box.	21:16:15
24 Apr 2025
ereslibre	luke-skywalker: unless I’m missing something, nothing is setting https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#containerd-configuration, right? You had to do this manually, right?	06:06:26
luke-skywalker	funnily from all the detours I took to make ti work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it. So yes you need to profide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set `[plugins."io.containerd.grpc.v1.cri".cdi]` Could you give me the TLDR why using `image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x` fails with glibc issue. my understanding it was build build with a newer version of glibc that on my system (2.40)? ANy way to solve this or is it ok to simply stick to 16.x?	12:19:15
luke-skywalker	* funnily from all the detours I took to make it work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it. So yes you need to provide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set `[plugins."io.containerd.grpc.v1.cri".cdi]` Could you give me the TLDR why using `image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x` fails with glibc issue. my understanding it was build build with a newer version of glibc that on my system (2.40)? ANy way to solve this or is it ok to simply stick to 16.x?	12:19:41
luke-skywalker	* funnily from all the detours I took to make it work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it. So yes you need to provide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set `[plugins."io.containerd.grpc.v1.cri".cdi]` Could you give me the TLDR why using `image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x` fails with glibc issue. my understanding it was build build with a newer version of glibc that on my system (2.40)? Any way to solve this or shoudl I to simply stick to 16.x until the glibc version on unstable channel nixos is compatible again?	12:20:36
luke-skywalker	ui interesting. How does it compare to vllm? I see it supports device maps. Is that for pipeline parallelism so GPU devices on different nodes / machines as well? Is it somehow affiliated to mistral-ai or whats the reason for the name of the library? ;)	14:12:59
luke-skywalker	also does that work on k8s clusters? 🤔	14:13:43
Gaétan Lepage	I haven't used it much myself as I don't own a big enough GPU. According to me, it is not affiliated to Mistral (the company). I guess that it's the same as "ollama" and Llama (Meta).	15:43:10
luke-skywalker	its getting better still though. Now switched from Daemonset deplyoment of the device plugin to helm deployment with custom values. This made it possible to also enable time slicing available GPU 🥳	16:10:02
luke-skywalker	thx for the info, yeah the same as ollama was my assumption. Guess ill stick to vllm depoyment with helm on k8s.	16:37:54

Show newer messages

Back to Room ListRoom Version: 9