!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

286 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

Load older messages


SenderMessageTime
15 Apr 2025
@ss:someonex.netSomeoneSerge (back on matrix) connor (he/him) (UTC-7): did you use something like josh for cuda-legacy? I suspect this produced at least a few pings 😅 13:35:27
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)I used https://github.com/newren/git-filter-repo — what would have pinged people?13:37:06
@ss:someonex.netSomeoneSerge (back on matrix)User handles in commit messages xD13:37:28
@ereslibre:ereslibre.socialereslibreHi! Given https://github.com/NixOS/nixpkgs/pull/362197 had conflicts recently due to the treewide formatting I closed it, and reopened it at https://github.com/NixOS/nixpkgs/pull/398993. I think we can merge this one too21:24:34
@ereslibre:ereslibre.socialereslibreWe have been going back and forth with the author for a while, and I thought it would be good to go ahead on our side21:25:10
@ereslibre:ereslibre.socialereslibreThanks!21:29:32
17 Apr 2025
@luke-skywalker:matrix.orgluke-skywalker joined the room.09:38:30
@luke-skywalker:matrix.orgluke-skywalker

is this the right place to ask questions / get pointes on how to properly setup cuda container toolkit?

For docker it seems to work when enabling deprecated enableNvidia = true; flag. However with neither nvidia-container-toolkit in systemPackages with or without hardware.nvidia-container-toolkit.enable = true; I cannot seem to get it to run...

11:01:34
@luke-skywalker:matrix.orgluke-skywalkerwas not lucky at all with containerd for k3s11:02:09
@luke-skywalker:matrix.orgluke-skywalkerfor anybody stumbling over this: I'm pretty sure im on the right track using CDIs, having it work with docker (& compose). Should have read the docs properly. The relevant section section from the nixOS CUDA docs that got me here was all the way at the bottom: https://nixos.wiki/wiki/Nvidia#NVIDIA%20Docker%20not%20Working 14:38:50
@luke-skywalker:matrix.orgluke-skywalkerfrom all I understand this gives a lot more flexibility to pass accelerators of different vendors to containerized workloads 🥳14:39:36
@ss:someonex.netSomeoneSerge (back on matrix) Yes, CDI is the supported way (and has received a lot of care from @ereslibre), enableNvidia relies on end-of-life runtime wrappers 16:18:38
@ss:someonex.netSomeoneSerge (back on matrix)

Should have read the docs properly. The relevant section section from

Did you manage to get containerd to work?

16:20:27
@ereslibre:ereslibre.socialereslibre+1, let us know if you run into any issues when enabling CDI :)19:31:30
18 Apr 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) SomeoneSerge (UTC+U[-12,12]) I removed all the module system stuff from https://github.com/connorbaker/cuda-packages 11:24:48
@luke-skywalker:matrix.orgluke-skywalker ereslibre: I got it to run with docker but still struggling to getting it to run with containerd and k8s-device-plugin. 20:46:45
@ereslibre:ereslibre.socialereslibre
In reply to @luke-skywalker:matrix.org
ereslibre: I got it to run with docker but still struggling to getting it to run with containerd and k8s-device-plugin.
Interesting. If you feel like it, please open an issue and we can follow up. I did not try to run CDI with either of those
20:48:38
20 Apr 2025
@ss:someonex.netSomeoneSerge (back on matrix) Updated https://github.com/NVIDIA/build-system-archive-import-examples/issues/5 to reflect preference for the.note.dlopen section over eager-loading 09:34:53
@techyporcupine:matrix.org@techyporcupine:matrix.org left the room.18:15:53
21 Apr 2025
@luke-skywalker:matrix.orgluke-skywalkerRedacted or Malformed Event13:54:54
@ss:someonex.netSomeoneSerge (back on matrix) @luke-skywalker:matrix.org: the moderation bot is configured to drop all media in nixos spaces because there was a spam campaign disseminating csam matrix-wide, it's an unfortunate situation but the mods don't really have any other tools at their disposal 19:48:15
22 Apr 2025
@jaredmontoya:matrix.orgjaredmontoya joined the room.09:32:38
@luke-skywalker:matrix.orgluke-skywalker

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
  2. containerd runtime directly running cuda containers
  3. get rke2 running with a config.toml that points to all needed runtimes in the nix store.
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

10:56:26
@luke-skywalker:matrix.orgluke-skywalker *

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
  2. containerd runtime directly running cuda containers
  3. get rke2 running with a config.toml that points to all needed runtimes in the nix store.
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 3
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[grpc]
  address = "/run/k3s/containerd/containerd.sock"

[plugins.'io.containerd.internal.v1.opt']
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.'io.containerd.grpc.v1.cri']
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

[plugins.'io.containerd.cri.v1.runtime']
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false

[plugins.'io.containerd.cri.v1.images']
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.'io.containerd.cri.v1.images'.pinned_images]
  sandbox = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.images'.registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.

10:59:26
@luke-skywalker:matrix.orgluke-skywalkerMy suggestion to this point, that something is looking for some specific path resolution of either the nvidia drivers or the container runtime. 🤷‍♂️ 11:00:32
@luke-skywalker:matrix.orgluke-skywalker* My suspicion to this point is, that something is looking for some specific path resolution of either the nvidia driver / library or the container runtime. 🤷‍♂️ 11:00:58
@luke-skywalker:matrix.orgluke-skywalker *

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
  2. containerd runtime directly running cuda containers
  3. get rke2 running with a config.toml that points to all needed runtimes in the nix store.
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    hardware.nvidia = {
      modesetting.enable = true;
      nvidiaPersistenced = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 3
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[grpc]
  address = "/run/k3s/containerd/containerd.sock"

[plugins.'io.containerd.internal.v1.opt']
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.'io.containerd.grpc.v1.cri']
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

[plugins.'io.containerd.cri.v1.runtime']
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false

[plugins.'io.containerd.cri.v1.images']
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.'io.containerd.cri.v1.images'.pinned_images]
  sandbox = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.images'.registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

lsmod | grep nvidia
nvidia_drm            139264  81
nvidia_modeset       1830912  26 nvidia_drm
nvidia_uvm           3817472  2
nvidia              97120256  533 nvidia_uvm,nvidia_modeset
video                  81920  2 asus_wmi,nvidia_modeset
drm_ttm_helper         20480  2 nvidia_drm

However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.

11:04:17
@luke-skywalker:matrix.orgluke-skywalker *

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
2. containerd runtime directly running cuda containers
3. get rke2 running with a `config.toml` that points to all needed runtimes in the nix store.

```nix
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    hardware.nvidia = {
      modesetting.enable = true;
      nvidiaPersistenced = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 3
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[grpc]
  address = "/run/k3s/containerd/containerd.sock"

[plugins.'io.containerd.internal.v1.opt']
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.'io.containerd.grpc.v1.cri']
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

[plugins.'io.containerd.cri.v1.runtime']
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false

[plugins.'io.containerd.cri.v1.images']
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.'io.containerd.cri.v1.images'.pinned_images]
  sandbox = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.images'.registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

lsmod | grep nvidia
nvidia_drm            139264  81
nvidia_modeset       1830912  26 nvidia_drm
nvidia_uvm           3817472  2
nvidia              97120256  533 nvidia_uvm,nvidia_modeset
video                  81920  2 asus_wmi,nvidia_modeset
drm_ttm_helper         20480  2 nvidia_drm

However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.

11:08:12
@luke-skywalker:matrix.orgluke-skywalker *

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
❯ docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Tue Apr 22 11:08:39 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P8             42W /  350W |     773MiB /  12288MiB |     20%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
  1. containerd runtime directly running cuda containers
  2. get rke2 running with a config.toml that points to all needed runtimes in the nix store.
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    hardware.nvidia = {
      modesetting.enable = true;
      nvidiaPersistenced = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 3
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[grpc]
  address = "/run/k3s/containerd/containerd.sock"

[plugins.'io.containerd.internal.v1.opt']
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.'io.containerd.grpc.v1.cri']
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

[plugins.'io.containerd.cri.v1.runtime']
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false

[plugins.'io.containerd.cri.v1.images']
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.'io.containerd.cri.v1.images'.pinned_images]
  sandbox = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.images'.registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

lsmod | grep nvidia
nvidia_drm            139264  81
nvidia_modeset       1830912  26 nvidia_drm
nvidia_uvm           3817472  2
nvidia              97120256  533 nvidia_uvm,nvidia_modeset
video                  81920  2 asus_wmi,nvidia_modeset
drm_ttm_helper         20480  2 nvidia_drm

However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.

11:08:46
@luke-skywalker:matrix.orgluke-skywalker *

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
  2. containerd runtime directly running cuda containers
  3. get rke2 running with a config.toml that points to all needed runtimes in the nix store.
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    hardware.nvidia = {
      modesetting.enable = true;
      nvidiaPersistenced = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 3
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[grpc]
  address = "/run/k3s/containerd/containerd.sock"

[plugins.'io.containerd.internal.v1.opt']
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.'io.containerd.grpc.v1.cri']
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

[plugins.'io.containerd.cri.v1.runtime']
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false

[plugins.'io.containerd.cri.v1.images']
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.'io.containerd.cri.v1.images'.pinned_images]
  sandbox = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
  runtime_type = "io.containerd.runc.v2"

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
  BinaryName = "/var/lib/rancher/rke2/data/v1.31.7-rke2r1-7f85e977b85d/bin/nvidia-container-runtime.cdi"
  SystemdCgroup = true

[plugins.'io.containerd.cri.v1.images'.registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

lsmod | grep nvidia
nvidia_drm            139264  81
nvidia_modeset       1830912  26 nvidia_drm
nvidia_uvm           3817472  2
nvidia              97120256  533 nvidia_uvm,nvidia_modeset
video                  81920  2 asus_wmi,nvidia_modeset
drm_ttm_helper         20480  2 nvidia_drm

However when trying to deploy the nvidia device plugin either with rke2 operator or as simple daemonset or as helm chart from the nvidia-device-plugin repo, it fails on detecing the cuda environment. for example by complaining about auto strategy.

kube-system/nvidia-device-plugin-daemonset-j8rmc:

I0422 11:10:39.192906       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
    3c378193
    commit: 3c378193fcebf6e955f0d65bd6f2aeed099ad8ea
 >
I0422 11:10:39.193038       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0422 11:10:39.193372       1 main.go:245] Starting OS watcher.
I0422 11:10:39.193730       1 main.go:260] Starting Plugins.
I0422 11:10:39.193772       1 main.go:317] Loading configuration.
I0422 11:10:39.194842       1 main.go:342] Updating config with default resource matching patterns.
I0422 11:10:39.195036       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0422 11:10:39.195045       1 main.go:356] Retrieving plugins.
E0422 11:10:39.195368       1 factory.go:112] Incompatible strategy detected auto
E0422 11:10:39.195374       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0422 11:10:39.195378       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0422 11:10:39.195382       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0422 11:10:39.195385       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0422 11:10:39.195390       1 main.go:381] No devices found. Waiting indefinitely.

11:12:55

Show newer messages


Back to Room ListRoom Version: 9