!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

289 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
12 Apr 2025
@oak:universumi.fioak 🏳️‍🌈♥️ changed their display name from oak - mikatammi.fi ÄÄNESTÄ to oak - mikatammi.fi.12:56:11
13 Apr 2025
@ereslibre:ereslibre.socialereslibre joined the room.11:43:29
@ereslibre:ereslibre.socialereslibre

Hi everyone! I am looking at a bug we have with CDI (Container Device Interface, for forwarding GPU's to containers): https://github.com/NixOS/nixpkgs/issues/397065

I think the user has a correct configuration (unless there are settings that were not mentioned in the issue), my main question is when using the datacenter driver, why the nvidia-container-toolkit is reporting:

ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND

Do you have any idea on why NVML would not be present in this environment?

11:45:34
@ss:someonex.netSomeoneSerge (back on matrix)

HI! I've a small announcement to make.

I've been failing badly to keep up with the backlog as a maintainer, even though I'm recently able to spend some more time on Nixpkgs&c. Working in occasional 1:1 meetings, otoh, has always felt comparatively productive. We've just had another call with Gaétan Lepage and I find it was nice, so I now want to try the following: https://md.someonex.net/s/9S4E00sIb#

This is not exactly "official", I'm not posting this e.g. on Discourse until I'm more confident, but as such it's an open invitation.

14:52:27
@glepage:matrix.orgGaétan Lepage Indeed, it was great! We were able to finally finish fixint mistral-rs's cuda support! 15:24:02
@glepage:matrix.orgGaétan Lepage * Indeed, it was great! We were able to finally finish fixing mistral-rs's cuda support! 15:24:22
15 Apr 2025
@ereslibre:ereslibre.socialereslibreBTW folks, if you have a moment, I'd love to get this one merged: https://github.com/NixOS/nixpkgs/pull/36776906:26:28
@ss:someonex.netSomeoneSerge (back on matrix) connor (he/him) (UTC-7): did you use something like josh for cuda-legacy? I suspect this produced at least a few pings 😅 13:35:27
@connorbaker:matrix.orgconnor (he/him)I used https://github.com/newren/git-filter-repo — what would have pinged people?13:37:06
@ss:someonex.netSomeoneSerge (back on matrix)User handles in commit messages xD13:37:28
@ereslibre:ereslibre.socialereslibreHi! Given https://github.com/NixOS/nixpkgs/pull/362197 had conflicts recently due to the treewide formatting I closed it, and reopened it at https://github.com/NixOS/nixpkgs/pull/398993. I think we can merge this one too21:24:34
@ereslibre:ereslibre.socialereslibreWe have been going back and forth with the author for a while, and I thought it would be good to go ahead on our side21:25:10
@ereslibre:ereslibre.socialereslibreThanks!21:29:32
17 Apr 2025
@luke-skywalker:matrix.orgluke-skywalker joined the room.09:38:30
@luke-skywalker:matrix.orgluke-skywalker

is this the right place to ask questions / get pointes on how to properly setup cuda container toolkit?

For docker it seems to work when enabling deprecated enableNvidia = true; flag. However with neither nvidia-container-toolkit in systemPackages with or without hardware.nvidia-container-toolkit.enable = true; I cannot seem to get it to run...

11:01:34
@luke-skywalker:matrix.orgluke-skywalkerwas not lucky at all with containerd for k3s11:02:09
@luke-skywalker:matrix.orgluke-skywalkerfor anybody stumbling over this: I'm pretty sure im on the right track using CDIs, having it work with docker (& compose). Should have read the docs properly. The relevant section section from the nixOS CUDA docs that got me here was all the way at the bottom: https://nixos.wiki/wiki/Nvidia#NVIDIA%20Docker%20not%20Working 14:38:50
@luke-skywalker:matrix.orgluke-skywalkerfrom all I understand this gives a lot more flexibility to pass accelerators of different vendors to containerized workloads 🥳14:39:36
@ss:someonex.netSomeoneSerge (back on matrix) Yes, CDI is the supported way (and has received a lot of care from @ereslibre), enableNvidia relies on end-of-life runtime wrappers 16:18:38
@ss:someonex.netSomeoneSerge (back on matrix)

Should have read the docs properly. The relevant section section from

Did you manage to get containerd to work?

16:20:27
@ereslibre:ereslibre.socialereslibre+1, let us know if you run into any issues when enabling CDI :)19:31:30
18 Apr 2025
@connorbaker:matrix.orgconnor (he/him) SomeoneSerge (UTC+U[-12,12]) I removed all the module system stuff from https://github.com/connorbaker/cuda-packages 11:24:48
@luke-skywalker:matrix.orgluke-skywalker ereslibre: I got it to run with docker but still struggling to getting it to run with containerd and k8s-device-plugin. 20:46:45
@ereslibre:ereslibre.socialereslibre
In reply to @luke-skywalker:matrix.org
ereslibre: I got it to run with docker but still struggling to getting it to run with containerd and k8s-device-plugin.
Interesting. If you feel like it, please open an issue and we can follow up. I did not try to run CDI with either of those
20:48:38
20 Apr 2025
@ss:someonex.netSomeoneSerge (back on matrix) Updated https://github.com/NVIDIA/build-system-archive-import-examples/issues/5 to reflect preference for the.note.dlopen section over eager-loading 09:34:53
@techyporcupine:matrix.org@techyporcupine:matrix.org left the room.18:15:53
21 Apr 2025
@luke-skywalker:matrix.orgluke-skywalkerRedacted or Malformed Event13:54:54
@ss:someonex.netSomeoneSerge (back on matrix) @luke-skywalker:matrix.org: the moderation bot is configured to drop all media in nixos spaces because there was a spam campaign disseminating csam matrix-wide, it's an unfortunate situation but the mods don't really have any other tools at their disposal 19:48:15
22 Apr 2025
@jaredmontoya:matrix.orgjaredmontoya joined the room.09:32:38
@luke-skywalker:matrix.orgluke-skywalker

ereslibre: where exactly should I make an issue? on nixpkgs gituhb? If so how do I indicate that its about the nvidia-container-toolkit?

So far, the attached config got me to:

  1. docker (+compose) runtime working with CUDA workloads
  2. containerd runtime directly running cuda containers
  3. get rke2 running with a config.toml that points to all needed runtimes in the nix store.
   hardware.nvidia-container-toolkit = {
      enable = true;
      # package = pkgs.nvidia-container-toolkit;
      # # Use UUID for device naming - better for multi-GPU setups
      device-name-strategy = "uuid"; # one of "index", "uuid", "type-index"
      # Mount additional directories for compatibility
      mount-nvidia-docker-1-directories = true;

      # Mount NVIDIA executables into container
      mount-nvidia-executables = true;
    };
    services.rke2 = {
      enable = true;
      role = "server";
      nodeName = "workstation-0";
      cni = "canal"; # | canal

      # Set the node IP directly
      nodeIP = "${systemProfile.network.staticIP}";
      debug = true;

      # Set cluster CIDR ranges properly
      extraFlags = [
        "--kubelet-arg=cgroup-driver=systemd"
        "--cluster-cidr=10.42.0.0/16"
        "--service-cidr=10.43.0.0/16"
        "--disable-cloud-controller" # Disable cloud controller for bare metal
        # "--kubelet-arg=feature-gates=DevicePlugins=true" # Add this for device plugins
      ];
      disable = ["traefik"]; # "servicelb"
      # environmentVars = {
      #   NVIDIA_VISIBLE_DEVICES = "all";
      #   NVIDIA_DRIVER_CAPABILITIES = "all";

      #   # Set NVIDIA driver root to the standard location
      #   # NVIDIA_DRIVER_ROOT = "/usr/lib/nvidia";

      #   # Home directory for RKE2
      #   HOME = "/root";
      # };
    };

/var/lib/rancher/rke2/agent/etc/containerd/config.toml

10:56:26

Show newer messages


Back to Room ListRoom Version: 9