!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

290 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
22 Apr 2025
@ereslibre:ereslibre.socialereslibre I have to check myself and reproduce the issue and check the fix. When done, I am positive that we can also automate this on the nixos module side 19:39:32
@luke-skywalker:matrix.orgluke-skywalkerok finally after 4 days of falling from one rabbit hole into the next deeper one, was finally able to deploy nvidia device plugin onto my initial cluster node and run cuda workloads πŸ₯³ Will now proceed replicated the setup on a second machine with another NVIDIA GPU, join the cluster and see if I can do pipeline parellelised vLLM inference 🀞20:44:24
@luke-skywalker:matrix.orgluke-skywalkeryes that was indeed the last missing piece of the puzzle!20:45:13
@luke-skywalker:matrix.orgluke-skywalker* πŸ₯³ I was finally able to deploy nvidia device plugin onto my initial cluster node and run cuda workloads after 4 days of falling from one rabbit hole into the next πŸ˜… Will now proceed replicated the setup on a second machine with another NVIDIA GPU, join the cluster and see if I can do pipeline parellelised vLLM inference 🀞21:27:51
@luke-skywalker:matrix.orgluke-skywalker ereslibre: interestingly enough I was so far only able to successfully deploy the daemonset v14 and 15. Using the latest v17 results in a glib error. 22:12:36
@luke-skywalker:matrix.orgluke-skywalkerth for pointing me to this. I have been scratchign my head on what the right channel and format is to give feedback to nixos project πŸ™πŸ™πŸ™22:13:27
@luke-skywalker:matrix.orgluke-skywalker* πŸ™πŸ™πŸ™ thx for pointing me to this. I have been scratchign my head on what the right channel and format is to give feedback to nixOS project.22:13:47
@ss:someonex.netSomeoneSerge (back on matrix) connor (he/him) (UTC-7): look familiar? https://mastodon.social/@effinbirds/114383881424822335 23:06:37
23 Apr 2025
@ereslibre:ereslibre.socialereslibreGlad it worked! :)05:48:55
@ss:someonex.netSomeoneSerge (back on matrix) luke-skywalker: looking forward to read the blog post xD 12:03:57
@luke-skywalker:matrix.orgluke-skywalkerblog post? Shouldnt everyone have the joy of fighting through those dungeons of rabbit holes and come out the other end with some awesome loot? 😊 Will do when I find the time to write it down as a guide / article or make a PR to either rke2 or nvidia-container-toolkit. Might even wrap it into its own system module. But main thing is available time since this is just one of the stepping stones to a system to federate distributed "AI" capabilites. Dont actually want to be too public before I have a working "kernel" of the envisioned system.15:22:21
@ereslibre:ereslibre.socialereslibreI might be able to open a PR to enable CDI on containerd this weekend21:14:48
@luke-skywalker:matrix.orgluke-skywalkerFYI with the virtualisation.containerd module (not the one used by rke2) it works already out of the box.21:16:15
24 Apr 2025
@ereslibre:ereslibre.socialereslibre luke-skywalker: unless I’m missing something, nothing is setting https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#containerd-configuration, right? You had to do this manually, right? 06:06:26
@luke-skywalker:matrix.orgluke-skywalker

funnily from all the detours I took to make ti work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it.

So yes you need to profide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set [plugins."io.containerd.grpc.v1.cri".cdi]

Could you give me the TLDR why using image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x

fails with glibc issue.

my understanding it was build build with a newer version of glibc that on my system (2.40)? ANy way to solve this or is it ok to simply stick to 16.x?

12:19:15
@luke-skywalker:matrix.orgluke-skywalker *

funnily from all the detours I took to make it work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it.

So yes you need to provide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set [plugins."io.containerd.grpc.v1.cri".cdi]

Could you give me the TLDR why using image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x

fails with glibc issue.

my understanding it was build build with a newer version of glibc that on my system (2.40)? ANy way to solve this or is it ok to simply stick to 16.x?

12:19:41
@luke-skywalker:matrix.orgluke-skywalker *

funnily from all the detours I took to make it work, I though rke2 was doing that, but I think I must have done that by hand and forgot about it.

So yes you need to provide a config.toml.tmpl with nvidia-cdi defined pointing to the runtime binary and set [plugins."io.containerd.grpc.v1.cri".cdi]

Could you give me the TLDR why using image: nvcr.io/nvidia/k8s-device-plugin:v0.17.x

fails with glibc issue.

my understanding it was build build with a newer version of glibc that on my system (2.40)? Any way to solve this or shoudl I to simply stick to 16.x until the glibc version on unstable channel nixos is compatible again?

12:20:36
@luke-skywalker:matrix.orgluke-skywalker ui interesting. How does it compare to vllm? I see it supports device maps. Is that for pipeline parallelism so GPU devices on different nodes / machines as well? Is it somehow affiliated to mistral-ai or whats the reason for the name of the library? ;)14:12:59
@luke-skywalker:matrix.orgluke-skywalkeralso does that work on k8s clusters? πŸ€”14:13:43
@glepage:matrix.orgGaΓ©tan Lepage I haven't used it much myself as I don't own a big enough GPU.
According to me, it is not affiliated to Mistral (the company). I guess that it's the same as "ollama" and Llama (Meta).
15:43:10
@luke-skywalker:matrix.orgluke-skywalkerits getting better still though. Now switched from Daemonset deplyoment of the device plugin to helm deployment with custom values. This made it possible to also enable time slicing available GPU πŸ₯³ 16:10:02
@luke-skywalker:matrix.orgluke-skywalkerthx for the info, yeah the same as ollama was my assumption. Guess ill stick to vllm depoyment with helm on k8s. 16:37:54
25 Apr 2025
@glepage:matrix.orgGaΓ©tan LepageNot getting lighter by the release...06:52:16
@glepage:matrix.orgGaΓ©tan LepageRedacted or Malformed Event06:52:18
@glepage:matrix.orgGaΓ©tan Lepage
prefetching https://download.pytorch.org/whl/cu128/torch-2.7.0%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl...
[929.8/1046.8 MiB DL] downloading 'https://download.pytorch.org/whl/cu128/torch-2.7.0%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl'
06:52:33
@ereslibre:ereslibre.socialereslibre

luke-skywalker: yes, glibc is not forwards compatible, only backwards compatible.

You can check https://github.com/NixOS/nixpkgs/issues/338511#issuecomment-2341496949 and the previous comments, since this is basically the issue you are hitting

12:24:21
@ereslibre:ereslibre.socialereslibre *

luke-skywalker: yes, glibc is not forward compatible, only backwards compatible.

You can check https://github.com/NixOS/nixpkgs/issues/338511#issuecomment-2341496949 and the previous comments, since this is basically the issue you are hitting

12:24:30
@luke-skywalker:matrix.orgluke-skywalkerthx πŸ™ So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image build with a glibc version that is too new for nixos (unstable) ? πŸ€” because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling. 12:29:21
@luke-skywalker:matrix.orgluke-skywalker* thx πŸ™ So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? πŸ€” because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling. 12:29:39
@luke-skywalker:matrix.orgluke-skywalker* thx πŸ™ So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? πŸ€” because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling and I would hate to let that go πŸ˜…12:30:04

Show newer messages


Back to Room ListRoom Version: 9