!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

253 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda48 Servers

You have reached the beginning of time (for this room).


SenderMessageTime
25 Apr 2025
@luke-skywalker:matrix.orgluke-skywalkerthx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling. 12:29:21
@luke-skywalker:matrix.orgluke-skywalker* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling. 12:29:39
@luke-skywalker:matrix.orgluke-skywalker* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling and I would hate to let that go 😅12:30:04
@ereslibre:ereslibre.socialereslibrehi! good news, I was able to reproduce and have a fix; this is very related to an issue reported to nvidia-container-toolkit. Let me explain12:52:45
@luke-skywalker:matrix.orgluke-skywalkerim all ears 🤩12:53:11
@ereslibre:ereslibre.socialereslibrehttps://gist.github.com/ereslibre/483fec3217ffca38b3244df42a477db213:00:36
@ereslibre:ereslibre.socialereslibrethis is related to upstream issue https://github.com/NVIDIA/nvidia-container-toolkit/issues/944 somehow. We need to figure out the best way to handle this, but at least you have two workarounds for the time being, none of them is ideal...13:04:11
@luke-skywalker:matrix.orgluke-skywalker

I see.

If Ill have to I would probably opt for editing /var/run/cdi/nvidia-container-toolkit.json, but to this point I think dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.

13:07:54
@luke-skywalker:matrix.orgluke-skywalker *

I see.

If Ill have to I would probably opt for editing /var/run/cdi/nvidia-container-toolkit.json, but at this point I think dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.

13:08:09
@luke-skywalker:matrix.orgluke-skywalker *

I see.

If Ill have to I would probably opt for editing /var/run/cdi/nvidia-container-toolkit.json, but at this point I dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.

13:08:17
@ereslibre:ereslibre.socialereslibreyeah, updating /var/run/cdi/nvidia-container-toolkit.json is flaky as I exposed it, it expects ldconfig to be present within the container at the specified path13:10:27
@luke-skywalker:matrix.orgluke-skywalkergood to know. well until I see a good reason not to and everything works as needed, I will stick with 16.2 for the time being then.13:13:16
@luke-skywalker:matrix.orgluke-skywalkercurrently just testing out different cluster setups in my homelab (4x machines, 2x with nvidia GPU) so will be a bit until any real deployment... 13:14:35
@connorbaker:matrix.orgconnor (he/him) (UTC-7) Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes libnvinfer_builder_resource.so.10.9.0 and libnvinfer_builder_resource_win.so.10.9.0. Both are ~1.9 GB, and I'm wondering if libnvinfer_builder_resource_win.so.10.9.0 is relevant for x86_64-linux systems, and if so, what it does compared to libnvinfer_builder_resource.so.10.9.0. 23:33:24
@justbrowsing:matrix.orgKevin Mittman
In reply to @connorbaker:matrix.org
Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes libnvinfer_builder_resource.so.10.9.0 and libnvinfer_builder_resource_win.so.10.9.0. Both are ~1.9 GB, and I'm wondering if libnvinfer_builder_resource_win.so.10.9.0 is relevant for x86_64-linux systems, and if so, what it does compared to libnvinfer_builder_resource.so.10.9.0.
Checking. Also that tarball doesn't conform to the "binary archive" format ... and 6.4GB
23:48:23
@justbrowsing:matrix.orgKevin Mittman As the name implies, seems to be for cross compilation, Linux -> Windows 23:55:01
26 Apr 2025
@connorbaker:matrix.orgconnor (he/him) (UTC-7) Shouldn't it be in a different targets directory if it's for cross to another system? 00:01:28
@hexa:lossy.networkhexa (UTC+1)heads up19:52:11
@hexa:lossy.networkhexa (UTC+1)current onnxruntime on unstable requires w+x, while the version on release-24.11 does not19:52:34
@hexa:lossy.networkhexa (UTC+1)
❯ objdump -x result/lib/libonnxruntime.so | grep -A1 "STACK off"
   STACK off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags rwx
19:52:55
@hexa:lossy.networkhexa (UTC+1)
❯ objdump -x result/lib/libonnxruntime.so | grep -A1 "STACK off"
   STACK off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4
         filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-
19:53:01
@hexa:lossy.networkhexa (UTC+1) implies systemd units that depend on onnxruntime and have MemoryDenyWriteExecute need to be updated to allow it 19:53:33
@connorbaker:matrix.orgconnor (he/him) (UTC-7)

I don’t know if anyone else uses torchmetrics, but if you’re wondering why using DISTS is so freaking slow, it’s because they create a new instance of the model every time you call it: https://github.com/Lightning-AI/torchmetrics/blob/60e7686c97c14a4286825ec23187b8629f825d15/src/torchmetrics/functional/image/dists.py#L176

I tried just creating the model once and using it directly, and it is much faster, but something about doing that causes a memory leak which makes training OOM eventually :(

At any rate, it’s not the packaging’s fault, woohoo

19:58:30
29 Apr 2025
@connorbaker:matrix.orgconnor (he/him) (UTC-7)finally started writing more docs (https://github.com/ConnorBaker/cuda-packages/blob/main/doc/language-frameworks/cuda.section.md) and moving some new package expressions (cuda-python, cutlass, flash-attn, modelopt, pyglove, schedulefree, transformer-engine) to my public repo (https://github.com/ConnorBaker/cuda-packages/tree/main/pkgs/development/python-modules)04:50:56
@ygt:matrix.org@ygt:matrix.org left the room.23:42:49
1 May 2025
@connorbaker:matrix.orgconnor (he/him) (UTC-7)God I need to finish arrayUtilities so I can start landing CUDA setup hooks19:07:13
@oak:universumi.fioak 🏳️‍🌈♥️ changed their display name from oak - mikatammi.fi to oak 🫱⭕🫲.23:18:34
@connorbaker:matrix.orgconnor (he/him) (UTC-7) Kevin Mittman: is it intentional that the CUDA 12.9 docs (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7) say they require a driver version >=575.51.03 for 12.9, but the latest release is 575.51.02 (https://download.nvidia.com/XFree86/Linux-x86_64/575.51.02/)? 23:27:12
@justbrowsing:matrix.orgKevin Mittman
In reply to @connorbaker:matrix.org
Kevin Mittman: is it intentional that the CUDA 12.9 docs (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7) say they require a driver version >=575.51.03 for 12.9, but the latest release is 575.51.02 (https://download.nvidia.com/XFree86/Linux-x86_64/575.51.02/)?
CUDA 12.9.0 ships with driver 575.51.03
what you are seeing is a separate release from GeForce BU
23:29:15

Show newer messages


Back to Room ListRoom Version: 9