NixOS CUDA - Public Room Timeline

	NixOS CUDA	253 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	48 Servers

You have reached the beginning of time (for this room).

Sender	Message	Time
25 Apr 2025
luke-skywalker	thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling.	12:29:21
luke-skywalker	* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling.	12:29:39
luke-skywalker	* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling and I would hate to let that go 😅	12:30:04
ereslibre	hi! good news, I was able to reproduce and have a fix; this is very related to an issue reported to nvidia-container-toolkit. Let me explain	12:52:45
luke-skywalker	im all ears 🤩	12:53:11
ereslibre	https://gist.github.com/ereslibre/483fec3217ffca38b3244df42a477db2	13:00:36
ereslibre	this is related to upstream issue https://github.com/NVIDIA/nvidia-container-toolkit/issues/944 somehow. We need to figure out the best way to handle this, but at least you have two workarounds for the time being, none of them is ideal...	13:04:11
luke-skywalker	I see. If Ill have to I would probably opt for editing `/var/run/cdi/nvidia-container-toolkit.json`, but to this point I think dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.	13:07:54
luke-skywalker	* I see. If Ill have to I would probably opt for editing `/var/run/cdi/nvidia-container-toolkit.json`, but at this point I think dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.	13:08:09
luke-skywalker	* I see. If Ill have to I would probably opt for editing `/var/run/cdi/nvidia-container-toolkit.json`, but at this point I dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.	13:08:17
ereslibre	yeah, updating /var/run/cdi/nvidia-container-toolkit.json is flaky as I exposed it, it expects ldconfig to be present within the container at the specified path	13:10:27
luke-skywalker	good to know. well until I see a good reason not to and everything works as needed, I will stick with 16.2 for the time being then.	13:13:16
luke-skywalker	currently just testing out different cluster setups in my homelab (4x machines, 2x with nvidia GPU) so will be a bit until any real deployment...	13:14:35
connor (he/him) (UTC-7)	Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes `libnvinfer_builder_resource.so.10.9.0` and `libnvinfer_builder_resource_win.so.10.9.0`. Both are ~1.9 GB, and I'm wondering if `libnvinfer_builder_resource_win.so.10.9.0` is relevant for x86_64-linux systems, and if so, what it does compared to `libnvinfer_builder_resource.so.10.9.0`.	23:33:24
Kevin Mittman	In reply to @connorbaker:matrix.org Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes `libnvinfer_builder_resource.so.10.9.0` and `libnvinfer_builder_resource_win.so.10.9.0`. Both are ~1.9 GB, and I'm wondering if `libnvinfer_builder_resource_win.so.10.9.0` is relevant for x86_64-linux systems, and if so, what it does compared to `libnvinfer_builder_resource.so.10.9.0`. Checking. Also that tarball doesn't conform to the "binary archive" format ... and 6.4GB	23:48:23
Kevin Mittman	As the name implies, seems to be for cross compilation, Linux -> Windows	23:55:01
26 Apr 2025
connor (he/him) (UTC-7)	Shouldn't it be in a different `targets` directory if it's for cross to another system?	00:01:28
hexa (UTC+1)	heads up	19:52:11
hexa (UTC+1)	current onnxruntime on unstable requires w+x, while the version on release-24.11 does not	19:52:34
hexa (UTC+1)	`❯ objdump -x result/lib/libonnxruntime.so \| grep -A1 "STACK off" STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4 filesz 0x0000000000000000 memsz 0x0000000000000000 flags rwx`	19:52:55
hexa (UTC+1)	`❯ objdump -x result/lib/libonnxruntime.so \| grep -A1 "STACK off" STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4 filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-`	19:53:01
hexa (UTC+1)	implies systemd units that depend on onnxruntime and have `MemoryDenyWriteExecute` need to be updated to allow it	19:53:33
connor (he/him) (UTC-7)	I don’t know if anyone else uses torchmetrics, but if you’re wondering why using DISTS is so freaking slow, it’s because they create a new instance of the model every time you call it: https://github.com/Lightning-AI/torchmetrics/blob/60e7686c97c14a4286825ec23187b8629f825d15/src/torchmetrics/functional/image/dists.py#L176 I tried just creating the model once and using it directly, and it is much faster, but something about doing that causes a memory leak which makes training OOM eventually :( At any rate, it’s not the packaging’s fault, woohoo	19:58:30
29 Apr 2025
connor (he/him) (UTC-7)	finally started writing more docs (https://github.com/ConnorBaker/cuda-packages/blob/main/doc/language-frameworks/cuda.section.md) and moving some new package expressions (cuda-python, cutlass, flash-attn, modelopt, pyglove, schedulefree, transformer-engine) to my public repo (https://github.com/ConnorBaker/cuda-packages/tree/main/pkgs/development/python-modules)	04:50:56
	@ygt:matrix.org left the room.	23:42:49
1 May 2025
connor (he/him) (UTC-7)	God I need to finish arrayUtilities so I can start landing CUDA setup hooks	19:07:13
	oak 🏳️‍🌈♥️ changed their display name from oak - mikatammi.fi to oak 🫱⭕🫲.	23:18:34
connor (he/him) (UTC-7)	Kevin Mittman: is it intentional that the CUDA 12.9 docs (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7) say they require a driver version `>=575.51.03` for 12.9, but the latest release is `575.51.02` (https://download.nvidia.com/XFree86/Linux-x86_64/575.51.02/)?	23:27:12
Kevin Mittman	In reply to @connorbaker:matrix.org Kevin Mittman: is it intentional that the CUDA 12.9 docs (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7) say they require a driver version `>=575.51.03` for 12.9, but the latest release is `575.51.02` (https://download.nvidia.com/XFree86/Linux-x86_64/575.51.02/)? CUDA 12.9.0 ships with driver 575.51.03 what you are seeing is a separate release from GeForce BU	23:29:15

Show newer messages

Back to Room ListRoom Version: 9