NixOS CUDA - Public Room Timeline

	NixOS CUDA	306 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	61 Servers

Load older messages

Sender	Message	Time
24 Apr 2025
luke-skywalker	its getting better still though. Now switched from Daemonset deplyoment of the device plugin to helm deployment with custom values. This made it possible to also enable time slicing available GPU 🥳	16:10:02
luke-skywalker	thx for the info, yeah the same as ollama was my assumption. Guess ill stick to vllm depoyment with helm on k8s.	16:37:54
25 Apr 2025
Gaétan Lepage	Not getting lighter by the release...	06:52:16
Gaétan Lepage	Redacted or Malformed Event	06:52:18
Gaétan Lepage	`prefetching https://download.pytorch.org/whl/cu128/torch-2.7.0%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl... [929.8/1046.8 MiB DL] downloading 'https://download.pytorch.org/whl/cu128/torch-2.7.0%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl'`	06:52:33
ereslibre	luke-skywalker: yes, glibc is not forwards compatible, only backwards compatible. You can check https://github.com/NixOS/nixpkgs/issues/338511#issuecomment-2341496949 and the previous comments, since this is basically the issue you are hitting	12:24:21
ereslibre	* luke-skywalker: yes, glibc is not forward compatible, only backwards compatible. You can check https://github.com/NixOS/nixpkgs/issues/338511#issuecomment-2341496949 and the previous comments, since this is basically the issue you are hitting	12:24:30
luke-skywalker	thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling.	12:29:21
luke-skywalker	* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling.	12:29:39
luke-skywalker	* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling and I would hate to let that go 😅	12:30:04
ereslibre	hi! good news, I was able to reproduce and have a fix; this is very related to an issue reported to nvidia-container-toolkit. Let me explain	12:52:45
luke-skywalker	im all ears 🤩	12:53:11
ereslibre	https://gist.github.com/ereslibre/483fec3217ffca38b3244df42a477db2	13:00:36
ereslibre	this is related to upstream issue https://github.com/NVIDIA/nvidia-container-toolkit/issues/944 somehow. We need to figure out the best way to handle this, but at least you have two workarounds for the time being, none of them is ideal...	13:04:11
luke-skywalker	I see. If Ill have to I would probably opt for editing `/var/run/cdi/nvidia-container-toolkit.json`, but to this point I think dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.	13:07:54
luke-skywalker	* I see. If Ill have to I would probably opt for editing `/var/run/cdi/nvidia-container-toolkit.json`, but at this point I think dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.	13:08:09
luke-skywalker	* I see. If Ill have to I would probably opt for editing `/var/run/cdi/nvidia-container-toolkit.json`, but at this point I dont see a reason not to stick with 16.2 and update once the upstream issue is resolved.	13:08:17
ereslibre	yeah, updating /var/run/cdi/nvidia-container-toolkit.json is flaky as I exposed it, it expects ldconfig to be present within the container at the specified path	13:10:27
luke-skywalker	good to know. well until I see a good reason not to and everything works as needed, I will stick with 16.2 for the time being then.	13:13:16
luke-skywalker	currently just testing out different cluster setups in my homelab (4x machines, 2x with nvidia GPU) so will be a bit until any real deployment...	13:14:35
connor (burnt/out) (UTC-8)	Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes `libnvinfer_builder_resource.so.10.9.0` and `libnvinfer_builder_resource_win.so.10.9.0`. Both are ~1.9 GB, and I'm wondering if `libnvinfer_builder_resource_win.so.10.9.0` is relevant for x86_64-linux systems, and if so, what it does compared to `libnvinfer_builder_resource.so.10.9.0`.	23:33:24
Kevin Mittman (jetlagged/UTC-7)	In reply to @connorbaker:matrix.org Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes `libnvinfer_builder_resource.so.10.9.0` and `libnvinfer_builder_resource_win.so.10.9.0`. Both are ~1.9 GB, and I'm wondering if `libnvinfer_builder_resource_win.so.10.9.0` is relevant for x86_64-linux systems, and if so, what it does compared to `libnvinfer_builder_resource.so.10.9.0`. Checking. Also that tarball doesn't conform to the "binary archive" format ... and 6.4GB	23:48:23
Kevin Mittman (jetlagged/UTC-7)	As the name implies, seems to be for cross compilation, Linux -> Windows	23:55:01
26 Apr 2025
connor (burnt/out) (UTC-8)	Shouldn't it be in a different `targets` directory if it's for cross to another system?	00:01:28
hexa (UTC+1)	heads up	19:52:11
hexa (UTC+1)	current onnxruntime on unstable requires w+x, while the version on release-24.11 does not	19:52:34
hexa (UTC+1)	`❯ objdump -x result/lib/libonnxruntime.so \| grep -A1 "STACK off" STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4 filesz 0x0000000000000000 memsz 0x0000000000000000 flags rwx`	19:52:55
hexa (UTC+1)	`❯ objdump -x result/lib/libonnxruntime.so \| grep -A1 "STACK off" STACK off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**4 filesz 0x0000000000000000 memsz 0x0000000000000000 flags rw-`	19:53:01
hexa (UTC+1)	implies systemd units that depend on onnxruntime and have `MemoryDenyWriteExecute` need to be updated to allow it	19:53:33
connor (burnt/out) (UTC-8)	I don’t know if anyone else uses torchmetrics, but if you’re wondering why using DISTS is so freaking slow, it’s because they create a new instance of the model every time you call it: https://github.com/Lightning-AI/torchmetrics/blob/60e7686c97c14a4286825ec23187b8629f825d15/src/torchmetrics/functional/image/dists.py#L176 I tried just creating the model once and using it directly, and it is much faster, but something about doing that causes a memory leak which makes training OOM eventually :( At any rate, it’s not the packaging’s fault, woohoo	19:58:30

Show newer messages

Back to Room ListRoom Version: 9