Sender | Message | Time |
---|---|---|
25 Apr 2025 | ||
thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling. | 12:29:21 | |
* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling. | 12:29:39 | |
* thx 🙏 So just to get it right: is it nixOS (unstable) running a glibc version that is too new for the 17.x images or has the image been build with a glibc version that is too new for nixos (unstable) ? 🤔 because nvidia device plugin image versions 14.x/15.x/16.x all work. Do you see any critical issue running clusters on 16.2? It works like a beaty currently testing gpu workload autoscaling and I would hate to let that go 😅 | 12:30:04 | |
hi! good news, I was able to reproduce and have a fix; this is very related to an issue reported to nvidia-container-toolkit. Let me explain | 12:52:45 | |
im all ears 🤩 | 12:53:11 | |
https://gist.github.com/ereslibre/483fec3217ffca38b3244df42a477db2 | 13:00:36 | |
this is related to upstream issue https://github.com/NVIDIA/nvidia-container-toolkit/issues/944 somehow. We need to figure out the best way to handle this, but at least you have two workarounds for the time being, none of them is ideal... | 13:04:11 | |
I see. If Ill have to I would probably opt for editing | 13:07:54 | |
* I see. If Ill have to I would probably opt for editing | 13:08:09 | |
* I see. If Ill have to I would probably opt for editing | 13:08:17 | |
yeah, updating /var/run/cdi/nvidia-container-toolkit.json is flaky as I exposed it, it expects ldconfig to be present within the container at the specified path | 13:10:27 | |
good to know. well until I see a good reason not to and everything works as needed, I will stick with 16.2 for the time being then. | 13:13:16 | |
currently just testing out different cluster setups in my homelab (4x machines, 2x with nvidia GPU) so will be a bit until any real deployment... | 13:14:35 | |
Kevin Mittman: I noticed the TensorRT binary archive for x86_64-linux (and only x86_64-linux) includes libnvinfer_builder_resource.so.10.9.0 and libnvinfer_builder_resource_win.so.10.9.0 . Both are ~1.9 GB, and I'm wondering if libnvinfer_builder_resource_win.so.10.9.0 is relevant for x86_64-linux systems, and if so, what it does compared to libnvinfer_builder_resource.so.10.9.0 . | 23:33:24 | |
In reply to @connorbaker:matrix.orgChecking. Also that tarball doesn't conform to the "binary archive" format ... and 6.4GB | 23:48:23 | |
As the name implies, seems to be for cross compilation, Linux -> Windows | 23:55:01 | |
26 Apr 2025 | ||
Shouldn't it be in a different targets directory if it's for cross to another system? | 00:01:28 | |
heads up | 19:52:11 | |
current onnxruntime on unstable requires w+x, while the version on release-24.11 does not | 19:52:34 | |
| 19:52:55 | |
| 19:53:01 | |
implies systemd units that depend on onnxruntime and have MemoryDenyWriteExecute need to be updated to allow it | 19:53:33 | |
I don’t know if anyone else uses torchmetrics, but if you’re wondering why using DISTS is so freaking slow, it’s because they create a new instance of the model every time you call it: https://github.com/Lightning-AI/torchmetrics/blob/60e7686c97c14a4286825ec23187b8629f825d15/src/torchmetrics/functional/image/dists.py#L176 I tried just creating the model once and using it directly, and it is much faster, but something about doing that causes a memory leak which makes training OOM eventually :( At any rate, it’s not the packaging’s fault, woohoo | 19:58:30 | |
29 Apr 2025 | ||
finally started writing more docs (https://github.com/ConnorBaker/cuda-packages/blob/main/doc/language-frameworks/cuda.section.md) and moving some new package expressions (cuda-python, cutlass, flash-attn, modelopt, pyglove, schedulefree, transformer-engine) to my public repo (https://github.com/ConnorBaker/cuda-packages/tree/main/pkgs/development/python-modules) | 04:50:56 | |
23:42:49 | ||
1 May 2025 | ||
God I need to finish arrayUtilities so I can start landing CUDA setup hooks | 19:07:13 | |
23:18:34 | ||
Kevin Mittman: is it intentional that the CUDA 12.9 docs (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7) say they require a driver version >=575.51.03 for 12.9, but the latest release is 575.51.02 (https://download.nvidia.com/XFree86/Linux-x86_64/575.51.02/)? | 23:27:12 | |
In reply to @connorbaker:matrix.orgCUDA 12.9.0 ships with driver 575.51.03 what you are seeing is a separate release from GeForce BU | 23:29:15 |