!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

211 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda42 Servers

Load older messages


SenderMessageTime
9 Sep 2024
@ss:someonex.netSomeoneSerge (utc+3)
In reply to @connorbaker:matrix.org
SomeoneSerge (nix.camp): at the top of https://github.com/NixOS/nixpkgs/pull/339619 I have a list of packages I found which have environments with mixed versions of CUDA packages. Any ideas on how best to test for cases where code loads arbitrary / incorrect versions of CUDA libraries? As an example, I’d hope OpenCV would load the CUDA libraries it was built with, and the other packages would load the CUDA libraries from their expressions (not the OpenCV one).

Off the top of my head, I'd say we prepare a function under pkgs.testers for running an arbitrary command with LD_DEBUG=libs, parsing its outputs, and running asserts of the form:

  • soname was/was not searched for
  • soname was/was not loaded
  • soname was/was not loaded from a path matching a pattern
17:27:04
@ss:someonex.netSomeoneSerge (utc+3)

As an example, I’d hope OpenCV would load the CUDA libraries it was built with, and the other packages would load the CUDA libraries from their expressions (not the OpenCV one).

in most of our practical cases (native extensions in python) it's "who gets there first"

17:27:52
@connorbaker:matrix.orgconnor (he/him) (UTC-7)

So... PyTorch does something similar, in that packages with extensions are supposed to use the same version of CUDA libraries the PyTorch package was built with (and so we have cudaPackages in torch.passthru for downstream consumers).

Assuming OpenCV, PyTorch, and the other packages using CUDA libraries don't play nicely with each other, are there any solutions you can think of?
Best I can imagine is some sort of logic to compute the maximum supported version of cudaPackages or else package consumers being responsible for handling that themselves. Outside of that, unless there's a way to ensure each package has a unique namespace for libraries it looks for, and they can co-exist in the runtime, I don't know how to resolve such and issue.

17:54:42
@ss:someonex.netSomeoneSerge (utc+3)Hypothesis: it should be probably ok to build with one version of cudart and execute with a newer, otherwise all other distributions would have been permanently broken. So we should try to do the same thing that we should start doing wrt libc: build against a "compatible" version, but exclude it from the closure in favour of linking the newest in the package set18:39:26
@ss:someonex.netSomeoneSerge (utc+3)Implicit evidence: at the end of the day users do put packages like tensorflow and pytorch and opencv into the same fixpoints (venvs), so it doesn't matter that our fixpoint is larger because the corner cases are the same18:42:47
@connorbaker:matrix.orgconnor (he/him) (UTC-7)Working on rebasing https://github.com/NixOS/nixpkgs/pull/306172 and... wow, NVIDIA just keeps pushing out updates don't they? CUDA 12.6.1? TensorRT 10.4?? I really gotta clean up the update scripts21:26:08
@connorbaker:matrix.orgconnor (he/him) (UTC-7)

Also do something to split up how I handle fetching packages to avoid a single massive derivation with every tarball NVIDIA has:

/nix/store/412899ispzymkv5fgvav37j7v6sk5i7m-mk-index-of-package-info                                    	 610.2 GiB
21:28:04
@connorbaker:matrix.orgconnor (he/him) (UTC-7) *

Also do something to split up how I handle fetching packages to avoid a single massive derivation with every tarball NVIDIA has:

$ nix path-info -Sh --impure .#cuda-redist-index
warning: Nix search path entry '/nix/var/nix/profiles/per-user/root/channels' does not exist, ignoring
/nix/store/412899ispzymkv5fgvav37j7v6sk5i7m-mk-index-of-package-info	 610.2 GiB
21:28:31
10 Sep 2024
@matthewcroughan:defenestrate.itmatthewcroughan changed their display name from matthewcroughan - going to nix.camp to matthewcroughan.15:52:11
@connorbaker:matrix.orgconnor (he/him) (UTC-7) So... I really don't want to have to figure out testing and stuff for OpenCV for https://github.com/NixOS/nixpkgs/pull/339619.
OpenCV 4.10 (we have 4.9) supports CUDA 12.4+. Maybe just updating it to punt the issue down the road is fine? (Our latest CUDA version right now is 12.4.)
23:05:06
@connorbaker:matrix.orgconnor (he/him) (UTC-7)
In reply to @ss:someonex.net
Hypothesis: it should be probably ok to build with one version of cudart and execute with a newer, otherwise all other distributions would have been permanently broken. So we should try to do the same thing that we should start doing wrt libc: build against a "compatible" version, but exclude it from the closure in favour of linking the newest in the package set
wouldn't things like API changes between versions cause breakage?
23:16:04
@connorbaker:matrix.orgconnor (he/him) (UTC-7)
In reply to @ss:someonex.net
Hypothesis: it should be probably ok to build with one version of cudart and execute with a newer, otherwise all other distributions would have been permanently broken. So we should try to do the same thing that we should start doing wrt libc: build against a "compatible" version, but exclude it from the closure in favour of linking the newest in the package set
* wouldn't things like API changes between versions cause breakage?
EDIT: I guess they would cause build failures... my primary concern was that it would cause failures at runtime, but I suppose that's not really a problem for compiled targets. Relative to libc, NVIDIA's libraries change way, way more between releases (even minor versions!).
23:31:05
@adam:robins.wtfadamcstephens
In reply to @adam:robins.wtf

hmm, ollama is failing for me on unstable

Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.680-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/srv/fast/ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 gpu=GPU-c2c9209f-9632-bb03-ca95-d903c8664a1a parallel=4 available=12396331008 required="11.1 GiB"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.681-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=28 layers.split="" memory.available="[11.5 GiB]" memory.required.full="11.1 GiB" memory.required.partial="11.1 GiB" memory.required.kv="2.1 GiB" memory.required.allocations="[11.1 GiB]" memory.weights.total="10.1 GiB" memory.weights.repeating="10.0 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="391.4 MiB"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.695-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1289771407/runners/cuda_v12/ollama_llama_server --model /srv/fast/ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 28 --parallel 4 --port 35991"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.696-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.696-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.696-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 07 15:59:47 sink1 ollama[1314]: /tmp/ollama1289771407/runners/cuda_v12/ollama_llama_server: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.947-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: exit status 127"

I just spent some time looking into this again, and it appears the issue is cudaPackages. when trying the larger config.cudaSupport change I had to downgrade cudaPackages to 12.3 to successfully build. Leaving this downgrade in place allows ollama to work
23:41:03
@adam:robins.wtfadamcstephens
In reply to @adam:robins.wtf

hmm, ollama is failing for me on unstable

Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.680-04:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/srv/fast/ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 gpu=GPU-c2c9209f-9632-bb03-ca95-d903c8664a1a parallel=4 available=12396331008 required="11.1 GiB"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.681-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=28 layers.offload=28 layers.split="" memory.available="[11.5 GiB]" memory.required.full="11.1 GiB" memory.required.partial="11.1 GiB" memory.required.kv="2.1 GiB" memory.required.allocations="[11.1 GiB]" memory.weights.total="10.1 GiB" memory.weights.repeating="10.0 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="391.4 MiB"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.695-04:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1289771407/runners/cuda_v12/ollama_llama_server --model /srv/fast/ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 28 --parallel 4 --port 35991"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.696-04:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.696-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.696-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 07 15:59:47 sink1 ollama[1314]: /tmp/ollama1289771407/runners/cuda_v12/ollama_llama_server: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory
Sep 07 15:59:47 sink1 ollama[1314]: time=2024-09-07T15:59:47.947-04:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: exit status 127"

* I just spent some time looking into this again, and it appears the issue is cudaPackages. when trying the larger config.cudaSupport change I had to downgrade cudaPackages to 12.3 to successfully build. Leaving this downgrade in place allows ollama to work even without using config.cudaSupport
23:41:28
@connorbaker:matrix.orgconnor (he/him) (UTC-7)Any idea if it's just CUDA 12.4, or if it also had to do with the version bump https://github.com/NixOS/nixpkgs/pull/331585?23:45:24
@connorbaker:matrix.orgconnor (he/him) (UTC-7)Although it looks like they didn't add CUDA 12 support until 0.3.7 (https://github.com/ollama/ollama/releases/tag/v0.3.7)23:45:58
@connorbaker:matrix.orgconnor (he/him) (UTC-7)What driver version are you using?23:47:24
@adam:robins.wtfadamcstephensi can try and downgrade ollama and see23:47:27
@connorbaker:matrix.orgconnor (he/him) (UTC-7)Can you try upgrading it as well? Looks like 0.3.10 is out now23:47:59
@adam:robins.wtfadamcstephens560.35.0323:48:19
@adam:robins.wtfadamcstephens
In reply to @connorbaker:matrix.org
Can you try upgrading it as well? Looks like 0.3.10 is out now
yeah i'll try that first
23:48:32
@connorbaker:matrix.orgconnor (he/him) (UTC-7)Is this a NixOS system, and what GPU?23:50:19
@adam:robins.wtfadamcstephensyes, NixOS. 6700XT23:51:52
@adam:robins.wtfadamcstephens * yes, NixOS. 3060Ti23:52:05
@adam:robins.wtfadamcstephens * yes, NixOS. 306023:52:13
@adam:robins.wtfadamcstephens06:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1)23:52:25
11 Sep 2024
@adam:robins.wtfadamcstephensresults of my ollama testing are: 0.3.5 - works with cudaPackages 12_3 and 12_4 0.3.9 - works on 12_3, broken on 12_4 0.3.10 - works on 12_3, broken on 12_401:13:46
@connorbaker:matrix.orgconnor (he/him) (UTC-7)It is surprising to me that 0.3.5 works with CUDA 12 at all; I guess there were no breaking API changes on stuff they relied on?18:05:26
12 Sep 2024
@connorbaker:matrix.orgconnor (he/him) (UTC-7)
In reply to @connorbaker:matrix.org
So... I really don't want to have to figure out testing and stuff for OpenCV for https://github.com/NixOS/nixpkgs/pull/339619.
OpenCV 4.10 (we have 4.9) supports CUDA 12.4+. Maybe just updating it to punt the issue down the road is fine? (Our latest CUDA version right now is 12.4.)
I started writing a pkgs.testers implementation for what Serge suggested here: https://matrix.to/#/!eWOErHSaiddIbsUNsJ:nixos.org/$phSCjT-mxTap-ccF98Z7hZakHk3_-jjkPw2fIvzBhjA?via=nixos.org&via=matrix.org&via=nixos.dev
00:32:04
@connorbaker:matrix.orgconnor (he/him) (UTC-7) SomeoneSerge (nix.camp): as a short-term thing, are you okay with me patching out OpenCV's requirement that CUDA version match so we can merge the CUDA fix? 23:30:01

Show newer messages


Back to Room ListRoom Version: 9