NixOS CUDA - Public Room Timeline

	NixOS CUDA	317 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	63 Servers

Load older messages

Sender	Message	Time
19 May 2024
SomeoneSerge (matrix works sometimes)	CUDA not ready for CA derivations xD	16:42:46
evax	SomeoneSerge (UTC+3): some more elements, I could make jaxlib-bin to work by overriding the src package to point to the cuda11 version instead of cuda12. For some reason the jaxlibWithCuda packages seems to be missing the cuda folder (flake setup, nixos-23.11 nixpkgs, cuda-maintainers cache)	17:40:26
SomeoneSerge (matrix works sometimes)	In reply to @evax:matrix.org SomeoneSerge (UTC+3): some more elements, I could make jaxlib-bin to work by overriding the src package to point to the cuda11 version instead of cuda12. For some reason the jaxlibWithCuda packages seems to be missing the cuda folder (flake setup, nixos-23.11 nixpkgs, cuda-maintainers cache) Jaxlib-bin is prebuilt against a concrete version if cuda. Nixpkgs manually pins that version. If an override helped you it's a bug in nixpkgs (pinning the wrong cuda). Jaxlib (without bin) otoh should work with any cuda. We'd still need ld_debug=libs to derive any more conclusions	17:55:19
SomeoneSerge (matrix works sometimes)	* Jaxlib-bin is prebuilt against a concrete version of cuda. Nixpkgs manually pins that version. If an override helped you it's a bug in nixpkgs (pinning the wrong cuda). Jaxlib (without bin) otoh should work with any cuda. We'd still need ld_debug=libs to derive any more conclusions	17:55:42
evax	I can't really move any data to/from that system, could you tell me more about what to look for in the output?	18:12:15
aidalgol	In reply to @connorbaker:matrix.org aidalgol: running `nix-cuda-test` I see it on my nvidia-smi $ nvidia-smi Sun May 19 15:11:11 2024 +-----------------------------------------------------------------------------------------+ \| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 \| \|-----------------------------------------+------------------------+----------------------+ \| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|=========================================+========================+======================\| \| 0 NVIDIA GeForce RTX 4090 Off \| 00000000:01:00.0 Off \| Off \| \| 45% 56C P2 347W / 500W \| 8187MiB / 24564MiB \| 96% Default \| \| \| \| N/A \| +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=========================================================================================\| \| 0 N/A N/A 3656630 C ...y88kh-python3-3.11.9/bin/python3.11 8180MiB \| +-----------------------------------------------------------------------------------------+ Sorry, what's `nix-cuda-test`?	18:30:56
connor (he/him)	Ah my bad, https://github.com/ConnorBaker/nix-cuda-test/tree/main	18:31:16
SomeoneSerge (matrix works sometimes)	Depends on the exact error message, but firstly: where is libcuda.so loaded from	18:56:19
aidalgol	In reply to @connorbaker:matrix.org Ah my bad, https://github.com/ConnorBaker/nix-cuda-test/tree/main And what on earth is https://cantcache.me/ ? 🧐	19:00:20
connor (he/him)	It's a binary cache I made for myself using Attic (https://github.com/zhaofengli/attic)	19:00:53
connor (he/him)	I thought the domain name was funny and bought it so I've been using it on and off for various things	19:01:04
evax	the error message it that "a CUDA enabled jaxlib is not instaleld" - and it's actually not trying to load libcuda	20:07:27
evax	* the error message is that "a CUDA enabled jaxlib is not instaleld" - and it's actually not trying to load libcuda	20:07:34
evax	now if I look at the path jaxlib is loaded from in python, there's no cuda folder in there (but there's one when I use jaxlib-bin)	20:10:11
aidalgol	In reply to @connorbaker:matrix.org Ah my bad, https://github.com/ConnorBaker/nix-cuda-test/tree/main I get an error trying to run that. ❯ nix run github:ConnorBaker/nix-cuda-test#nix-cuda-test do you want to allow configuration setting 'extra-substituters' to be set to 'https://cantcache.me/cuda https://cuda-maintainers.cachix.org' (y/N)? y do you want to permanently mark this value as trusted (y/N)? do you want to allow configuration setting 'extra-trusted-public-keys' to be set to 'cuda:NtbpAU7XGYlttrhCduqvpYKottCPdWVITWT+3nFVTBY= cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E=' (y/N)? y do you want to permanently mark this value as trusted (y/N)? do you want to allow configuration setting 'extra-trusted-substituters' to be set to 'https://cantcache.me/cuda https://cuda-maintainers.cachix.org' (y/N)? y do you want to permanently mark this value as trusted (y/N)? error: builder for '/nix/store/vhh1jmqaf9pn9sfkygi8kn1l8lp8m322-python3.11-nix-cuda-test-0.1.0.drv' failed with exit code 1; last 25 log lines: > Using pypaInstallPhase > Sourcing python-imports-check-hook.sh > Using pythonImportsCheckPhase > Sourcing python-namespaces-hook > Sourcing python-catch-conflicts-hook.sh > Running phase: unpackPhase > unpacking source archive /nix/store/hb7ifp4m6n79cfgpc7ipnwp7cam9x71w-source > source root is source > setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/pyproject.toml > Running phase: patchPhase > Running phase: updateAutotoolsGnuConfigScriptsPhase > Running phase: configurePhase > no configure script, doing nothing > Running phase: buildPhase > Executing pypaBuildPhase > Creating a wheel... > * Getting build dependencies for wheel... > * Building wheel... > Successfully built nix_cuda_test-0.1.0-py3-none-any.whl > Finished creating a wheel... > Finished executing pypaBuildPhase > Running phase: pythonRuntimeDepsCheckHook > Executing pythonRuntimeDepsCheck > Checking runtime dependencies for nix_cuda_test-0.1.0-py3-none-any.whl > - torchvision>=0.15.0 not satisfied by version 0.18.0a0	20:19:05
evax	I think I'm exactly in the situation described here: https://github.com/NixOS/nixpkgs/issues/282184	20:24:52
evax	and I went through pretty much the same steps	20:25:11
evax	I can get jaxlib-bin to work for me, but jaxlibWithCuda doesn't seem to ship cuda support	20:25:53
evax	wait, the fix probably was never backported in 23.11	20:32:27
evax	https://github.com/NixOS/nixpkgs/pull/288857	20:33:33
connor (he/him)	That’s what the whole conversation with Hexa earlier was about. For the fix see https://matrix.to/#/!eWOErHSaiddIbsUNsJ%3Anixos.org/%24AqWIeELQaHl8v5LeT6F4ZPGscJfBAJp1ZvL8vtDHrGY	22:17:26
aidalgol	Seed set to 42 Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /nix/store/h3wgj19ywjsd5ha976k8kajmg1jz7lpw-python3.11-torch-2.3.0/lib/python3.11/site-packages/torch/cuda/__init__.py:184: UserWarning: Found GPU0 NVIDIA GeForce RTX 3080 which is of cuda capability 8.6. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 8.9. warnings.warn( Files already downloaded and verified Files already downloaded and verified LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] \| Name \| Type \| Params ----------------------------------------------- 0 \| criterion \| CrossEntropyLoss \| 0 1 \| model \| ViT \| 86.3 M ----------------------------------------------- 86.3 M Trainable params 0 Non-trainable params 86.3 M Total params 345.317 Total estimated model params size (MB) Sanity Checking: \| \| 0/? [00:00<?, ?it/s] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .nix-cuda-test-wrapped 9 <module> sys.exit(main()) __main__.py 128 main trainer.fit( trainer.py 544 fit call._call_and_handle_interrupt( call.py 44 _call_and_handle_interrupt return trainer_fn(args, kwargs) trainer.py 580 _fit_impl self._run(model, ckpt_path=ckpt_path) trainer.py 987 _run results = self._run_stage() trainer.py 1031 _run_stage self._run_sanity_check() trainer.py 1060 _run_sanity_check val_loop.run() utilities.py 182 _decorator return loop_run(self, args, **kwargs) evaluation_loop.py 110 run self.setup_data() evaluation_loop.py 192 setup_data length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf") data.py 103 has_len_all_ranks if total_length == 0: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.	23:01:00
aidalgol	An RTX 3080 is too old?!	23:01:15
SomeoneSerge (matrix works sometimes)	In reply to @aidalgol:matrix.org Seed set to 42 Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /nix/store/h3wgj19ywjsd5ha976k8kajmg1jz7lpw-python3.11-torch-2.3.0/lib/python3.11/site-packages/torch/cuda/__init__.py:184: UserWarning: Found GPU0 NVIDIA GeForce RTX 3080 which is of cuda capability 8.6. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 8.9. warnings.warn( Files already downloaded and verified Files already downloaded and verified LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] \| Name \| Type \| Params ----------------------------------------------- 0 \| criterion \| CrossEntropyLoss \| 0 1 \| model \| ViT \| 86.3 M ----------------------------------------------- 86.3 M Trainable params 0 Non-trainable params 86.3 M Total params 345.317 Total estimated model params size (MB) Sanity Checking: \| \| 0/? [00:00<?, ?it/s] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .nix-cuda-test-wrapped 9 <module> sys.exit(main()) __main__.py 128 main trainer.fit( trainer.py 544 fit call._call_and_handle_interrupt( call.py 44 _call_and_handle_interrupt return trainer_fn(args, kwargs) trainer.py 580 _fit_impl self._run(model, ckpt_path=ckpt_path) trainer.py 987 _run results = self._run_stage() trainer.py 1031 _run_stage self._run_sanity_check() trainer.py 1060 _run_sanity_check val_loop.run() utilities.py 182 _decorator return loop_run(self, args, **kwargs) evaluation_loop.py 110 run self.setup_data() evaluation_loop.py 192 setup_data length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf") data.py 103 has_len_all_ranks if total_length == 0: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Did you build with `cudaCapabilities = [ ... "8.6" ... ]`?	23:49:24
SomeoneSerge (matrix works sometimes)	* Oh...	23:51:03
SomeoneSerge (matrix works sometimes)	(Did I just edit a message instead of sending a new one again?)	23:51:45
aidalgol	I just cloned https://github.com/ConnorBaker/nix-cuda-test.git, applied the fix described earlier, then ran `nix run .#nix-cuda-test`	23:52:36
SomeoneSerge (matrix works sometimes)	In reply to @aidalgol:matrix.org An RTX 3080 is too old?! So yeah there was a question about `cudaCapabilities = [ ... "8.6" ]`	23:52:41
aidalgol	I saw it. 👍️	23:52:53
SomeoneSerge (matrix works sometimes)	In reply to @aidalgol:matrix.org I just cloned https://github.com/ConnorBaker/nix-cuda-test.git, applied the fix described earlier, then ran `nix run .#nix-cuda-test` Our culprit: https://github.com/ConnorBaker/nix-cuda-test/blob/182c2148e6df0932fe19f9cb7180173ee2f9cb2d/flake.nix#L66	23:53:08

Show newer messages

Back to Room ListRoom Version: 9