!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

315 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda63 Servers

Load older messages


SenderMessageTime
19 May 2024
@evax:matrix.orgevaxthe error message it that "a CUDA enabled jaxlib is not instaleld" - and it's actually not trying to load libcuda20:07:27
@evax:matrix.orgevax * the error message is that "a CUDA enabled jaxlib is not instaleld" - and it's actually not trying to load libcuda 20:07:34
@evax:matrix.orgevaxnow if I look at the path jaxlib is loaded from in python, there's no cuda folder in there (but there's one when I use jaxlib-bin)20:10:11
@aidalgol:matrix.orgaidalgol
In reply to @connorbaker:matrix.org
Ah my bad, https://github.com/ConnorBaker/nix-cuda-test/tree/main

I get an error trying to run that.

❯ nix run github:ConnorBaker/nix-cuda-test#nix-cuda-test
do you want to allow configuration setting 'extra-substituters' to be set to 'https://cantcache.me/cuda https://cuda-maintainers.cachix.org' (y/N)? y
do you want to permanently mark this value as trusted (y/N)? 
do you want to allow configuration setting 'extra-trusted-public-keys' to be set to 'cuda:NtbpAU7XGYlttrhCduqvpYKottCPdWVITWT+3nFVTBY= cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E=' (y/N)? y
do you want to permanently mark this value as trusted (y/N)? 
do you want to allow configuration setting 'extra-trusted-substituters' to be set to 'https://cantcache.me/cuda https://cuda-maintainers.cachix.org' (y/N)? y
do you want to permanently mark this value as trusted (y/N)? 
error: builder for '/nix/store/vhh1jmqaf9pn9sfkygi8kn1l8lp8m322-python3.11-nix-cuda-test-0.1.0.drv' failed with exit code 1;
       last 25 log lines:
       > Using pypaInstallPhase
       > Sourcing python-imports-check-hook.sh
       > Using pythonImportsCheckPhase
       > Sourcing python-namespaces-hook
       > Sourcing python-catch-conflicts-hook.sh
       > Running phase: unpackPhase
       > unpacking source archive /nix/store/hb7ifp4m6n79cfgpc7ipnwp7cam9x71w-source
       > source root is source
       > setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/pyproject.toml
       > Running phase: patchPhase
       > Running phase: updateAutotoolsGnuConfigScriptsPhase
       > Running phase: configurePhase
       > no configure script, doing nothing
       > Running phase: buildPhase
       > Executing pypaBuildPhase
       > Creating a wheel...
       > * Getting build dependencies for wheel...
       > * Building wheel...
       > Successfully built nix_cuda_test-0.1.0-py3-none-any.whl
       > Finished creating a wheel...
       > Finished executing pypaBuildPhase
       > Running phase: pythonRuntimeDepsCheckHook
       > Executing pythonRuntimeDepsCheck
       > Checking runtime dependencies for nix_cuda_test-0.1.0-py3-none-any.whl
       >   - torchvision>=0.15.0 not satisfied by version 0.18.0a0

20:19:05
@evax:matrix.orgevaxI think I'm exactly in the situation described here: https://github.com/NixOS/nixpkgs/issues/28218420:24:52
@evax:matrix.orgevaxand I went through pretty much the same steps20:25:11
@evax:matrix.orgevaxI can get jaxlib-bin to work for me, but jaxlibWithCuda doesn't seem to ship cuda support20:25:53
@evax:matrix.orgevaxwait, the fix probably was never backported in 23.1120:32:27
@evax:matrix.orgevaxhttps://github.com/NixOS/nixpkgs/pull/28885720:33:33
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)That’s what the whole conversation with Hexa earlier was about. For the fix see https://matrix.to/#/!eWOErHSaiddIbsUNsJ%3Anixos.org/%24AqWIeELQaHl8v5LeT6F4ZPGscJfBAJp1ZvL8vtDHrGY22:17:26
@aidalgol:matrix.orgaidalgol
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/nix/store/h3wgj19ywjsd5ha976k8kajmg1jz7lpw-python3.11-torch-2.3.0/lib/python3.11/site-packages/torch/cuda/__init__.py:184: UserWarning: 
    Found GPU0 NVIDIA GeForce RTX 3080 which is of cuda capability 8.6.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 8.9.
    
  warnings.warn(
Files already downloaded and verified
Files already downloaded and verified
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | model     | ViT              | 86.3 M
-----------------------------------------------
86.3 M    Trainable params
0         Non-trainable params
86.3 M    Total params
345.317   Total estimated model params size (MB)
Sanity Checking: |                                                                                                                                                                                                                                                                                    | 0/? [00:00<?, ?it/s]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
.nix-cuda-test-wrapped 9 <module>
sys.exit(main())

__main__.py 128 main
trainer.fit(

trainer.py 544 fit
call._call_and_handle_interrupt(

call.py 44 _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)

trainer.py 580 _fit_impl
self._run(model, ckpt_path=ckpt_path)

trainer.py 987 _run
results = self._run_stage()

trainer.py 1031 _run_stage
self._run_sanity_check()

trainer.py 1060 _run_sanity_check
val_loop.run()

utilities.py 182 _decorator
return loop_run(self, *args, **kwargs)

evaluation_loop.py 110 run
self.setup_data()

evaluation_loop.py 192 setup_data
length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf")

data.py 103 has_len_all_ranks
if total_length == 0:

RuntimeError:
CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
23:01:00
@aidalgol:matrix.orgaidalgol An RTX 3080 is too old?! 23:01:15
@ss:someonex.netSomeoneSerge (matrix works sometimes)
In reply to @aidalgol:matrix.org
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/nix/store/h3wgj19ywjsd5ha976k8kajmg1jz7lpw-python3.11-torch-2.3.0/lib/python3.11/site-packages/torch/cuda/__init__.py:184: UserWarning: 
    Found GPU0 NVIDIA GeForce RTX 3080 which is of cuda capability 8.6.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 8.9.
    
  warnings.warn(
Files already downloaded and verified
Files already downloaded and verified
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | model     | ViT              | 86.3 M
-----------------------------------------------
86.3 M    Trainable params
0         Non-trainable params
86.3 M    Total params
345.317   Total estimated model params size (MB)
Sanity Checking: |                                                                                                                                                                                                                                                                                    | 0/? [00:00<?, ?it/s]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
.nix-cuda-test-wrapped 9 <module>
sys.exit(main())

__main__.py 128 main
trainer.fit(

trainer.py 544 fit
call._call_and_handle_interrupt(

call.py 44 _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)

trainer.py 580 _fit_impl
self._run(model, ckpt_path=ckpt_path)

trainer.py 987 _run
results = self._run_stage()

trainer.py 1031 _run_stage
self._run_sanity_check()

trainer.py 1060 _run_sanity_check
val_loop.run()

utilities.py 182 _decorator
return loop_run(self, *args, **kwargs)

evaluation_loop.py 110 run
self.setup_data()

evaluation_loop.py 192 setup_data
length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf")

data.py 103 has_len_all_ranks
if total_length == 0:

RuntimeError:
CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Did you build with cudaCapabilities = [ ... "8.6" ... ]?
23:49:24
@ss:someonex.netSomeoneSerge (matrix works sometimes)* Oh...23:51:03
@ss:someonex.netSomeoneSerge (matrix works sometimes)(Did I just edit a message instead of sending a new one again?)23:51:45
@aidalgol:matrix.orgaidalgol I just cloned https://github.com/ConnorBaker/nix-cuda-test.git, applied the fix described earlier, then ran nix run .#nix-cuda-test 23:52:36
@ss:someonex.netSomeoneSerge (matrix works sometimes)
In reply to @aidalgol:matrix.org
An RTX 3080 is too old?!
So yeah there was a question about cudaCapabilities = [ ... "8.6" ]
23:52:41
@aidalgol:matrix.orgaidalgolI saw it. 👍️23:52:53
@ss:someonex.netSomeoneSerge (matrix works sometimes)
In reply to @aidalgol:matrix.org
I just cloned https://github.com/ConnorBaker/nix-cuda-test.git, applied the fix described earlier, then ran nix run .#nix-cuda-test
Our culprit: https://github.com/ConnorBaker/nix-cuda-test/blob/182c2148e6df0932fe19f9cb7180173ee2f9cb2d/flake.nix#L66
23:53:08
20 May 2024
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)Ah yeah sorry I usually just have it set to my GPU to speed up compile since I use it to test PRs00:22:23
@aidalgol:matrix.orgaidalgol Sooo... while that was building, KDE crashed to the display manager, and now the GPU usage is showing a non-zero value as I expected earlier. I have no idea what could have changed. 01:12:24
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)Maybe it forced the driver to reload lol03:46:33
@ss:someonex.netSomeoneSerge (matrix works sometimes)

connor (he/him) (UTC-5): Samuel Ainsworth: Madoura:

1 I'd like us to add a more generic alias to this room that would encompass a wider range of topics, rocm at the least. The reasoning is the same: i don't think we need a special room for cuda... Madoura: what do rocm-maintainers think about their presence on matrix?
2 I notice we never brought up the subject of onboarding new people anywhere in public, not even discourse. Wdyt should be done about that?

14:20:05
@trexd:matrix.orgtrexdCould call the room nixos-gpu since that covers rocm too.14:21:23
@ss:someonex.netSomeoneSerge (matrix works sometimes)
In reply to @trexd:matrix.org
Could call the room nixos-gpu since that covers rocm too.
Yes, gpu/coprocessors/accelerators/scicomp/ai even/anything in that direction. Well there already is nixos hpc and nixos data science but I don't see much conversation there, what has to be changed to spark conversations? There's activity in matthewcroughan's flake room, and sometimes hete, but not in nixos ds.
14:34:08
@ss:someonex.netSomeoneSerge (matrix works sometimes)* Yes, gpu/coprocessors/accelerators/scicomp/ai even/anything in that direction. Well there already is nixos hpc and nixos data science but I don't see much conversation there, what has to be changed to spark conversations? There's activity in matthewcroughan's flake room, and sometimes here, but not in nixos ds.14:34:25
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)

To tackle diamond dependencies (among other things), I started making https://github.com/ConnorBaker/dep-tracker

Specify a flake attribute for a package and it’ll grab a copy of all the attributes on the package containing a list of dependencies (the attributes it looks for are here https://github.com/ConnorBaker/dep-tracker/blob/cd8e927c561f3f1ed5c904609654c946d85cf954/packages/dep-tracker/dep_tracker/types.py#L15). It’ll look through those arrays and populate a SQLite database with libraries it finds in those dependencies.

Now, a question: besides recursing and doing the same for every dependency I find (that is, harvesting attributes and updating the database), is there an easier way to get the closure of dependencies without building the package? IIRC nix path-info requires the package be built.

A different question: with that hardcoded list of attributes I inspect, is it possible I’d miss dependencies (and therefore libraries) which are present in the closure?

@someoneserge you have good ideas about finding dependencies — any suggestions? Currently finding what a dependency provides is limited to listing the names of libraries present under lib (https://github.com/ConnorBaker/dep-tracker/blob/cd8e927c561f3f1ed5c904609654c946d85cf954/packages/dep-tracker/dep_tracker/deps.py#L42) and finding out what a library needs is accomplished through patchelf (https://github.com/ConnorBaker/dep-tracker/blob/cd8e927c561f3f1ed5c904609654c946d85cf954/packages/dep-tracker/dep_tracker/deps.py#L56)

17:57:11
@tpw_rules:matrix.orgtpw_rulesis there some reason we've got an out of date tensorflow build from source?18:07:41
@tpw_rules:matrix.orgtpw_rulesoh maybe the bin has cuda support now?18:08:17
@ss:someonex.netSomeoneSerge (matrix works sometimes)

IIRC nix path-info requires the package be built.

For deciding which dependencies to retain a runtime reference to

18:11:59

Show newer messages


Back to Room ListRoom Version: 9