NixOS CUDA - Public Room Timeline

	NixOS CUDA	323 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	64 Servers

Load older messages

Sender	Message	Time
19 May 2024
connor (he/him)	That’s what the whole conversation with Hexa earlier was about. For the fix see https://matrix.to/#/!eWOErHSaiddIbsUNsJ%3Anixos.org/%24AqWIeELQaHl8v5LeT6F4ZPGscJfBAJp1ZvL8vtDHrGY	22:17:26
aidalgol	Seed set to 42 Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /nix/store/h3wgj19ywjsd5ha976k8kajmg1jz7lpw-python3.11-torch-2.3.0/lib/python3.11/site-packages/torch/cuda/__init__.py:184: UserWarning: Found GPU0 NVIDIA GeForce RTX 3080 which is of cuda capability 8.6. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 8.9. warnings.warn( Files already downloaded and verified Files already downloaded and verified LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] \| Name \| Type \| Params ----------------------------------------------- 0 \| criterion \| CrossEntropyLoss \| 0 1 \| model \| ViT \| 86.3 M ----------------------------------------------- 86.3 M Trainable params 0 Non-trainable params 86.3 M Total params 345.317 Total estimated model params size (MB) Sanity Checking: \| \| 0/? [00:00<?, ?it/s] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .nix-cuda-test-wrapped 9 <module> sys.exit(main()) __main__.py 128 main trainer.fit( trainer.py 544 fit call._call_and_handle_interrupt( call.py 44 _call_and_handle_interrupt return trainer_fn(args, kwargs) trainer.py 580 _fit_impl self._run(model, ckpt_path=ckpt_path) trainer.py 987 _run results = self._run_stage() trainer.py 1031 _run_stage self._run_sanity_check() trainer.py 1060 _run_sanity_check val_loop.run() utilities.py 182 _decorator return loop_run(self, args, **kwargs) evaluation_loop.py 110 run self.setup_data() evaluation_loop.py 192 setup_data length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf") data.py 103 has_len_all_ranks if total_length == 0: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.	23:01:00
aidalgol	An RTX 3080 is too old?!	23:01:15
SomeoneSerge (matrix works sometimes)	In reply to @aidalgol:matrix.org Seed set to 42 Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /nix/store/h3wgj19ywjsd5ha976k8kajmg1jz7lpw-python3.11-torch-2.3.0/lib/python3.11/site-packages/torch/cuda/__init__.py:184: UserWarning: Found GPU0 NVIDIA GeForce RTX 3080 which is of cuda capability 8.6. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 8.9. warnings.warn( Files already downloaded and verified Files already downloaded and verified LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] \| Name \| Type \| Params ----------------------------------------------- 0 \| criterion \| CrossEntropyLoss \| 0 1 \| model \| ViT \| 86.3 M ----------------------------------------------- 86.3 M Trainable params 0 Non-trainable params 86.3 M Total params 345.317 Total estimated model params size (MB) Sanity Checking: \| \| 0/? [00:00<?, ?it/s] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .nix-cuda-test-wrapped 9 <module> sys.exit(main()) __main__.py 128 main trainer.fit( trainer.py 544 fit call._call_and_handle_interrupt( call.py 44 _call_and_handle_interrupt return trainer_fn(args, kwargs) trainer.py 580 _fit_impl self._run(model, ckpt_path=ckpt_path) trainer.py 987 _run results = self._run_stage() trainer.py 1031 _run_stage self._run_sanity_check() trainer.py 1060 _run_sanity_check val_loop.run() utilities.py 182 _decorator return loop_run(self, args, **kwargs) evaluation_loop.py 110 run self.setup_data() evaluation_loop.py 192 setup_data length = len(dl) if has_len_all_ranks(dl, trainer.strategy, allow_zero_length) else float("inf") data.py 103 has_len_all_ranks if total_length == 0: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Did you build with `cudaCapabilities = [ ... "8.6" ... ]`?	23:49:24
SomeoneSerge (matrix works sometimes)	* Oh...	23:51:03
SomeoneSerge (matrix works sometimes)	(Did I just edit a message instead of sending a new one again?)	23:51:45
aidalgol	I just cloned https://github.com/ConnorBaker/nix-cuda-test.git, applied the fix described earlier, then ran `nix run .#nix-cuda-test`	23:52:36
SomeoneSerge (matrix works sometimes)	In reply to @aidalgol:matrix.org An RTX 3080 is too old?! So yeah there was a question about `cudaCapabilities = [ ... "8.6" ]`	23:52:41
aidalgol	I saw it. 👍️	23:52:53
SomeoneSerge (matrix works sometimes)	In reply to @aidalgol:matrix.org I just cloned https://github.com/ConnorBaker/nix-cuda-test.git, applied the fix described earlier, then ran `nix run .#nix-cuda-test` Our culprit: https://github.com/ConnorBaker/nix-cuda-test/blob/182c2148e6df0932fe19f9cb7180173ee2f9cb2d/flake.nix#L66	23:53:08
20 May 2024
connor (he/him)	Ah yeah sorry I usually just have it set to my GPU to speed up compile since I use it to test PRs	00:22:23
aidalgol	Sooo... while that was building, KDE crashed to the display manager, and now the GPU usage is showing a non-zero value as I expected earlier. I have no idea what could have changed.	01:12:24
connor (he/him)	Maybe it forced the driver to reload lol	03:46:33
SomeoneSerge (matrix works sometimes)	connor (he/him) (UTC-5): Samuel Ainsworth: Madoura: 1 I'd like us to add a more generic alias to this room that would encompass a wider range of topics, rocm at the least. The reasoning is the same: i don't think we need a special room for cuda... Madoura: what do rocm-maintainers think about their presence on matrix? 2 I notice we never brought up the subject of onboarding new people anywhere in public, not even discourse. Wdyt should be done about that?	14:20:05
trexd	Could call the room nixos-gpu since that covers rocm too.	14:21:23
SomeoneSerge (matrix works sometimes)	In reply to @trexd:matrix.org Could call the room nixos-gpu since that covers rocm too. Yes, gpu/coprocessors/accelerators/scicomp/ai even/anything in that direction. Well there already is nixos hpc and nixos data science but I don't see much conversation there, what has to be changed to spark conversations? There's activity in matthewcroughan's flake room, and sometimes hete, but not in nixos ds.	14:34:08
SomeoneSerge (matrix works sometimes)	* Yes, gpu/coprocessors/accelerators/scicomp/ai even/anything in that direction. Well there already is nixos hpc and nixos data science but I don't see much conversation there, what has to be changed to spark conversations? There's activity in matthewcroughan's flake room, and sometimes here, but not in nixos ds.	14:34:25
connor (he/him)	To tackle diamond dependencies (among other things), I started making https://github.com/ConnorBaker/dep-tracker Specify a flake attribute for a package and it’ll grab a copy of all the attributes on the package containing a list of dependencies (the attributes it looks for are here https://github.com/ConnorBaker/dep-tracker/blob/cd8e927c561f3f1ed5c904609654c946d85cf954/packages/dep-tracker/dep_tracker/types.py#L15). It’ll look through those arrays and populate a SQLite database with libraries it finds in those dependencies. Now, a question: besides recursing and doing the same for every dependency I find (that is, harvesting attributes and updating the database), is there an easier way to get the closure of dependencies without building the package? IIRC `nix path-info` requires the package be built. A different question: with that hardcoded list of attributes I inspect, is it possible I’d miss dependencies (and therefore libraries) which are present in the closure? @someoneserge you have good ideas about finding dependencies — any suggestions? Currently finding what a dependency provides is limited to listing the names of libraries present under `lib` (https://github.com/ConnorBaker/dep-tracker/blob/cd8e927c561f3f1ed5c904609654c946d85cf954/packages/dep-tracker/dep_tracker/deps.py#L42) and finding out what a library needs is accomplished through patchelf (https://github.com/ConnorBaker/dep-tracker/blob/cd8e927c561f3f1ed5c904609654c946d85cf954/packages/dep-tracker/dep_tracker/deps.py#L56)	17:57:11
tpw_rules	is there some reason we've got an out of date tensorflow build from source?	18:07:41
tpw_rules	oh maybe the bin has cuda support now?	18:08:17
SomeoneSerge (matrix works sometimes)	IIRC nix path-info requires the package be built. For deciding which dependencies to retain a runtime reference to	18:11:59
SomeoneSerge (matrix works sometimes)	Have you seen https://fzakaria.com/2023/09/11/quick-insights-using-sqlelf.html?	18:17:01
Gaétan Lepage	In reply to @tpw_rules:matrix.org is there some reason we've got an out of date tensorflow build from source? Because a >200 IQ is necessary to grasp this derivation 😅	18:19:34
trexd	In reply to @glepage:matrix.org Because a >200 IQ is necessary to grasp this derivation 😅 What's the TLDR on why tensorflow is so difficult to package if you don't mind me asking? Maybe this is another example of "packaging is a hard problem" that I can add to my Nix pitch slides.	18:27:20
Gaétan Lepage	Well, the [tensorflow derivation](https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/python-modules/tensorflow/default.nix= is ~600 lines of hacking around the bazel build system, + doing a bunch of hacks to inject our own dependencies + CUDA stuff... All of this requires a lot of expertise (that I personnaly lack). It is surely one of the hardest packages that I am aware of in the python package set.	18:36:24
Gaétan Lepage	* Well, the tensorflow derivation is ~600 lines of hacking around the bazel build system, + doing a bunch of hacks to inject our own dependencies + CUDA stuff... All of this requires a lot of expertise (that I personnaly lack). It is surely one of the hardest packages that I am aware of in the python package set.	18:36:30
Gaétan Lepage	Now, if you want to take this challenge and update tensorflow, please go ahead. For context, there is a stale PR for updating tensorflow to 2.14: https://github.com/NixOS/nixpkgs/pull/272838	18:38:02
trexd	In reply to @glepage:matrix.org Now, if you want to take this challenge and update tensorflow, please go ahead. For context, there is a stale PR for updating tensorflow to 2.14: https://github.com/NixOS/nixpkgs/pull/272838 I'm already full up with packaging hasktorch haha! Yeah I've had a look at the derivation before and it seems nuts.	18:49:58
aidalgol	In reply to @connorbaker:matrix.org Maybe it forced the driver to reload lol I think that's what it is, yeah. This seems to break upon resuming the machine from suspend.	20:06:08
connor (he/him)	In reply to @ss:someonex.net Have you seen https://fzakaria.com/2023/09/11/quick-insights-using-sqlelf.html? I have! I saw Farid give a presentation on it at NixCon NA and that was neat; but it’s not packaged in Nixpkgs and I don’t want to do it :/	20:58:45

Show newer messages

Back to Room ListRoom Version: 9