| 16 Sep 2022 |
SomeoneSerge (back on matrix) | cudatoolkit.{out,lib} bring in a lot (4-5GiB) of luggage; if you'd like to get rid of it later, maybe you could start with https://github.com/NixOS/nixpkgs/blob/befe56a1ee1d383fafaf9db41e3f4fc506578da1/pkgs/development/python-modules/pytorch/default.nix#L57 | 21:14:47 |
SomeoneSerge (back on matrix) | * cudatoolkit.{out,lib} brings in a lot (4-5GiB) of luggage; if you'd like to get rid of it later, maybe you could start with https://github.com/NixOS/nixpkgs/blob/befe56a1ee1d383fafaf9db41e3f4fc506578da1/pkgs/development/python-modules/pytorch/default.nix#L57 | 21:14:53 |
| 17 Sep 2022 |
aidalgol | In a flake shell with config.allowUnfree = true; and config.cudaSupport = true;, the python torch module is throwing an unknown CUDA error. Is there something more I need to do to get the package's CUDA support enabled?
File "/nix/store/bf48f3zny7q08lg4hc4279fn3jw1lkpl-python3-3.10.6-env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 217, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
| 05:54:58 |
tpw_rules | are you running on nixos? | 05:55:22 |
aidalgol | Yes, sorry, this is on NixOS. | 05:55:35 |
tpw_rules | and you have the nvidia drivers set up and nvidia-smi works and stuff? | 05:57:16 |
aidalgol | Yep, nvidia-smi output still looks good. | 05:58:14 |
tpw_rules | is this the torch-bin module or did you compile it yourself? | 05:58:52 |
aidalgol | I was referencing torch, not torch-bin. Should I try that one? | 06:00:26 |
aidalgol | I'm also using the cuda-maintainers cachix cache, if that makes a difference. | 06:01:19 |
tpw_rules | you can try torch-bin, it's precompiled by upstream with cuda support. | 06:01:41 |
tpw_rules | do you have an excessively recent nvidia card? | 06:02:01 |
aidalgol | RTX3080 | 06:02:07 |
aidalgol | Driver Version: 515.48.07 CUDA Version: 11.7
| 06:02:23 |
tpw_rules | give torch-bin a try, it doesn't sound like you're doing anything wrong with regular torch though, something might be broken with cuda 11.7 or so | 06:02:54 |
tpw_rules | actually i think nixpkgs only has cuda 11.6 so that shouldn't even be it. i reviewed the pr and tested it i thought.. | 06:03:26 |
aidalgol | Some days it feels like GPU programming has invented a new kind of dependency hell. | 06:03:49 |
tpw_rules | what nixpkgs commit are you on | 06:04:41 |
tpw_rules | and are you trying to run any particular code | 06:05:15 |
tpw_rules | i might be able to debug next week. i have a 3060Ti at work | 06:06:18 |
aidalgol | I'm trying to run this script for some video upscaling I'm trying to do with VapourSynth and arcane plugins.
https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/main/convert_esrgan_to_onnx.py | 06:06:50 |
aidalgol | (Modifying that script to use a different input file) | 06:07:21 |
tpw_rules | maybe tensorrt is the problem. don't think that's in nixpkgs | 06:08:03 |
tpw_rules | anyway it is excessively my bedtime. good luck | 06:08:16 |
aidalgol | Welp, that made no difference. | 06:37:31 |
aidalgol | With this shell.nix,
{ pkgs ? import <nixpkgs> {
config.allowUnfree = true;
config.cudaSupport = true;
} }:
pkgs.mkShell {
packages = with pkgs; [
(python3.withPackages (ps: [
ps.torch
]))
];
}
Just a basic "is CUDA available" check fails.
$ nix-shell --run 'python'
Python 3.10.6 (main, Aug 1 2022, 20:38:21) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> assert torch.cuda.is_available()
/nix/store/bf48f3zny7q08lg4hc4279fn3jw1lkpl-python3-3.10.6-env/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /build/source/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
| 06:41:20 |
aidalgol | Uh, false alarm I guess, because after a NixOS update and reboot, the assert passes. | 09:09:37 |
tpw_rules | the ways of cuda are strange. glad you got it working. maybe you updated your kernel recently and needed to reboot | 13:44:11 |
SomeoneSerge (back on matrix) | (too late, but I'll still chime in with a comment on how I so far understand the landscape)
This is all that's needed from nixpkgs. The other requirements are imposed on the running system, and probably amount to having a /run/opengl-driver/lib/libcuda.so and some kernel module loaded. Both are deployed on NixOS when hardware.opengl.enable = true and the driver is nvidia
| 14:19:31 |
aidalgol | In reply to@ss:someonex.net
(too late, but I'll still chime in with a comment on how I so far understand the landscape)
This is all that's needed from nixpkgs. The other requirements are imposed on the running system, and probably amount to having a /run/opengl-driver/lib/libcuda.so and some kernel module loaded. Both are deployed on NixOS when hardware.opengl.enable = true and the driver is nvidia
I did not have hardware.opengl.enable = true; in my system config, so I'm not sure how OpenGL ever worked on my system. It's there now, though. 👍️ | 18:39:18 |