!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

290 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

Load older messages


SenderMessageTime
13 Jun 2024
@aidalgol:matrix.orgaidalgol I'm not convinced that having a single fixed version of libXNVCtrl is worth the trouble, but I do want to move it to the top level. 20:00:03
15 Jun 2024
@shekhinah:she.khinah.xyzshekhinah set their display name to shekhinah.08:46:32
@matthewcroughan:defenestrate.itmatthewcroughan Did you guys know that python311Packages.tensorrt is broken because someone updated pkgs/development/cuda-modules/tensorrt/releases.nix without checking that it broke any derivations? 15:14:46
@matthewcroughan:defenestrate.itmatthewcroughan
[astraluser@edward:~/Downloads/f/TensorRT-8.6.1.6]$ ls python/tensorrt-8.6.1-cp3
tensorrt-8.6.1-cp310-none-linux_x86_64.whl  tensorrt-8.6.1-cp36-none-linux_x86_64.whl   tensorrt-8.6.1-cp38-none-linux_x86_64.whl   
tensorrt-8.6.1-cp311-none-linux_x86_64.whl  tensorrt-8.6.1-cp37-none-linux_x86_64.whl   tensorrt-8.6.1-cp39-none-linux_x86_64.whl
15:15:11
@matthewcroughan:defenestrate.itmatthewcroughan
python3.11-tensorrt> /nix/store/d3dzfy4amjl826fb8j00qp1d9887h7hm-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context
error: builder for '/nix/store/8pw2fjq86vbkdd6s1bl6axfkhbnm18lr-python3.11-tensorrt-8.6.1.6.drv' failed with exit code 2;
       last 10 log lines:
       > Using pythonImportsCheckPhase
       > Sourcing python-namespaces-hook
       > Sourcing python-catch-conflicts-hook.sh
       > Sourcing auto-add-driver-runpath-hook
       > Using autoAddDriverRunpath
       > Sourcing fix-elf-files.sh
       > Running phase: unpackPhase
       > tar: TensorRT-8.6.1.6/python/tensorrt-8.6.1.6-cp311-none-linux_x86_64.whl: Not found in archive
       > tar: Exiting with failure status due to previous errors
       > /nix/store/d3dzfy4amjl826fb8j00qp1d9887h7hm-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context
       For full logs, run 'nix log /nix/store/8pw2fjq86vbkdd6s1bl6axfkhbnm18lr-python3.11-tensorrt-8.6.1.6.drv'.
15:15:34
@matthewcroughan:defenestrate.itmatthewcroughan they removed the .6 from the release 15:15:47
@matthewcroughan:defenestrate.itmatthewcroughan TensorRT-8.6.1.6/python/tensorrt-8.6.1.6-cp311-none-linux_x86_64.whl is wrong
TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp311-none-linux_x86_64.whl is correct
15:16:11
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @matthewcroughan:defenestrate.it
Did you guys know that python311Packages.tensorrt is broken because someone updated pkgs/development/cuda-modules/tensorrt/releases.nix without checking that it broke any derivations?
Nvidia prevents unattended downloads, of course it broke
16:08:17
@matthewcroughan:defenestrate.itmatthewcroughanGod we need archive-org-pkgs 16:22:55
@keiichi:matrix.orgteto
In reply to @connorbaker:matrix.org
What revision of nixpkgs are you on? master fails to build (go-stable-diffusion errors during CMake configure)
right sry I had disabled diffusion in an overlay. I've checked that it works on master now (following the ;local-ai 2.16 bump today). I've opened https://github.com/NixOS/nixpkgs/issues/320145 to help myself collect the info
22:26:22
17 Jun 2024
@grw00:matrix.orggrw00 joined the room.12:25:16
@grw00:matrix.orggrw00

hey all, has anyone had success using cuda libraries inside a docker container built with nix? i don't mean running a cuda container on nixos host but the opposite, running a nix container containing cuda program on another host
i build a container with nix and pytorch etc and run it on runpod, it doesnt see nvidia drivers/device though, i guess i am missing something. currently i have:

        dockerImages.default = pkgs.dockerTools.streamLayeredImage {
          name = "ghcr.io/my-image";
          tag = "latest";

          contents = [
            pkgs.bash
            pkgs.uutils-coreutils-noprefix
            pkgs.cacert
            pkgs.libnvidia-container

            pythonEnv
          ];

          config = {
            Cmd = [ "${pkgs.bash}/bin/bash" ];
            Env = [
              "CUDA_PATH=${pkgs.cudatoolkit}"
              "LD_LIBRARY_PATH=${pkgs.linuxPackages_5_4.nvidia_x11}/lib"
            ];
          };
        };
12:30:49
@ss:someonex.netSomeoneSerge (back on matrix) grw00: are you using CDI or the runtime wrappers? Either way you need to have the drivers exposed in ld_library_path or mounted under /run/opengl-driver/lib 12:31:13
@grw00:matrix.orggrw00not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container12:32:09
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @grw00:matrix.org

hey all, has anyone had success using cuda libraries inside a docker container built with nix? i don't mean running a cuda container on nixos host but the opposite, running a nix container containing cuda program on another host
i build a container with nix and pytorch etc and run it on runpod, it doesnt see nvidia drivers/device though, i guess i am missing something. currently i have:

        dockerImages.default = pkgs.dockerTools.streamLayeredImage {
          name = "ghcr.io/my-image";
          tag = "latest";

          contents = [
            pkgs.bash
            pkgs.uutils-coreutils-noprefix
            pkgs.cacert
            pkgs.libnvidia-container

            pythonEnv
          ];

          config = {
            Cmd = [ "${pkgs.bash}/bin/bash" ];
            Env = [
              "CUDA_PATH=${pkgs.cudatoolkit}"
              "LD_LIBRARY_PATH=${pkgs.linuxPackages_5_4.nvidia_x11}/lib"
            ];
          };
        };
Hard coding linuxPackages in the image is a bad idea. With cuda you normally don't want drivers in the image, you want the host's drivers mounted in the containet
12:33:17
@ss:someonex.netSomeoneSerge (back on matrix)No need for libnvidia-container in the imahe either i think12:34:17
@grw00:matrix.orggrw00
In reply to @ss:someonex.net
Hard coding linuxPackages in the image is a bad idea. With cuda you normally don't want drivers in the image, you want the host's drivers mounted in the containet
ah kk, got it. i'm specifically trying to use this on runpod.io, i don't think they offer this as a possibility. it seems like the images they offer all have cuda installed in image
12:35:10
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @grw00:matrix.org
not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container
CDI is the new thing where you can specify where to mount things in the containers in a json file
12:36:28
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @grw00:matrix.org
ah kk, got it. i'm specifically trying to use this on runpod.io, i don't think they offer this as a possibility. it seems like the images they offer all have cuda installed in image
They have to have a driver on the host, it's separate from the cuda toolkit
12:37:40
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @grw00:matrix.org
not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container
Can you bind mount it using CLI flags maybe?
12:38:09
@ss:someonex.netSomeoneSerge (back on matrix)Bottom line is: this is not about the image, it's about the host configuration12:38:39
@grw00:matrix.orggrw00ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though12:40:49
@grw00:matrix.orggrw00
❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io

-- RUNPOD.IO --
Enjoy your Pod #6r0gwnq7twsots ^_^

bash-5.2# nvidia-smi
bash: /usr/bin/nvidia-smi: cannot execute: required file not found
12:41:31
@ss:someonex.netSomeoneSerge (back on matrix)There's one thing you could do at the image level: anticipating that the host configuration assumes fhs (=is broken and non cross platform) you could wrap your entrypoint with numtide/nixglhost which will separate the meat from the flies and put libcuda (mounted in usr lib probably) in ld library path without any extra breakages12:43:33
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @grw00:matrix.org
❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io

-- RUNPOD.IO --
Enjoy your Pod #6r0gwnq7twsots ^_^

bash-5.2# nvidia-smi
bash: /usr/bin/nvidia-smi: cannot execute: required file not found
Is this nvidia-smi from your hard coded linuxPackages?
12:44:00
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @grw00:matrix.org
ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though
When you specify --gpus=all or the equivalent cdi thing it mounts extra stuff
12:44:34
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @ss:someonex.net
Is this nvidia-smi from your hard coded linuxPackages?
Or is the one mounted from the host and expecting that there would be a /lib/ld-linux*.so?
12:45:51
@grw00:matrix.orggrw00yes it is12:46:43
@grw00:matrix.orggrw00
bash-5.2# df
Filesystem     1K-blocks     Used Available Use% Mounted on
overlay         10485760    64056  10421704   1% /
tmpfs              65536        0     65536   0% /dev
tmpfs          132014436        0 132014436   0% /sys/fs/cgroup
shm             15728640        0  15728640   0% /dev/shm
/dev/nvme0n1p2  65478188 24385240  37721124  40% /sbin/docker-init
/dev/nvme0n1p4  52428800        0  52428800   0% /cache
udev           131923756        0 131923756   0% /dev/null
udev           131923756        0 131923756   0% /dev/tty
tmpfs          132014436       12 132014424   1% /proc/driver/nvidia
tmpfs          132014436        4 132014432   1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs           26402888    18148  26384740   1% /run/nvidia-persistenced/socket
tmpfs          132014436        0 132014436   0% /proc/asound
tmpfs          132014436        0 132014436   0% /proc/acpi
tmpfs          132014436        0 132014436   0% /proc/scsi
tmpfs          132014436        0 132014436   0% /sys/firmware
12:46:52
@grw00:matrix.orggrw00

ok, checked their ubuntu-based image and mounts look like this:

root@cc04a766e493:~# df
Filesystem     1K-blocks     Used Available Use% Mounted on
overlay         20971520    64224  20907296   1% /
tmpfs              65536        0     65536   0% /dev
tmpfs          132014448        0 132014448   0% /sys/fs/cgroup
shm             15728640        0  15728640   0% /dev/shm
/dev/nvme0n1p2  65478188 18995924  43110440  31% /usr/bin/nvidia-smi
/dev/nvme0n1p4  20971520        0  20971520   0% /workspace
tmpfs          132014448       12 132014436   1% /proc/driver/nvidia
tmpfs          132014448        4 132014444   1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs           26402892     8832  26394060   1% /run/nvidia-persistenced/socket
tmpfs          132014448        0 132014448   0% /proc/asound
tmpfs          132014448        0 132014448   0% /proc/acpi
tmpfs          132014448        0 132014448   0% /proc/scsi
tmpfs          132014448        0 132014448   0% /sys/firmware
12:50:05

Show newer messages


Back to Room ListRoom Version: 9