| 13 Jun 2024 |
aidalgol | In reply to @ss:someonex.net
I suppose we need some kind of a fixpoint 🤷
Btw, I just had a look at the mangohud derivation, and
- we're still doing
inherit (linuxPackages.nvidia_x11) libXNVCtrl which I think is a mauvais ton (referencing a concrete version of linuxPackages from nixpkgs)
- I still think it belongs in the top-level, maybe as an attrset,
libXNVCtrlVersions
- probably a very bad idea, but the nvidia nixos module could add an overlay setting the respective default version of libXNVCtrl
- all packages taking
libXNVCtrl from the top-level, we'd ensure only one version is in use in any given closure (probably at the cost of rebuilding reverse dependencies)
- passing a python package (mako) rather than python3Packages
Honestly, not sure if this is worth the effort 🙃 I did have an attempt at moving it to the top-level, but decided to do it in a separate PR. I need to come back to that. I'm not entirely sure how best to approach that so that nvidia_x11 can override it. | 19:59:27 |
aidalgol | I'm not convinced that having a single fixed version of libXNVCtrl is worth the trouble, but I do want to move it to the top level. | 20:00:03 |
| 15 Jun 2024 |
| shekhinah set their display name to shekhinah. | 08:46:32 |
matthewcroughan | Did you guys know that python311Packages.tensorrt is broken because someone updated pkgs/development/cuda-modules/tensorrt/releases.nix without checking that it broke any derivations? | 15:14:46 |
matthewcroughan | [astraluser@edward:~/Downloads/f/TensorRT-8.6.1.6]$ ls python/tensorrt-8.6.1-cp3
tensorrt-8.6.1-cp310-none-linux_x86_64.whl tensorrt-8.6.1-cp36-none-linux_x86_64.whl tensorrt-8.6.1-cp38-none-linux_x86_64.whl
tensorrt-8.6.1-cp311-none-linux_x86_64.whl tensorrt-8.6.1-cp37-none-linux_x86_64.whl tensorrt-8.6.1-cp39-none-linux_x86_64.whl
| 15:15:11 |
matthewcroughan | python3.11-tensorrt> /nix/store/d3dzfy4amjl826fb8j00qp1d9887h7hm-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context
error: builder for '/nix/store/8pw2fjq86vbkdd6s1bl6axfkhbnm18lr-python3.11-tensorrt-8.6.1.6.drv' failed with exit code 2;
last 10 log lines:
> Using pythonImportsCheckPhase
> Sourcing python-namespaces-hook
> Sourcing python-catch-conflicts-hook.sh
> Sourcing auto-add-driver-runpath-hook
> Using autoAddDriverRunpath
> Sourcing fix-elf-files.sh
> Running phase: unpackPhase
> tar: TensorRT-8.6.1.6/python/tensorrt-8.6.1.6-cp311-none-linux_x86_64.whl: Not found in archive
> tar: Exiting with failure status due to previous errors
> /nix/store/d3dzfy4amjl826fb8j00qp1d9887h7hm-stdenv-linux/setup: line 131: pop_var_context: head of shell_variables not a function context
For full logs, run 'nix log /nix/store/8pw2fjq86vbkdd6s1bl6axfkhbnm18lr-python3.11-tensorrt-8.6.1.6.drv'.
| 15:15:34 |
matthewcroughan | they removed the .6 from the release | 15:15:47 |
matthewcroughan | TensorRT-8.6.1.6/python/tensorrt-8.6.1.6-cp311-none-linux_x86_64.whl is wrong
TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp311-none-linux_x86_64.whl is correct | 15:16:11 |
SomeoneSerge (back on matrix) | In reply to @matthewcroughan:defenestrate.it Did you guys know that python311Packages.tensorrt is broken because someone updated pkgs/development/cuda-modules/tensorrt/releases.nix without checking that it broke any derivations? Nvidia prevents unattended downloads, of course it broke | 16:08:17 |
matthewcroughan | God we need archive-org-pkgs | 16:22:55 |
teto | In reply to @connorbaker:matrix.org What revision of nixpkgs are you on? master fails to build (go-stable-diffusion errors during CMake configure) right sry I had disabled diffusion in an overlay. I've checked that it works on master now (following the ;local-ai 2.16 bump today). I've opened https://github.com/NixOS/nixpkgs/issues/320145 to help myself collect the info | 22:26:22 |
| 17 Jun 2024 |
| grw00 joined the room. | 12:25:16 |
grw00 | hey all, has anyone had success using cuda libraries inside a docker container built with nix? i don't mean running a cuda container on nixos host but the opposite, running a nix container containing cuda program on another host i build a container with nix and pytorch etc and run it on runpod, it doesnt see nvidia drivers/device though, i guess i am missing something. currently i have:
dockerImages.default = pkgs.dockerTools.streamLayeredImage {
name = "ghcr.io/my-image";
tag = "latest";
contents = [
pkgs.bash
pkgs.uutils-coreutils-noprefix
pkgs.cacert
pkgs.libnvidia-container
pythonEnv
];
config = {
Cmd = [ "${pkgs.bash}/bin/bash" ];
Env = [
"CUDA_PATH=${pkgs.cudatoolkit}"
"LD_LIBRARY_PATH=${pkgs.linuxPackages_5_4.nvidia_x11}/lib"
];
};
};
| 12:30:49 |
SomeoneSerge (back on matrix) | grw00: are you using CDI or the runtime wrappers? Either way you need to have the drivers exposed in ld_library_path or mounted under /run/opengl-driver/lib | 12:31:13 |
grw00 | not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container | 12:32:09 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org
hey all, has anyone had success using cuda libraries inside a docker container built with nix? i don't mean running a cuda container on nixos host but the opposite, running a nix container containing cuda program on another host i build a container with nix and pytorch etc and run it on runpod, it doesnt see nvidia drivers/device though, i guess i am missing something. currently i have:
dockerImages.default = pkgs.dockerTools.streamLayeredImage {
name = "ghcr.io/my-image";
tag = "latest";
contents = [
pkgs.bash
pkgs.uutils-coreutils-noprefix
pkgs.cacert
pkgs.libnvidia-container
pythonEnv
];
config = {
Cmd = [ "${pkgs.bash}/bin/bash" ];
Env = [
"CUDA_PATH=${pkgs.cudatoolkit}"
"LD_LIBRARY_PATH=${pkgs.linuxPackages_5_4.nvidia_x11}/lib"
];
};
};
Hard coding linuxPackages in the image is a bad idea. With cuda you normally don't want drivers in the image, you want the host's drivers mounted in the containet | 12:33:17 |
SomeoneSerge (back on matrix) | No need for libnvidia-container in the imahe either i think | 12:34:17 |
grw00 | In reply to @ss:someonex.net Hard coding linuxPackages in the image is a bad idea. With cuda you normally don't want drivers in the image, you want the host's drivers mounted in the containet ah kk, got it. i'm specifically trying to use this on runpod.io, i don't think they offer this as a possibility. it seems like the images they offer all have cuda installed in image | 12:35:10 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container CDI is the new thing where you can specify where to mount things in the containers in a json file | 12:36:28 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org ah kk, got it. i'm specifically trying to use this on runpod.io, i don't think they offer this as a possibility. it seems like the images they offer all have cuda installed in image They have to have a driver on the host, it's separate from the cuda toolkit | 12:37:40 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container Can you bind mount it using CLI flags maybe? | 12:38:09 |
SomeoneSerge (back on matrix) | Bottom line is: this is not about the image, it's about the host configuration | 12:38:39 |
grw00 | ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though | 12:40:49 |
grw00 | ❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io
-- RUNPOD.IO --
Enjoy your Pod #6r0gwnq7twsots ^_^
bash-5.2# nvidia-smi
bash: /usr/bin/nvidia-smi: cannot execute: required file not found
| 12:41:31 |
SomeoneSerge (back on matrix) | There's one thing you could do at the image level: anticipating that the host configuration assumes fhs (=is broken and non cross platform) you could wrap your entrypoint with numtide/nixglhost which will separate the meat from the flies and put libcuda (mounted in usr lib probably) in ld library path without any extra breakages | 12:43:33 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org
❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io
-- RUNPOD.IO --
Enjoy your Pod #6r0gwnq7twsots ^_^
bash-5.2# nvidia-smi
bash: /usr/bin/nvidia-smi: cannot execute: required file not found
Is this nvidia-smi from your hard coded linuxPackages? | 12:44:00 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though When you specify --gpus=all or the equivalent cdi thing it mounts extra stuff | 12:44:34 |
SomeoneSerge (back on matrix) | In reply to @ss:someonex.net Is this nvidia-smi from your hard coded linuxPackages? Or is the one mounted from the host and expecting that there would be a /lib/ld-linux*.so? | 12:45:51 |
grw00 | yes it is | 12:46:43 |
grw00 | bash-5.2# df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 10485760 64056 10421704 1% /
tmpfs 65536 0 65536 0% /dev
tmpfs 132014436 0 132014436 0% /sys/fs/cgroup
shm 15728640 0 15728640 0% /dev/shm
/dev/nvme0n1p2 65478188 24385240 37721124 40% /sbin/docker-init
/dev/nvme0n1p4 52428800 0 52428800 0% /cache
udev 131923756 0 131923756 0% /dev/null
udev 131923756 0 131923756 0% /dev/tty
tmpfs 132014436 12 132014424 1% /proc/driver/nvidia
tmpfs 132014436 4 132014432 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 26402888 18148 26384740 1% /run/nvidia-persistenced/socket
tmpfs 132014436 0 132014436 0% /proc/asound
tmpfs 132014436 0 132014436 0% /proc/acpi
tmpfs 132014436 0 132014436 0% /proc/scsi
tmpfs 132014436 0 132014436 0% /sys/firmware
| 12:46:52 |