| 17 Jun 2024 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org not sure what CDI is, i understand i need the /run/opengl-driver but i'm not sure how to achieve that in docker container Can you bind mount it using CLI flags maybe? | 12:38:09 |
SomeoneSerge (back on matrix) | Bottom line is: this is not about the image, it's about the host configuration | 12:38:39 |
grw00 | ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though | 12:40:49 |
grw00 | ❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io
-- RUNPOD.IO --
Enjoy your Pod #6r0gwnq7twsots ^_^
bash-5.2# nvidia-smi
bash: /usr/bin/nvidia-smi: cannot execute: required file not found
| 12:41:31 |
SomeoneSerge (back on matrix) | There's one thing you could do at the image level: anticipating that the host configuration assumes fhs (=is broken and non cross platform) you could wrap your entrypoint with numtide/nixglhost which will separate the meat from the flies and put libcuda (mounted in usr lib probably) in ld library path without any extra breakages | 12:43:33 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org
❯ ssh 6r0gwnq7twsots-644110b1@ssh.runpod.io
-- RUNPOD.IO --
Enjoy your Pod #6r0gwnq7twsots ^_^
bash-5.2# nvidia-smi
bash: /usr/bin/nvidia-smi: cannot execute: required file not found
Is this nvidia-smi from your hard coded linuxPackages? | 12:44:00 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org ok good info thx. i will check running one of the containers they offer (that does work) and see if there are any external mounts for cuda drivers, i think not though When you specify --gpus=all or the equivalent cdi thing it mounts extra stuff | 12:44:34 |
SomeoneSerge (back on matrix) | In reply to @ss:someonex.net Is this nvidia-smi from your hard coded linuxPackages? Or is the one mounted from the host and expecting that there would be a /lib/ld-linux*.so? | 12:45:51 |
grw00 | yes it is | 12:46:43 |
grw00 | bash-5.2# df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 10485760 64056 10421704 1% /
tmpfs 65536 0 65536 0% /dev
tmpfs 132014436 0 132014436 0% /sys/fs/cgroup
shm 15728640 0 15728640 0% /dev/shm
/dev/nvme0n1p2 65478188 24385240 37721124 40% /sbin/docker-init
/dev/nvme0n1p4 52428800 0 52428800 0% /cache
udev 131923756 0 131923756 0% /dev/null
udev 131923756 0 131923756 0% /dev/tty
tmpfs 132014436 12 132014424 1% /proc/driver/nvidia
tmpfs 132014436 4 132014432 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 26402888 18148 26384740 1% /run/nvidia-persistenced/socket
tmpfs 132014436 0 132014436 0% /proc/asound
tmpfs 132014436 0 132014436 0% /proc/acpi
tmpfs 132014436 0 132014436 0% /proc/scsi
tmpfs 132014436 0 132014436 0% /sys/firmware
| 12:46:52 |
grw00 | ok, checked their ubuntu-based image and mounts look like this:
root@cc04a766e493:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 20971520 64224 20907296 1% /
tmpfs 65536 0 65536 0% /dev
tmpfs 132014448 0 132014448 0% /sys/fs/cgroup
shm 15728640 0 15728640 0% /dev/shm
/dev/nvme0n1p2 65478188 18995924 43110440 31% /usr/bin/nvidia-smi
/dev/nvme0n1p4 20971520 0 20971520 0% /workspace
tmpfs 132014448 12 132014436 1% /proc/driver/nvidia
tmpfs 132014448 4 132014444 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 26402892 8832 26394060 1% /run/nvidia-persistenced/socket
tmpfs 132014448 0 132014448 0% /proc/asound
tmpfs 132014448 0 132014448 0% /proc/acpi
tmpfs 132014448 0 132014448 0% /proc/scsi
tmpfs 132014448 0 132014448 0% /sys/firmware
| 12:50:05 |
grw00 | interesting they have nvidia-smi mount when my nix container does not 🤔 | 12:50:32 |
| ghpzin joined the room. | 13:05:27 |
| nim65s joined the room. | 13:36:09 |
SomeoneSerge (back on matrix) | H'm. Maybe they really don't mount the userspace driver o_0. I suppose images derived from NVC do contain a compat driver, but it's kind of weird of them to expect that | 14:14:23 |
SomeoneSerge (back on matrix) | You still could use NixGL then | 14:14:39 |
SomeoneSerge (back on matrix) | NixGL will look at the /proc (I think) and choose the correct linuxPackages | 14:15:22 |
SomeoneSerge (back on matrix) | I'd suggest get an MWE based on that and also reach out with runpod's support asking why they won't mount a driver compatible with the host's kernel | 14:16:36 |
SomeoneSerge (back on matrix) | (this conversation has happened here before: neither putting drivers into an image nor mounting the host's drivers is "correct": the driver in the image might not be compatible with the kernel running on the host, and the driver from the host might not be compatible e.g. with the libc in the image, et cetera) | 14:18:14 |
grw00 | great thanks, will try this and get back to you. it's
Cmd = [ "${inputs.nix-gl-host.defaultPackage.x86_64-linux}/bin/nixglhost" "${my-bin}/bin/executor" ];
?
| 14:37:50 |
grw00 | i guess i need to build some matrix of images with compat versions and choose which one based on cuda/kernel version in instance metadata | 14:39:17 |
SomeoneSerge (back on matrix) | In reply to @grw00:matrix.org i guess i need to build some matrix of images with compat versions and choose which one based on cuda/kernel version in instance metadata You can build a single image with NixGL | 14:39:55 |
SomeoneSerge (back on matrix) | Note: NixGL and nixglhost are different tools 🙃 | 14:40:03 |
SomeoneSerge (back on matrix) | * You can build a single image with NixGL (and multiple drivers) | 14:40:37 |
grw00 | ah 😓 | 14:41:32 |
| 19 Jun 2024 |
hexa (UTC+1) | python312 default migration has starrted | 14:04:34 |
SomeoneSerge (back on matrix) | stupid new faiss not building with cuda =\ | 14:19:56 |
| 21 Jun 2024 |
search-sense | Hello, NixOS community, I want to install python311Packages.tensorrt
TensorRT> command, and try building this derivation again.
TensorRT> $ nix-store --add-fixed sha256 TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz
TensorRT> ***
error: builder for '/nix/store/140c5c8lpa30r3jrxxbw74631831prrw-TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz.drv' failed with exit code 1;
but the cuda is 12.2 on my system, is it compatible?
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
| 04:53:46 |
search-sense | Is anyone interested to add latest tensorrt-10.1.0 to NixOS ?
searching for dependencies of /nix/store/gknr686xg6ggafkdfy5323bc7f1m5yf7-tensorrt-10.1.0.27-lib/lib/stubs/libnvinfer_vc_plugin.so
libstdc++.so.6 -> found: /nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib
libgcc_s.so.1 -> found: /nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib
setting RPATH to: /nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib:/nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib:$ORIGIN
auto-patchelf: 1 dependencies could not be satisfied
error: auto-patchelf could not satisfy dependency libcudart.so.12 wanted by /nix/store/799sv915xqi5b8n14hdkbbp6h06rrjz7-tensorrt-10.1.0.27-bin/bin/trtexec
auto-patchelf failed to find all the required dependencies.
Add the missing dependencies to --libs or use `--ignore-missing="foo.so.1 bar.so etc.so"`.
error: builder for '/nix/store/7rqkwg91vnk5d3p4vaym0z0pskkmj4r8-tensorrt-10.1.0.27.drv' failed with exit code 1;
last 10 log lines:
> libgcc_s.so.1 -> found: /nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib
> setting RPATH to: /nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib:/nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib:$ORIGIN
> searching for dependencies of /nix/store/gknr686xg6ggafkdfy5323bc7f1m5yf7-tensorrt-10.1.0.27-lib/lib/stubs/libnvinfer_vc_plugin.so
> libstdc++.so.6 -> found: /nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib
> libgcc_s.so.1 -> found: /nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib
> setting RPATH to: /nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib:/nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib:$ORIGIN
> auto-patchelf: 1 dependencies could not be satisfied
> error: auto-patchelf could not satisfy dependency libcudart.so.12 wanted by /nix/store/799sv915xqi5b8n14hdkbbp6h06rrjz7-tensorrt-10.1.0.27-bin/bin/trtexec
> auto-patchelf failed to find all the required dependencies.
> Add the missing dependencies to --libs or use `--ignore-missing="foo.so.1 bar.so etc.so"`.
For full logs, run 'nix log /nix/store/7rqkwg91vnk5d3p4vaym0z0pskkmj4r8-tensorrt-10.1.0.27.drv'.
| 07:59:22 |
search-sense | export NIXPKGS_ALLOW_UNFREE=1 && nix-build -A cudaPackages.tensorrt
> setting RPATH to: /nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib:/nix/store/pd8xxiyn2xi21fgg9qm7r0qghsk8715k-gcc-13.3.0-libgcc/lib:$ORIGIN
> auto-patchelf: 1 dependencies could not be satisfied
> error: auto-patchelf could not satisfy dependency libcudart.so.12 wanted by /nix/store/799sv915xqi5b8n14hdkbbp6h06rrjz7-tensorrt-10.1.0.27-bin/bin/trtexec
> auto-patchelf failed to find all the required dependencies.
> Add the missing dependencies to --libs or use `--ignore-missing="foo.so.1 bar.so etc.so"`.
| 11:03:14 |