!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

288 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda56 Servers

Load older messages


SenderMessageTime
7 Oct 2024
@glepage:matrix.orgGaétan Lepage

Hi,
I think that tinygrad is missing some libraries because I can get it to crash at runtime with:

Error processing prompt: Nvrtc Error 6, NVRTC_ERROR_COMPILATION
<null>(3): catastrophic error: cannot open source file "cuda_fp16.h"
  #include <cuda_fp16.h>
20:24:15
@glepage:matrix.orgGaétan Lepage Currently, we already patch the path to libnvrtc.so and libcuda.so, but maybe we should make the headers available too. 20:26:24
@glepage:matrix.orgGaétan Lepage * Currently, we already patch the path to libnvrtc.so and libcuda.so, but maybe we should make the headers available too. 20:35:11
@aidalgol:matrix.orgaidalgol What is that doing that a missing header is a runtime error? 20:44:05
@glepage:matrix.orgGaétan Lepage I think that tinygrad is compiling cuda kernels at runtime. 20:46:22
@glepage:matrix.orgGaétan Lepage That's why this missing kernel causes a crash when using the library.
tinygrad is entirely written in python and is thus itself not compiled at all.
20:46:55
@glepage:matrix.orgGaétan LepageThis is for sure quite unusual and this is why I am not sure how to make this header available "at runtime"...20:48:31
@aidalgol:matrix.orgaidalgol I figured it must be something like that. I think any library derivation should be providing the headers in the derivation's dev. Given the error message shows a #include line, and with the system header brackets, seems we need to pass a path to the tinygrad build somehow. 21:14:39
@aidalgol:matrix.orgaidalgol Does whatever nvidia compiler it's using have an equivalent to -isystem for gcc? 21:16:18
@glepage:matrix.orgGaétan Lepage Yes you are right. In the meantime, I found out that cuda_fp16.h is provided by cudaPackaged.cuda_cudart.dev 21:18:16
@glepage:matrix.orgGaétan Lepage The issue is that the way they invoke the compiler is a bit obscure: https://github.com/search?q=repo%3Atinygrad%2Ftinygrad%20nvrtcGetCUBIN&type=code 21:19:30
@glepage:matrix.orgGaétan Lepage I think that the closest examples within nixpkgs are cupy and numba.
But both of them operate this a bit differently.
21:20:06
8 Oct 2024
@pascal.grosmann:scs.ems.host@pascal.grosmann:scs.ems.host left the room.10:55:06
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)From what I’ve seen in the Python ecosystem, compiling kernels at runtime is becoming more commonplace because it reduces the size of binaries you ship and allows optimizing for the hardware you’re specifically running on. For example, JAX (via XLA) support auto tuning via Triton by compiling and running a number of different kernels.15:46:17
@glepage:matrix.orgGaétan LepageYes, compiling on the fly is the core spirit of tinygrad.15:47:06
@ss:someonex.netSomeoneSerge (back on matrix) Trying to compose backendStdenv with ccacheStdenv 🙃 17:07:51
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @ss:someonex.net
Trying to compose backendStdenv with ccacheStdenv 🙃
callPackage is a blessing and a curse
17:50:29
@ss:someonex.netSomeoneSerge (back on matrix)It works with a bit of copypaste though17:50:43
@ss:someonex.netSomeoneSerge (back on matrix) But has anyone run into weird PermissionDenied errors with ccache? the directory is visible in the sandbox and owned by nixbld group and id seemst o match... 17:57:47
@kaya:catnip.eekaya 𖤐 changed their profile picture.19:36:06
9 Oct 2024
@john:friendsgiv.ingjohn joined the room.01:20:41
10 Oct 2024
@ss:someonex.netSomeoneSerge (back on matrix)Iterating on triton with ccache is so much faster lmao16:12:34
11 Oct 2024
@msanft:matrix.orgMoritz Sanft Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I dropped the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?

CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot.

Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867
07:49:12
@msanft:matrix.orgMoritz Sanft Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?

CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot.

Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867
07:53:31
@ss:someonex.netSomeoneSerge (back on matrix) * Iterating on triton with ccache is so much faster lmao EDIT: triton+torch in half an hour on a single node, this not perfect but is an improvement11:41:55
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @msanft:matrix.org
Hey folks! I tried to update libnvidia-container, as it was lacking quite some versions (including security releases) behind. We use it in a work scenario for GPU containers in legacy mode, where we tested it to "work" generally. Only thing that doesn't is the binary resolving (e.g. nvidia-smi, nvidia-persistenced, ...). I just adapted the patches from the old version so that they apply on the new one. I tried dropping the replacement of PATH usage for binary lookup with fixing it to the /run/nvidia-docker directory, as this seems to be an artifact of older times, I believe? At least, the path doesn't exist in a legacy mode container nor on the host. I think the binaries should really be looked up through the PATH, which should be set accordingly when calling nvidia-container-cli? What do the experts think?

CDI containers work, as the binary paths are resolved correctly through the CDI config generated at boot.

Find my draft PR here: https://github.com/NixOS/nixpkgs/pull/347867
What'd be a reasonable way to test this, now that our docker/podman flows all migrated to CDI and our singularity IIRC uses a plain text file with the library paths?
11:44:30
@msanft:matrix.orgMoritz SanftI tested it with an "OCI Hook", like so: https://github.com/confidential-containers/cloud-api-adaptor/blob/191ec51f6245a1a475c15312d354efaf07ff64de/src/cloud-api-adaptor/podvm/addons/nvidia_gpu/setup.sh#L11C1-L17C4 Getting that to work was also the particular reason for why I got to update this package in the first place.12:21:24
@msanft:matrix.orgMoritz Sanft The update is necessary to fix legacy library lookup for containers with GPU access, as newer drivers won't have the libnvidia-pkcs11.so (which corresponds to OpenSSL 1.1), but only the *.openssl3.so alternatives for OpenSSL 3. Just to give this some context. Legacy binary lookup doesn't work with 1.9.0 nor 1.16.2 as of now. I think we might even want to get the update itself merged without fixing that, as it's security-relevant and the binary availability is not a regression, but I'm also happy to hear your stance on that. 12:24:09
@zopieux:matrix.zopi.euzopieux Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.stuff from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? 17:03:50
@zopieux:matrix.zopi.euzopieux * Pinning nixpkgs to 9357f4f23713673f310988025d9dc261c20e70c6 per this commit, I successfully manage to retrieve cudaPackages.(things) from cuda-maintainers cachix, however onnxruntime doesn't seem to be in there, is it broken? 17:04:05

Show newer messages


Back to Room ListRoom Version: 9