| 15 May 2024 |
| evax joined the room. | 21:57:42 |
| 16 May 2024 |
evax | Hi I'm trying to get jax with cuda to work in WSL using a flake, but the GPU is never recognized. Torch in the same flake recognizes the device. I've tried setting nixpkgs to both nixos-23.11 and nixos-unstable. | 19:11:22 |
trexd | In reply to @evax:matrix.org Hi I'm trying to get jax with cuda to work in WSL using a flake, but the GPU is never recognized. Torch in the same flake recognizes the device. I've tried setting nixpkgs to both nixos-23.11 and nixos-unstable. Can you post your Nix code? | 19:18:24 |
evax | {
description = "Jax+cuda shell";
nixConfig = {
extra-substituters = [
"https://cuda-maintainers.cachix.org"
];
extra-trusted-public-keys = [
"cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E="
];
};
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-23.11";
flake-utils.url = "github:numtide/flake-utils";
};
outputs = { self, nixpkgs, flake-utils }:
flake-utils.lib.eachDefaultSystem (system:
let
config = {
allowUnfree = true;
cudaSupport = true;
};
pkgs = (import nixpkgs { inherit system config; }).pkgs;
python3 = pkgs.python311;
deps = ps: with ps; [
jax
jaxlib
];
devPython = python3.withPackages(ps: with ps; (deps(ps) ++ [
ipython
]));
in rec {
inherit pkgs;
devShell = pkgs.stdenv.mkDerivation {
name = "jax-shell";
buildInputs = [
devPython
];
shellHook = ''
export CUDA_PATH=${pkgs.cudatoolkit}
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:${pkgs.linuxPackages.nvidia_x11}/lib:${pkgs.ncurses5}/lib
export EXTRA_LDFLAGS="-L/lib -L${pkgs.linuxPackages.nvidia_x11}/lib"
export EXTRA_CCFLAGS="-I/usr/include"
'';
};
defaultPackage = devShell;
}
);
}
| 21:02:04 |
| 17 May 2024 |
evax | This was originally with nix on top of alma linux in WSL, I switched to NixOS-WSL and have the same issue | 07:07:30 |
SomeoneSerge (matrix works sometimes) |
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:${pkgs.linuxPackages.nvidia_x11}/lib:${pkgs.ncurses5}/lib
You don't want to reference nvidia_x11 in WSL environments. Even on linux we don't reference it directly, cf. various posts about "impure drivers" on github and discourse | 08:38:07 |
SomeoneSerge (matrix works sometimes) |
/usr/lib/wsl/lib:
I forget now, is libcuda.so placed directly under this path, or in a subdirectory? | 08:38:38 |
SomeoneSerge (matrix works sometimes) | Could you also please gist the errors, and logs for LD_DEBUG=libs python -c "import torch; torch.cuda.is_available()" and LD_DEBUG=libs python -c "..." (some minimal code to make jax attempt loading libcuda) | 08:39:58 |
evax | thanks, let me try these things | 08:45:37 |
evax | libcuda.so is under /usr/lib/wsl/lib | 08:56:55 |
evax | some finding, under NixOS-WSL, with the option to use windows drivers, a wsl-lib package is created in the nix store linking the contents of /usr/lib/wsl/lib | 09:11:06 |
evax | it's exposed under /run/opengl-driver/lib | 09:11:57 |
evax | it might just be that jax is expecting cuda12 but the actual version in the system is cuda11 | 09:12:26 |
SomeoneSerge (matrix works sometimes) | In reply to @evax:matrix.org some finding, under NixOS-WSL, with the option to use windows drivers, a wsl-lib package is created in the nix store linking the contents of /usr/lib/wsl/lib Good, this sounds much safer than putting /usr/lib/wsl in LD_LIBRARY_PATH | 09:12:40 |
SomeoneSerge (matrix works sometimes) | In reply to @evax:matrix.org it might just be that jax is expecting cuda12 but the actual version in the system is cuda11 It links its cuda libraries directly, and the driver is likely compatible with both | 09:13:07 |
SomeoneSerge (matrix works sometimes) | * It links its cuda libraries directly, and the driver is likely compatible with both releases | 09:13:13 |
evax | another finding, using jaxlibWithCuda (the nix compiled version) jax complains there's no CUDA enabled jaxlib, while using jaxlib-bin there's an error message related to loading CUDA | 09:14:53 |
evax | (I can't cut/paste/gist from that system, sorry) | 09:16:40 |
evax | the jaxlib-bin error (with TF_CPP_MIN_LOG_LEVEL=0) is external/tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. | 09:22:09 |
evax | I tried to LD_PRELOAD libcuda.so and it doesn't help | 09:22:29 |
evax | with jaxlibWithCuda, the error is An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu. | 09:24:35 |
evax | torch finds the GPU with LD_LIBRARY_PATH pointing either to /usr/lib/wsl/lib or /run/opengl-driver/lib, but not without, for jaxWithCuda none of these options work | 09:46:50 |
connor (he/him) | Okay, tired of machines restarting
I just bought three different kits of RAM to replace the existing kits in my builders. And two 10Gbe NICs to try to increase builder performance since they’re all networked together and the 2.5Gbe on two of the machines was a bottleneck. | 17:09:15 |
connor (he/him) | God I hate hardware 🫠 | 17:09:20 |
Gaétan Lepage | How many systems do you have as builders ? | 20:27:05 |
connor (he/him) | I have three desktops I use as builders; I also pay for an aarch64-linux Hetzner server which I use for aarch64-linux builds for CI | 23:13:40 |
| 18 May 2024 |
Gaétan Lepage | Ok cool !
I am starting to think about building a workstation for nix builds.
Would you mind sharing the specs of your machines ? | 11:53:53 |
connor (he/him) | Sure! Although keep in mind I've had a very difficult time managing consumer-grade hardware (especially given I use ASUS motherboards and the stupid default levels for voltage which trigger instability in games also trigger very hard to reproduce segfaults during Nix builds) | 12:16:19 |
connor (he/him) | My main machine: https://pcpartpicker.com/user/connorbaker/saved/pxtbkL
A builder: https://pcpartpicker.com/user/connorbaker/saved/h6mvZL
A builder/storage: https://pcpartpicker.com/user/connorbaker/saved/Pyy7CJ | 12:51:29 |
connor (he/him) | FWIW, it takes magma-cuda-static with the default set of capabilities ~19m30s to build on nixos-desktop and ~21m12s to build on nixos-build01 or nixos-ext. | 12:52:26 |