NixOS CUDA - Public Room Timeline

	NixOS CUDA	290 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	58 Servers

Load older messages

Sender	Message	Time
11 Jul 2024
SomeoneSerge (back on matrix)	Madoura https://discourse.nixos.org/t/testing-gpu-compute-on-amd-apu-nixos/47060/2 this is falling apart 😹	23:16:45
12 Jul 2024
	@valconius:matrix.org left the room.	01:16:15
13 Jul 2024
mcwitt	This might just be me being dumb, but am surprised that I'm unable to build jax with `doCheck = false` (use case is I want to override `jaxlib = jaxlibWithCuda` and don't want to run the tests). Repro: `nix build --impure --expr 'let nixpkgs = builtins.getFlake("github:nixos/nixpkgs/nixpkgs-unstable"); pkgs = import nixpkgs { system = "x86_64-linux"; config.allowUnfree = true; }; in pkgs.python3Packages.jax.overridePythonAttrs { doCheck = false; }'` fails with `ModuleNotFoundError: jax requires jaxlib to be installed`	01:58:54
mcwitt	Something else that's puzzling me: overriding with `jaxlib = jaxlibWithCuda` doesn't seem to work for numpyro (which has a pretty simple derivation): Sanity check: a Python env with just `jax` and `jaxlibWithCuda` is GPU-enabled, as expected: `nix shell --refresh --impure --expr 'let nixpkgs = builtins.getFlake "github:nixos/nixpkgs/nixpkgs-unstable"; pkgs = import nixpkgs { config.allowUnfree = true; }; in (pkgs.python3.withPackages (ps: [ ps.jax ps.jaxlibWithCuda ]))' --command python -c "import jax; print(jax.devices())"` yields `[cuda(id=0), cuda(id=1)]`. But when `numpyro` is overridden to use `jaxlibWithCuda`, for some reason the propagated `jaxlib` is still the CPU version: `nix shell --refresh --impure --expr 'let nixpkgs = builtins.getFlake "github:nixos/nixpkgs/nixpkgs-unstable"; pkgs = import nixpkgs { config.allowUnfree = true; }; in (pkgs.python3.withPackages (ps: [ ((ps.numpyro.overridePythonAttrs (_: { doCheck = false; })).override { jaxlib = ps.jaxlibWithCuda; }) ]))' --command python -c "import jax; print(jax.devices())"` yields `[CpuDevice(id=0)]`. (Furthermore, if we try to add `jaxlibWithCuda` to the `withPackages` call, we get a collision error, so clearly something is propagating the CPU jaxlib 🤔 Has anyone seen this, or have a better way to use the GPU-enabled jaxlib as a dependency?	04:30:05
mcwitt	* Something else that's puzzling me: overriding with `jaxlib = jaxlibWithCuda` doesn't seem to work for numpyro (which has a pretty simple derivation): Sanity check: a Python env with just `jax` and `jaxlibWithCuda` is GPU-enabled, as expected: `nix shell --refresh --impure --expr 'let nixpkgs = builtins.getFlake "github:nixos/nixpkgs/nixpkgs-unstable"; pkgs = import nixpkgs { config.allowUnfree = true; }; in (pkgs.python3.withPackages (ps: [ ps.jax ps.jaxlibWithCuda ]))' --command python -c "import jax; print(jax.devices())"` yields `[cuda(id=0), cuda(id=1)]`. But when `numpyro` is overridden to use `jaxlibWithCuda`, for some reason the propagated `jaxlib` is still the CPU version: `nix shell --refresh --impure --expr 'let nixpkgs = builtins.getFlake "github:nixos/nixpkgs/nixpkgs-unstable"; pkgs = import nixpkgs { config.allowUnfree = true; }; in (pkgs.python3.withPackages (ps: [ ((ps.numpyro.overridePythonAttrs (_: { doCheck = false; })).override { jaxlib = ps.jaxlibWithCuda; }) ]))' --command python -c "import jax; print(jax.devices())"` yields `[CpuDevice(id=0)]`. (Furthermore, if we try to add `jaxlibWithCuda` to the `withPackages` call, we get a collision error, so clearly something is propagating the CPU jaxlib 🤔) Has anyone seen this, or have a better way to use the GPU-enabled jaxlib as a dependency?	04:30:27
mcwitt	Ack, the second one was just me being dumb. I'd mixed up the proper ordering of `override` and `overridePythonAttrs` 🤦 sorry for the noise	05:03:40
SomeoneSerge (back on matrix)	In reply to @mcwitt:matrix.org Something else that's puzzling me: overriding with `jaxlib = jaxlibWithCuda` doesn't seem to work for numpyro (which has a pretty simple derivation): Sanity check: a Python env with just `jax` and `jaxlibWithCuda` is GPU-enabled, as expected: `nix shell --refresh --impure --expr 'let nixpkgs = builtins.getFlake "github:nixos/nixpkgs/nixpkgs-unstable"; pkgs = import nixpkgs { config.allowUnfree = true; }; in (pkgs.python3.withPackages (ps: [ ps.jax ps.jaxlibWithCuda ]))' --command python -c "import jax; print(jax.devices())"` yields `[cuda(id=0), cuda(id=1)]`. But when `numpyro` is overridden to use `jaxlibWithCuda`, for some reason the propagated `jaxlib` is still the CPU version: `nix shell --refresh --impure --expr 'let nixpkgs = builtins.getFlake "github:nixos/nixpkgs/nixpkgs-unstable"; pkgs = import nixpkgs { config.allowUnfree = true; }; in (pkgs.python3.withPackages (ps: [ ((ps.numpyro.overridePythonAttrs (_: { doCheck = false; })).override { jaxlib = ps.jaxlibWithCuda; }) ]))' --command python -c "import jax; print(jax.devices())"` yields `[CpuDevice(id=0)]`. (Furthermore, if we try to add `jaxlibWithCuda` to the `withPackages` call, we get a collision error, so clearly something is propagating the CPU jaxlib 🤔) Has anyone seen this, or have a better way to use the GPU-enabled jaxlib as a dependency? Wait why is numpyro propagating jaxlib, I thought we had a convention not to propagate jaxlib	08:10:45
	ocharles set a profile picture.	10:21:04
15 Jul 2024
	oak 🏳️‍🌈♥️ changed their profile picture.	03:16:33
connor (burnt/out) (UTC-8)	Were there any changes to Python packages over the last month or so that may have caused breakages? Like default version or anything? I rebased a PR and I’m seeing more build failures than I was previously	21:25:25
SomeoneSerge (back on matrix)	In reply to @connorbaker:matrix.org Were there any changes to Python packages over the last month or so that may have caused breakages? Like default version or anything? I rebased a PR and I’m seeing more build failures than I was previously Like 3.12?	21:25:53
connor (burnt/out) (UTC-8)	Oooooh that would do it!	21:26:10
16 Jul 2024
connor (burnt/out) (UTC-8)	Trying to post the results of a `nixpkgs-review` run and: `There was a problem saving your comment. Your comment is too long (maximum is 65536 characters). Please try again.`	00:29:53
	will joined the room.	04:51:50
hexa (UTC+1)	that happend to me … never?	11:32:43
connor (burnt/out) (UTC-8)	What’s the mechanism which allows us to build the tests in `passthru.tests`?	13:27:48
connor (burnt/out) (UTC-8)	SomeoneSerge (UTC+3): I'm having a bit of trouble with GPU-access in the sandbox. In particular, I've enabled it in my NixOS config with { programs.nix-required-mounts = { enable = true; presets.nvidia-gpu.enable = true; # TODO: Fix merging of presets # error: The option `programs.nix-required-mounts.allowedPatterns.nvidia-gpu.unsafeFollowSymlinks' has conflicting definition values: # - In `/nix/store/dk2rpyb6ndvfbf19bkb2plcz5y3k8i5v-source/nixos/modules/programs/nix-required-mounts.nix': false # - In `/nix/store/2ja2h1nd0z2bw56cl4bn37cb9d18hnzr-source/devices/nixos-desktop/hardware.nix': true # Use `lib.mkForce value` or `lib.mkDefault value` to change the priority on any of these definitions. # TODO: After enabling running into this error when trying to build a derivation which has requiredSystemFeatures = [ "cuda" ]; # error: # … while setting up the build environment # # error: getting attributes of path '/nix/store/0p1qsszik7hwjddzmyhikq9ywr2ki69l-systemd-minimal-255.6/sbin/bin': No such file or directory # Perhaps this is due to the systemd directory in /run/opengl-driver/lib? allowedPatterns.nvidia-gpu.unsafeFollowSymlinks = lib.mkForce true; }; } I'm trying to enable the GPU portion of CMake's CUDA test suite (https://github.com/ConnorBaker/nixpkgs/commit/543cf7d2ec330286ba566e6e1187e531d155c5d0), but failing. I thought following the symlinks when mounting would help because before that I was unable to access the `${addDriverRunpath.driverLink}/lib` directory (multiple symlinks), but it now fails with `$ nix build --impure -L .#cudaPackages.cmake-cuda-tests.tests.withGpu warning: Nix search path entry '/nix/var/nix/profiles/per-user/root/channels' does not exist, ignoring warning: killing stray builder process 5264 ()... error: … while setting up the build environment error: getting attributes of path '/nix/store/0p1qsszik7hwjddzmyhikq9ywr2ki69l-systemd-minimal-255.6/sbin/bin': No such file or directory` Any ideas?	17:48:11
connor (burnt/out) (UTC-8)	Ah, okay. The think `addDriverRunpath.driverLink` links to is `/run/opengl-driver`. That is in turn a symlink, created by this: https://github.com/NixOS/nixpkgs/blob/c82d9d313d5107c6ad3a92fc7d20343f45fa5ace/nixos/modules/hardware/graphics.nix#L5-L8 That derivation isn't expose except as a path, used here: https://github.com/NixOS/nixpkgs/blob/c82d9d313d5107c6ad3a92fc7d20343f45fa5ace/nixos/modules/hardware/graphics.nix#L112-L121 I updated my nixos config as follows, and it seems to work. `{ programs.nix-required-mounts = { enable = true; presets.nvidia-gpu.enable = true; allowedPatterns.nvidia-gpu = { onFeatures = [ "gpu" "nvidia-gpu" "opengl" "cuda" ]; # It exposes these paths in the sandbox: paths = let inherit (pkgs.addOpenGLRunpath) driverLink; thingDriverLinkLinksTo = config.systemd.tmpfiles.settings.graphics-driver."/run/opengl-driver"."L+".argument; in [ driverLink thingDriverLinkLinksTo "/dev/dri" "/dev/nvidia*" ]; }; }; }` Of course, that same process would need to be repeated for anything in there which is in turn a symlink (which is the purpose of `unsafeFollowSymlinks`, I suppose), but I'm not getting that odd systemd `bin` error any more.	18:11:02
mkiefel	Hi! I trying to get an application to work with libGL on a Jetson Orin AGX. For context, I am trying to get a camera image from a device with libargus (which requires GL). I'm not on the latest from unstable; maybe that is the issue. I've already tried pre-loading various GL libs from the base image of Jetpack but to no avail. Does anybody have some pointers for me, please?	19:43:27
mkiefel	* Hi! I trying to get an application to work with libGL on a Jetson Orin AGX (with Ubuntu as host linux). For context, I am trying to get a camera image from a device with libargus (which requires GL). I'm not on the latest from unstable; maybe that is the issue. I've already tried pre-loading various GL libs from the base image of Jetpack but to no avail. Does anybody have some pointers for me, please?	19:44:24
mkiefel	In reply to @mkiefel:matrix.org Hi! I trying to get an application to work with libGL on a Jetson Orin AGX (with Ubuntu as host linux). For context, I am trying to get a camera image from a device with libargus (which requires GL). I'm not on the latest from unstable; maybe that is the issue. I've already tried pre-loading various GL libs from the base image of Jetpack but to no avail. Does anybody have some pointers for me, please? Man, I got it. Somehow the wrong libEGL_nvidia.so got picked up. With the right one it works. This kept me busy this afternoon. :) In any case, thanks so much for the great work on the cuda packages! I really appreciate all the work that you folks put into this.	20:04:26
SomeoneSerge (back on matrix)	In reply to @connorbaker:matrix.org Ah, okay. The think `addDriverRunpath.driverLink` links to is `/run/opengl-driver`. That is in turn a symlink, created by this: https://github.com/NixOS/nixpkgs/blob/c82d9d313d5107c6ad3a92fc7d20343f45fa5ace/nixos/modules/hardware/graphics.nix#L5-L8 That derivation isn't expose except as a path, used here: https://github.com/NixOS/nixpkgs/blob/c82d9d313d5107c6ad3a92fc7d20343f45fa5ace/nixos/modules/hardware/graphics.nix#L112-L121 I updated my nixos config as follows, and it seems to work. `{ programs.nix-required-mounts = { enable = true; presets.nvidia-gpu.enable = true; allowedPatterns.nvidia-gpu = { onFeatures = [ "gpu" "nvidia-gpu" "opengl" "cuda" ]; # It exposes these paths in the sandbox: paths = let inherit (pkgs.addOpenGLRunpath) driverLink; thingDriverLinkLinksTo = config.systemd.tmpfiles.settings.graphics-driver."/run/opengl-driver"."L+".argument; in [ driverLink thingDriverLinkLinksTo "/dev/dri" "/dev/nvidia*" ]; }; }; }` Of course, that same process would need to be repeated for anything in there which is in turn a symlink (which is the purpose of `unsafeFollowSymlinks`, I suppose), but I'm not getting that odd systemd `bin` error any more. Answering from a phone, curt. That's the reason the module mounts the closure of hardware.opengl.package by default. If you used mkForce somewhere you.could've overridden that accidentally. The symlink branch is for non-nixos but I don't trust it. I was thinking maybe a runtime closure computation (nix-store --query --rewuisites) might be a reasonable future alternative. We'll have to come up with something stable anyway, for cdi	20:17:07
SomeoneSerge (back on matrix)	The datacenter driver is also merged into hardware.opengl.package isn't it?	20:18:10
SomeoneSerge (back on matrix)	To be clear: the intention is that on nixos the user should manually list all packages in the driver's closure. If you find that you need to that's either a bug or an edge case I failed yo handle	20:19:42
SomeoneSerge (back on matrix)	* To be clear: the intention is that on nixos the user shouldn't manually list all packages in the driver's closure. If you find that you need to that's either a bug or an edge case I failed yo handle	20:19:53
SomeoneSerge (back on matrix)	* To be clear: the intention is that on nixos the user should never have to manually list all packages in the driver's closure. If you find that you need to that's either a bug or an edge case I failed yo handle	20:20:12
SomeoneSerge (back on matrix)	* To be clear: the intention is that on nixos the user should never have to manually list all packages in the driver's closure. If you find that you need to that's either a bug or an edge case I failed to handle	20:20:27
SomeoneSerge (back on matrix)	In reply to @mkiefel:matrix.org Man, I got it. Somehow the wrong libEGL_nvidia.so got picked up. With the right one it works. This kept me busy this afternoon. :) In any case, thanks so much for the great work on the cuda packages! I really appreciate all the work that you folks put into this. Thanks. Could you still tell us which libegl was the wrong one and which one is the right?	20:21:58
mkiefel	In reply to @ss:someonex.net Thanks. Could you still tell us which libegl was the wrong one and which one is the right? Sure. It went for `/nix/store/cg66ia01r8226nr478rv2b7fffvrl4gg-xgcc-12.3.0-libgcc/lib/libEGL_nvidia.so.0` but should have picked the one in `/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0`. I think I need to do something like nixGL and set these libraries up when calling the executable. I am still a bit confused why setting `export __EGL_VENDOR_LIBRARY_FILENAMES=/usr/lib/aarch64-linux-gnu/tegra-egl/nvidia.json` didn't do the trick.	20:31:42
	@alex3829:matrix.org left the room.	23:17:07

Show newer messages

Back to Room ListRoom Version: 9