!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

280 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
6 Oct 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)I made a faster diffing thing (but it requires a fair amount of memory): https://github.com/ConnorBaker/nix-nixpkgs-review14:33:23
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)

As an example:

 nix build -L .#diffs.x86_64-linux.pkgs-pre-pkgs-cuda-pre --build-dir /run/temp-ramdisk --builders '' --override-input nixpkgs-pre github:NixOS/nixpkgs

will evaluate a copy of nixpkgs using the nixpkgs-pre input without CUDA enabled and with CUDA enabled, and then diff the results (each step happens in a separate derivation so there's caching)

It's IO and memory hungry though (IO because it's instantiating ~1.5 GB worth of derivations) and memory hungry because it's evaluating all of Nixpksg in a single pass

I've written it so it uses DetSys' parallel eval as well

14:39:10
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)Here's the result of that command: https://gist.github.com/ConnorBaker/b1bbb3547d6c15921843ba0e048f94fd14:41:08
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) When the evaluations of Nixpkgs instantiations are done in the derivations, the --eval-store argument is set to the evalStore output so we can keep the derivations around. The entries in the packages output of the flake are small wrapper scripts which run a nix build using the added and changed derivations -- the evalStore outputs are used as extra substituters so derivations are copied as needed into the store and we avoid doing evaluation again 14:44:52
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) Anyway, I built that because I didn't have a way to run nixpkgs-review with content-addressed derivations and got irritated that it kept evaluating the base commit of PRs that hadn't changed (all it needed to do was re-evaluate the head of the PR). 14:48:28
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) (Using the scripts in packages does require the read-only-local-store feature be enabled, since the evalStore outputs from the reports are just small instances of Nix stores which are inside the Nix store, so they do need to be mounted as read-only.) 14:51:17
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) But as an example, just with that diff output, if we change python312Packages and versioned package sets to their default, sort and deduplicate, that gives us massively better coverage of the packages which require CUDA that we can build (since allowBroken and allowInsecure are both still false) 14:54:32
@ss:someonex.netSomeoneSerge (back on matrix)

~1.5 GB worth of derivations)

LoL can we have evals run on a separate machine from the rest of hydra, and just do tmpfs store there xD

16:49:17
@ss:someonex.netSomeoneSerge (back on matrix)

When the evaluations of Nixpkgs instantiations are done in the derivations, the --eval-store argument is set to the evalStore output so we can keep the derivations around

Cursed

16:51:37
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)When I had done something like this a year ago I had to use recursive nix so this is an “improvement” in that it only requires the read-only store experimental feature, which is much more limited18:43:17
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)I guess if you don’t mind re-evaluating everything there’s no need for evalStore since we could use an in-memory dummy store (probably wouldn’t need tmpfs build directory then either18:44:47
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) * 18:44:52
@ss:someonex.netSomeoneSerge (back on matrix)Well the real issue here is that lol why physical realizations of aterm drvs in the first place19:44:03
7 Oct 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)Using toJSON calls addToStore on paths before serializing so they’re valid; I guess I could try using unsafeDiscardWhatever before serializing to JSON to see if that’s enough to prevent realization01:32:05
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) There was a --read-only flag I had forgotten about 03:47:47
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)

Okay, it's way faster now and doesn't need a ramdisk.
Where as previously it would take about 1m40s on my i9-13900k now it's taking about 20s to do

time nix build -L ".#reports.x86_64-linux.pkgs-cuda-post^*" --builders '' --rebuild
03:53:40
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)For generating a single report htop shows it was taking about 18% of my RAM (less than 20GB)03:57:11
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)To clarify, it wasn’t enough to use unsafeDiscardStringContext because the derivation was instantiated as soon as drvPath was evaluated, even before toJSON. Using the read-only argument (which is different than the store URI query parameter of the same name lmao) avoids the instantiating.06:54:57
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)https://cuda-index.someonex.net15:40:22
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)https://forge.someonex.net/else/sidx15:40:26
@ss:someonex.netSomeoneSerge (back on matrix) connor (he/him) (UTC-7): ping about azure stuff again 19:30:18
@ss:someonex.netSomeoneSerge (back on matrix)Hey, we've been getting more experience with infra and hydra now, and I think like ephemeral builders are becoming more and more relevant. May I ask you to elaborate?19:34:01
@ss:someonex.netSomeoneSerge (back on matrix) * connor (he/him) (UTC-7): ping about azure images again 19:34:34
@ss:someonex.netSomeoneSerge (back on matrix) MI25s to test ROCm? CC Lun
https://docs.azure.cn/en-us/virtual-machines/sizes/gpu-accelerated/nv-family
19:39:24
@lt1379:matrix.orgLunToo old19:39:46
@ss:someonex.netSomeoneSerge (back on matrix)* Hey, we've been getting more experience with infra and hydra now, and I feel like ephemeral builders are becoming more and more relevant. May I ask you to elaborate?19:39:48
@ss:someonex.netSomeoneSerge (back on matrix)mx250 only?19:40:33
@lt1379:matrix.orgLungfx906 is barely cared for by upstream so not sure they'd be great for automated testing If we have credits that would otherwise go to waste and that's the only option then maybe worth it?19:41:00
@lt1379:matrix.orgLungfx90a (MI210/250) is the oldest instinct option upstream seems to actually be paying attention to19:41:26
@ss:someonex.netSomeoneSerge (back on matrix)Relatable: azure offers the right hw for us, but I'm not confident we can utilize it efficiently enough yet19:44:47

Show newer messages


Back to Room ListRoom Version: 9