!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

282 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
23 Sep 2025
@hugo:okeso.euHugoEspecially since a working build (previous release) finished in 14 minutes.12:08:09
@gregorburger:matrix.orgGregor BurgerHi Guys, quick question is there an equivalent cudaPackages.backendStdenv for clang? 12:09:25
@gregorburger:matrix.orgGregor Burger* Hi, quick question is there an equivalent cudaPackages.backendStdenv for clang? 12:11:37
@connorbaker:matrix.orgconnor (he/him)It looks like you’re out of memory and swapping hard; try lowering the number of cores given to the job or the number of parallel instances NVCC runs with (per what Albert said above). I’ve had to enable ZRAM (which has been highly effective) for some builds even on my desktops with 96GB of RAM.12:53:57
@hugo:okeso.euHugo

I have this policy on my server, and a similar one on my desktop. Should that not prevent the OS from swapping?

  systemd.services.nix-daemon.serviceConfig = {
    CPUAccounting = true;
    AllowedCPUs = "2-15";
    MemoryAccounting = true;
    MemoryHigh = "48G";
    MemoryMax = "56G";
  };
12:55:49
@connorbaker:matrix.orgconnor (he/him)Currently no; I’m working on making the setup hooks and everything generic for Clang in https://github.com/NixOS/nixpkgs/pull/437723 but I ran into issues doing that and it’s not a high priority for me in the scope of that PR. Any particular use case?12:56:18
@connorbaker:matrix.orgconnor (he/him)I don’t know enough about systemd to answer that, but I know some of the flash attention kernel builds consume at least a hundred GB of RAM and if you’re seeing the build stall that reminds me of the swapping to disk behavior I’d seen previously. (I may also have misinterpreted the BTOP screen shot.)12:58:25
@hugo:okeso.euHugo Apparently there is an extra setting MemorySwapMax = "0"; that can disallow swap for a systemd unit. 13:37:42
@gregorburger:matrix.orgGregor BurgerWe would like to compile our codebase both in gcc and clang to get a broader coverage of warnings and errors.14:33:43
@winter:catgirl.cloudWinter joined the room.19:57:49
@georgyo:nycr.chatgeorgyo joined the room.23:08:45
24 Sep 2025
@apache8080:matrix.orgapache8080

I'm running into a weird issue with tensorrt and a sandboxed environment. Im trying to run tensorrt models within the nix sandbox by leveraging the nix extra-sandbox-paths. This allows the nix sandbox to have access to hardware and drivers (e.g nvidia drivers). I'm able to successfully run trtexec using this to generate tensorrt engines from an ONNX file but for some reason when I try to run inference on those tensorrt engines in the sandbox it just hangs forever. I verified that all of the correct libraries are loaded in the sandbox but it is still just hanging forever. What is weird is the model is loaded on to the GPU just fine but it just hangs forever on inference calls. This only happens in the sandbox and so I think I may just be missing some paths/settings to expose that our app requires or what trtexec brings in on its own. Outside of the sandbox I can run our app just fine.

Pretty stuck on this one at the moment

01:55:44
@apache8080:matrix.orgapache8080looks like the issue is on the application side and not a driver/nvidia library issue. extra-sandbox-paths seems to be working fine03:06:29
@connorbaker:matrix.orgconnor (he/him)What HW/host OS/driver/CUDA & TensorRT version? Generating inference engines with TensorRT in the sandbox is something I want to look into so I’d love to hear more about pain points06:01:46
@winter:catgirl.cloudWinter
ImportError: /nix/store/d2b95k4ysi7822hnxq72np5vvfq7wbbp-python3.12-tensorflow-gpu-2.19.0/lib/python3.12/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN3tsl8profiler8internal21g_trace_filter_bitmapE
anyone know what could be going wrong here? i'm just using bog-standard pythonPackages.tensorflow with cudaSupport = true
15:58:45
@winter:catgirl.cloudWinter(though maybe unrelated)15:58:49
@winter:catgirl.cloudWinter i find it weird that this is even happening given this is all built by us/backendStdenv 16:00:57
@winter:catgirl.cloudWinter (occurs during the _pywrap_cpu_feature_guard) 16:03:00
@winter:catgirl.cloudWinter[maybe the wrong channel, lmk if i should move :)]16:04:33
@winter:catgirl.cloudWinterdisregard16:06:29
@winter:catgirl.cloudWintercomputers are downright evil16:06:34
@winter:catgirl.cloudWinter(the library it's pointing to isn't actually the one it's loading!)16:06:50
@apyh:matrix.orgapyhis there a server for pytorch stuff specifically, or is this as close as it gets? really struggling to get torch.compile working :/16:38:53
@sporeray:matrix.orgRobbie BuxtonWhat error are you running into apyh?16:42:44
@gammieduncan:matrix.orgDuncan Gammie apyh: you'll probably get the fastest answer to that here if you provide specific error messages here: https://discuss.pytorch.org/c/compile/41 18:30:20
@apyh:matrix.orgapyh
In reply to @sporeray:matrix.org
What error are you running into apyh?
well, torch's .compile functionality requires a bunch of stuff that isn't provided in its nix derivation - needs gcc at runtime, it reads an /etc/passwd file to pick a cache directory, etc - so it doesn't work out of the box thru it's nixpkgs stuff
18:50:26
@apyh:matrix.orgapyhwas just wondering if there was like a torch-nix chat outside here 18:51:40
@sporeray:matrix.orgRobbie BuxtonAh I’ve recently fixed the gcc iisue locally, I was planning to put a pr in upstream this week.18:58:56
@sporeray:matrix.orgRobbie Buxton* Ah I’ve recently fixed the gcc issue locally, I was planning to put a pr in upstream this week.18:59:05
@apyh:matrix.orgapyhyou will, for CUDA, also need to set TRITON_LIBCUDA_PATH - it normally tries to find it with ldconfig20:09:52

Show newer messages


Back to Room ListRoom Version: 9