| 15 Jul 2025 |
mcwitt | * if your goal is just to get a python env running with CUDA-enabled pytorch (versus wanting to compile CUDA code), I'd recommend starting with a more minimal flake (like the one I posted above) | 21:22:05 |
connor (he/him) | Not sure about segfaults (I had them regularly if my RAM was clocked to high or voltage was unstable etc), but make sure you’re enabling cudaSupport and specifying your GPU’s compute capability for faster builds. | 21:22:40 |
farmerd | That's where the hardware thing comes in. I was seeing issues about hash mismatches and then I tried to verify and repair my nix-store and it's got a bunch of corrupted files (it couldn't repair it). | 21:23:11 |
farmerd | Yeah I think I've got a dimm going bad on me. I've been having random crashes throughout the system and I hadn't put it together until I spent a bunch of time on this yesterday and realized how many random things were corrupted. | 21:24:02 |
farmerd | I've got a new pair of dimms coming tomorrow so I'll swap them in (and probably reinstall nix since my nix-store is apprently corrupted beyond repair :-/ ) and try again. | 21:24:59 |
farmerd | Oh, although may I ask how to specify the compute capability? I did notice it was passing a bunch of them to NVCC but I didn't see how to specify it. | 21:25:45 |
mcwitt | regardless of hardware issues, if you're just starting out I don't think you should need to build anything from source.
The reason you're seeing this is the flake template you linked is pinned to an old revision of nixpkgs-unstable, and the build artifacts have likely expired from cache.nixos.org. I'll often update the nixpkgs pin as a first step when starting with a new template for this reason | 23:57:02 |
| 16 Jul 2025 |
farmerd | Ok, that makes sense. | 01:09:44 |
connor (he/him) | See the end of the first section https://github.com/NixOS/nixpkgs/blob/master/doc/languages-frameworks/cuda.section.md#cuda-cuda | 07:02:27 |
| 18 Jul 2025 |
connor (he/him) | Could I get a review on https://github.com/NixOS/nixpkgs/pull/426280? | 19:20:16 |
| 21 Jul 2025 |
connor (he/him) | Went ahead and merged it | 17:20:26 |
| 23 Jul 2025 |
apyh | oof the nccl version in nixpkgs is quite old now | 16:30:38 |
apyh | (quite old in the ml world, lol. only a month old) | 16:31:25 |
apyh | torchtitan needs torch 2.8, torch 2.8 requires nccl 2.27, gotta update nccl myself | 16:31:49 |
apyh | guess I'll pr to nixpkgs lol | 16:31:56 |
apyh | pr opened 😁 | 16:59:39 |
Gaétan Lepage | Can you share the link apyh? | 22:56:02 |
apyh | ah sure! https://github.com/NixOS/nixpkgs/pull/427804 | 23:00:23 |
apyh | they added a bunch of new stuff so i have to patch the shebang in a second python script. surprisingly didn't cause a build failure without it, just didn't export some of the new symbols | 23:01:02 |
Gaétan Lepage | Thanks! | 23:03:49 |
| 24 Jul 2025 |
apyh | huh. thanks for the nixpkgs-review. very strange to me that it fails to build pytorch as a result, but that the python 3.13 failure is just a bunch of .. warnings inside torch? i'll compile again locally to see.. | 14:56:24 |
apyh | can't repro the build failure locally for python312Packages.torchWithCuda Gaétan Lepage 🤔 left a comment here to that effect https://github.com/NixOS/nixpkgs/pull/427804#issuecomment-3114819745 | 20:26:13 |
apyh | can't repro any of the build failures in fact, only took 3.5 hours per torch to test 😭 | 23:51:03 |
| 25 Jul 2025 |
Gaétan Lepage | It probably failed because of flakiness | 10:57:16 |
apyh | rebased it btw :) | 17:29:53 |
apyh | both builds worked fine on my machine.. does nixpkgs-review have a timeout? lol | 17:30:06 |
apyh | i have a 7800x3d and it still took 3.5 hours per torch build | 17:30:26 |
| 26 Jul 2025 |
Tristan Ross | Is that a PR that my 128 cores could be useful with? | 00:34:02 |
apyh | haha i mean, if you have the ram to match ;) | 01:07:29 |
apyh | it builds fine on my end - just a verification from someone else would be nice :) | 01:07:40 |