| 4 Nov 2025 |
Gaétan Lepage | I quadruppled check.
Both commits of my PR are actually necessary to get a nvcc-free onnxruntime. | 21:48:42 |
Gaétan Lepage | Let me change one comment to mention the bisection | 21:48:57 |
Gaétan Lepage | connor (burnt/out) (UTC-8), I reviewed nccl-tests. Feel free to merge | 22:08:11 |
Ari Lotter | i'm trying to fix this exact linker error right now 😭 trying to get flash-attn built for cuda capabilities 7.5 thru 12.0a, and i'm so stuck, and every rebuild with an attempted fix takes ~2 hours... any ideas? 😭 | 22:17:28 |
Ari Lotter | maybe we're just screwed :) | 22:20:25 |
Robbie Buxton | Which flash attention version | 22:24:21 |
Robbie Buxton | V2 or v3 | 22:24:27 |
Robbie Buxton | And from what got tag? | 22:24:51 |
Robbie Buxton | * And from what git tag? | 22:24:59 |
Ari Lotter | v2, from tag v2.8.2 | 22:29:50 |
Robbie Buxton | I think there is currently a pr open in nixpkgs to add this, is that the one you’re building? | 22:30:41 |
Ari Lotter | oh neat, no | 22:31:37 |
Ari Lotter | let me compare my derivation with that one | 22:31:40 |
Ari Lotter | ok yeah, decently similar. difference is i'm building against cutlass 4.0 instead of 4.1, and.. somehow my deps list is wayy simpler, yet the build works (on previous versions of my derivation, pre updating CUDA)? very strange.. | 22:35:13 |
Ari Lotter | but yeah i just smash into
> build/lib.linux-x86_64-cpython-312/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: PC-relative offset overflow in PLT entry for `_ZNK3c1010TensorImpl4sizeEl'
``` 🤷
| 22:35:28 |
Ari Lotter | i'm so tired of CUDA nightmares 😭 im so close to giving up and building dockerized devenvs, i just really don't want to give in..... :( | 22:37:57 |
Gaétan Lepage | (It's a secret, but you might want to add https://cache.nixos-cuda.org as a substituter, it is slowly getting more and more artifacts)
Public key: cache.nixos-cuda.org:74DUi4Ye579gUqzH4ziL9IyiJBlDpMRn9MBN8oNan9M= | 22:44:02 |
Gaétan Lepage | connor (burnt/out) (UTC-8), Serge and I got #457803 ready.
We are waiting for nixpkgs's CI to get fixed (https://github.com/NixOS/nixpkgs/pull/458647).
Let's merge ASAP | 23:38:07 |
Robbie Buxton | For flash attention you should use the version of cutlass in the repo | 23:54:57 |
Robbie Buxton | They have a hash | 23:55:06 |
Robbie Buxton | In csrc/cutlass | 23:56:01 |
Robbie Buxton | * They have a rev | 23:56:25 |
| 5 Nov 2025 |
apyh | ah fair enough | 00:10:30 |
SomeoneSerge (back on matrix) | step 1: torchWithCuda = pkgsCuda.....torch (we were supposed to be here now, but it got out of hand) step 2: torchWithCuda = warn "..." pkgsCuda... step 3: torchWithCuda = throw | 00:12:18 |
SomeoneSerge (back on matrix) | and what we really want is late binding and incremental builds | 00:13:41 |
connor (burnt/out) (UTC-8) | Why are you building for so many CUDA capabilities? I can’t really think of a reason you’d need that range in particular. | 01:59:14 |
connor (burnt/out) (UTC-8) | Added to merge queue | 02:07:23 |
apyh | In reply to @connorbaker:matrix.org Why are you building for so many CUDA capabilities? I can’t really think of a reason you’d need that range in particular. 's a distributed ml training application that needs to run on everything from gtx 10xx gpus to modern data center GH/GB200s :/ | 03:27:37 |
apyh | most common hardware is gonna be 30xx 40xx 50xx, h100, a100, b200 | 03:27:56 |
apyh | though.. i could just see what pytorch precompiled wherls runs on and limit to that | 03:28:54 |