NixOS CUDA | 290 Members | |
| CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda | 57 Servers |
| Sender | Message | Time |
|---|---|---|
| 23 Oct 2024 | ||
| The topic is interesting. Imagine being able to have a massive decentralized build farm ! That would be amazing. Of course this is far from possible today (mostly because of nix limitations). | 06:39:26 | |
| The opposite of decentralized, but I’ve been trying to set up Azure instances which all share the same Nix store over NFS. You’d have a storage server with a lot of disk (or RAM if you’re putting the store in memory) and a bunch of build servers. The storage server would be the only machine with Nix installed, and it would be a single user install (so no daemon). It would also have max-jobs set to zero. Build servers would mount the Nix store over NFS. The storage server would list all the build servers as remote builders, and specify their stores as the new-ish experimental ssh builder with mounted stores. Ideally, kicking off a build on the storage server would cause jobs to be taken up by the build servers, and because they’re using mounted stores, Nix shouldn’t try to copy paths to/from build servers and the storage server. Additionally, there shouldn’t be any traffic between builders since they’re all sharing the same store, so as long as locking works properly there’d be no duplicate downloads or builds of dependencies. The kick is that to make all this fast I’m using NFS over RDMA with 200G (or 400G) Infiniband. | 16:09:23 | |
| I’ve also started packaging CUDA Library Samples (different than CUDA Samples) to serve as a sort of test suite for changes made to the package set https://github.com/ConnorBaker/CUDALibrarySamples/tree/feat/cmake-rewrite | 16:12:35 | |
| Absolute biggest pain in my ass right now is packaging onnxruntime (https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnxruntime/package.nix). For the ONNX and TensorRT ecosystem, I’m doing a cursed build where the CMake and Python builds are interlaced. Turns out doing a straight CMake build gives different results compared to doing a Python build. Go figure, the multi-thousand line Python scripts invoked by setup.py change how stuff is configured. | 16:15:00 | |
| Here’s an example of the interlaced build: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx-tensorrt.nix Note that it does avoid building the library twice! | 16:16:45 | |
| That looks so cool ! It's a very good idea. It would be great to have a single-entry point to this crazy setup. Multi-platform support would also be great (but probably quite hard). | 16:28:31 | |
Btw connor (he/him) (UTC-7) we are facing a super weird onnx issue in this PR.Basically, updating torchmetrics makes some random package further down the tree fail on aarch64-linux.In case you have some idea... | 16:29:45 | |
| Yeah part of the reason I’m iterating in a separate repo for this stuff is because I can just say “SCREW THE OTHER PLATFORMS MUAHAHAHHAAH” (and also because I don’t have to re-evaluate nixpkgs on every change). I’ll try to take a look… but no promises :) | 16:31:23 | |
| Okay I looked at it and have no idea 🤷♂️ | 16:32:07 | |
| The whole ONNX ecosystem is difficult to package for Nix because they all use both git submodules AND CMake’s fetchcontent functionality, making it super difficult to package with stuff we already provide. For some packages, they build with flags we don’t, or have patches they apply before building, so it’s painful. | 16:34:04 | |
| That’s partly why my packaging of Onnxruntime involves rewriting some of their CMake files | 16:34:35 | |
| I also love that onnx by default builds with ONNX_ML=0 (disabling old APIs in favor of new ones), but various projects depend on it being set to one value or the other, so you could very easily end up with two copies of onnx, each configured with a different value for ONNX_ML. | 16:36:22 | |
| God what a nightmare | 16:36:36 | |
| We should also update OpenCV at some point to 4.10 if we haven’t already so it can build with CUDA 12.4+ | 16:38:16 | |
| Oh! Unrelated but this was a cute change I made that I quite like: https://github.com/ConnorBaker/cuda-packages/blob/c81a6595f07456c6cc34d8976031c4fa972a741f/cudaPackages-common/backendStdenv.nix#L36 Sets some defaults for the CUDA stdenv and adds a name prefix, similar to what the Python packaging does, for more descriptive store paths | 16:40:07 | |
| Thanks for taking the time to look at it and explain all of this ! | 21:26:24 | |
| 25 Oct 2024 | ||
| * | 11:51:25 | |
| https://gist.github.com/ConnorBaker/6c9c522d46e4244eb33d2aad94c753b0 | 11:51:27 | |
| 26 Oct 2024 | ||
| 🥲 | 20:34:04 | |
Download clipboard.png | 20:34:07 | |
| 27 Oct 2024 | ||
Does anyone of you use clangd as a C LSP, with cudatoolkit coming from a shell? clangd seems not to take notice of CUDA in that case, saying Cannot find CUDA installation; ... | 07:54:44 | |
Also, does cudatoolkit miss a dependency on gcc, or am I mistaken by this error: | 07:56:37 | |
Also, does cudatoolkit miss a dependency on gcc, or am I mistaken by this error:EDIT: No, it indeed seems to try and find GCC: | 07:57:18 | |
In reply to @msanft:matrix.orgYea we don't link gcc directly in nvcc but provide it independently via the overridden stdenv | 09:40:39 | |
In reply to @glepage:matrix.orgA horror security-wise though xD | 11:08:32 | |