NixOS CUDA - Public Room Timeline

	NixOS CUDA	290 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	57 Servers

You have reached the beginning of time (for this room).

Sender	Message	Time
23 Oct 2024
Gaétan Lepage	The topic is interesting. Imagine being able to have a massive decentralized build farm ! That would be amazing. Of course this is far from possible today (mostly because of nix limitations).	06:39:26
connor (he/him)	The opposite of decentralized, but I’ve been trying to set up Azure instances which all share the same Nix store over NFS. You’d have a storage server with a lot of disk (or RAM if you’re putting the store in memory) and a bunch of build servers. The storage server would be the only machine with Nix installed, and it would be a single user install (so no daemon). It would also have max-jobs set to zero. Build servers would mount the Nix store over NFS. The storage server would list all the build servers as remote builders, and specify their stores as the new-ish experimental ssh builder with mounted stores. Ideally, kicking off a build on the storage server would cause jobs to be taken up by the build servers, and because they’re using mounted stores, Nix shouldn’t try to copy paths to/from build servers and the storage server. Additionally, there shouldn’t be any traffic between builders since they’re all sharing the same store, so as long as locking works properly there’d be no duplicate downloads or builds of dependencies. The kick is that to make all this fast I’m using NFS over RDMA with 200G (or 400G) Infiniband.	16:09:23
connor (he/him)	I’ve also started packaging CUDA Library Samples (different than CUDA Samples) to serve as a sort of test suite for changes made to the package set https://github.com/ConnorBaker/CUDALibrarySamples/tree/feat/cmake-rewrite	16:12:35
connor (he/him)	Absolute biggest pain in my ass right now is packaging onnxruntime (https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnxruntime/package.nix). For the ONNX and TensorRT ecosystem, I’m doing a cursed build where the CMake and Python builds are interlaced. Turns out doing a straight CMake build gives different results compared to doing a Python build. Go figure, the multi-thousand line Python scripts invoked by setup.py change how stuff is configured.	16:15:00
connor (he/him)	Here’s an example of the interlaced build: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx-tensorrt.nix Note that it does avoid building the library twice!	16:16:45
Gaétan Lepage	That looks so cool ! It's a very good idea. It would be great to have a single-entry point to this crazy setup. Multi-platform support would also be great (but probably quite hard).	16:28:31
Gaétan Lepage	Btw connor (he/him) (UTC-7) we are facing a super weird `onnx` issue in this PR. Basically, updating `torchmetrics` makes some random package further down the tree fail on `aarch64-linux`. In case you have some idea...	16:29:45
connor (he/him)	Yeah part of the reason I’m iterating in a separate repo for this stuff is because I can just say “SCREW THE OTHER PLATFORMS MUAHAHAHHAAH” (and also because I don’t have to re-evaluate nixpkgs on every change). I’ll try to take a look… but no promises :)	16:31:23
connor (he/him)	Okay I looked at it and have no idea 🤷‍♂️	16:32:07
connor (he/him)	The whole ONNX ecosystem is difficult to package for Nix because they all use both git submodules AND CMake’s fetchcontent functionality, making it super difficult to package with stuff we already provide. For some packages, they build with flags we don’t, or have patches they apply before building, so it’s painful.	16:34:04
connor (he/him)	That’s partly why my packaging of Onnxruntime involves rewriting some of their CMake files	16:34:35
connor (he/him)	I also love that onnx by default builds with ONNX_ML=0 (disabling old APIs in favor of new ones), but various projects depend on it being set to one value or the other, so you could very easily end up with two copies of onnx, each configured with a different value for ONNX_ML.	16:36:22
connor (he/him)	God what a nightmare	16:36:36
connor (he/him)	We should also update OpenCV at some point to 4.10 if we haven’t already so it can build with CUDA 12.4+	16:38:16
connor (he/him)	Oh! Unrelated but this was a cute change I made that I quite like: https://github.com/ConnorBaker/cuda-packages/blob/c81a6595f07456c6cc34d8976031c4fa972a741f/cudaPackages-common/backendStdenv.nix#L36 Sets some defaults for the CUDA stdenv and adds a name prefix, similar to what the Python packaging does, for more descriptive store paths	16:40:07
Gaétan Lepage	Thanks for taking the time to look at it and explain all of this !	21:26:24
25 Oct 2024
	* connor (he/him) makes a sad noise	11:51:25
connor (he/him)	https://gist.github.com/ConnorBaker/6c9c522d46e4244eb33d2aad94c753b0	11:51:27
26 Oct 2024
Gaétan Lepage	🥲	20:34:04
Gaétan Lepage	Download clipboard.png	20:34:07
27 Oct 2024
Moritz Sanft	Does anyone of you use `clangd` as a C LSP, with `cudatoolkit` coming from a shell? `clangd` seems not to take notice of CUDA in that case, saying `Cannot find CUDA installation; ...`	07:54:44
Moritz Sanft	Also, does `cudatoolkit` miss a dependency on `gcc`, or am I mistaken by this error: `✖ nvcc main.cu gcc: No such file or directory`	07:56:37
Moritz Sanft	Also, does `cudatoolkit` miss a dependency on `gcc`, or am I mistaken by this error: `✖ nvcc main.cu gcc: No such file or directory` EDIT: No, it indeed seems to try and find GCC: [pid 46094] execve("/nix/store/j09h8v4ldx0ix547gjp0p2asj0asssaw-glib-2.80.4-bin/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/run/wrappers/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/home/msanft/.nix-profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/home/msanft/.local/state/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/etc/profiles/per-user/msanft/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/nix/var/nix/profiles/default/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/run/current-system/sw/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars */) = -1 ENOENT (No such file or directory)	07:57:18
SomeoneSerge (back on matrix)	In reply to @msanft:matrix.org Also, does `cudatoolkit` miss a dependency on `gcc`, or am I mistaken by this error: `✖ nvcc main.cu gcc: No such file or directory` EDIT: No, it indeed seems to try and find GCC: [pid 46094] execve("/nix/store/j09h8v4ldx0ix547gjp0p2asj0asssaw-glib-2.80.4-bin/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/run/wrappers/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/home/msanft/.nix-profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/home/msanft/.local/state/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/etc/profiles/per-user/msanft/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/nix/var/nix/profiles/default/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars /) = -1 ENOENT (No such file or directory) [pid 46094] execve("/run/current-system/sw/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 / 97 vars */) = -1 ENOENT (No such file or directory) Yea we don't link gcc directly in nvcc but provide it independently via the overridden stdenv	09:40:39
SomeoneSerge (back on matrix)	In reply to @glepage:matrix.org The topic is interesting. Imagine being able to have a massive decentralized build farm ! That would be amazing. Of course this is far from possible today (mostly because of nix limitations). A horror security-wise though xD	11:08:32

Show newer messages

Back to Room ListRoom Version: 9