!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

290 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

You have reached the beginning of time (for this room).


SenderMessageTime
23 Oct 2024
@glepage:matrix.orgGaétan Lepage The topic is interesting. Imagine being able to have a massive decentralized build farm ! That would be amazing.
Of course this is far from possible today (mostly because of nix limitations).
06:39:26
@connorbaker:matrix.orgconnor (he/him) The opposite of decentralized, but I’ve been trying to set up Azure instances which all share the same Nix store over NFS.
You’d have a storage server with a lot of disk (or RAM if you’re putting the store in memory) and a bunch of build servers. The storage server would be the only machine with Nix installed, and it would be a single user install (so no daemon). It would also have max-jobs set to zero.
Build servers would mount the Nix store over NFS.
The storage server would list all the build servers as remote builders, and specify their stores as the new-ish experimental ssh builder with mounted stores.
Ideally, kicking off a build on the storage server would cause jobs to be taken up by the build servers, and because they’re using mounted stores, Nix shouldn’t try to copy paths to/from build servers and the storage server. Additionally, there shouldn’t be any traffic between builders since they’re all sharing the same store, so as long as locking works properly there’d be no duplicate downloads or builds of dependencies.
The kick is that to make all this fast I’m using NFS over RDMA with 200G (or 400G) Infiniband.
16:09:23
@connorbaker:matrix.orgconnor (he/him)I’ve also started packaging CUDA Library Samples (different than CUDA Samples) to serve as a sort of test suite for changes made to the package set https://github.com/ConnorBaker/CUDALibrarySamples/tree/feat/cmake-rewrite16:12:35
@connorbaker:matrix.orgconnor (he/him) Absolute biggest pain in my ass right now is packaging onnxruntime (https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnxruntime/package.nix).
For the ONNX and TensorRT ecosystem, I’m doing a cursed build where the CMake and Python builds are interlaced.
Turns out doing a straight CMake build gives different results compared to doing a Python build. Go figure, the multi-thousand line Python scripts invoked by setup.py change how stuff is configured.
16:15:00
@connorbaker:matrix.orgconnor (he/him)

Here’s an example of the interlaced build: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx-tensorrt.nix

Note that it does avoid building the library twice!

16:16:45
@glepage:matrix.orgGaétan Lepage That looks so cool ! It's a very good idea.
It would be great to have a single-entry point to this crazy setup. Multi-platform support would also be great (but probably quite hard).
16:28:31
@glepage:matrix.orgGaétan Lepage Btw connor (he/him) (UTC-7) we are facing a super weird onnx issue in this PR.
Basically, updating torchmetrics makes some random package further down the tree fail on aarch64-linux.
In case you have some idea...
16:29:45
@connorbaker:matrix.orgconnor (he/him) Yeah part of the reason I’m iterating in a separate repo for this stuff is because I can just say “SCREW THE OTHER PLATFORMS MUAHAHAHHAAH” (and also because I don’t have to re-evaluate nixpkgs on every change).
I’ll try to take a look… but no promises :)
16:31:23
@connorbaker:matrix.orgconnor (he/him) Okay I looked at it and have no idea 🤷‍♂️ 16:32:07
@connorbaker:matrix.orgconnor (he/him)The whole ONNX ecosystem is difficult to package for Nix because they all use both git submodules AND CMake’s fetchcontent functionality, making it super difficult to package with stuff we already provide. For some packages, they build with flags we don’t, or have patches they apply before building, so it’s painful.16:34:04
@connorbaker:matrix.orgconnor (he/him)That’s partly why my packaging of Onnxruntime involves rewriting some of their CMake files16:34:35
@connorbaker:matrix.orgconnor (he/him)I also love that onnx by default builds with ONNX_ML=0 (disabling old APIs in favor of new ones), but various projects depend on it being set to one value or the other, so you could very easily end up with two copies of onnx, each configured with a different value for ONNX_ML.16:36:22
@connorbaker:matrix.orgconnor (he/him)God what a nightmare16:36:36
@connorbaker:matrix.orgconnor (he/him)We should also update OpenCV at some point to 4.10 if we haven’t already so it can build with CUDA 12.4+16:38:16
@connorbaker:matrix.orgconnor (he/him) Oh! Unrelated but this was a cute change I made that I quite like: https://github.com/ConnorBaker/cuda-packages/blob/c81a6595f07456c6cc34d8976031c4fa972a741f/cudaPackages-common/backendStdenv.nix#L36
Sets some defaults for the CUDA stdenv and adds a name prefix, similar to what the Python packaging does, for more descriptive store paths
16:40:07
@glepage:matrix.orgGaétan LepageThanks for taking the time to look at it and explain all of this !21:26:24
25 Oct 2024
* @connorbaker:matrix.orgconnor (he/him) makes a sad noise11:51:25
@connorbaker:matrix.orgconnor (he/him)https://gist.github.com/ConnorBaker/6c9c522d46e4244eb33d2aad94c753b011:51:27
26 Oct 2024
@glepage:matrix.orgGaétan Lepage🥲20:34:04
@glepage:matrix.orgGaétan Lepageclipboard.png
Download clipboard.png
20:34:07
27 Oct 2024
@msanft:matrix.orgMoritz Sanft Does anyone of you use clangd as a C LSP, with cudatoolkit coming from a shell? clangd seems not to take notice of CUDA in that case, saying Cannot find CUDA installation; ... 07:54:44
@msanft:matrix.orgMoritz Sanft Also, does cudatoolkit miss a dependency on gcc, or am I mistaken by this error:
✖ nvcc main.cu 
gcc: No such file or directory
07:56:37
@msanft:matrix.orgMoritz Sanft Also, does cudatoolkit miss a dependency on gcc, or am I mistaken by this error:
✖ nvcc main.cu 
gcc: No such file or directory

EDIT: No, it indeed seems to try and find GCC:
[pid 46094] execve("/nix/store/j09h8v4ldx0ix547gjp0p2asj0asssaw-glib-2.80.4-bin/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/run/wrappers/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/home/msanft/.nix-profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/home/msanft/.local/state/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/etc/profiles/per-user/msanft/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/nix/var/nix/profiles/default/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/run/current-system/sw/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
07:57:18
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @msanft:matrix.org
Also, does cudatoolkit miss a dependency on gcc, or am I mistaken by this error:
✖ nvcc main.cu 
gcc: No such file or directory

EDIT: No, it indeed seems to try and find GCC:
[pid 46094] execve("/nix/store/j09h8v4ldx0ix547gjp0p2asj0asssaw-glib-2.80.4-bin/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/run/wrappers/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/home/msanft/.nix-profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/home/msanft/.local/state/nix/profile/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/etc/profiles/per-user/msanft/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/nix/var/nix/profiles/default/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
[pid 46094] execve("/run/current-system/sw/bin/gcc", ["gcc", "-E", "/tmp/tmpxft_0000b40d_00000000-2."...], 0x7ffe40c921b0 /* 97 vars */) = -1 ENOENT (No such file or directory)
Yea we don't link gcc directly in nvcc but provide it independently via the overridden stdenv
09:40:39
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @glepage:matrix.org
The topic is interesting. Imagine being able to have a massive decentralized build farm ! That would be amazing.
Of course this is far from possible today (mostly because of nix limitations).
A horror security-wise though xD
11:08:32

Show newer messages


Back to Room ListRoom Version: 9