!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

293 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

Load older messages


SenderMessageTime
20 Oct 2024
@sielicki:matrix.orgsielicki
In reply to @ss:someonex.net
I think I'd vote pro propagation if we could say with some certainty, that that is the only way to guarantee correctness for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable)
*

Not specific to nixos, but just a rant from me: there's been a pretty large push around the cuda world for everyone to move to static libcudart... largely because with cuda 12 they introduced the minor version compatibility and "cuda enhanced compatibility" guarantees, and there's a lot of public statements (on github, etc.) from nvidia that suggests this is the safest way to distribute packages. All of this is really complicated and I don't fault projects for moving forward under this guidance, but i'm pretty confident that this does not cover all cases and you do still need to think about this stuff.

One example of where you still need to think about it: a lot of code uses the runtime API to resolve the driver API (through cudaGetDriverEntrypoint). The returned function pointers are given by min(linked_runtime_api_ver, actual_driver_version), exclusively. There's no automatic detection of another copy of libcudart in the same process that would allow for automatically matching the API version -- it's exclusively based on what you linked against compared to the driver version in use. (There's no way to implement API-level alignment between libraries in the same process; they would need a way to invalidate fnptrs they've already handed out when they suddenly encounter some new library in the process operating at a new version.)

This is a really easy way to run afoul of the cuda version mixing guidelines, and I feel like it's pretty underdiscussed and underdocumented. Those version mixing guidelines are still important, minor version compatibility does not save you, it's not the case that if they all start with "12" you don't have to think about it anymore.

03:12:17
@sielicki:matrix.orgsielickiDon't get me started on pypi wheels, and the nuance between RPATH and RUNPATH, and so on03:13:08
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net

a footgun people keep firing,

True

autoAddDriverRunpath

Yes and no. Yes because that'd definitely make one-off and our own contributions easier. No because once we start propagating it we lose the knowledge of which packages actually need to be patched. It still seems to me that most packages we don't have to patch because they call cudart and cudart is patchelfed. Maybe yes because I'm unsure what happens with libcudart_static.

autoPatchelfHook

I'd be rather strongly opposed to this one. Autopatchelf is a huge hammer, coarse and imprecise. It can actually erase correct runpaths from an originally correct binary. Let's reserve it for non

Another important thing to consider is (here we go again) whether we want to keep both backendStdenv and the hook and which of these things should be propagating what

My favorite functionality autoPatchelfHook has is that it will error on unresolved dependencies — I could live without the actual patching, I suppose, but I really like using it to check that all the libraries I need are in scope.
Any ideas if such functionality already exists in Nixpkgs or would be a useful check?
07:30:53
@alex_nordin:matrix.orgalex_nordin joined the room.18:27:40
22 Oct 2024
@connorbaker:matrix.orgconnor (he/him) Had a learning moment today: was wondering why all the traffic between my remote builders, which are all networked together in the same room, kept going through the router instead of the switch. Well, turns out if you don't specify a mask with Address in a systemd.network configuration, it defaults to /32... so all the builders thought they were alone on their network, and all traffic to other IP addresses went through the gateway (my router). 20:59:13
@ss:someonex.netSomeoneSerge (back on matrix)Added the notice about the nix-community cache at https://github.com/SomeoneSerge/nixpkgs-cuda-ci/?tab=readme-ov-file#nixpkgs-cuda-ci, still can't catch a moment for the discourse post21:16:44
@ss:someonex.netSomeoneSerge (back on matrix)
In reply to @connorbaker:matrix.org
Had a learning moment today: was wondering why all the traffic between my remote builders, which are all networked together in the same room, kept going through the router instead of the switch. Well, turns out if you don't specify a mask with Address in a systemd.network configuration, it defaults to /32... so all the builders thought they were alone on their network, and all traffic to other IP addresses went through the gateway (my router).
tbh it's incredible how easy it is to screw up with networking
21:23:55
23 Oct 2024
@glepage:matrix.orgGaétan Lepage Good to know indeed...
Is your builder configuration public connor (he/him) (UTC-7) ?
06:11:07
@connorbaker:matrix.orgconnor (he/him)
In reply to @glepage:matrix.org
Good to know indeed...
Is your builder configuration public connor (he/him) (UTC-7) ?
https://github.com/ConnorBaker/nixos-configs
06:30:11
@glepage:matrix.orgGaétan Lepage I caught up on https://github.com/nix-community/infra/issues/1343.
I think that infra access could be a major booster for people to participate, especially to nixpkgs.
For more contributors to be able to run nixpkgs-review on ideally all platforms in a reasonable time could avoid many package breakage and make PRs advance faster.
06:37:28
@glepage:matrix.orgGaétan Lepage The topic is interesting. Imagine being able to have a massive decentralized build farm ! That would be amazing.
Of course this is far from possible today (mostly because of nix limitations).
06:39:26
@connorbaker:matrix.orgconnor (he/him) The opposite of decentralized, but I’ve been trying to set up Azure instances which all share the same Nix store over NFS.
You’d have a storage server with a lot of disk (or RAM if you’re putting the store in memory) and a bunch of build servers. The storage server would be the only machine with Nix installed, and it would be a single user install (so no daemon). It would also have max-jobs set to zero.
Build servers would mount the Nix store over NFS.
The storage server would list all the build servers as remote builders, and specify their stores as the new-ish experimental ssh builder with mounted stores.
Ideally, kicking off a build on the storage server would cause jobs to be taken up by the build servers, and because they’re using mounted stores, Nix shouldn’t try to copy paths to/from build servers and the storage server. Additionally, there shouldn’t be any traffic between builders since they’re all sharing the same store, so as long as locking works properly there’d be no duplicate downloads or builds of dependencies.
The kick is that to make all this fast I’m using NFS over RDMA with 200G (or 400G) Infiniband.
16:09:23
@connorbaker:matrix.orgconnor (he/him)I’ve also started packaging CUDA Library Samples (different than CUDA Samples) to serve as a sort of test suite for changes made to the package set https://github.com/ConnorBaker/CUDALibrarySamples/tree/feat/cmake-rewrite16:12:35
@connorbaker:matrix.orgconnor (he/him) Absolute biggest pain in my ass right now is packaging onnxruntime (https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnxruntime/package.nix).
For the ONNX and TensorRT ecosystem, I’m doing a cursed build where the CMake and Python builds are interlaced.
Turns out doing a straight CMake build gives different results compared to doing a Python build. Go figure, the multi-thousand line Python scripts invoked by setup.py change how stuff is configured.
16:15:00
@connorbaker:matrix.orgconnor (he/him)

Here’s an example of the interlaced build: https://github.com/ConnorBaker/cuda-packages/blob/main/cudaPackages-common/onnx-tensorrt.nix

Note that it does avoid building the library twice!

16:16:45
@glepage:matrix.orgGaétan Lepage That looks so cool ! It's a very good idea.
It would be great to have a single-entry point to this crazy setup. Multi-platform support would also be great (but probably quite hard).
16:28:31
@glepage:matrix.orgGaétan Lepage Btw connor (he/him) (UTC-7) we are facing a super weird onnx issue in this PR.
Basically, updating torchmetrics makes some random package further down the tree fail on aarch64-linux.
In case you have some idea...
16:29:45
@connorbaker:matrix.orgconnor (he/him) Yeah part of the reason I’m iterating in a separate repo for this stuff is because I can just say “SCREW THE OTHER PLATFORMS MUAHAHAHHAAH” (and also because I don’t have to re-evaluate nixpkgs on every change).
I’ll try to take a look… but no promises :)
16:31:23
@connorbaker:matrix.orgconnor (he/him) Okay I looked at it and have no idea 🤷‍♂️ 16:32:07
@connorbaker:matrix.orgconnor (he/him)The whole ONNX ecosystem is difficult to package for Nix because they all use both git submodules AND CMake’s fetchcontent functionality, making it super difficult to package with stuff we already provide. For some packages, they build with flags we don’t, or have patches they apply before building, so it’s painful.16:34:04
@connorbaker:matrix.orgconnor (he/him)That’s partly why my packaging of Onnxruntime involves rewriting some of their CMake files16:34:35
@connorbaker:matrix.orgconnor (he/him)I also love that onnx by default builds with ONNX_ML=0 (disabling old APIs in favor of new ones), but various projects depend on it being set to one value or the other, so you could very easily end up with two copies of onnx, each configured with a different value for ONNX_ML.16:36:22
@connorbaker:matrix.orgconnor (he/him)God what a nightmare16:36:36
@connorbaker:matrix.orgconnor (he/him)We should also update OpenCV at some point to 4.10 if we haven’t already so it can build with CUDA 12.4+16:38:16
@connorbaker:matrix.orgconnor (he/him) Oh! Unrelated but this was a cute change I made that I quite like: https://github.com/ConnorBaker/cuda-packages/blob/c81a6595f07456c6cc34d8976031c4fa972a741f/cudaPackages-common/backendStdenv.nix#L36
Sets some defaults for the CUDA stdenv and adds a name prefix, similar to what the Python packaging does, for more descriptive store paths
16:40:07
@glepage:matrix.orgGaétan LepageThanks for taking the time to look at it and explain all of this !21:26:24
25 Oct 2024
* @connorbaker:matrix.orgconnor (he/him) makes a sad noise11:51:25
@connorbaker:matrix.orgconnor (he/him)https://gist.github.com/ConnorBaker/6c9c522d46e4244eb33d2aad94c753b011:51:27
26 Oct 2024
@glepage:matrix.orgGaétan Lepage🥲20:34:04
@glepage:matrix.orgGaétan Lepageclipboard.png
Download clipboard.png
20:34:07

Show newer messages


Back to Room ListRoom Version: 9