!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

336 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda64 Servers

You have reached the beginning of time (for this room).


SenderMessageTime
20 Oct 2024
@sielicki:matrix.orgsielicki
In reply to @ss:someonex.net
I think I'd vote pro propagation if we could say with some certainty, that that is the only way to guarantee correctness for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable)

Not specific to nixos, but just a rant from me: there's been a pretty large push around the cuda world for everyone to move to static libcudart... largely because with cuda 12 they introduced the minor version compatibility and "cuda enhanced compatibility" guarantees, and there's a lot of public statements (on github, etc.) from nvidia that suggests this is the safest way to distribute packages. All of this is really complicated and I don't fault projects for moving forward under this guidance, but i'm pretty confident that this does not cover all cases and you do still need to think about this stuff.

One example of where you still need to think about it: a lot of code uses the runtime API to resolve the driver API (through cudaGetDriverEntrypoint). The returned function pointers are given by min(linked_runtime_api_ver, actual_driver_version), exclusively. There's no automatic detection of another copy of libcudart in the same process that would allow for automatically matching the API version -- it's exclusively based on what you linked against compared to the driver version in use. (There's no way to implement API-level alignment between libraries in the same process; they would need a way to invalidate fnptrs they've already handed out when they suddenly encounter some new library in the process operating at a new version.)

This is a really easy way to run afoul of the cuda version mixing guidelines, and I feel like it's pretty underdiscussed and underdocumented. Those version mixing guidelines are still relevant, dammit! It's not magic!

03:10:26
@sielicki:matrix.orgsielicki
In reply to @ss:someonex.net
I think I'd vote pro propagation if we could say with some certainty, that that is the only way to guarantee correctness for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable)
*

Not specific to nixos, but just a rant from me: there's been a pretty large push around the cuda world for everyone to move to static libcudart... largely because with cuda 12 they introduced the minor version compatibility and "cuda enhanced compatibility" guarantees, and there's a lot of public statements (on github, etc.) from nvidia that suggests this is the safest way to distribute packages. All of this is really complicated and I don't fault projects for moving forward under this guidance, but i'm pretty confident that this does not cover all cases and you do still need to think about this stuff.

One example of where you still need to think about it: a lot of code uses the runtime API to resolve the driver API (through cudaGetDriverEntrypoint). The returned function pointers are given by min(linked_runtime_api_ver, actual_driver_version), exclusively. There's no automatic detection of another copy of libcudart in the same process that would allow for automatically matching the API version -- it's exclusively based on what you linked against compared to the driver version in use. (There's no way to implement API-level alignment between libraries in the same process; they would need a way to invalidate fnptrs they've already handed out when they suddenly encounter some new library in the process operating at a new version.)

This is a really easy way to run afoul of the cuda version mixing guidelines, and I feel like it's pretty underdiscussed and underdocumented. Those version mixing guidelines are still important, minor version compatibility does not save you, it's not the case that if they all start with "12" you don't have to think about it anymore.

03:12:17
@sielicki:matrix.orgsielickiDon't get me started on pypi wheels, and the nuance between RPATH and RUNPATH, and so on03:13:08
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)
In reply to @ss:someonex.net

a footgun people keep firing,

True

autoAddDriverRunpath

Yes and no. Yes because that'd definitely make one-off and our own contributions easier. No because once we start propagating it we lose the knowledge of which packages actually need to be patched. It still seems to me that most packages we don't have to patch because they call cudart and cudart is patchelfed. Maybe yes because I'm unsure what happens with libcudart_static.

autoPatchelfHook

I'd be rather strongly opposed to this one. Autopatchelf is a huge hammer, coarse and imprecise. It can actually erase correct runpaths from an originally correct binary. Let's reserve it for non

Another important thing to consider is (here we go again) whether we want to keep both backendStdenv and the hook and which of these things should be propagating what

My favorite functionality autoPatchelfHook has is that it will error on unresolved dependencies — I could live without the actual patching, I suppose, but I really like using it to check that all the libraries I need are in scope.
Any ideas if such functionality already exists in Nixpkgs or would be a useful check?
07:30:53

Show newer messages


Back to Room ListRoom Version: 9