!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

211 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda42 Servers

Load older messages


SenderMessageTime
19 Oct 2024
@hacker1024:matrix.orghacker1024They have long lists of errata12:57:52
@hexa:lossy.networkhexa (UTC+1)yeah, worth a try12:58:00
@hexa:lossy.networkhexa (UTC+1)can always roll back12:58:03
@hexa:lossy.networkhexa (UTC+1)thx12:58:58
@connorbaker:matrix.orgconnor (he/him) (UTC-7)I keep forgetting I added myself as a maintainer to glibc until I get emails for reviews lmao19:09:59
@connorbaker:matrix.orgconnor (he/him) (UTC-7) SomeoneSerge (utc+3) Gaétan Lepage thoughts on having backendStdenv automatically propagate autoAddDriverRunpath and autoPatchelfHook? I feel like forgetting to add the former is a footgun people keep firing, and the latter is a great check to make sure all your dependencies are either present or explicitly ignored. 19:22:47
@glepage:matrix.orgGaétan LepageI am not sure to be qualified to answer properly. From my point of view, these kind of automations indeed help and avoid sneaky mistakes.19:24:29
@hexa:lossy.networkhexa (UTC+1) hacker1024: I think your recommendation was spot on 21:27:27
@hexa:lossy.networkhexa (UTC+1)at 22% I see the first [pt_main_thread] instances21:27:36
@hexa:lossy.networkhexa (UTC+1)and they don't seem to crash with microcode updates applied21:27:50
@hexa:lossy.networkhexa (UTC+1)wow, I hope that makes python-updates much smoother in the future21:28:06
@ss:someonex.netSomeoneSerge (utc+3)

a footgun people keep firing,

True

autoAddDriverRunpath

Yes and no. Yes because that'd definitely make one-off and our own contributions easier. No because once we start propagating it we lose the knowledge of which packages actually need to be patched. It still seems to me that most packages we don't have to patch because they call cudart and cudart is patchelfed. Maybe yes because I'm unsure what happens with libcudart_static.

autoPatchelfHook

I'd be rather strongly opposed to this one. Autopatchelf is a huge hammer, coarse and imprecise. It can actually erase correct runpaths from an originally correct binary. Let's reserve it for non

Another important thing to consider is (here we go again) whether we want to keep both backendStdenv and the hook and which of these things should be propagating what

21:29:49
@ss:someonex.netSomeoneSerge (utc+3)Even right now there's something cursed going on that e.g. triggers a "rebuild" of libcublas when you override nvcc, and I think that happens because of the propagated hook. That at the least is a surprising behaviour21:32:25
@ss:someonex.netSomeoneSerge (utc+3)Another "no" for autoAddDriverRunpath is that it's also just not enough anyway, because of python and all other sorts of projects where we're forced to use wrappers instead21:38:08
@ss:someonex.netSomeoneSerge (utc+3) I think I'd vote pro propagation if we could say with some certainty, that that is the only way to make stuff work automatically for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable) 21:41:59
@ss:someonex.netSomeoneSerge (utc+3) * I think I'd vote pro propagation if we could say with some certainty, that that is the only way to make stuff work automatically/correctly for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable) 21:42:31
@ss:someonex.netSomeoneSerge (utc+3) * I think I'd vote pro propagation if we could say with some certainty, that that is the only way to guarantee correctness for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable) 21:42:47
@ss:someonex.netSomeoneSerge (utc+3) hexa (UTC+1): I've been playing with (s)ccache lately and I'm almost amazed we're not using it more widely when building derivations for tests rather for direct consumption 21:45:20
@ss:someonex.netSomeoneSerge (utc+3) * hexa (UTC+1): I've been playing with (s)ccache lately and I'm almost amazed we're not using it more widely when building derivations for tests rather than for direct consumption 21:45:32
@ss:someonex.netSomeoneSerge (utc+3)

not using it more widely

Which I'm guessing based on the fact that the infrastructure for this is rather lacking

21:46:18
@hexa:lossy.networkhexa (UTC+1)where would results be stored?21:46:47
@hexa:lossy.networkhexa (UTC+1)caching of intermediate build steps would also be super helpful 😭21:48:49
@ss:someonex.netSomeoneSerge (utc+3)Same as usual, derivation outputs would be stored in the nix store, they'd never be confused with the "pure" ones because they'd have different hashes. The (s)ccache directory would have to be set up on each builder21:48:52
@ss:someonex.netSomeoneSerge (utc+3)
In reply to @hexa:lossy.network
caching of intermediate build steps would also be super helpful 😭
Yes but that's a much bigger refactor
21:49:04
@ss:someonex.netSomeoneSerge (utc+3)Also I feel like when we do that Nixpkgs will effectively depend on the host having a CoW file system because the alternative sounds too IO intensive21:50:21
20 Oct 2024
@sielicki:matrix.orgsielicki
In reply to @ss:someonex.net
I think I'd vote pro propagation if we could say with some certainty, that that is the only way to guarantee correctness for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable)

Not specific to nixos, but just a rant from me: there's been a pretty large push around the cuda world for everyone to move to static libcudart... largely because with cuda 12 they introduced the minor version compatibility and "cuda enhanced compatibility" guarantees, and there's a lot of public statements (on github, etc.) from nvidia that suggests this is the safest way to distribute packages. All of this is really complicated and I don't fault projects for moving forward under this guidance, but i'm pretty confident that this does not cover all cases and you do still need to think about this stuff.

One example of where you still need to think about it: a lot of code uses the runtime API to resolve the driver API (through cudaGetDriverEntrypoint). The returned function pointers are given by min(linked_runtime_api_ver, actual_driver_version), exclusively. There's no automatic detection of another copy of libcudart in the same process that would allow for automatically matching the API version -- it's exclusively based on what you linked against compared to the driver version in use. (There's no way to implement API-level alignment between libraries in the same process; they would need a way to invalidate fnptrs they've already handed out when they suddenly encounter some new library in the process operating at a new version.)

This is a really easy way to run afoul of the cuda version mixing guidelines, and I feel like it's pretty underdiscussed and underdocumented. Those version mixing guidelines are still relevant, dammit! It's not magic!

03:10:26
@sielicki:matrix.orgsielicki
In reply to @ss:someonex.net
I think I'd vote pro propagation if we could say with some certainty, that that is the only way to guarantee correctness for users of libcudart_static and of cmake's CUDA::cuda_driver (just because supporting that scope sounds doable)
*

Not specific to nixos, but just a rant from me: there's been a pretty large push around the cuda world for everyone to move to static libcudart... largely because with cuda 12 they introduced the minor version compatibility and "cuda enhanced compatibility" guarantees, and there's a lot of public statements (on github, etc.) from nvidia that suggests this is the safest way to distribute packages. All of this is really complicated and I don't fault projects for moving forward under this guidance, but i'm pretty confident that this does not cover all cases and you do still need to think about this stuff.

One example of where you still need to think about it: a lot of code uses the runtime API to resolve the driver API (through cudaGetDriverEntrypoint). The returned function pointers are given by min(linked_runtime_api_ver, actual_driver_version), exclusively. There's no automatic detection of another copy of libcudart in the same process that would allow for automatically matching the API version -- it's exclusively based on what you linked against compared to the driver version in use. (There's no way to implement API-level alignment between libraries in the same process; they would need a way to invalidate fnptrs they've already handed out when they suddenly encounter some new library in the process operating at a new version.)

This is a really easy way to run afoul of the cuda version mixing guidelines, and I feel like it's pretty underdiscussed and underdocumented. Those version mixing guidelines are still important, minor version compatibility does not save you, it's not the case that if they all start with "12" you don't have to think about it anymore.

03:12:17
@sielicki:matrix.orgsielickiDon't get me started on pypi wheels, and the nuance between RPATH and RUNPATH, and so on03:13:08
@connorbaker:matrix.orgconnor (he/him) (UTC-7)
In reply to @ss:someonex.net

a footgun people keep firing,

True

autoAddDriverRunpath

Yes and no. Yes because that'd definitely make one-off and our own contributions easier. No because once we start propagating it we lose the knowledge of which packages actually need to be patched. It still seems to me that most packages we don't have to patch because they call cudart and cudart is patchelfed. Maybe yes because I'm unsure what happens with libcudart_static.

autoPatchelfHook

I'd be rather strongly opposed to this one. Autopatchelf is a huge hammer, coarse and imprecise. It can actually erase correct runpaths from an originally correct binary. Let's reserve it for non

Another important thing to consider is (here we go again) whether we want to keep both backendStdenv and the hook and which of these things should be propagating what

My favorite functionality autoPatchelfHook has is that it will error on unresolved dependencies — I could live without the actual patching, I suppose, but I really like using it to check that all the libraries I need are in scope.
Any ideas if such functionality already exists in Nixpkgs or would be a useful check?
07:30:53
@alex_nordin:matrix.orgalex_nordin joined the room.18:27:40

There are no newer messages yet.


Back to Room ListRoom Version: 9