!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

197 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda40 Servers

Load older messages


SenderMessageTime
27 Feb 2023
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net

RE: 3.

Btw, did you find any documentation for this env attribute?

* Seems like it’s going to be an upcoming change — there’s a new __structuredAttrs Boolean attribute stdenv can take (I think). When true, there’s some additional machinery which goes on in the background to ensure when you export or do string interpolation with certain variables that they come out in a sensible way into bash.
This was kind of handy for an example: https://nixos.mayflower.consulting/blog/2020/01/20/structured-attrs/
It is disabled by default currently though!
11:28:08
@connorbaker:matrix.orgconnor (he/him)
In reply to @ss:someonex.net

RE: Caching "single build that supports all capabilities" vs "multiple builds that support individual cuda architectures"

Couldn't find an issue tracking this, so I'll drop a message here.
The more precise argument in favour of building for individual capabilities is easier maintenance and nixpkgs development.
When working on master it's desirable to only build for your own arch, but currently it means a cache-miss for transitive dependencies.
For example, you work on torchvision and you import nixpkgs with config.cudaCapabilities = [ "8.6" ]. Snap! You're rebuilding pytorch, you cancel, you write a custom shell that overrides torchvision specifically, you remove asserts, etc.

Alternative world: cuda-maintainers.cachix.org has a day-old pytorch build for 8.6, a build for 7.5, a build for 6.0, etc
Extra: faster nixpkgs-review, assuming fewer default capabilities

Can I add this question to the CUDA docs issue I have open on Nixpkgs? And if so, can I give you credit for it?
I’ve been spinning up 120 core VMs on azure as spot instances to do reviews and not having stuff cached is killing me. I’m currently working on my own binary cache with Cloudflare’s R2 (no ingress / egress fees and competing pricing per GB) to take care of that. Cachix is nice, but i keep hitting the limit and don’t want to pay for it / would feel bad about asking for a discount or something
11:31:19
@ss:someonex.netSomeoneSerge (UTC+3)
In reply to @connorbaker:matrix.org
Can I add this question to the CUDA docs issue I have open on Nixpkgs? And if so, can I give you credit for it?
I’ve been spinning up 120 core VMs on azure as spot instances to do reviews and not having stuff cached is killing me. I’m currently working on my own binary cache with Cloudflare’s R2 (no ingress / egress fees and competing pricing per GB) to take care of that. Cachix is nice, but i keep hitting the limit and don’t want to pay for it / would feel bad about asking for a discount or something
Yes, please. I just wasn't sure where's the appropriate place to track this, and this sounds fit
11:34:25
@ss:someonex.netSomeoneSerge (UTC+3)

I think NCCL is (still) ignoring cudaCapabilities. We should probably pass NVCC_GENCODE in makeFlagsArray

The format is:

NVCC_GENCODE is -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80

Seems like we can use cudaFlags.cudaGencode for that

13:58:39
@connorbaker:matrix.orgconnor (he/him)Ah yeah I didn’t switch it over yet, that’s in https://github.com/NixOS/nixpkgs/pull/21761914:36:18
@domenkozar:matrix.orgDomen Kožar
In reply to @connorbaker:matrix.org
Can I add this question to the CUDA docs issue I have open on Nixpkgs? And if so, can I give you credit for it?
I’ve been spinning up 120 core VMs on azure as spot instances to do reviews and not having stuff cached is killing me. I’m currently working on my own binary cache with Cloudflare’s R2 (no ingress / egress fees and competing pricing per GB) to take care of that. Cachix is nice, but i keep hitting the limit and don’t want to pay for it / would feel bad about asking for a discount or something
I'm happy to sponsor such stuff :)
16:22:38
@justbrowsing:matrix.orgKevin MittmanRedacted or Malformed Event23:19:49
28 Feb 2023
@connorbaker:matrix.orgconnor (he/him)I've got the stomach flu so sorry if I haven't responded / reviewed things recently; I should be able to resume tomorrow.15:14:33
@connorbaker:matrix.orgconnor (he/him) Unrelated but something to know: in the cudaPackages 11.7 to 11.8 transition, be aware that cuda_profiler_api.h is no longer in cuda_nvprof; it's in a new cuda_profile_api package in cudaPackages. 15:16:06
@connorbaker:matrix.orgconnor (he/him) * Unrelated but something to know: in the cudaPackages 11.7 to 11.8 transition, be aware that cuda_profiler_api.h is no longer in cuda_nvprof; it's in a new cuda_profiler_api package in cudaPackages. 15:19:39
@ss:someonex.netSomeoneSerge (UTC+3)
In reply to @connorbaker:matrix.org
I've got the stomach flu so sorry if I haven't responded / reviewed things recently; I should be able to resume tomorrow.

Ooh, that sounds devastating. Take care!

P.S. Also remember that the whole affair is voluntary, there isn't any rush, and it's more important to keep things sustainable than to sprint

15:32:17
@ss:someonex.netSomeoneSerge (UTC+3)

RE: Building for individual arches

We'd need to choose a smaller list for nixpkgs' default cudaCapabilities, and we don't have a criteria to make that choice.
We could run a poll on nixos discourse, but I don't expect it to be representative.

One option is to include everything from Tim Dettmer's guide (available in json), which probably means just [ "8.6" "8.9" ]

Another is to choose whatever covers most of cuda support table from wikipedia, i.e. [ "6.1" "7.5" "8.6" ]. I feel like this would be still pretty fat build-wise

And then, I still wonder what happens if we something like [ "7.5" "8.6" "5.0" ] (i.e. with 5.0+PTX). I haven't seen anyone do that, I expect it would work, and it would cover all computes inbetween 5.0 and 8.6, just that everything except 8.6 and 7.5 might use suboptimal implementations?

20:29:39
@ss:someonex.netSomeoneSerge (UTC+3)

Personally, I'd prefer that there was only one arch in there by default.

Alt: default to cudaCapabilities = [ "5.0" ] (with PTX); probably cuda works by default for everyone; it's maybe mysteriously slow and people don't know to override the config
Alt: default to cudaCapabilities = [ "8.6" ]; works for DL users, throws an error for lower grade cards, maybe people find out they need to override the config, but maybe they don't and end up feeling overwhelmed with nixpkgs

20:43:13
@ss:someonex.netSomeoneSerge (UTC+3)Smaller closures 🙏20:43:44
@connorbaker:matrix.orgconnor (he/him)8.6 wouldn’t work for people using an A100 though, right? Since that’s only 8.022:41:22
@ss:someonex.netSomeoneSerge (UTC+3)Uh, right22:52:46
1 Mar 2023
@justbrowsing:matrix.orgKevin Mittman

FYI, CUDA 12.1.0 is now available 

https://developer.download.nvidia.com/compute/cuda/redist/redistrib_12.1.0.json 

00:22:22
@justbrowsing:matrix.orgKevin Mittman

which presents some questions 

  • how are these software releases typically noticed - organically? when something depends on it?
  • what sort of translation would be hypothetically needed to convert this or a similar manifest into something automation could pick up
  • normally how are changes discovered such as added, removed, renamed, or split components? if this was in the json would that be helpful?
01:26:05
@hexa:lossy.networkhexawell, here we go again03:12:02
@hexa:lossy.networkhexanumba03:12:03
@hexa:lossy.networkhexait just can't keep up03:12:13
@hexa:lossy.networkhexano release in 5 months to address numpy lag03:12:34
@hexa:lossy.networkhexawe probably need https://github.com/numba/numba/pull/869103:15:19
@hexa:lossy.networkhexabut it is 20 commits big03:15:23
@hexa:lossy.networkhexaand has failing t ests03:15:28
@hexa:lossy.networkhexa * and has failing tests03:15:37
@ss:someonex.netSomeoneSerge (UTC+3)
Error in fail: Repository command failed
No library found under: /nix/store/iq5b0g0md105dsw3zkw07lasaghsy0wq-cudatoolkit-12.0.1-merged/lib/libcupti.so.12.0
ERROR: /build/source/WORKSPACE:15:14: fetching cuda_configure rule //external:local_config_cuda: Traceback (most recent call last)
❯ ldd /nix/store/iq5b0g0md105dsw3zkw07lasaghsy0wq-cudatoolkit-12.0.1-merged/lib/libcupti.so.12.0
ldd: /nix/store/iq5b0g0md105dsw3zkw07lasaghsy0wq-cudatoolkit-12.0.1-merged/lib/libcupti.so.12.0: No such file or directory
03:16:29
@ss:someonex.netSomeoneSerge (UTC+3)I'ma sleep03:16:37
@hexa:lossy.networkhexa FRidhSomeone S please tell me if https://github.com/NixOS/nixpkgs/pull/218929 is acceptable 03:48:17
@ss:someonex.netSomeoneSerge (UTC+3)
In reply to @hexa:lossy.network
we probably need https://github.com/numba/numba/pull/8691

Presently CI will fail due to the lack of NumPy 1.24 packages in Anaconda, but this should be resolved in time.
...
I think all review comments are now addressed and this is just waiting on package availability so as to complete testing.

What the actual fuck, they won't just relax pinned versions?

12:05:56

There are no newer messages yet.


Back to Room ListRoom Version: 9