!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

288 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda56 Servers

Load older messages


SenderMessageTime
11 Jan 2025
@ss:someonex.netSomeoneSerge (back on matrix) Odd, I thought we nuked old cuda releases that didn't have cuda_cccl? 11:47:40
@hexa:lossy.networkhexa (UTC+1)oh yeah, that would explain why I could build it just fine, but eval would fail 🙂 16:09:02
13 Jan 2025
@ruroruro:matrix.orgruro

Hi, everyone. In my experience, CUDA packages and CUDA-enabled packages when cudaSupport = true; are quite often broken in nixpkgs (more often than other packages).

For example, https://hydra.nix-community.org/jobset/nixpkgs/cuda/evals has a bunch of Eval Errors and build errors and I don't remember the last time that it was green (although some of those eval errors might not be indicative of actually broken packages).

I was thinking that we might be able to improve the situation by making general nixpkgs contributors more aware of this situation. For example, it would be pretty cool if we could track the nix-community hydra builds on status.nixos.org, on zh.fail (and try to include CUDA packages in future ZHF events).

Also, I understand why hydra.nixos.org doesn't build CUDA packages, but do you think that we could enable evaluation-only checks for CUDA packages on nixpkgs github PRs and then build those PRs using the nix-community builders and report the results on the PR?

Finally, I was wondering if there is some canonical place to track/discuss CUDA-specific build failures in nixpkgs?

14:27:12
@ruroruro:matrix.orgruro *

Hi, everyone. In my experience, CUDA packages and CUDA-enabled packages when cudaSupport = true; are quite often broken in nixpkgs (more often than other packages).

For example, https://hydra.nix-community.org/jobset/nixpkgs/cuda/evals has a bunch of Eval Errors and build errors and I don't remember the last time that it was green (although some of those eval errors might not be indicative of actually broken packages).

I was thinking that we might be able to improve the situation by making general nixpkgs contributors more aware of this situation. For example, it would be pretty cool if we could track the nix-community hydra builds on status.nixos.org and on zh.fail (and try to include CUDA packages in future ZHF events).

Also, I understand why hydra.nixos.org doesn't build CUDA packages, but do you think that we could enable evaluation-only checks for CUDA packages on nixpkgs github PRs and then build those PRs using the nix-community builders and report the results on the PR?

Finally, I was wondering if there is some canonical place to track/discuss CUDA-specific build failures in nixpkgs?

14:28:08
@ruroruro:matrix.orgruroIt feels like fixing CUDA packages currently is "treadmill work" where some package gets fixed only for something else to get broken by unrelated changes in nixpkgs (because the current automation on github PRs doesn't check CUDA-enabled versions of packages).14:35:04
@ruroruro:matrix.orgruro

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab https://hydra.nix-community.org/jobset/nixpkgs/cuda#tabs-errors, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken for some reason
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken for some reason
  • 4 errors are due to some of the nsight_systems packages being marked as broken for some reason
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it is marked as broken
  • pymc because it depends on pytensor which is marked as broken
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • truecrack-cuda because it depends on truecrack which is marked as broken
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree"

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:09:46
@ruroruro:matrix.orgruro *

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken for some reason
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken for some reason
  • 4 errors are due to some of the nsight_systems packages being marked as broken for some reason
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it is marked as broken
  • pymc because it depends on pytensor which is marked as broken
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • truecrack-cuda because it depends on truecrack which is marked as broken
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree"

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:10:11
@ruroruro:matrix.orgruro *

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken (I think that for most of them, its because "CUDA version is too new" or "CUDA version is too old").
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken for some reason
  • 4 errors are due to some of the nsight_systems packages being marked as broken for some reason
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it is marked as broken
  • pymc because it depends on pytensor which is marked as broken
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • truecrack-cuda because it depends on truecrack which is marked as broken
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree"

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:12:44
@ruroruro:matrix.orgruro *

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken (I think that for most of them, its because "CUDA version is too new" or "CUDA version is too old").
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken (because, "Package is not supported; use drivers from linuxPackages")
  • 4 errors are due to some of the nsight_systems packages being marked as broken for some reason
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it is marked as broken
  • pymc because it depends on pytensor which is marked as broken
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • truecrack-cuda because it depends on truecrack which is marked as broken
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree"

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:13:39
@ruroruro:matrix.orgruro *

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken (I think that for most of them, its because "CUDA version is too new" or "CUDA version is too old").
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken (because "Package is not supported; use drivers from linuxPackages")
  • 4 errors are due to some of the nsight_systems packages being marked as broken (because "CUDA too old (<11.8)")
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it is marked as broken
  • pymc because it depends on pytensor which is marked as broken
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • truecrack-cuda because it depends on truecrack which is marked as broken
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree"

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:14:49
@ruroruro:matrix.orgruro *

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken (I think that for most of them, its because "CUDA version is too new" or "CUDA version is too old").
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken (because "Package is not supported; use drivers from linuxPackages")
  • 4 errors are due to some of the nsight_systems packages being marked as broken (because "CUDA too old (<11.8)")
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it was marked as broken in #173463
  • truecrack-cuda because it was marked as broken in #167250
  • pymc because it depends on pytensor which is marked as broken (0 clues, why nix-community hydra thinks so, it seemed to work for me locally)
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree" (bsd3 issl unfreeRedistributable)

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:23:52
@ruroruro:matrix.orgruro *

On a related note does anybody know, what's up with the Eval Errors on the nix-community cuda job?

Looking at this tab, it seems that most eval errors are caused by the fact that release-cuda.nix just tries to indiscriminately build everything in cudaPackages.

  • 119 of those errors are caused by the fact that TensorRT has an "unfree" license (and it can't be built in CI anyway, because you need to manually download the tarballs)
  • 101 errors are due to some of the cudnn_* packages being marked as broken (I think that for most of them, its because "CUDA version is too new" or "CUDA version is too old").
  • 15 errors are due to cuda-samples, colmap and deepin.image-editor packages depending on freeimage-unstable-2021-11-01 which is marked as insecure
  • 13 errors are due to some of the nvidia_driver packages being marked as broken (because "Package is not supported; use drivers from linuxPackages")
  • 4 errors are due to some of the nsight_systems packages being marked as broken (because "CUDA too old (<11.8)")
  • 4 errors are due to CUDA 10 being removed from nixpkgs, but still being accessible via cudaPackages_10{,_0,_1,_2}
  • 2 errors are due to boxx and bpycv depending on fn which doesn't work with python>=3.11

And the following individual packages are also failing to eval:

  • pixinsight because it is "unfree"
  • mxnet because it was marked as broken in #173463
  • truecrack-cuda because it was marked as broken in #167250
  • pymc because it depends on pytensor which was marked as broken in #373239
  • Theano because it was removed from nixpkgs, but is still accessible (and listed in release-cuda.nix)
  • tts because it depends on a -bin version of pytorch for some reason, which is "unfree" (bsd3 issl unfreeRedistributable)

Interestingly, the "Evaluation Errors" tabs of individual job runs are empty for some reason

15:26:04
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)

Excellent questions and ideas!

You’re correct that CUDA packages are broken more often than other packages — we don’t get the benefit of any of the tooling Nixpkgs CI provides.

I’m all for Hydra integrations to make that information more visible, but I fear it won’t prevent breakages since those are usually caught when maintainers run Nixpkgs-review, and they don’t typically enable CUDA support from what I can tell.

I think evaluation only checks are very reasonable for upstream.

I’m not sure what would be involved in getting the community builders to build CUDA packages (especially given some of their licenses and the fact that CUDA packages tend to be resource intensive to build).

We do have a CUDA project board on GitHub, but nothing solely for build failures IIRC.

I haven’t had the chance to follow what’s happening with the Nix community Hydra :(

15:50:21
@ruroruro:matrix.orgruro I am honestly not too familiar with the internals of nixpkgs-review and other CI/automation tooling. The nixpkgs-review README states that it uses ofborg evaluation results do determine, what needs to be built. I wonder if release-cuda.nix could be included in ofborg (and consequently nixpkgs-review) without making hydra.nixos.org build it& 16:03:37
@ruroruro:matrix.orgruro * I am honestly not too familiar with the internals of nixpkgs-review and other CI/automation tooling. The nixpkgs-review README states that it uses ofborg evaluation results do determine, what needs to be built. I wonder if release-cuda.nix could be included in ofborg (and consequently nixpkgs-review) without making hydra.nixos.org build it? 16:03:40
@ruroruro:matrix.orgruro * I am honestly not too familiar with the internals of nixpkgs-review and other CI/automation tooling. The nixpkgs-review README states that it uses ofborg evaluation results to determine, what needs to be built. I wonder if release-cuda.nix could be included in ofborg (and consequently nixpkgs-review) without making hydra.nixos.org build it? 16:05:22
14 Jan 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) When doing review for CUDA that hasn’t been my experience; eval happens locally, consumes a bunch of memory, and then builds stuff
On the other hand, I haven’t run Nixpkgs-review for CUDA stuff since I split it out into a separate repository
04:14:00
@msanft:matrix.orgMoritz SanftDoes anyone of you have some spare time to review https://github.com/NixOS/nixpkgs/pull/372320? It fixes a silly mistake I made in another PR when writing a patch for libnvidia-container, which unfortunately causes libnvidia-container to segfault under certain conditions. Could possibly mark this as security too, as it's memory corruption :/08:14:41
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) I’m assuming you kept the if expression to minimize the diff in the patch? Otherwise I’d recommend removing it entirely since the condition will always be false. 16:51:02
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)Merged16:58:29
@maxwell325:matrix.org@maxwell325:matrix.org left the room.22:41:36
15 Jan 2025
@msanft:matrix.orgMoritz Sanft
In reply to@connorbaker:matrix.org
I’m assuming you kept the if expression to minimize the diff in the patch? Otherwise I’d recommend removing it entirely since the condition will always be false.
exactly. Minimizes the room for human error here imo.
07:23:03
@ss:someonex.netSomeoneSerge (back on matrix) changed their display name from SomeoneSerge (utc+3) to SomeoneSerge.19:02:40
16 Jan 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)if anyone enjoys the process of bumping opencv looks like 4.11 is out now (master is still using 4.9)07:51:29
@glepage:matrix.orgGaétan LepageHas to happen on staging right ?07:52:29
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)I think so due to the number of rebuilds07:54:12
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)Good news is that 4.10 fixes compilation with CUDA 12.3+07:54:34
18 Jan 2025
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) just updated https://github.com/connorbaker/nix-cuda-test to use the latest changes in https://github.com/ConnorBaker/cuda-packages, which include getting rid of the need for a custom CUDA stdenv 08:51:19
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)At least, I’m fairly certain I’ve implemented it that way. I’ve also got a setup hook which checks for RPATHs in produced libraries that match library directories from NVCC’s host compiler09:35:02
@glepage:matrix.orgGaétan Lepage
In reply to @connorbaker:matrix.org
if anyone enjoys the process of bumping opencv looks like 4.11 is out now (master is still using 4.9)
I'm looking for some reviewers ;)
https://github.com/NixOS/nixpkgs/pull/374246
09:45:39

Show newer messages


Back to Room ListRoom Version: 9