1 Aug 2024 |
SomeoneSerge (utc+3) | * like
$ LD_DEBUG=libs python
import torch
# do whatever you do
| 14:53:34 |
SomeoneSerge (utc+3) | * like
$ LD_DEBUG=libs python my-repro.py
| 14:53:51 |
yorickvp | alright, I'll try that | 14:54:17 |
yorickvp | the failing line is LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${cudaPackages.cuda_cudart.stubs}/lib python -m pybind11_stubgen -o . bindings | 14:55:02 |
yorickvp | In reply to @ss:someonex.net
like
$ LD_DEBUG=libs python my-repro.py
okay, so my libraries have rpath $ORIGIN:/home/yorick/outputs/out/lib:/nix/store/kzx58d5pbb78gnv9s4d62f4r46x9waw9-gcc-12.3.0-lib/lib:/nix/store/8rzflwd9bxri4s0bpicm8bkmi2ikmv7n-nccl-2.21.5-1/lib:/nix/store/61q201jxc1g6pkbvhyyriwlm7zasa81k-openmpi-4.1.6/lib:/nix/store/g798k855fny946jnycp61vkzy27kwlyl-libcublas-12.1.3.1-lib/lib:/nix/store/dbwp0scbb0rk78m636sb7cvycz8xzgyh-glibc-2.39-52/lib:/nix/store/bn7pnigb0f8874m6riiw6dngsmdyic1g-gcc-13.3.0-lib/lib:/nix/store/2v1jx43nsp9njldxh4bfljvh5wmnbzk3-python3.10-tensorrt-cu12-libs-10.2.0/lib/python3.10/site-packages/tensorrt_libs:/nix/store/ybqfab6p2p6ir9dcr6gn6rxn825wb86g-cudnn-8.9.7.29-lib/lib | 15:12:20 |
yorickvp | looks like cmake is writing it as a LINK_PATH | 15:22:52 |
SomeoneSerge (utc+3) | So there's something else propagating an unwrapped (differently wrapped) gcc12 maybe | 15:23:40 |
yorickvp | how can I list all propagated inputs? | 15:24:08 |
SomeoneSerge (utc+3) | all propagated inputs of | 15:24:30 |
yorickvp | I'm in a nix develop for the drv that produces the libraries with the wrong rpath | 15:25:24 |
SomeoneSerge (utc+3) | H'mm, maybe you can echo "${pkgsBuildHost[@]}" for compilers/build tools | 15:26:51 |
SomeoneSerge (utc+3) | But that won't tell you where it's coming from | 15:27:09 |
SomeoneSerge (utc+3) | Just do a nix-tree --derivation or path-info why-depends | 15:27:31 |
yorickvp | seems like there's no unwrapped gcc | 15:48:59 |
yorickvp | libtorch_cuda.so also manages to link it | 15:49:53 |
yorickvp | https://gist.github.com/yorickvP/b263b9d6d058280a3f7d4c70eff2a758
/nix/store/mbg29pcjydgss24z0v6jczjda7q4z9x6-gcc-12.3.0.drv (the offending gcc lib) only occurs as a dependency of the gcc-wrapper that has the correct lib first | 15:54:09 |
yorickvp | I'll try to repro with torch on nixos-unstable | 15:57:23 |
yorickvp | yeah, ${python3.pkgs.torchWithCuda.lib}/lib/libtorch_cuda.so links to gcc-12.4.0-lib | 16:16:14 |
SomeoneSerge (utc+3) | Wow | 16:40:20 |
SomeoneSerge (utc+3) | This looks like a regression | 16:40:27 |
SomeoneSerge (utc+3) | Well the first obvious leak (the one we see in the wrapper) is https://github.com/NixOS/nixpkgs/blob/fc27807b85986bb26a8f28e590e01fae742e6b53/pkgs/build-support/cc-wrapper/default.nix#L596-L606 | 16:53:54 |
SomeoneSerge (utc+3) | Notably, cudaPackages.saxpy works fine at that commit | 16:54:12 |
SomeoneSerge (utc+3) | I'm running github:NixOS/nixpkgs/c66e984bda09e7230ea7b364e677c5ba4f0d36d0#opencv4.tests.no-libstdcxx-errors now (only defined for cudaSupport = true) | 16:54:41 |
SomeoneSerge (utc+3) | Going to take a while | 16:54:45 |
SomeoneSerge (utc+3) | But it might be the regression is somehow magically torch specific | 16:54:59 |
SomeoneSerge (utc+3) | No idea why https://github.com/NixOS/nixpkgs/blame/fc27807b85986bb26a8f28e590e01fae742e6b53/pkgs/build-support/cc-wrapper/default.nix#L605-L606 uses cc_solib honestly | 16:55:53 |
yorickvp | you know, I blame cmake | 17:00:59 |
yorickvp | * you know, I blame cmake :) | 17:01:03 |
yorickvp | looking at 36 megabytes of cmake logs, it obviously parses it out of some gcc output (together with the correct one, which it puts first in the path). I'm not sure what it does with it after | 17:02:50 |
SomeoneSerge (utc+3) | Waiting for opencv, but so far I'm leaning towards "maybe pytorch devs replaced some of the cmake logic with an unnecessary gcc -print-search-paths" | 17:06:46 |