| 17 Sep 2024 |
Gaétan Lepage | Any idea ? Otherwise I can make a PR to mark it as broken (it builds fine on x86_64) | 12:19:32 |
Gaétan Lepage | * connor (he/him) (UTC-5) tiny-cuda-nn fails on aarch64-linux with:
-- Build files have been written to: /build/source/build
cmake: enabled parallel building
cmake: enabled parallel installing
Running phase: buildPhase
build flags: -j80
[1/20] Building CUDA object CMakeFiles/tiny-cuda-nn.dir/src/encoding.cu.o
FAILED: CMakeFiles/tiny-cuda-nn.dir/src/encoding.cu.o
/nix/store/z8g6ma876kbi5mxwq388aadn1h35yqy9-cuda-redist/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/nix/store/k0kxskxvmkw97h3z3b5y4hwd56fh9x33-gcc-wrapper-13.3.0/bin/c++ -DTCNN_MIN_GPU_ARCH=60 -DTCNN_SHAMPOO -I/build/source/include -I/build/source/dependencies -I/build/source/dependencies/cutlass/include -I/build/source/dependencies/cutlass/tools/util/include -I/build/source/dependencies/fmt/include -O3 -DNDEBUG -std=c++14 "--generate-code=arch=compute_60,code=[compute_60,sm_60]" "--generate-code=arch=compute_61,code=[compute_61,sm_61]" "--generate-code=arch=compute_70,code=[compute_70,sm_70]" "--generate-code=arch=compute_75,code=[compute_75,sm_75]" "--generate-code=arch=compute_80,code=[compute_80,sm_80]" "--generate-code=arch=compute_86,code=[compute_86,sm_86]" "--generate-code=arch=compute_89,code=[compute_89,sm_89]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" "--generate-code=arch=compute_90a,code=[compute_90a,sm_90a]" -Xcompiler=-mf16c -Xcompiler=-Wno-float-conversion -Xcompiler=-fno-strict-aliasing -Xcudafe=--diag_suppress=unrecognized_gcc_pragma --extended-lambda --expt-relaxed-constexpr -MD -MT CMakeFiles/tiny-cuda-nn.dir/src/encoding.cu.o -MF CMakeFiles/tiny-cuda-nn.dir/src/encoding.cu.o.d -x cu -c /build/source/src/encoding.cu -o CMakeFiles/tiny-cuda-nn.dir/src/encoding.cu.o
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
g++: error: unrecognized command-line option '-mf16c'
...
| 12:21:40 |
connor (burnt/out) (UTC-8) | Mark it as broken, it needs more love than I have to give right now :( | 14:35:33 |
Gaétan Lepage | Sure no worry | 15:26:16 |
connor (burnt/out) (UTC-8) | SomeoneSerge (utc+3): if you have a chance, would you take one last look at https://github.com/NixOS/nixpkgs/pull/339619? I added tests (several of which fail due to Torch requiring Magma be built with the same version of CUDA, which is something I'll handle in a follow-up PR) | 15:39:56 |
| 18 Sep 2024 |
evax | We have a flake based setup using the nixos cache, the cuda-maintainer cache and our own private cache. For some reason on our CI system cuda_nvcc always ends up being rebuilt from scratch while we don't have the problem when developing locally - does anybody have any idea regarding what could cause this? At the end of the CI build, we sign recursively anything linked to ./result and upload to our private cache. | 06:30:55 |
teto | you can diff the two derivations. Was it nix-diff that showed a nice result ? | 10:37:13 |
myrkskog | Redacted or Malformed Event | 12:52:27 |
myrkskog | Anyone know which Linux kernel and driver is most stable and performent for a Quadro RTX 4000? Finding it hard to gather this information. | 13:06:17 |
SomeoneSerge (back on matrix) | Mhm we should make a wiki page with a list of setups we run | 13:31:24 |
myrkskog | Great I’ll have a look | 13:31:50 |
myrkskog | * Great I’ll have a look. Thank you. | 13:32:08 |
SomeoneSerge (back on matrix) | I mean there isn't one yet | 13:32:13 |
SomeoneSerge (back on matrix) | Just acknowledging there is a visibility/discoverability issue here, and we could just do something like what nixos-mobile or postmarketos do: a table with contributors and their devices, and the modules and packages they actively use | 13:33:40 |
SomeoneSerge (back on matrix) | * Just acknowledging there is a visibility/discoverability issue here, and we could just do something like what nixos-mobile or postmarketos do: a table with contributors and their devices, and their caches, and the modules and packages they actively use | 13:34:23 |
myrkskog | Got it. Well that would be fantastic 👍 | 13:37:42 |
| 19 Sep 2024 |
| @pascal.grosmann:scs.ems.host changed their display name from Pascal Grosmann - Urlaub 🚐 🏝️ 🏄♂️ 18.05. - 15.09. to Pascal Grosmann. | 06:33:34 |
| @pascal.grosmann:scs.ems.host set a profile picture. | 12:28:30 |
| @pascal.grosmann:scs.ems.host removed their profile picture. | 12:28:55 |
| 21 Sep 2024 |
aidalgol | Not CUDA-related, but Nvidia-specific: I have no idea where to even start troubleshooting this: https://github.com/NixOS/nixpkgs/pull/341219#issuecomment-2365253518 | 22:20:37 |
| 23 Sep 2024 |
| connor (burnt/out) (UTC-8) changed their display name from connor (he/him) (UTC-5) to connor (he/him) (UTC-7). | 17:57:52 |
connor (burnt/out) (UTC-8) | Kevin Mittman: does NVIDIA happen to have JSON (or otherwise structured) versions of their dependency constraints for packages somewhere, or are the tables on the docs for each respective package the only source? I'm working on update scripts and I'd like to avoid the manual stage of "go look on the website, find the table (it may have moved), and encode the contents as a Nix expression" | 18:39:25 |
| 24 Sep 2024 |
| @pascal.grosmann:scs.ems.host set a profile picture. | 08:56:22 |
hexa (UTC+1) | _______ TestKernelLinearOperatorLinOpReturn.test_solve_matrix_broadcast ________
self = <test.operators.test_kernel_linear_operator.TestKernelLinearOperatorLinOpReturn testMethod=test_solve_matrix_broadcast>
def test_solve_matrix_broadcast(self):
linear_op = self.create_linear_op()
# Right hand size has one more batch dimension
batch_shape = torch.Size((3, *linear_op.batch_shape))
rhs = torch.randn(*batch_shape, linear_op.size(-1), 5)
self._test_solve(rhs)
if linear_op.ndimension() > 2:
# Right hand size has one fewer batch dimension
batch_shape = torch.Size(linear_op.batch_shape[1:])
rhs = torch.randn(*batch_shape, linear_op.size(-1), 5)
> self._test_solve(rhs)
linear_operator/test/linear_operator_test_case.py:1115:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
linear_operator/test/linear_operator_test_case.py:615: in _test_solve
self.assertAllClose(arg.grad, arg_copy.grad, **self.tolerances["grad"])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test.operators.test_kernel_linear_operator.TestKernelLinearOperatorLinOpReturn testMethod=test_solve_matrix_broadcast>
tensor1 = tensor([[[[ 1.8514e+04, 7.1797e+03, -1.1073e+04, -6.6690e+03, 1.2985e+04,
6.8468e+03],
[ 1.685... -3.0153e+04],
[-9.0042e+04, -1.3429e+04, -3.1822e+04, 1.3839e+04, 5.9735e+04,
-5.4315e+04]]]])
tensor2 = tensor([[[[ 1.8514e+04, 7.1797e+03, -1.1073e+04, -6.6690e+03, 1.2985e+04,
6.8468e+03],
[ 1.685... -3.0153e+04],
[-9.0042e+04, -1.3429e+04, -3.1822e+04, 1.3839e+04, 5.9735e+04,
-5.4315e+04]]]])
rtol = 0.03, atol = 1e-05, equal_nan = False
def assertAllClose(self, tensor1, tensor2, rtol=1e-4, atol=1e-5, equal_nan=False):
if not tensor1.shape == tensor2.shape:
raise ValueError(f"tensor1 ({tensor1.shape}) and tensor2 ({tensor2.shape}) do not have the same shape.")
if torch.allclose(tensor1, tensor2, rtol=rtol, atol=atol, equal_nan=equal_nan):
return True
if not equal_nan:
if not torch.equal(tensor1, tensor1):
raise AssertionError(f"tensor1 ({tensor1.shape}) contains NaNs")
if not torch.equal(tensor2, tensor2):
raise AssertionError(f"tensor2 ({tensor2.shape}) contains NaNs")
rtol_diff = (torch.abs(tensor1 - tensor2) / torch.abs(tensor2)).view(-1)
rtol_diff = rtol_diff[torch.isfinite(rtol_diff)]
rtol_max = rtol_diff.max().item()
atol_diff = (torch.abs(tensor1 - tensor2) - torch.abs(tensor2).mul(rtol)).view(-1)
atol_diff = atol_diff[torch.isfinite(atol_diff)]
atol_max = atol_diff.max().item()
> raise AssertionError(
f"tensor1 ({tensor1.shape}) and tensor2 ({tensor2.shape}) are not close enough. \n"
f"max rtol: {rtol_max:0.8f}\t\tmax atol: {atol_max:0.8f}"
)
E AssertionError: tensor1 (torch.Size([2, 3, 4, 6])) and tensor2 (torch.Size([2, 3, 4, 6])) are not close enough.
E max rtol: 0.03577567 max atol: 0.00741313
linear_operator/test/base_test_case.py:46: AssertionError
| 11:40:36 |
hexa (UTC+1) | I think this one has been failing for me on the linear-operator package | 11:41:02 |
connor (burnt/out) (UTC-8) | As a sanity check — has anyone been able to successfully use torch.compile to speed up model training, or do they also get a python stack trace when torch tries to call into OpenAI’s triton | 15:23:08 |
| 25 Sep 2024 |
SomeoneSerge (back on matrix) | It used to work but now our t2iton is lagging 1 major version behind | 19:36:58 |
Gaétan Lepage | Because those geniuses are not able to tag a freaking release | 20:20:55 |
Gaétan Lepage | https://github.com/triton-lang/triton/issues/3535 | 20:21:18 |
SomeoneSerge (back on matrix) | unstable-yyyy-mm-dd is ok for us; there were some minor but unresolved issues with the PR that does the bump though | 20:23:04 |