!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

290 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda58 Servers

Load older messages


SenderMessageTime
24 Sep 2024
@hexa:lossy.networkhexa (UTC+1)
_______ TestKernelLinearOperatorLinOpReturn.test_solve_matrix_broadcast ________

self = <test.operators.test_kernel_linear_operator.TestKernelLinearOperatorLinOpReturn testMethod=test_solve_matrix_broadcast>

    def test_solve_matrix_broadcast(self):
        linear_op = self.create_linear_op()
    
        # Right hand size has one more batch dimension
        batch_shape = torch.Size((3, *linear_op.batch_shape))
        rhs = torch.randn(*batch_shape, linear_op.size(-1), 5)
        self._test_solve(rhs)
    
        if linear_op.ndimension() > 2:
            # Right hand size has one fewer batch dimension
            batch_shape = torch.Size(linear_op.batch_shape[1:])
            rhs = torch.randn(*batch_shape, linear_op.size(-1), 5)
>           self._test_solve(rhs)

linear_operator/test/linear_operator_test_case.py:1115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
linear_operator/test/linear_operator_test_case.py:615: in _test_solve
    self.assertAllClose(arg.grad, arg_copy.grad, **self.tolerances["grad"])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.operators.test_kernel_linear_operator.TestKernelLinearOperatorLinOpReturn testMethod=test_solve_matrix_broadcast>
tensor1 = tensor([[[[ 1.8514e+04,  7.1797e+03, -1.1073e+04, -6.6690e+03,  1.2985e+04,
            6.8468e+03],
          [ 1.685...  -3.0153e+04],
          [-9.0042e+04, -1.3429e+04, -3.1822e+04,  1.3839e+04,  5.9735e+04,
           -5.4315e+04]]]])
tensor2 = tensor([[[[ 1.8514e+04,  7.1797e+03, -1.1073e+04, -6.6690e+03,  1.2985e+04,
            6.8468e+03],
          [ 1.685...  -3.0153e+04],
          [-9.0042e+04, -1.3429e+04, -3.1822e+04,  1.3839e+04,  5.9735e+04,
           -5.4315e+04]]]])
rtol = 0.03, atol = 1e-05, equal_nan = False

    def assertAllClose(self, tensor1, tensor2, rtol=1e-4, atol=1e-5, equal_nan=False):
        if not tensor1.shape == tensor2.shape:
            raise ValueError(f"tensor1 ({tensor1.shape}) and tensor2 ({tensor2.shape}) do not have the same shape.")
    
        if torch.allclose(tensor1, tensor2, rtol=rtol, atol=atol, equal_nan=equal_nan):
            return True
    
        if not equal_nan:
            if not torch.equal(tensor1, tensor1):
                raise AssertionError(f"tensor1 ({tensor1.shape}) contains NaNs")
            if not torch.equal(tensor2, tensor2):
                raise AssertionError(f"tensor2 ({tensor2.shape}) contains NaNs")
    
        rtol_diff = (torch.abs(tensor1 - tensor2) / torch.abs(tensor2)).view(-1)
        rtol_diff = rtol_diff[torch.isfinite(rtol_diff)]
        rtol_max = rtol_diff.max().item()
    
        atol_diff = (torch.abs(tensor1 - tensor2) - torch.abs(tensor2).mul(rtol)).view(-1)
        atol_diff = atol_diff[torch.isfinite(atol_diff)]
        atol_max = atol_diff.max().item()
    
>       raise AssertionError(
            f"tensor1 ({tensor1.shape}) and tensor2 ({tensor2.shape}) are not close enough. \n"
            f"max rtol: {rtol_max:0.8f}\t\tmax atol: {atol_max:0.8f}"
        )
E       AssertionError: tensor1 (torch.Size([2, 3, 4, 6])) and tensor2 (torch.Size([2, 3, 4, 6])) are not close enough. 
E       max rtol: 0.03577567            max atol: 0.00741313

linear_operator/test/base_test_case.py:46: AssertionError
11:40:36
@hexa:lossy.networkhexa (UTC+1)I think this one has been failing for me on the linear-operator package11:41:02
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) As a sanity check — has anyone been able to successfully use torch.compile to speed up model training, or do they also get a python stack trace when torch tries to call into OpenAI’s triton 15:23:08
25 Sep 2024
@ss:someonex.netSomeoneSerge (back on matrix)It used to work but now our t2iton is lagging 1 major version behind19:36:58
@glepage:matrix.orgGaétan LepageBecause those geniuses are not able to tag a freaking release20:20:55
@glepage:matrix.orgGaétan Lepage https://github.com/triton-lang/triton/issues/3535 20:21:18
@ss:someonex.netSomeoneSerge (back on matrix)unstable-yyyy-mm-dd is ok for us; there were some minor but unresolved issues with the PR that does the bump though20:23:04
26 Sep 2024
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)
In reply to @glepage:matrix.org
https://github.com/triton-lang/triton/issues/3535
Well that’s an infuriating read
16:33:18
@glepage:matrix.orgGaétan LepageIt's OK, OpenAI is just a small startup with only a few people. And deep learning is not even their main activity17:07:38
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) Yeah and they're definitely not a for-profit organization 17:20:14
@adam:robins.wtf@adam:robins.wtf"open" is in their name17:24:26
@gsaurel:laas.frnim65sit's such a joke that I find it sad it was not opened one day earlier17:28:20
@glepage:matrix.orgGaétan Lepage "I propose a 200€ bounty for this PR. Please git tag the freaking commit. 21:09:04
@glepage:matrix.orgGaétan Lepage * "I propose a 200€ bounty for this PR. Please git tag the freaking commit." 21:09:07
@glepage:matrix.orgGaétan LepageThe ease of spinning up a release is a decreasing function of the project/company resources.21:09:40
@gsaurel:laas.frnim65ssame issue on a one-man project abandonned for the last year or so: https://github.com/bab2min/EigenRand/issues/5621:47:05
@gsaurel:laas.frnim65s * same issue on a one-man project abandonned for the last year or so: https://github.com/bab2min/EigenRand/issues/56 : <48h21:49:56
28 Sep 2024
@shekhinah:she.khinah.xyzshekhinah changed their profile picture.07:04:58
@kaya:catnip.eekaya 𖤐 changed their profile picture.16:55:46
1 Oct 2024
@-_o:matrix.org-_o joined the room.21:00:15
2 Oct 2024
@hexa:lossy.networkhexa (UTC+1) Gaétan Lepage: please take care of tensordict 00:25:19
@hexa:lossy.networkhexa (UTC+1)image.png
Download image.png
00:25:22
@glepage:matrix.orgGaétan Lepage Sure, I will have a look right now.
I have not faced any failure on my end, weird...
06:21:33
@glepage:matrix.orgGaétan LepageIs this on staging ?06:23:26
@glepage:matrix.orgGaétan Lepage All failures that I was able to find on hydra are timeouts or upstream dependency failures.
I was able to build tensordict on all architectures...
07:05:50
@hexa:lossy.networkhexa (UTC+1)this is on trunk11:03:39
@hexa:lossy.networkhexa (UTC+1)then you probably need to increase meta.timeout11:04:00
@glepage:matrix.orgGaétan Lepage Now that you say it, I remember this package being stuck (indefinitly) during mass rebuilds.
I don't know if increasing the timeout will help. When everything works fine, it builds in ~1min...
Also, nothing has changed in the derivation for the past few months.
11:47:12
@justbrowsing:matrix.orgKevin Mittman (UTC-8)Back from vacation18:23:19
@justbrowsing:matrix.orgKevin Mittman (UTC-8)Redacted or Malformed Event18:32:05

Show newer messages


Back to Room ListRoom Version: 9