!eWOErHSaiddIbsUNsJ:nixos.org

NixOS CUDA

289 Members
CUDA packages maintenance and support in nixpkgs | https://github.com/orgs/NixOS/projects/27/ | https://nixos.org/manual/nixpkgs/unstable/#cuda57 Servers

Load older messages


SenderMessageTime
23 Sep 2024
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) Kevin Mittman: does NVIDIA happen to have JSON (or otherwise structured) versions of their dependency constraints for packages somewhere, or are the tables on the docs for each respective package the only source? I'm working on update scripts and I'd like to avoid the manual stage of "go look on the website, find the table (it may have moved), and encode the contents as a Nix expression" 18:39:25
24 Sep 2024
@pascal.grosmann:scs.ems.host@pascal.grosmann:scs.ems.host set a profile picture.08:56:22
@hexa:lossy.networkhexa
_______ TestKernelLinearOperatorLinOpReturn.test_solve_matrix_broadcast ________

self = <test.operators.test_kernel_linear_operator.TestKernelLinearOperatorLinOpReturn testMethod=test_solve_matrix_broadcast>

    def test_solve_matrix_broadcast(self):
        linear_op = self.create_linear_op()
    
        # Right hand size has one more batch dimension
        batch_shape = torch.Size((3, *linear_op.batch_shape))
        rhs = torch.randn(*batch_shape, linear_op.size(-1), 5)
        self._test_solve(rhs)
    
        if linear_op.ndimension() > 2:
            # Right hand size has one fewer batch dimension
            batch_shape = torch.Size(linear_op.batch_shape[1:])
            rhs = torch.randn(*batch_shape, linear_op.size(-1), 5)
>           self._test_solve(rhs)

linear_operator/test/linear_operator_test_case.py:1115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
linear_operator/test/linear_operator_test_case.py:615: in _test_solve
    self.assertAllClose(arg.grad, arg_copy.grad, **self.tolerances["grad"])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.operators.test_kernel_linear_operator.TestKernelLinearOperatorLinOpReturn testMethod=test_solve_matrix_broadcast>
tensor1 = tensor([[[[ 1.8514e+04,  7.1797e+03, -1.1073e+04, -6.6690e+03,  1.2985e+04,
            6.8468e+03],
          [ 1.685...  -3.0153e+04],
          [-9.0042e+04, -1.3429e+04, -3.1822e+04,  1.3839e+04,  5.9735e+04,
           -5.4315e+04]]]])
tensor2 = tensor([[[[ 1.8514e+04,  7.1797e+03, -1.1073e+04, -6.6690e+03,  1.2985e+04,
            6.8468e+03],
          [ 1.685...  -3.0153e+04],
          [-9.0042e+04, -1.3429e+04, -3.1822e+04,  1.3839e+04,  5.9735e+04,
           -5.4315e+04]]]])
rtol = 0.03, atol = 1e-05, equal_nan = False

    def assertAllClose(self, tensor1, tensor2, rtol=1e-4, atol=1e-5, equal_nan=False):
        if not tensor1.shape == tensor2.shape:
            raise ValueError(f"tensor1 ({tensor1.shape}) and tensor2 ({tensor2.shape}) do not have the same shape.")
    
        if torch.allclose(tensor1, tensor2, rtol=rtol, atol=atol, equal_nan=equal_nan):
            return True
    
        if not equal_nan:
            if not torch.equal(tensor1, tensor1):
                raise AssertionError(f"tensor1 ({tensor1.shape}) contains NaNs")
            if not torch.equal(tensor2, tensor2):
                raise AssertionError(f"tensor2 ({tensor2.shape}) contains NaNs")
    
        rtol_diff = (torch.abs(tensor1 - tensor2) / torch.abs(tensor2)).view(-1)
        rtol_diff = rtol_diff[torch.isfinite(rtol_diff)]
        rtol_max = rtol_diff.max().item()
    
        atol_diff = (torch.abs(tensor1 - tensor2) - torch.abs(tensor2).mul(rtol)).view(-1)
        atol_diff = atol_diff[torch.isfinite(atol_diff)]
        atol_max = atol_diff.max().item()
    
>       raise AssertionError(
            f"tensor1 ({tensor1.shape}) and tensor2 ({tensor2.shape}) are not close enough. \n"
            f"max rtol: {rtol_max:0.8f}\t\tmax atol: {atol_max:0.8f}"
        )
E       AssertionError: tensor1 (torch.Size([2, 3, 4, 6])) and tensor2 (torch.Size([2, 3, 4, 6])) are not close enough. 
E       max rtol: 0.03577567            max atol: 0.00741313

linear_operator/test/base_test_case.py:46: AssertionError
11:40:36
@hexa:lossy.networkhexaI think this one has been failing for me on the linear-operator package11:41:02
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) As a sanity check — has anyone been able to successfully use torch.compile to speed up model training, or do they also get a python stack trace when torch tries to call into OpenAI’s triton 15:23:08
25 Sep 2024
@ss:someonex.netSomeoneSerge (back on matrix)It used to work but now our t2iton is lagging 1 major version behind19:36:58
@glepage:matrix.orgGaétan LepageBecause those geniuses are not able to tag a freaking release20:20:55
@glepage:matrix.orgGaétan Lepage https://github.com/triton-lang/triton/issues/3535 20:21:18
@ss:someonex.netSomeoneSerge (back on matrix)unstable-yyyy-mm-dd is ok for us; there were some minor but unresolved issues with the PR that does the bump though20:23:04
26 Sep 2024
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8)
In reply to @glepage:matrix.org
https://github.com/triton-lang/triton/issues/3535
Well that’s an infuriating read
16:33:18
@glepage:matrix.orgGaétan LepageIt's OK, OpenAI is just a small startup with only a few people. And deep learning is not even their main activity17:07:38
@connorbaker:matrix.orgconnor (burnt/out) (UTC-8) Yeah and they're definitely not a for-profit organization 17:20:14
@adam:robins.wtf@adam:robins.wtf"open" is in their name17:24:26
@gsaurel:laas.frnim65sit's such a joke that I find it sad it was not opened one day earlier17:28:20
@glepage:matrix.orgGaétan Lepage "I propose a 200€ bounty for this PR. Please git tag the freaking commit. 21:09:04
@glepage:matrix.orgGaétan Lepage * "I propose a 200€ bounty for this PR. Please git tag the freaking commit." 21:09:07
@glepage:matrix.orgGaétan LepageThe ease of spinning up a release is a decreasing function of the project/company resources.21:09:40
@gsaurel:laas.frnim65ssame issue on a one-man project abandonned for the last year or so: https://github.com/bab2min/EigenRand/issues/5621:47:05
@gsaurel:laas.frnim65s * same issue on a one-man project abandonned for the last year or so: https://github.com/bab2min/EigenRand/issues/56 : <48h21:49:56
28 Sep 2024
@shekhinah:she.khinah.xyzshekhinah changed their profile picture.07:04:58
@kaya:catnip.eekaya 𖤐 changed their profile picture.16:55:46
1 Oct 2024
@-_o:matrix.org-_o joined the room.21:00:15
2 Oct 2024
@hexa:lossy.networkhexa Gaétan Lepage: please take care of tensordict 00:25:19
@hexa:lossy.networkhexaimage.png
Download image.png
00:25:22
@glepage:matrix.orgGaétan Lepage Sure, I will have a look right now.
I have not faced any failure on my end, weird...
06:21:33
@glepage:matrix.orgGaétan LepageIs this on staging ?06:23:26
@glepage:matrix.orgGaétan Lepage All failures that I was able to find on hydra are timeouts or upstream dependency failures.
I was able to build tensordict on all architectures...
07:05:50
@hexa:lossy.networkhexathis is on trunk11:03:39
@hexa:lossy.networkhexathen you probably need to increase meta.timeout11:04:00
@glepage:matrix.orgGaétan Lepage Now that you say it, I remember this package being stuck (indefinitly) during mass rebuilds.
I don't know if increasing the timeout will help. When everything works fine, it builds in ~1min...
Also, nothing has changed in the derivation for the past few months.
11:47:12

Show newer messages


Back to Room ListRoom Version: 9