NixOS CUDA - Public Room Timeline

	NixOS CUDA	290 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	57 Servers

Load older messages

Sender	Message	Time
9 Jun 2024
connor (he/him)	In reply to @glepage:matrix.org connor (he/him) (UTC-5), in case you have a bit of available CPU time, could you please run a `nixpkgs-review pr --post-result 317576` ? (If you don't to that's fine ofc) Rerunning it by the way; got stuck for 20h+ on tensordict’s checkPhase :/	04:05:26
Gaétan Lepage	In reply to @connorbaker:matrix.org Rerunning it by the way; got stuck for 20h+ on tensordict’s checkPhase :/ Thanks. The tensordict problematic problematic test has been disabled in https://github.com/NixOS/nixpkgs/pull/318111	07:47:15
Gaétan Lepage	I now get stuck on `python312Packages.botorch`	07:47:27
	shekhinah removed their display name she⯰khinah - traumaturgische Beratung.	16:57:35
connor (he/him)	I have a custom config I use specifically for nixpkgs-review you may like	17:59:18
connor (he/him)	https://gist.github.com/ConnorBaker/305b1aebd7ee74a258a616bbbd4dcd7b	17:59:55
	shekhinah changed their profile picture.	18:04:55
Gaétan Lepage	In reply to @connorbaker:matrix.org I have a custom config I use specifically for nixpkgs-review you may like Wow	18:07:32
Gaétan Lepage	So botorch did build for you ?	18:07:41
connor (he/him)	Yeah it did after I disabled checks for it	18:15:10
connor (he/him)	Just posted three variations of `nixpkgs-review` on the PR, https://github.com/NixOS/nixpkgs/pull/317576	18:15:45
connor (he/him)	Looks good to me!	18:15:49
connor (he/him)	I'm going to try a run with nix-cuda-test real quick	18:16:02
Gaétan Lepage	Thank you so much !	18:16:45
Gaétan Lepage	yes, for me it hangs in the tests...	18:16:54
Gaétan Lepage	If this is not the case on `master` we should probably investigate that ? Anyway, considering that a vast majority of the downstream packages do still build fine, I would argue for merging this PR.	18:17:49
Gaétan Lepage	As a more general thought, I find very important to mark broken packages as such as it prevents us from diving in the `nixpkgs-review` failures every time to investigate whether a breakage is a regression or not.	18:18:53
connor (he/him)	Agreed; I can't do it fast enough which is why I've just got that config I use	18:25:26
connor (he/him)	If I succeed in running `nix-cuda-test`, are you okay with me merging it?	18:30:56
Gaétan Lepage	In reply to @connorbaker:matrix.org If I succeed in running `nix-cuda-test`, are you okay with me merging it? Yes, I am OK with it.	19:22:14
Gaétan Lepage	I will go once again through the failures while this completes	19:22:27
hexa	In reply to @ss:someonex.net Hmm there's a PR starting with `python3Packages.torch:` but Ofborg didn't try building it, just shows 20 green (eval&c) checks https://github.com/NixOS/ofborg/issues/577	19:30:37
hexa	use python311Packages or python312Packages instead	19:30:56
Gaétan Lepage	SomeoneSerge (UTC+3) what is your opinion on merging the torch update as is ?	19:47:12
Gaétan Lepage	I am pretty confident in the absence of regression in this PR	19:48:03
connor (he/him)	Gaétan Lepage: have you had a chance to try training a model with `torch.compile`?	20:27:04
connor (he/him)	I've been testing with `nix run -L --override-input nixpkgs github:nixos/nixpkgs/6f0e1545adfa64c9f3a22f5ce789b9f509080abd .#nix-cuda-test` run inside https://github.com/ConnorBaker/nix-cuda-test	20:28:56
connor (he/him)	$ nix run -L --override-input nixpkgs github:nixos/nixpkgs/6f0e1545adfa64c9f3a22f5ce789b9f509080abd .#nix-cuda-test -- --compile warning: not writing modified lock file of flake 'git+file:///home/connorbaker/nix-cuda-test': • Updated input 'nixpkgs': 'github:nixos/nixpkgs/593754412bff02f735ba339d7a3afda41ad19bb5?narHash=sha256-a%2BVM3UnER9KOFZBPjIin3ojO1h3m4NzR9y8wwLka6oQ%3D' (2024-06-09) → 'github:nixos/nixpkgs/6f0e1545adfa64c9f3a22f5ce789b9f509080abd?narHash=sha256-EQDc%2BmcEQG7Q1PzZKikAnX5YtAHT/KjFR773m48L7m0%3D' (2024-06-09) Seed set to 42 Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Files already downloaded and verified Files already downloaded and verified LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] \| Name \| Type \| Params ----------------------------------------------- 0 \| criterion \| CrossEntropyLoss \| 0 1 \| model \| ViT \| 86.3 M ----------------------------------------------- 86.3 M Trainable params 0 Non-trainable params 86.3 M Total params 345.317 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%\| \| 0/2 [00:00<?, ?it/s]ldconfig: Can't open cache file /nix/store/apab5i73dqa09wx0q27b6fbhd1r18ihl-glibc-2.39-31/etc/ld.so.cache : No such file or directory -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .nix-cuda-test-wrapped 9 <module> sys.exit(main()) __main__.py 126 main trainer.fit( trainer.py 544 fit call._call_and_handle_interrupt( call.py 44 _call_and_handle_interrupt return trainer_fn(args, kwargs) trainer.py 580 _fit_impl self._run(model, ckpt_path=ckpt_path) trainer.py 987 _run results = self._run_stage() trainer.py 1031 _run_stage self._run_sanity_check() trainer.py 1060 _run_sanity_check val_loop.run() utilities.py 182 _decorator return loop_run(self, args, *kwargs) evaluation_loop.py 135 run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) evaluation_loop.py 396 _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) call.py 309 _call_strategy_hook output = fn(args, kwargs) strategy.py 412 validation_step return self.lightning_module.validation_step(args, *kwargs) eval_frame.py 451 _fn return fn(args, *kwargs) convert_frame.py 921 catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) convert_frame.py 786 _convert_frame result = inner_convert( convert_frame.py 400 _convert_frame_assert return _compile( contextlib.py 81 inner return func(args, *kwds) convert_frame.py 676 _compile guarded_code = compile_inner(code, one_graph, hooks, transform) utils.py 262 time_wrapper r = func(args, *kwargs) convert_frame.py 535 compile_inner out_code = transform_code_object(code, transform) bytecode_transformation.py 1036 transform_code_object transformations(instructions, code_options) convert_frame.py 165 _fn return fn(args, *kwargs) convert_frame.py 500 transform tracer.run() symbolic_convert.py 2149 run super().run() symbolic_convert.py 810 run and self.step() symbolic_convert.py 773 step getattr(self, inst.opname)(inst) symbolic_convert.py 484 wrapper return handle_graph_break(self, inst, speculation.reason) symbolic_convert.py 548 handle_graph_break self.output.compile_subgraph(self, reason=reason) output_graph.py 1001 compile_subgraph self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root) contextlib.py 81 inner return func(args, *kwds) output_graph.py 1178 compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) utils.py 262 time_wrapper r = func(args, *kwargs) output_graph.py 1251 call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( output_graph.py 1232 call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) after_dynamo.py 117 debug_wrapper compiled_gm = compiler_fn(gm, example_inputs) __init__.py 1731 __call__ return compile_fx(model_, inputs_, config_patches=self.config) contextlib.py 81 inner return func(args, kwds) compile_fx.py 1330 compile_fx return aot_autograd( common.py 58 compiler_fn cg = aot_module_simplified(gm, example_inputs, kwargs) aot_autograd.py 903 aot_module_simplified compiled_fn = create_aot_dispatcher_function( utils.py 262 time_wrapper r = func(args, kwargs) aot_autograd.py 628 create_aot_dispatcher_function compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata) runtime_wrappers.py 443 aot_wrapper_dedupe return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata) runtime_wrappers.py 648 aot_wrapper_synthetic_base return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata) jit_compile_runtime_wrappers.py 119 aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) utils.py 262 time_wrapper r = func(args, *kwargs) compile_fx.py 1257 fw_compiler_base return inner_compile( after_aot.py 83 debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) debug.py 304 inner return fn(args, *kwargs) contextlib.py 81 inner return func(args, *kwds) contextlib.py 81 inner return func(args, *kwds) utils.py 262 time_wrapper r = func(args, *kwargs) compile_fx.py 438 compile_fx_inner compiled_graph = fx_codegen_and_compile( compile_fx.py 714 fx_codegen_and_compile compiled_fn = graph.compile_to_fn() graph.py 1307 compile_to_fn return self.compile_to_module().call utils.py 262 time_wrapper r = func(args, *kwargs) graph.py 1250 compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() graph.py 1208 codegen self.scheduler.codegen() utils.py 262 time_wrapper r = func(args, **kwargs) scheduler.py 2339 codegen self.get_backend(device).codegen_nodes(node.get_nodes()) # type: ignore[possibly-undefined] cuda_combined_scheduling.py 63 codegen_nodes return self._triton_scheduling.codegen_nodes(nodes) triton.py 3255 codegen_nodes return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel) triton.py 3425 codegen_node_schedule src_code = kernel.codegen_kernel() triton.py 2753 codegen_kernel "backend_hash": torch.utils._triton.triton_hash_with_backend(), _triton.py 101 triton_hash_with_backend backend_hash = triton_backend_hash() _triton.py 37 triton_backend_hash from triton.common.backend import get_backend, get_cuda_version_key torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: ImportError: cannot import name 'get_cuda_version_key' from 'triton.common.backend' (/nix/store/4pd9qb5sd865n8nms3vadx83kzzr6i8v-python3.11-triton-2.1.0/lib/python3.11/site-packages/triton/common/backend.py) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True	20:30:06
Gaétan Lepage	Nope I haven't tried	20:36:00
Gaétan Lepage	Is this with my branch or from master ?	20:36:12

Show newer messages

Back to Room ListRoom Version: 9