NixOS CUDA - Public Room Timeline

	NixOS CUDA	312 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	60 Servers

Load older messages

Sender	Message	Time
9 Jun 2024
Gaétan Lepage	SomeoneSerge (UTC+3) what is your opinion on merging the torch update as is ?	19:47:12
Gaétan Lepage	I am pretty confident in the absence of regression in this PR	19:48:03
connor (he/him)	Gaétan Lepage: have you had a chance to try training a model with `torch.compile`?	20:27:04
connor (he/him)	I've been testing with `nix run -L --override-input nixpkgs github:nixos/nixpkgs/6f0e1545adfa64c9f3a22f5ce789b9f509080abd .#nix-cuda-test` run inside https://github.com/ConnorBaker/nix-cuda-test	20:28:56
connor (he/him)	$ nix run -L --override-input nixpkgs github:nixos/nixpkgs/6f0e1545adfa64c9f3a22f5ce789b9f509080abd .#nix-cuda-test -- --compile warning: not writing modified lock file of flake 'git+file:///home/connorbaker/nix-cuda-test': • Updated input 'nixpkgs': 'github:nixos/nixpkgs/593754412bff02f735ba339d7a3afda41ad19bb5?narHash=sha256-a%2BVM3UnER9KOFZBPjIin3ojO1h3m4NzR9y8wwLka6oQ%3D' (2024-06-09) → 'github:nixos/nixpkgs/6f0e1545adfa64c9f3a22f5ce789b9f509080abd?narHash=sha256-EQDc%2BmcEQG7Q1PzZKikAnX5YtAHT/KjFR773m48L7m0%3D' (2024-06-09) Seed set to 42 Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Files already downloaded and verified Files already downloaded and verified LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] \| Name \| Type \| Params ----------------------------------------------- 0 \| criterion \| CrossEntropyLoss \| 0 1 \| model \| ViT \| 86.3 M ----------------------------------------------- 86.3 M Trainable params 0 Non-trainable params 86.3 M Total params 345.317 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%\| \| 0/2 [00:00<?, ?it/s]ldconfig: Can't open cache file /nix/store/apab5i73dqa09wx0q27b6fbhd1r18ihl-glibc-2.39-31/etc/ld.so.cache : No such file or directory -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .nix-cuda-test-wrapped 9 <module> sys.exit(main()) __main__.py 126 main trainer.fit( trainer.py 544 fit call._call_and_handle_interrupt( call.py 44 _call_and_handle_interrupt return trainer_fn(args, kwargs) trainer.py 580 _fit_impl self._run(model, ckpt_path=ckpt_path) trainer.py 987 _run results = self._run_stage() trainer.py 1031 _run_stage self._run_sanity_check() trainer.py 1060 _run_sanity_check val_loop.run() utilities.py 182 _decorator return loop_run(self, args, *kwargs) evaluation_loop.py 135 run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) evaluation_loop.py 396 _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) call.py 309 _call_strategy_hook output = fn(args, kwargs) strategy.py 412 validation_step return self.lightning_module.validation_step(args, *kwargs) eval_frame.py 451 _fn return fn(args, *kwargs) convert_frame.py 921 catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) convert_frame.py 786 _convert_frame result = inner_convert( convert_frame.py 400 _convert_frame_assert return _compile( contextlib.py 81 inner return func(args, *kwds) convert_frame.py 676 _compile guarded_code = compile_inner(code, one_graph, hooks, transform) utils.py 262 time_wrapper r = func(args, *kwargs) convert_frame.py 535 compile_inner out_code = transform_code_object(code, transform) bytecode_transformation.py 1036 transform_code_object transformations(instructions, code_options) convert_frame.py 165 _fn return fn(args, *kwargs) convert_frame.py 500 transform tracer.run() symbolic_convert.py 2149 run super().run() symbolic_convert.py 810 run and self.step() symbolic_convert.py 773 step getattr(self, inst.opname)(inst) symbolic_convert.py 484 wrapper return handle_graph_break(self, inst, speculation.reason) symbolic_convert.py 548 handle_graph_break self.output.compile_subgraph(self, reason=reason) output_graph.py 1001 compile_subgraph self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root) contextlib.py 81 inner return func(args, *kwds) output_graph.py 1178 compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) utils.py 262 time_wrapper r = func(args, *kwargs) output_graph.py 1251 call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( output_graph.py 1232 call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) after_dynamo.py 117 debug_wrapper compiled_gm = compiler_fn(gm, example_inputs) __init__.py 1731 __call__ return compile_fx(model_, inputs_, config_patches=self.config) contextlib.py 81 inner return func(args, kwds) compile_fx.py 1330 compile_fx return aot_autograd( common.py 58 compiler_fn cg = aot_module_simplified(gm, example_inputs, kwargs) aot_autograd.py 903 aot_module_simplified compiled_fn = create_aot_dispatcher_function( utils.py 262 time_wrapper r = func(args, kwargs) aot_autograd.py 628 create_aot_dispatcher_function compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata) runtime_wrappers.py 443 aot_wrapper_dedupe return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata) runtime_wrappers.py 648 aot_wrapper_synthetic_base return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata) jit_compile_runtime_wrappers.py 119 aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) utils.py 262 time_wrapper r = func(args, *kwargs) compile_fx.py 1257 fw_compiler_base return inner_compile( after_aot.py 83 debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) debug.py 304 inner return fn(args, *kwargs) contextlib.py 81 inner return func(args, *kwds) contextlib.py 81 inner return func(args, *kwds) utils.py 262 time_wrapper r = func(args, *kwargs) compile_fx.py 438 compile_fx_inner compiled_graph = fx_codegen_and_compile( compile_fx.py 714 fx_codegen_and_compile compiled_fn = graph.compile_to_fn() graph.py 1307 compile_to_fn return self.compile_to_module().call utils.py 262 time_wrapper r = func(args, *kwargs) graph.py 1250 compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() graph.py 1208 codegen self.scheduler.codegen() utils.py 262 time_wrapper r = func(args, **kwargs) scheduler.py 2339 codegen self.get_backend(device).codegen_nodes(node.get_nodes()) # type: ignore[possibly-undefined] cuda_combined_scheduling.py 63 codegen_nodes return self._triton_scheduling.codegen_nodes(nodes) triton.py 3255 codegen_nodes return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel) triton.py 3425 codegen_node_schedule src_code = kernel.codegen_kernel() triton.py 2753 codegen_kernel "backend_hash": torch.utils._triton.triton_hash_with_backend(), _triton.py 101 triton_hash_with_backend backend_hash = triton_backend_hash() _triton.py 37 triton_backend_hash from triton.common.backend import get_backend, get_cuda_version_key torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: ImportError: cannot import name 'get_cuda_version_key' from 'triton.common.backend' (/nix/store/4pd9qb5sd865n8nms3vadx83kzzr6i8v-python3.11-triton-2.1.0/lib/python3.11/site-packages/triton/common/backend.py) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True	20:30:06
Gaétan Lepage	Nope I haven't tried	20:36:00
Gaétan Lepage	Is this with my branch or from master ?	20:36:12
connor (he/him)	This is with the latest commit from your branch	20:50:29
connor (he/him)	It works without compile, just curious if this is a problem with the PR	20:50:43
connor (he/him)	I'll try again with master to make sure it's not a regression	20:51:08
connor (he/him)	Cool, fails on master too	20:51:57
Gaétan Lepage	Nice ^^	20:53:19
Gaétan Lepage	Is it OK for me to merge now ?	20:53:44
Gaétan Lepage	Oh, I've just seen your message	20:53:58
10 Jun 2024
	NixOS Moderation Bot unbanned @jonringer:matrix.org.	00:17:14
Gaétan Lepage	Download clipboard.png	06:44:40
Gaétan Lepage	Haha botorch has probably taken ~11h but it succeeded X)	06:44:56
	shekhinah set their display name to yaldebaoth.	11:02:59
	shekhinah changed their display name from yaldebaoth to yaldabaoth.	11:03:43
connor (he/him)	Gaétan Lepage: did you mention there was a PR or something merged to disable the `checkPhase` or test suite for `botorch`, or did I misunderstand?	14:01:56
connor (he/him)	On another note, has anyone built `elpa` (https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/libraries/elpa/default.nix) successfully with CUDA support? I let it run for like 20h and it was still building. Seems to compile four object files at a time?	14:04:03
Gaétan Lepage	In reply to @connorbaker:matrix.org Gaétan Lepage: did you mention there was a PR or something merged to disable the `checkPhase` or test suite for `botorch`, or did I misunderstand? No, I have not done anything. I was actually able to build it just fine from master earlier today.	14:29:01
hexa	Gaétan Lepage: have you considered pulling this patch for tensorflow-bin? https://github.com/tensorflow/tensorflow/issues/58073#issuecomment-2097055553	20:58:34
11 Jun 2024
teto	when using localai 2.15 from unstable and even after a reboot I get `ggml_cuda_init: failed to initialize CUDA: CUDA driver is a stub library`. It's a bit random but if anyone has a tip, I take it. nvidia-smi output looks fine	00:25:38
Gaétan Lepage	In reply to @hexa:lossy.network Gaétan Lepage: have you considered pulling this patch for tensorflow-bin? https://github.com/tensorflow/tensorflow/issues/58073#issuecomment-2097055553 This looks like it could work ! However, how do you apply a patch to a wheel-type python derivation ?	06:38:47
Gaétan Lepage	What phase of the `buildPythonPackage` script should I hook it into ?	06:39:02
Gaétan Lepage	I tried `patches = [` but it does not work	06:39:15
Gaétan Lepage	I am packaging this: https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#installation-and-build You can see that it support several variations for building (CUDA, metal, mkl...) -> What should be the approach ? Adding `cudaSupport` ? `metalSupport` ? `mklSupport` ?	07:01:41
	kaya 𖤐 changed their profile picture.	08:03:48
hexa	In reply to @glepage:matrix.org This looks like it could work ! However, how do you apply a patch to a wheel-type python derivation ? likely in postInstall 😕	11:58:18

Show newer messages

Back to Room ListRoom Version: 9