NixOS CUDA - Public Room Timeline

	NixOS CUDA	311 Members
	CUDA packages maintenance and support in nixpkgs \| https://github.com/orgs/NixOS/projects/27/ \| https://nixos.org/manual/nixpkgs/unstable/#cuda	62 Servers

Load older messages

Sender	Message	Time
17 May 2024
evax	libcuda.so is under /usr/lib/wsl/lib	08:56:55
evax	some finding, under NixOS-WSL, with the option to use windows drivers, a wsl-lib package is created in the nix store linking the contents of /usr/lib/wsl/lib	09:11:06
evax	it's exposed under /run/opengl-driver/lib	09:11:57
evax	it might just be that jax is expecting cuda12 but the actual version in the system is cuda11	09:12:26
SomeoneSerge (matrix works sometimes)	In reply to @evax:matrix.org some finding, under NixOS-WSL, with the option to use windows drivers, a wsl-lib package is created in the nix store linking the contents of /usr/lib/wsl/lib Good, this sounds much safer than putting `/usr/lib/wsl` in `LD_LIBRARY_PATH`	09:12:40
SomeoneSerge (matrix works sometimes)	In reply to @evax:matrix.org it might just be that jax is expecting cuda12 but the actual version in the system is cuda11 It links its cuda libraries directly, and the driver is likely compatible with both	09:13:07
SomeoneSerge (matrix works sometimes)	* It links its cuda libraries directly, and the driver is likely compatible with both releases	09:13:13
evax	another finding, using jaxlibWithCuda (the nix compiled version) jax complains there's no CUDA enabled jaxlib, while using jaxlib-bin there's an error message related to loading CUDA	09:14:53
evax	(I can't cut/paste/gist from that system, sorry)	09:16:40
evax	the jaxlib-bin error (with TF_CPP_MIN_LOG_LEVEL=0) is `external/tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.`	09:22:09
evax	I tried to LD_PRELOAD libcuda.so and it doesn't help	09:22:29
evax	with jaxlibWithCuda, the error is `An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.`	09:24:35
evax	torch finds the GPU with LD_LIBRARY_PATH pointing either to `/usr/lib/wsl/lib` or `/run/opengl-driver/lib`, but not without, for jaxWithCuda none of these options work	09:46:50
connor (burnt/out) (UTC-8)	Okay, tired of machines restarting I just bought three different kits of RAM to replace the existing kits in my builders. And two 10Gbe NICs to try to increase builder performance since they’re all networked together and the 2.5Gbe on two of the machines was a bottleneck.	17:09:15
connor (burnt/out) (UTC-8)	God I hate hardware 🫠	17:09:20
Gaétan Lepage	How many systems do you have as builders ?	20:27:05
connor (burnt/out) (UTC-8)	I have three desktops I use as builders; I also pay for an aarch64-linux Hetzner server which I use for aarch64-linux builds for CI	23:13:40
18 May 2024
Gaétan Lepage	Ok cool ! I am starting to think about building a workstation for nix builds. Would you mind sharing the specs of your machines ?	11:53:53
connor (burnt/out) (UTC-8)	Sure! Although keep in mind I've had a very difficult time managing consumer-grade hardware (especially given I use ASUS motherboards and the stupid default levels for voltage which trigger instability in games also trigger very hard to reproduce segfaults during Nix builds)	12:16:19
connor (burnt/out) (UTC-8)	My main machine: https://pcpartpicker.com/user/connorbaker/saved/pxtbkL A builder: https://pcpartpicker.com/user/connorbaker/saved/h6mvZL A builder/storage: https://pcpartpicker.com/user/connorbaker/saved/Pyy7CJ	12:51:29
connor (burnt/out) (UTC-8)	FWIW, it takes `magma-cuda-static` with the default set of capabilities ~19m30s to build on `nixos-desktop` and ~21m12s to build on `nixos-build01` or `nixos-ext`.	12:52:26
connor (burnt/out) (UTC-8)	However, I would strongly recommend writing a few scripts to provision an Azure instance instead. For example, `Standard_HB120rs_v3` (https://learn.microsoft.com/en-us/azure/virtual-machines/hbv3-series) is available as a spot instance in US-East for just $0.36 an hour. Keep in mind that has a 10Gb NIC in addition to two 1TB NVME drives. It's also server-grade hardware so no need to chase down segfaults caused by the motherboard melting your nice chips :)	12:54:53
connor (burnt/out) (UTC-8)	I mean seriously, just in troubleshooting stability issues yesterday I got frustrated and got new RAM for all my machines. That was about $1000 -- that would have bought me ~2,777h of the HBv3 as a spot instance.	12:58:17
SomeoneSerge (matrix works sometimes)	`>>> magma_compute_hours = (19.5 / 60) * 24 * 2 # 24 hyper-threading cores >>> 2777 / magma_compute_hours 178.0128205128205` After about 180 magma builds azure will have costed more than your RAM 🤔	18:46:46
SomeoneSerge (matrix works sometimes)	* `>>> magma_compute_hours = (19.5 / 60) * 24 * 2 # 24 hyper-threading cores >>> 2777 / magma_compute_hours 178.0128205128205` After about 180 magma builds azure will have costed more than your RAM, and I think we build several magmas a day 🤔	18:47:38
SomeoneSerge (matrix works sometimes)	* `>>> magma_compute_hours = (19.5 / 60) * 24 * 2 # 24 hyper-threading cores >>> 2777 / magma_compute_hours 178.0128205128205` AFAIU after about 180 magma builds azure will have costed more than your RAM, and I think we build several magmas a day 🤔	18:47:59
19 May 2024
connor (burnt/out) (UTC-8)	Correction since the i9-13900k has 32 cores in total, some are hyper-threaded and others are not `>>> magma_compute_hours = (19.5 / 60) * 32 # 32 "cores" >>> 2777 / magma_compute_hours 267.01923076923`	01:36:25
connor (burnt/out) (UTC-8)	However, that assumes it takes magma the same amount of time to build on an i9-13900k as it does on the HBv3 (it does not)	01:36:50
aidalgol	`nvidia-smi` is reporting 0% GPU usage even when I am running a game and I can hear my card's fans speed up. Is it reporting correctly for anyone else?	09:47:55
aidalgol	It sounds exactly like this: https://forums.developer.nvidia.com/t/nvidia-smi-reporting-0-gpu-utilization/261878	09:48:51

Show newer messages

Back to Room ListRoom Version: 9