| 17 May 2024 |
evax | libcuda.so is under /usr/lib/wsl/lib | 08:56:55 |
evax | some finding, under NixOS-WSL, with the option to use windows drivers, a wsl-lib package is created in the nix store linking the contents of /usr/lib/wsl/lib | 09:11:06 |
evax | it's exposed under /run/opengl-driver/lib | 09:11:57 |
evax | it might just be that jax is expecting cuda12 but the actual version in the system is cuda11 | 09:12:26 |
SomeoneSerge (matrix works sometimes) | In reply to @evax:matrix.org some finding, under NixOS-WSL, with the option to use windows drivers, a wsl-lib package is created in the nix store linking the contents of /usr/lib/wsl/lib Good, this sounds much safer than putting /usr/lib/wsl in LD_LIBRARY_PATH | 09:12:40 |
SomeoneSerge (matrix works sometimes) | In reply to @evax:matrix.org it might just be that jax is expecting cuda12 but the actual version in the system is cuda11 It links its cuda libraries directly, and the driver is likely compatible with both | 09:13:07 |
SomeoneSerge (matrix works sometimes) | * It links its cuda libraries directly, and the driver is likely compatible with both releases | 09:13:13 |
evax | another finding, using jaxlibWithCuda (the nix compiled version) jax complains there's no CUDA enabled jaxlib, while using jaxlib-bin there's an error message related to loading CUDA | 09:14:53 |
evax | (I can't cut/paste/gist from that system, sorry) | 09:16:40 |
evax | the jaxlib-bin error (with TF_CPP_MIN_LOG_LEVEL=0) is external/tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. | 09:22:09 |
evax | I tried to LD_PRELOAD libcuda.so and it doesn't help | 09:22:29 |
evax | with jaxlibWithCuda, the error is An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu. | 09:24:35 |
evax | torch finds the GPU with LD_LIBRARY_PATH pointing either to /usr/lib/wsl/lib or /run/opengl-driver/lib, but not without, for jaxWithCuda none of these options work | 09:46:50 |
connor (burnt/out) (UTC-8) | Okay, tired of machines restarting
I just bought three different kits of RAM to replace the existing kits in my builders. And two 10Gbe NICs to try to increase builder performance since they’re all networked together and the 2.5Gbe on two of the machines was a bottleneck. | 17:09:15 |
connor (burnt/out) (UTC-8) | God I hate hardware 🫠 | 17:09:20 |
Gaétan Lepage | How many systems do you have as builders ? | 20:27:05 |
connor (burnt/out) (UTC-8) | I have three desktops I use as builders; I also pay for an aarch64-linux Hetzner server which I use for aarch64-linux builds for CI | 23:13:40 |
| 18 May 2024 |
Gaétan Lepage | Ok cool !
I am starting to think about building a workstation for nix builds.
Would you mind sharing the specs of your machines ? | 11:53:53 |
connor (burnt/out) (UTC-8) | Sure! Although keep in mind I've had a very difficult time managing consumer-grade hardware (especially given I use ASUS motherboards and the stupid default levels for voltage which trigger instability in games also trigger very hard to reproduce segfaults during Nix builds) | 12:16:19 |
connor (burnt/out) (UTC-8) | My main machine: https://pcpartpicker.com/user/connorbaker/saved/pxtbkL
A builder: https://pcpartpicker.com/user/connorbaker/saved/h6mvZL
A builder/storage: https://pcpartpicker.com/user/connorbaker/saved/Pyy7CJ | 12:51:29 |
connor (burnt/out) (UTC-8) | FWIW, it takes magma-cuda-static with the default set of capabilities ~19m30s to build on nixos-desktop and ~21m12s to build on nixos-build01 or nixos-ext. | 12:52:26 |
connor (burnt/out) (UTC-8) | However, I would strongly recommend writing a few scripts to provision an Azure instance instead. For example, Standard_HB120rs_v3 (https://learn.microsoft.com/en-us/azure/virtual-machines/hbv3-series) is available as a spot instance in US-East for just $0.36 an hour. Keep in mind that has a 10Gb NIC in addition to two 1TB NVME drives. It's also server-grade hardware so no need to chase down segfaults caused by the motherboard melting your nice chips :) | 12:54:53 |
connor (burnt/out) (UTC-8) | I mean seriously, just in troubleshooting stability issues yesterday I got frustrated and got new RAM for all my machines. That was about $1000 -- that would have bought me ~2,777h of the HBv3 as a spot instance. | 12:58:17 |
SomeoneSerge (matrix works sometimes) | >>> magma_compute_hours = (19.5 / 60) * 24 * 2 # 24 hyper-threading cores
>>> 2777 / magma_compute_hours
178.0128205128205
After about 180 magma builds azure will have costed more than your RAM 🤔 | 18:46:46 |
SomeoneSerge (matrix works sometimes) | * >>> magma_compute_hours = (19.5 / 60) * 24 * 2 # 24 hyper-threading cores
>>> 2777 / magma_compute_hours
178.0128205128205
After about 180 magma builds azure will have costed more than your RAM, and I think we build several magmas a day 🤔 | 18:47:38 |
SomeoneSerge (matrix works sometimes) | * >>> magma_compute_hours = (19.5 / 60) * 24 * 2 # 24 hyper-threading cores
>>> 2777 / magma_compute_hours
178.0128205128205
AFAIU after about 180 magma builds azure will have costed more than your RAM, and I think we build several magmas a day 🤔 | 18:47:59 |
| 19 May 2024 |
connor (burnt/out) (UTC-8) | Correction since the i9-13900k has 32 cores in total, some are hyper-threaded and others are not
>>> magma_compute_hours = (19.5 / 60) * 32 # 32 "cores"
>>> 2777 / magma_compute_hours
267.01923076923
| 01:36:25 |
connor (burnt/out) (UTC-8) | However, that assumes it takes magma the same amount of time to build on an i9-13900k as it does on the HBv3 (it does not) | 01:36:50 |
aidalgol | nvidia-smi is reporting 0% GPU usage even when I am running a game and I can hear my card's fans speed up. Is it reporting correctly for anyone else? | 09:47:55 |
aidalgol | It sounds exactly like this: https://forums.developer.nvidia.com/t/nvidia-smi-reporting-0-gpu-utilization/261878 | 09:48:51 |