| 12 Nov 2025 |
Gaétan Lepage | replacing crashed worker is pytest's message following one of its processes crashing. | 12:12:09 |
Daniel Fahey | Yeah, could be for any reason, still seems a bit fishy | 12:16:40 |
Gaétan Lepage | Try with -j 8 for instance. I have never experienced flakiness when building jax, even though I have done it dozens of times. | 12:17:35 |
Gaétan Lepage | Never tried on Intel though... | 12:17:44 |
Daniel Fahey | I also only saw build failures with python3.12
Looks like this particular version is in the team cache, so it built fine
$ nix-build https://github.com/daniel-fahey/nixpkgs/archive/fix/python3Packages.vllm.tar.gz --pure --arg config '{ allowUnfree = true; cudaSupport = true; }' --attr python312Packages.jax
these 3 paths will be fetched (443.62 MiB download, 443.62 MiB unpacked):
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
/nix/store/78kfvx2q26r5053pkp4g9f9y41hc99xm-python3.12-jax-cuda12-pjrt-0.8.0
/nix/store/3a5iw8yqhsc5x16wllsanyyyzqm3xmvd-python3.12-jax-cuda12-plugin-0.8.0
copying path '/nix/store/78kfvx2q26r5053pkp4g9f9y41hc99xm-python3.12-jax-cuda12-pjrt-0.8.0' from 'https://cache.nixos-cuda.org'...
copying path '/nix/store/3a5iw8yqhsc5x16wllsanyyyzqm3xmvd-python3.12-jax-cuda12-plugin-0.8.0' from 'https://cache.nixos-cuda.org'...
copying path '/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0' from 'https://cache.nixos-cuda.org'...
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
Latest build with the matching hash https://hydra.nixos-cuda.org/build/8123#tabs-buildsteps
atlas.nixos-cuda.org and https://github.com/nixos-cuda/infra/blob/f150adfab4863b131ef67fb97919fff949793995/hosts/atlas/hardware.nix#L28 suggests it's Intel?
🤔
| 12:32:38 |
Gaétan Lepage | Then maybe some flakiness with your specific CPU... Sometimes things are weird. | 12:37:39 |
Gaétan Lepage | (atlas is configured with cores = 9, maybe this plays a role in building jax successfully) | 12:38:10 |
Daniel Fahey | Same exact hashed versions declared from nixos-unstable and master branches too:
[daniel@laptop:~]$ nix-build https://github.com/NixOS/nixpkgs/archive/nixos-unstable.tar.gz --pure --arg config '{ allowUnfree = true; cudaSupport = true; }' --attr python312Packages.jax --builders ''
unpacking 'https://github.com/NixOS/nixpkgs/archive/nixos-unstable.tar.gz' into the Git cache...
these 3 paths will be fetched (443.62 MiB download, 443.62 MiB unpacked):
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
/nix/store/78kfvx2q26r5053pkp4g9f9y41hc99xm-python3.12-jax-cuda12-pjrt-0.8.0
/nix/store/3a5iw8yqhsc5x16wllsanyyyzqm3xmvd-python3.12-jax-cuda12-plugin-0.8.0
copying path '/nix/store/78kfvx2q26r5053pkp4g9f9y41hc99xm-python3.12-jax-cuda12-pjrt-0.8.0' from 'https://cache.nixos-cuda.org'...
copying path '/nix/store/3a5iw8yqhsc5x16wllsanyyyzqm3xmvd-python3.12-jax-cuda12-plugin-0.8.0' from 'https://cache.nixos-cuda.org'...
copying path '/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0' from 'https://cache.nixos-cuda.org'...
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
[daniel@laptop:~]$ nix-build https://github.com/NixOS/nixpkgs/archive/master.tar.gz --pure --arg config '{ allowUnfree = true; cudaSupport = true; }' --attr python312Packages.jax --builders ''
unpacking 'https://github.com/NixOS/nixpkgs/archive/master.tar.gz' into the Git cache...
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
| 12:38:15 |
Gaétan Lepage | So you get a cache hit, all good then? | 12:39:15 |
Daniel Fahey | Yeah, Ari Lotter, might be a good idea to add this cache (as well as Flox's):
extra-trusted-substituters = https://cache.nixos-cuda.org
extra-trusted-public-keys = cache.nixos-cuda.org:74DUi4Ye579gUqzH4ziL9IyiJBlDpMRn9MBN8oNan9M=
| 12:44:00 |
Daniel Fahey | Yep, no cache hit with Flox:
[daniel@laptop:~]$ nix path-info --store https://cache.flox.dev /nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
these 21 paths will be fetched (2586.20 MiB download, 4491.51 MiB unpacked):
/nix/store/mx2c21i61q6mm21cr27h3kpz09z9j3ds-cuda12.8-cuda_cccl-12.8.90
/nix/store/60bccal8rk5zm3nsxszvfvv6754imwcl-cuda12.8-cuda_cudart-12.8.90
/nix/store/js94l573zp6a325irbymcpajr95r8011-cuda12.8-cuda_cupti-12.8.90-lib
/nix/store/a9d5nqjvd81kq3rxpch647xxasvfvvpi-cuda12.8-cuda_nvcc-12.8.93
/nix/store/pdjnbw4sa9f4mag54hxq8wrk5qidk6pn-cuda12.8-cudnn-9.13.0.50-lib
/nix/store/1c0jcdqaf7pjf28jsizkysy6h1pj2048-cuda12.8-libcublas-12.8.4.1-lib
/nix/store/x1pf46gpsy4s0b18598p4byagl15im89-cuda12.8-libcufft-11.3.3.83-lib
/nix/store/2myxa089vbhxrls82nhhpi93gr68crwc-cuda12.8-libcusolver-11.7.3.90-lib
/nix/store/lh80i7q850hgk6m55yfxzllhx0mcim88-cuda12.8-libcusparse-12.5.8.93-lib
/nix/store/ajvrjfzbmmi2sarsf6xmjhd1ib1g6a8w-cuda12.8-libnvjitlink-12.8.93-lib
/nix/store/kx6rjjpgybnxci4wfm0yq54zdm4qidnp-cuda12.8-nccl-2.28.7-1
/nix/store/98qfxl63r5s3fa6q9dlaladsrb4pn8n1-python3.12-absl-py-2.3.1
/nix/store/kmi3l0wdnkma0sjfbm6661jsy9957r5g-python3.12-flatbuffers-25.2.10
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
/nix/store/78kfvx2q26r5053pkp4g9f9y41hc99xm-python3.12-jax-cuda12-pjrt-0.8.0
/nix/store/3a5iw8yqhsc5x16wllsanyyyzqm3xmvd-python3.12-jax-cuda12-plugin-0.8.0
/nix/store/cfav66cmsr83k0hf45pps5azhys6kfl8-python3.12-jaxlib-0.8.0
/nix/store/ji79rzmqg5r66bkdhdglzfg9ji2lb32q-python3.12-ml-dtypes-0.5.3
/nix/store/jir196c5rj03a561hzp4scmvv0xcivwn-python3.12-numpy-2.3.3
/nix/store/p95s2s0id7n6lc7czsk5rg7j0qdy8853-python3.12-opt-einsum-3.4.0
/nix/store/2ga29gq011y1wa7gkcqfc1az7cp1mkah-python3.12-scipy-1.16.2
error: path '/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0' is not valid
[daniel@laptop:~]$ nix path-info --store https://cache.nixos-cuda.org /nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
/nix/store/s2w4h19yylw9ls7q84j8bd1md62kcrzh-python3.12-jax-0.8.0
| 12:52:31 |
Daniel Fahey | (maybe they're having the same build issue, lol) I don't know enough about them tbh
| 12:58:09 |
Daniel Fahey | Flox are building from their own fork of Nixpkgs. (according to https://flox.dev/blog/the-flox-catalog-now-contains-nvidia-cuda/)
Their https://github.com/flox/nixpkgs/tree/unstable is ~10 days old, lol | 13:06:13 |
Daniel Fahey | So much for the private sector | 13:06:18 |
Robbie Buxton | In reply to @daniel-fahey:matrix.org Flox are building from their own fork of Nixpkgs. (according to https://flox.dev/blog/the-flox-catalog-now-contains-nvidia-cuda/)
Their https://github.com/flox/nixpkgs/tree/unstable is ~10 days old, lol Are you sure they aren’t just wrapping nix unstable with some hacks for their project? | 15:23:23 |
Ari Lotter | allllllright let's try nixpkgs-review again with the new binary cache :p | 15:30:38 |
Ari Lotter | still building jax, but, uhhhh, we ball | 15:40:02 |
Gaétan Lepage | Which PR? | 15:56:37 |
Ari Lotter | this one https://github.com/NixOS/nixpkgs/pull/460701 | 15:57:23 |
Ari Lotter | yep, workers keep crashing :/
] building python3.12-jax-0.8.0 (pytestCheckPhase): replacing crashed worker gw1 | 15:57:38 |
Ari Lotter | hm,
warning: ignoring the client-specified setting 'sandbox', because it is a restricted setting and you are not a trusted user
warning: ignoring the client-specified setting 'system', because it is a restricted setting and you are not a trusted user would setting myself as a trusted user fix this, i wonder | 16:02:06 |
Daniel Fahey | Yeah you might want to use extra-substituters (I never grok'd the difference, if I'm being honest) | 16:06:51 |
Daniel Fahey | not sure if it's different with Nix on Ubuntu, but on NixOS, I have to rebuild the system before the binary cache is available. Is there a rebuild step with plain Nix? | 16:08:28 |
Daniel Fahey | I reckon they might be in a private repo hinted at in https://github.com/flox/nixpkgs/pull/3#issuecomment-1276439899
But the https://github.com/flox/nixpkgs/tree/unstable is a simple fork that I'd like to see sync'd/rebased from upstream Nixpkgs more frequently.
Gaétan Lepage is this is the kind of thing that could be discussed in your CUDA Team meetings, and maybe brought up to the Steering Committee for discussion with Flox?
| 16:12:16 |
Gaétan Lepage | Flox is managing their cache internally. As you pointed, they use an internal fork of nixpkgs that is slightly delayed from nixos-unstable (or nixos-unstable-small). It's normal that their cache is less fresh than chache.nixos-cuda.org. | 16:13:41 |
Gaétan Lepage | The difference is that they have the permission from Nvidia to redistribute their binaries. | 16:14:01 |
Daniel Fahey | 😅 | 16:14:45 |
Ari Lotter | i have it in both because i don't understand it <3 | 16:15:52 |
Ari Lotter | also - 100% sure i'm not running out of ram anymore, at only ~400gb/2tb used on the machine, but jax still has crashed workers - and not sure if it's progressing | 16:17:00 |
Ari Lotter | can i pull up interactive logs for its derivation somehow? | 16:17:11 |