| 9 May 2026 |
hexa | https://grafana.nixos.org/d/he6lz9g/macos-disk-mem-swap?orgId=1&from=now-90d&to=now&timezone=utc | 18:03:47 |
John Ericson | in this situation, do we know why the (old) queue runner told two builders to build fish? | 18:52:31 |
hexa | I don't | 18:55:01 |
John Ericson | unless it thought the first one failed I can't think of why either | 18:56:46 |
John Ericson | I am working on new queue runner right now making it more like old when with the use of BuildDerivation | 18:56:59 |
John Ericson | hopefully that will at lest help with the new scheduling things and building their dependencies problem | 18:57:24 |
emily | I know that another job can pull fish in as a dependency and it'll be built on the Nix level. | 19:00:47 |
emily | how that interacts with the queue runner/uploads I don't know | 19:00:54 |
John Ericson | the queue runner sends BasicDerivations so that the builder should not know how to build fish in that case | 19:03:12 |
John Ericson | it should either substitute or fail | 19:03:19 |
emily | right, okay. I don't know exactly why it happens but I know vcunat has mentioned it happening multiple times before. | 19:03:34 |
emily | e.g. in the context of duplicate work during stdenv building. | 19:03:40 |
John Ericson | (the new queue runner shares the whole drv graph which is causing rebuilding problems, I am working on it right now to make it like the older queue runner so that cannot happen) | 19:04:03 |
emily | these problems predate when hexa (signing key rotation when) mentioned the new runner had been deployed | 19:10:39 |
emily | (as does my memory of talk of duplicate builds) | 19:10:47 |
emily | so whatever is going on here is unrelated to any problems the new runner has I think | 19:10:54 |
Sergei Zimmerman (xokdvium) | Indeed. There are several related issue than can happen (frankenbuild shaped, not necessarily our cases).
-
A builder grabs something it has built previously which isn’t what’s in the cache. - We’ve observed this in the nix repo. Can happen when a build gets scheduled after a successful build that couldn’t be uploaded by the queue runner.
-
A partial upload where the queue runner uploads outputs that have been built by different machines. - this is probably the case with fish.
| 19:21:58 |
Sergei Zimmerman (xokdvium) | Ideally we’d ensure consistency of the narHash of the inputs on the builder and what’s in the cache | 19:23:24 |
emily | (2) is confusing though, since it seems like we're observing large time gaps between the builds/uploads | 19:24:30 |
emily | so why would it get built twice not even in a race condition way but after many minutes? | 19:24:45 |
emily | the BasicDerivation point seems relevant as it's not like a transitive dependency can easily pull it in | 19:25:12 |
Sergei Zimmerman (xokdvium) | Probably can happen when the queue runner tries to upload outputs from the first builder and fails halfway and does it on another builder afterwards? And it pulls the one known output from the cache? | 19:25:40 |
emily | I guess the queue runner would observe the output is missing and schedule another build? | 19:25:42 |
emily | right. so that could be fixed at the queue runner level? "if we have an output waiting to be uploaded, then don't spawn another build; just keep trying to upload that output"? | 19:26:14 |
Sergei Zimmerman (xokdvium) | In reply to @emilazy:matrix.org right. so that could be fixed at the queue runner level? "if we have an output waiting to be uploaded, then don't spawn another build; just keep trying to upload that output"? Makes sense yeah | 19:26:57 |
Sergei Zimmerman (xokdvium) | Things probably get complicated when the builder dies halfway? | 19:27:29 |
emily | as in half-way through sending its successfully-built outputs to the queue runner? | 19:27:57 |
K900 | Then it should probably discard the entire build | 19:28:13 |
emily | yeah, though I think the issue is potentially that stuff happens per-output? | 19:28:33 |
emily | "waiting for all outputs to be ready for upload before uploading any of them" would be good | 19:28:54 |