| 9 May 2026 |
John Ericson | (the new queue runner shares the whole drv graph which is causing rebuilding problems, I am working on it right now to make it like the older queue runner so that cannot happen) | 19:04:03 |
emily | these problems predate when hexa (signing key rotation when) mentioned the new runner had been deployed | 19:10:39 |
emily | (as does my memory of talk of duplicate builds) | 19:10:47 |
emily | so whatever is going on here is unrelated to any problems the new runner has I think | 19:10:54 |
Sergei Zimmerman (xokdvium) | Indeed. There are several related issue than can happen (frankenbuild shaped, not necessarily our cases).
-
A builder grabs something it has built previously which isn’t what’s in the cache. - We’ve observed this in the nix repo. Can happen when a build gets scheduled after a successful build that couldn’t be uploaded by the queue runner.
-
A partial upload where the queue runner uploads outputs that have been built by different machines. - this is probably the case with fish.
| 19:21:58 |
Sergei Zimmerman (xokdvium) | Ideally we’d ensure consistency of the narHash of the inputs on the builder and what’s in the cache | 19:23:24 |
emily | (2) is confusing though, since it seems like we're observing large time gaps between the builds/uploads | 19:24:30 |
emily | so why would it get built twice not even in a race condition way but after many minutes? | 19:24:45 |
emily | the BasicDerivation point seems relevant as it's not like a transitive dependency can easily pull it in | 19:25:12 |
Sergei Zimmerman (xokdvium) | Probably can happen when the queue runner tries to upload outputs from the first builder and fails halfway and does it on another builder afterwards? And it pulls the one known output from the cache? | 19:25:40 |
emily | I guess the queue runner would observe the output is missing and schedule another build? | 19:25:42 |
emily | right. so that could be fixed at the queue runner level? "if we have an output waiting to be uploaded, then don't spawn another build; just keep trying to upload that output"? | 19:26:14 |
Sergei Zimmerman (xokdvium) | In reply to @emilazy:matrix.org right. so that could be fixed at the queue runner level? "if we have an output waiting to be uploaded, then don't spawn another build; just keep trying to upload that output"? Makes sense yeah | 19:26:57 |
Sergei Zimmerman (xokdvium) | Things probably get complicated when the builder dies halfway? | 19:27:29 |
emily | as in half-way through sending its successfully-built outputs to the queue runner? | 19:27:57 |
K900 | Then it should probably discard the entire build | 19:28:13 |
emily | yeah, though I think the issue is potentially that stuff happens per-output? | 19:28:33 |
emily | "waiting for all outputs to be ready for upload before uploading any of them" would be good | 19:28:54 |
K900 | I wonder if there's any reasonable way to do two phase commit on this | 19:29:59 |
K900 | Like upload to a temporary directory and then move atomically | 19:30:07 |
K900 | If S3 lets you do that | 19:30:15 |
emily | yeah, if we could have S3 expose all the narinfos atomically that would be great | 19:30:20 |
emily | I mean even just uploading all the actual outputs first and then uploading the narinfos would probably help | 19:30:28 |
emily | I don't really understand why Hydra would decide to build something it's trying to upload anyway though | 19:32:03 |
emily | does it ever give up on retrying to upload? | 19:32:10 |
emily | like, anything that wants something in the process of uploading should block on that upload anyway, right? | 19:32:52 |
emily | so perhaps the solution could be as simple as just "never stop retrying uploads"? it wouldn't handle the "builder disappears" case Sergei Zimmerman (xokdvium) mentioned but that's at least an edge case | 19:33:24 |
emily | hexa (signing key rotation when): can we get queue runner logs for musikcube | 19:46:16 |
emily | https://github.com/NixOS/nixpkgs/issues/517508 is recent | 19:46:27 |
emily | though, let me check it's not a library dependency actually | 19:46:39 |