!RROtHmAaQIkiJzJZZE:nixos.org

NixOS Infrastructure

417 Members
Next Infra call: 2024-07-11, 18:00 CEST (UTC+2) | Infra operational issues backlog: https://github.com/orgs/NixOS/projects/52 | See #infra-alerts:nixos.org for real time alerts from Prometheus.130 Servers

Load older messages


SenderMessageTime
9 May 2026
@emilazy:matrix.orgemilyso, uh… does the new queue runner retry uploads?17:52:27
@xokdvium:matrix.orgSergei Zimmerman (xokdvium)Tbh it’s not exaxy clear to me. I thought that it was supposed to be doing presigned URLs and the builders would be the ones uploading17:53:08
@k900:0upti.meK900That's not actually implemented17:53:23
@k900:0upti.meK900AFAIUI17:53:25
@k900:0upti.meK900 And also I don't see how that would even help because you also need to sign the actual NAR 17:53:36
@k900:0upti.meK900Which the builders don't have keys for17:53:42
@k900:0upti.meK900So you'd need a custom protocol for the builders to ask the coordinator to sign the NAR and then you need to figure out how to actually authenticate the builder ideally with something like SPIFFE and that's a whole other can of worms17:54:14
@xokdvium:matrix.orgSergei Zimmerman (xokdvium)
In reply to @emilazy:matrix.org
so "failure at the time of upload" sounds very plausible to me. especially given that Nix retries substitutions a bunch out of the box, whereas these queue runner logs look like it's not retrying at all
Hm queue runner is supposed to be retrying to upload the same thing - at least nix binary cache store does this. Whether we are not doing it well enough is another question
17:56:58
@xokdvium:matrix.orgSergei Zimmerman (xokdvium) But from the logs portion seems like retries do succeed after a couple of attempts 17:57:21
@xokdvium:matrix.orgSergei Zimmerman (xokdvium)But S3 robustness might not be best, AWS does this crazy thing where it returns 400 on a closed socket and it’s not retried - but I don’t see that particular error mode in the logs for now.17:58:22
@xokdvium:matrix.orgSergei Zimmerman (xokdvium)But it’s not clear to me what needs to happen to improve this since the hydra repo is in this strange state where the new queue runner is quite f’d17:59:45
@emilazy:matrix.orgemilyso I do notice that it doesn't seem like we've had new cases crop up in the past… idk, month or so?18:01:04
@emilazy:matrix.orgemilybut it was really bad for a while before that18:01:10
@emilazy:matrix.orgemilydoes that line up with the times where the disks were chronically full on the Darwin nodes?18:01:24
@emilazy:matrix.orgemilyI'm wondering if we could have had a situation where only some outputs were getting registered and pushed out somehow because of running out of disk. or where GC was getting aggressive and clobbering stuff before it was even uploaded.18:01:47
@hexa:lossy.networkhexaFeature wise that's in the new queue runner, we are still running the old queue runner18:02:02
@hexa:lossy.networkhexaEven after switching to the new queue runner we'll test centralized mode first and presigned urls laterr18:02:42
@hexa:lossy.networkhexa * 18:02:46
@hexa:lossy.networkhexaPlausible18:03:00
@hexa:lossy.networkhexaI fixed that over a month ago18:03:12
@hexa:lossy.networkhexahttps://grafana.nixos.org/d/he6lz9g/macos-disk-mem-swap?orgId=1&from=now-90d&to=now&timezone=utc18:03:47
@Ericson2314:matrix.orgJohn Ericsonin this situation, do we know why the (old) queue runner told two builders to build fish?18:52:31
@hexa:lossy.networkhexaI don't18:55:01
@Ericson2314:matrix.orgJohn Ericsonunless it thought the first one failed I can't think of why either18:56:46
@Ericson2314:matrix.orgJohn EricsonI am working on new queue runner right now making it more like old when with the use of BuildDerivation18:56:59
@Ericson2314:matrix.orgJohn Ericsonhopefully that will at lest help with the new scheduling things and building their dependencies problem18:57:24
@emilazy:matrix.orgemily I know that another job can pull fish in as a dependency and it'll be built on the Nix level. 19:00:47
@emilazy:matrix.orgemilyhow that interacts with the queue runner/uploads I don't know19:00:54
@Ericson2314:matrix.orgJohn Ericson the queue runner sends BasicDerivations so that the builder should not know how to build fish in that case 19:03:12
@Ericson2314:matrix.orgJohn Ericsonit should either substitute or fail19:03:19

Show newer messages


Back to Room ListRoom Version: 6