NixOS Infrastructure - Public Room Timeline

	NixOS Infrastructure	416 Members
	Next Infra call: 2024-07-11, 18:00 CEST (UTC+2) \| Infra operational issues backlog: https://github.com/orgs/NixOS/projects/52 \| See #infra-alerts:nixos.org for real time alerts from Prometheus.	130 Servers

Load older messages

Sender	Message	Time
9 May 2026
Sergei Zimmerman (xokdvium)	What the new non-deployed code does is unclear to me	17:48:23
emily	since the log for fish shows the non-fallback paths, but the log for ffmpeg shows the `data` output as an odd-one-out fallback path	17:48:47
emily	but they both seem to be broken for path-rewrite-related reasons	17:48:54
Sergei Zimmerman (xokdvium)	Hm logs always get pushed over	17:49:04
emily	hm, if it's just doing it through Nix, how come it's the queue runner talking about uploading them in https://termbin.com/69iy?	17:49:33
emily	I thought all the compression/signing/uploads were done on the central queue runner machine	17:49:41
Sergei Zimmerman (xokdvium)	It used to be linked to nix-store	17:50:18
emily	ah, I see what you mean. (I thought you meant the builders were using that store directly)	17:50:53
emily	so "failure at the time of upload" sounds very plausible to me. especially given that Nix retries substitutions a bunch out of the box, whereas these queue runner logs look like it's not retrying at all	17:51:27
emily	(so download side should be expected to be more robust by default?)	17:51:47
emily	so, uh… does the new queue runner retry uploads?	17:52:27
Sergei Zimmerman (xokdvium)	Tbh it’s not exaxy clear to me. I thought that it was supposed to be doing presigned URLs and the builders would be the ones uploading	17:53:08
K900	That's not actually implemented	17:53:23
K900	AFAIUI	17:53:25
K900	And also I don't see how that would even help because you also need to sign the actual NAR	17:53:36
K900	Which the builders don't have keys for	17:53:42
K900	So you'd need a custom protocol for the builders to ask the coordinator to sign the NAR and then you need to figure out how to actually authenticate the builder ideally with something like SPIFFE and that's a whole other can of worms	17:54:14
Sergei Zimmerman (xokdvium)	In reply to @emilazy:matrix.org so "failure at the time of upload" sounds very plausible to me. especially given that Nix retries substitutions a bunch out of the box, whereas these queue runner logs look like it's not retrying at all Hm queue runner is supposed to be retrying to upload the same thing - at least nix binary cache store does this. Whether we are not doing it well enough is another question	17:56:58
Sergei Zimmerman (xokdvium)	But from the logs portion seems like retries do succeed after a couple of attempts	17:57:21
Sergei Zimmerman (xokdvium)	But S3 robustness might not be best, AWS does this crazy thing where it returns 400 on a closed socket and it’s not retried - but I don’t see that particular error mode in the logs for now.	17:58:22
Sergei Zimmerman (xokdvium)	But it’s not clear to me what needs to happen to improve this since the hydra repo is in this strange state where the new queue runner is quite f’d	17:59:45
emily	so I do notice that it doesn't seem like we've had new cases crop up in the past… idk, month or so?	18:01:04
emily	but it was really bad for a while before that	18:01:10
emily	does that line up with the times where the disks were chronically full on the Darwin nodes?	18:01:24
emily	I'm wondering if we could have had a situation where only some outputs were getting registered and pushed out somehow because of running out of disk. or where GC was getting aggressive and clobbering stuff before it was even uploaded.	18:01:47
hexa	Feature wise that's in the new queue runner, we are still running the old queue runner	18:02:02
hexa	Even after switching to the new queue runner we'll test centralized mode first and presigned urls laterr	18:02:42
hexa	*	18:02:46
hexa	Plausible	18:03:00
hexa	I fixed that over a month ago	18:03:12

Show newer messages

Back to Room ListRoom Version: 6