OfBorg - Public Room Timeline

	OfBorg	165 Members
	Number of builds and evals in queue: <TBD>	61 Servers

Load older messages

Sender	Message	Time
12 Oct 2023
@delroth:delroth.net	since you're asking... :p	15:40:05
Lily Foster	The small stuff you listed and ability for mortals to run pieces of or all of ofborg locally are definitely pain points i'm looking at helping short-term. I appreciate you making the list ❤️	15:42:31
@delroth:delroth.net	yeah I don't think anything here is groundbreaking :)	15:43:49
Lily Foster	(also local testing will let even people with infra access not have to test changes in prod 😅)	15:44:01
@adam:robins.wtf	"properly mark errors as errors" - yes, this times 100	15:49:24
@delroth:delroth.net	another "larger stuff" topic: I'm not sure if ofborg auto-scales based on queue length, but there's been a few times recently where it's 4-6h behind on processing PRs, and I wonder if we could just throw more compute at it	15:58:44
cole-h	I've already tried that (manually), unfortunately. A few years ago, 3-4 ofborg evaluators was enough to chew through the queue. Nowadays, even 9 is not enough, due to eval times blowing up.	15:59:45
cole-h	Also, I don't know how I feel about marking "errors as errors" (I assume this means "failed builds turn into failed checks"). There could be any number of reasons as why the build failed that may not have anything to do with the derivation itself. Maybe the machine OOM'd. Maybe networking died. Maybe the kernel panicked. Maybe there was a hardware failure. Maybe.... Something that was decided early on was that things with a red X should not be merged under any circumstance (as always, there are exceptions, but those should be very rare). If one of those transient (or not so transient) failures happens, but nobody can reproduce it and someone decides to merge it anyways, that cheapens the meaning of a failed CI check. At least with a "skipped" check, its communicated that something may have gone wrong, but it may not be anyone in particular's fault.	16:03:02
cole-h	* Also, I don't know how I feel about marking "errors as errors" (I assume this means "failed builds turn into failed checks"). There could be any number of reasons as why the build failed that may not have anything to do with the derivation itself. Maybe the machine OOM'd. Maybe networking died. Maybe the kernel panicked. Maybe there was a hardware failure. Maybe.... Something that was decided early on was that things with a red X should not be merged under any circumstance (as always, there are exceptions, but those should be very rare). If one of those transient (or not so transient) failures happens, but nobody can reproduce it and someone decides to merge it anyways, that cheapens the meaning of a failed CI check. At least with a "skipped" check, it's communicated that something may have gone wrong, but it may not be anyone in particular's fault.	16:03:05
cole-h	(Not to say I'd block that change, per se, but it'd be nice to be convinced that it's the right thing to do.)	16:03:49
@delroth:delroth.net	allowing people to retry failed runs and figuring out how to address infra flakiness seem like they'd both help there - fwiw I've rarely seen ofborg failing for the reasons you're listing, and they seem to be all transient conditions	16:04:44
@delroth:delroth.net	(could even do something like "retries get scheduled on a different runner" if we wanted to be fancy :p)	16:05:06
@delroth:delroth.net	I agree that we should at the very least try to measure how often these problems happen before making any decision, but I don't think a low rate of false positives necessarily needs to be a blocker - it would still be a massive improvement	16:06:01
@delroth:delroth.net	(imo)	16:06:03
Lily Foster	Yeah I definitely want to get measurements for how often those transient failures happen, before changing anything. Stuff like build timeout, OOM, and nix daemon/builder error should be technically distinguishable from derivation-builder-error, though, so ideally we'd have the ability to do both "neutral" and "failure" conditions depending on how it failed I think there would be value in having at least some scoped conditions surrounding builds get a red X though, because I see a lot of PRs get merged with failing builds or tests because failures of, e.g. build timeouts due to rebuilding llvm or something, are not clearly distinguishable from failures due to the build just not actually working, without diving into the logs on the ofborg website I'd have to see where the decision was originally made, but I feel like, too, if there's enough of the community really wanting the red X now, we could decide to try it temporarily and roll it back if it does in fact turn out to be worse than status quo (even though that is admittedly hard to measure)	17:53:51
@adam:robins.wtf	I've definitely had PRs merged that were failing a build, and nobody noticed. It seems like some of the concerns (timeouts, OOM, hardware failure) are common across other CI systems, yet it is also quite common for them to mark builds as failing through the UI.	18:14:03
@adam:robins.wtf	Is the difference just a lack of ability to retry?	18:15:55
Lily Foster	In reply to @adam:robins.wtf Is the difference just a lack of ability to retry? You can actually have ofborg retry-ish now by requeueing the same attrs (`@ofborg build attr1 attr1.tests attr2 attr2.tests ...`)	18:18:57
Lily Foster	It would be nice if there was a better way to request a retry of one build specifically though	18:19:12
Lily Foster	In reply to @adam:robins.wtf Is the difference just a lack of ability to retry? * You can actually have ofborg retry-ish now by requeueing the same attrs (commenting `@ofborg build attr1 attr1.tests attr2 attr2.tests ...`)	18:20:12
Lily Foster	* You can actually have ofborg retry-ish now by requeueing the same attrs (commenting `@ofborg build attr1 attr1.tests attr2 attr2.tests ...` on the PR)	18:20:20
	Andreas Schrägle left the room.	22:24:38
13 Oct 2023
	tgamblin changed their profile picture.	22:25:06
18 Oct 2023
	Hubble the Wolverine (they/them) joined the room.	16:52:34
Hubble the Wolverine (they/them)	Heya! I've been redirected to here for a question A PR of mine wasn't able to be built on ofborg but I could on my end. I even tried on a separate machine then my laptop (albeit sharing a little bit of the same configuration) and it was able to build it. I don't understand why https://github.com/NixOS/nixpkgs/pull/260812#issuecomment-1766750802	16:54:52
	thubrecht joined the room.	17:02:40
cole-h	Smells like a networking issue. Attempting to reproduce on my local machine, but fetching the source from blender is so slow x)	17:02:58
Hubble the Wolverine (they/them)	In reply to @cole-h:matrix.org Smells like a networking issue. Attempting to reproduce on my local machine, but fetching the source from blender is so slow x) Ohhhh yeah, I know how that feels. I've been working on PR for blender and the source fetching is always the longest	17:04:55
ElvishJerricco	I submitted a "approved" review and wanted ofborg to build a test, so I put `@ofborg test grow-partition` in the review comment. Doesn't appear to have picked it up. Am I correct in assuming it only picks up regular comments? Should I just repeat the comment?	17:31:14
cole-h	Which PR? (Just so I can check if it appeared in the logs)	17:31:56

Show newer messages

Back to Room ListRoom Version: 6