5 Nov 2024 |
K900 ⚡️ | Can someone please look into the flaky tests | 06:44:50 |
K900 ⚡️ | It's been happening more and more lately | 06:44:56 |
7 Nov 2024 |
K900 ⚡️ | Folks I know I am starting to sound like a broken record | 07:00:33 |
K900 ⚡️ | But the tests are flaking | 07:00:39 |
K900 ⚡️ | And I really don't want to retire them from the blocking jobs | 07:00:47 |
K900 ⚡️ | And I have no idea what is going on there | 07:01:00 |
K900 ⚡️ | Can someone with either knowledge or more free time please take a look | 07:01:12 |
emily | cc m1cr0man | 07:02:07 |
emily | the ACME tests are pretty important since they're the one line of defence we have against everyone's services going completely unavailable. unfortunately they have also long since exceeded the complexity at which I feel like I have a handle on them and I know m1cr0man only has so much time these days :( | 07:03:32 |
m1cr0man | Are they still flaking? I did put out some fixes a few weeks ago to help reduce flakiness by wrapping some of the assertions in retries. I hadn't heard anything more so I assumed it was fixed.
I am a bit better for time now (house move over) so I can look into it again. Feel free to spam me with any failures you see. I'll take a look on hydra too
Wrt actual test complexity. I'm not sure how to simplify it. There's a lot of moving parts to testing acme. I did put a nice summary into an issue comment last week. https://github.com/NixOS/nixpkgs/pull/340136#issuecomment-2448648944 | 09:20:12 |
K900 ⚡️ | Yes, they are | 09:27:47 |
K900 ⚡️ | webserver: waiting for unit acme-finished-http.example.test.target
Test "Can request certificate with Lego's built in web server" failed with error: "unit "acme-finished-http.example.test.target" is inactive and there are no pending jobs" | 16:09:28 |
K900 ⚡️ | Again | 16:09:29 |
8 Nov 2024 |
m1cr0man | https://github.com/NixOS/nixpkgs/pull/336412 sometimes a fresh set of eyes is all that's needed. ThinkChaos' change here should significantly reduce flakiness. | 15:45:47 |
K900 ⚡️ | Appreciated | 15:47:40 |
K900 ⚡️ | Is it good to merge? | 15:47:44 |
K900 ⚡️ | OK I assume yes | 15:50:52 |
m1cr0man | Yes - apologies I closed my client | 15:59:30 |
K900 ⚡️ | Nope :( | 20:41:11 |
K900 ⚡️ | webserver # the following new units were started: acme-http.example.test.timer, multi-user.target, network-online.target, run-credentials-getty\x40tty1.service.mount, run-credentials-systemd\x2dtmpfiles\x2dresetup.service.mount, sysinit-reactivation.target, systemd-tmpfiles-resetup.service
webserver # [ 14.902862] nixos[844]: finished switching to system configuration /nix/store/m1jmxwnpaibvj9szm7q3li1nia20q7d2-nixos-system-webserver-test
(finished: must succeed: /run/current-system/specialisation/http01lego/bin/switch-to-configuration test, in 2.37 seconds)
webserver: waiting for unit acme-finished-http.example.test.target
Test "Can request certificate with Lego's built in web server" failed with error: "unit "acme-finished-http.example.test.target" is inactive and there are no pending jobs"
cleanup
| 20:41:14 |
K900 ⚡️ | Hmm wait | 20:44:22 |
K900 ⚡️ | This feels wrong | 20:44:23 |
K900 ⚡️ | The service isn't even started by the switch | 20:44:30 |
K900 ⚡️ | Yeah OK this is definitely a race | 20:46:54 |
K900 ⚡️ | https://github.com/NixOS/nixpkgs/pull/354629 | 22:58:33 |
K900 ⚡️ | OK last thing I'm doing for the night | 22:58:38 |
K900 ⚡️ | I tried a bunch of ways to make it fail and it didn't | 23:02:46 |
K900 ⚡️ | Which is a good sign | 23:02:48 |
K900 ⚡️ | The funny thing is | 23:03:20 |
K900 ⚡️ | It actually fails if the test runs too fast | 23:03:27 |