| 7 Nov 2024 |
K900 ⚡️ | But the tests are flaking | 07:00:39 |
K900 ⚡️ | And I really don't want to retire them from the blocking jobs | 07:00:47 |
K900 ⚡️ | And I have no idea what is going on there | 07:01:00 |
K900 ⚡️ | Can someone with either knowledge or more free time please take a look | 07:01:12 |
emily | cc m1cr0man | 07:02:07 |
emily | the ACME tests are pretty important since they're the one line of defence we have against everyone's services going completely unavailable. unfortunately they have also long since exceeded the complexity at which I feel like I have a handle on them and I know m1cr0man only has so much time these days :( | 07:03:32 |
m1cr0man | Are they still flaking? I did put out some fixes a few weeks ago to help reduce flakiness by wrapping some of the assertions in retries. I hadn't heard anything more so I assumed it was fixed.
I am a bit better for time now (house move over) so I can look into it again. Feel free to spam me with any failures you see. I'll take a look on hydra too
Wrt actual test complexity. I'm not sure how to simplify it. There's a lot of moving parts to testing acme. I did put a nice summary into an issue comment last week. https://github.com/NixOS/nixpkgs/pull/340136#issuecomment-2448648944 | 09:20:12 |
K900 ⚡️ | Yes, they are | 09:27:47 |
K900 ⚡️ | webserver: waiting for unit acme-finished-http.example.test.target
Test "Can request certificate with Lego's built in web server" failed with error: "unit "acme-finished-http.example.test.target" is inactive and there are no pending jobs" | 16:09:28 |
K900 ⚡️ | Again | 16:09:29 |
| 8 Nov 2024 |
m1cr0man | https://github.com/NixOS/nixpkgs/pull/336412 sometimes a fresh set of eyes is all that's needed. ThinkChaos' change here should significantly reduce flakiness. | 15:45:47 |
K900 ⚡️ | Appreciated | 15:47:40 |
K900 ⚡️ | Is it good to merge? | 15:47:44 |
K900 ⚡️ | OK I assume yes | 15:50:52 |
m1cr0man | Yes - apologies I closed my client | 15:59:30 |
K900 ⚡️ | Nope :( | 20:41:11 |
K900 ⚡️ | webserver # the following new units were started: acme-http.example.test.timer, multi-user.target, network-online.target, run-credentials-getty\x40tty1.service.mount, run-credentials-systemd\x2dtmpfiles\x2dresetup.service.mount, sysinit-reactivation.target, systemd-tmpfiles-resetup.service
webserver # [ 14.902862] nixos[844]: finished switching to system configuration /nix/store/m1jmxwnpaibvj9szm7q3li1nia20q7d2-nixos-system-webserver-test
(finished: must succeed: /run/current-system/specialisation/http01lego/bin/switch-to-configuration test, in 2.37 seconds)
webserver: waiting for unit acme-finished-http.example.test.target
Test "Can request certificate with Lego's built in web server" failed with error: "unit "acme-finished-http.example.test.target" is inactive and there are no pending jobs"
cleanup
| 20:41:14 |
K900 ⚡️ | Hmm wait | 20:44:22 |
K900 ⚡️ | This feels wrong | 20:44:23 |
K900 ⚡️ | The service isn't even started by the switch | 20:44:30 |
K900 ⚡️ | Yeah OK this is definitely a race | 20:46:54 |
K900 ⚡️ | https://github.com/NixOS/nixpkgs/pull/354629 | 22:58:33 |
K900 ⚡️ | OK last thing I'm doing for the night | 22:58:38 |
K900 ⚡️ | I tried a bunch of ways to make it fail and it didn't | 23:02:46 |
K900 ⚡️ | Which is a good sign | 23:02:48 |
K900 ⚡️ | The funny thing is | 23:03:20 |
K900 ⚡️ | It actually fails if the test runs too fast | 23:03:27 |
K900 ⚡️ | Unlike most of our other flakes | 23:03:40 |
m1cr0man | In reply to @k900:0upti.me I tried a bunch of ways to make it fail and it didn't I once left my server executing the test suite in a loop over 24 hours and had no failures. I've never been able to reproduce the issue when I want to 😅 that change does look good though. | 23:06:01 |
K900 ⚡️ | I was hoping I could get it to trigger by giving the server machine a lot of resources and the CA machine no resources | 23:06:40 |