-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Looser timeouts, disable broken test, less verbose output #1399
Conversation
We've seen many fails on ofborg, at lot of them ultimately appear to come down to a timeout being hit, resulting in something like this: Failure executing slapadd -F /<path>/slap.d -b dc=example -l /<path>/load.ldif. Hopefully this resolves it for most cases. I've done some endurance testing and this helps a lot. some other commands also regularly time-out with high load: - hydra-init - hydra-create-user - nix-store --delete This should address most issues with tests randomly failing. Used the following script for endurance testing: ``` import os import subprocess run_counter = 0 fail_counter = 0 while True: try: run_counter += 1 print(f"Starting run {run_counter}") env = os.environ env["YATH_JOB_COUNT"] = "20" result = subprocess.run(["perl", "t/test.pl"], env=env) if (result.returncode != 0): fail_counter += 1 print(f"Finish run {run_counter}, total fail count: {fail_counter}") except KeyboardInterrupt: print(f"Finished {run_counter} runs with {fail_counter} fails") break ``` In case someone else wants to do it on their system :). Note that YATH_JOB_COUNT may need to be changed loosely based on your cores. I only have 4 cores (8 threads), so for others higher numbers might yield better results in hashing out unstable tests.
Only log issues/failures when something's actually up. It has irked me for a long time that so much output came out of running the tests, this seems to silence it. It does hide some warnings, but I think it makes the output so much more readable that it's worth the tradeoff. Helps for highly parallel running of jobs, sometimes they'd not give output for a while. Setting this timeout higher appears to help. Not completely sure if this is the right place to do it, but it works fine for me.
We should look into how to resolve this, but I tried some things and nothing really worked. Let's put it skipped for now until someone comes along to improve it.
Ran this (once more) today for the whole day with
Note that one run had both tests failing, that's why the numbers don't seem to add up 😄. Since YATH_JOB_COUNT set to 32 is 4 times higher than what it does by default (8 thread for this machine), I think this is good enough. Those timeouts (at least, it seems likely they are timeouts) shouldn't be hit running the test suite normally or even with 2x number of threads (which I did test with before). I don't think it makes sense to bump the timeouts much further than I already did here unless we actually run into them again. Some raw fail logs for reference
|
I did some testing (see how in the commit messages) to hash out any unstable tests. On pull requests and possibly also on hydra we keep running into troubles with running the tests, since they keep giving seemingly random errors (timeouts, I figured out now).
This patchset increases timeouts for some commands that kept timing out under higher load, likely resolving almost all issues we have with instability on ofborg and with local builds.
We've already disabled a test (NixOS/nixpkgs#173887) and increased the timeout in nixpkgs itself (https://github.com/NixOS/nixpkgs/blob/24e36589b79da526f339245641ade6261dcafaf8/pkgs/development/tools/misc/hydra/unstable.nix#L210-L212). Let's now resolve this upstream for good.
Also found out that t/evaluator/evaluate-oom-job.t always fails, so I put a comment that for later reference. I'd like to keep fixing it out of scope for this PR, but I though putting a note would be good, and unconditionally skipping it for now.