Allow redeploying to interactive tests #281331

Radvendii · 2024-01-16T14:24:35Z

Description of changes

Bugs

The vlans are getting updated on every rebuild() for some reason

This PR has been reworked since it's original formulation. See below for the original text.

Original PR description

adds a command to the output of driverInteractive, as well as in its own output redeploy, which will deploy the current test machine configuration to an actively running driverInteractive
Adds an extra machine to driverInteractive which serves as a jump host so that the user can ssh into machines in the test (including to do the redeploy).
Adds an ssh config to the output of driverInteractive, as well as in its own output sshConfig to take advantage of that jump host. This can be used with ssh -F ./result/ssh_config
document changes (haven't yet done. waiting for feedback)
automatically start jump host. Currently this setup requires the manual step of running redeploy_jumphost.start() after starting the interactive driver (couldn't figure out how to do this. does anyone have ideas?)

Add a 👍 reaction to pull requests you find important.

Radvendii · 2024-01-16T14:26:08Z

I opened the issue (#281332) for discussion of alternate solutions and whether this is an issue worth solving at all.

Radvendii · 2024-02-13T19:49:27Z

Tagging some people who have touched the testing infrastructure recently, and therefore might be qualified to comment / advise

@roberth @K900 @tfc

nixos/lib/testing/interactive.nix

roberth · 2024-02-14T23:19:38Z

Question: The nodes are observable by the rest of the test, which will probably lead to heisenbugs. Assuming there isn't a command that just forwards socket connections to an address and port on a VDE network, perhaps you could use a different option for this extra node config; not nodes.<something>?

tfc · 2024-02-15T04:25:24Z

It feels like this should neither increase the complexity of the python code nor of the standard test module code.
Can we maybe have this similar to profiles in the NixOS modules? So that it is not included by default and people who don't use it don't have to regard it?

The reason for my thinking is that this is quite an opinionated way to do it (i.e. not switching on a few flags but even adding another host that ought to be used as jumphost etc.) and might not automatically play well with any test.

Radvendii · 2024-02-22T13:24:39Z

Points well taken. To summarize

This is a disruptive way of doing this, such that it shouldn't be on by default.
Mostly this is because it adds an extra machine

I will work on finding a way around (2). But if there isn't a clean way to do it in the end, maybe the solution is more to put this in NUR (or a flake), and in Nixpkgs the only change necessary would be to expose the underlying module system structure of the tests, so that it can be extended outside of Nixpkgs.

nixos-discourse · 2024-03-08T08:55:57Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-55/40996/1

RaitoBezarius · 2024-03-08T09:51:12Z

cc @nikstur as folks working on the NixOS tests ecosystem have not been paged.

K900 · 2024-03-08T09:52:36Z

Hard disagree. If we have slow iteration times, we should be removing complexity, not adding more complexity to bypass original complexity.

Radvendii · 2024-03-08T18:33:52Z

It's not complexity that leads to the slow iteration time, it's the time to restart the VM and spin up all the services again. For some tests this is quick, but for others it can be slow and there's not much we can do to change that if the test needs services that are slow to start.

There's also the issue of: if I am running the interactive driver, and making stateful changes to test things out, and then I want to make a change to the configuration and test that out, I now have to redo all of the stateful changes I made. It's a question of workflow getting disrupted.

Radvendii · 2024-03-28T14:31:31Z

There's another approach demonstrated here: https://github.com/tweag/nix-hour/blob/master/templates/vm/configuration.nix#L164-L167

Which mounts the relevant directory into the VM so that you can run some version of nixos-rebuild from inside the VM and it has the latest version of the source code.

This only works though because the directory structure is highly constrained. NixOS tests, and really any NixOS VM are Nix derivations that can come from, in theory, any Nix code, so you can't know how to extract the derivation from the directory, or even which directory to mount. There could even be references to arbitrary other parts of the file system.

I don't see a way to solve that problem...

Radvendii · 2024-07-30T15:10:58Z

Okay, I've completley reworked how this PR functions. No jumphosts, no sshing, no modification of the machines in the test.

Instead, I've added a method to the driver, rebuild() which will run a rebuild command and then get information from the produced executable about how to update the driver.

It then does that update, by changing fields on the machines, adding new machines and removing old ones, and running nixos-rebuild on machines whose configurations have changed.

I've also added a flag to the driver for internal use which just produces that information needed to update.

There's a little bit of "magic" going on where we assume the path of the new driver executable will be the same as the one we're running now. This works well with nix-build and nix build which modify the result/ symlink, but for instance it will not work if you nix run .#nixosTests.foo.driverInteractive directly, since no result/ symlink is produced. I think this is an okay balance of convenience to magic, especially since you can override the exe path if needed.

tfc · 2024-07-31T07:51:43Z

nixos/lib/test-driver/test_driver/driver.py

+                if machine.name not in names:
+                    if machine.is_up():
+                        self.logger.warning(
+                            f"{machine.name} removed from the test, but it's running so we're not going to shut it down automatically. Call driver.machines[\"{machine.name}\"].stop() and re-run rebuild() to remove."


If I understand correctly, with N machines running, we still need to stop all of them before executing rebuild(...). This approach doesn’t seem significantly more advantageous than stopping the entire driver and rerunning it with the --keep-vm-state parameter. This way, the interactive shell history and the VMs’ disks remain intact for the next run. The most important difference is that machines that are not stopped remain running with all their state in RAM if that is of value for some scenario, right?

Could you please elaborate on your specific use case and the exact benefits you see from this additional complexity? Understanding this will help evaluate if the added complexity is justified.

If I understand correctly, with N machines running, we still need to stop all of them before executing rebuild(...)

No! This is only if you delete a machine from the test, communicating that you no longer want it to be there at all.

If you change the config of a machine, it will do the equivalent of running nixos-rebuild switch on that machine. Namely, it will build the new configuration and run <new-configuration-store-path>/system/bin/switch-to-configuration test on the running VM

I expect deleting machines from the test in this context will be quite uncommon. This code is just handling an edge case gracefully. Most of the time you will be modifying existing machines, and that's where this PR shines, since it allows you to redeploy the VMs in-place rather than restart them.

nixos/doc/manual/development/running-nixos-tests-interactively.section.md

roberth · 2024-08-08T14:00:52Z

nixos/doc/manual/development/running-nixos-tests-interactively.section.md

+Logs from the running virtual machines get output in-line and can be disruptive
+while using the python REPL. You can separate these streams by redirecting
+stderr:


The documentation team recommends one sentence per line, which can be a long line. It works better with GitHub suggestions, as it requires no reflowing of these ignored line breaks.
(Pre-existing problem and not urgent for these additions)

Yeah, I guess I was trying to match the style of the document. Should I reformat the page since I'm editing it anyways, or should that be its own PR?

nixos/lib/test-driver/test_driver/driver.py

roberth · 2024-08-08T14:20:59Z

nixos/lib/test-driver/test_driver/driver.py

+                            machine.start_command = start_cmd
+                else:
+                    self.logger.warning(
+                        f"{start_cmd.machine_name} has multiple instances. This shouldn't be possible, but either way it's now ambiguous which to update."


Could we log info about the running instances?

I'm not sure how this would happen so I'm not sure what would be useful info. If it does happen it's certainly a bug in the testing framework and not a problem with the user's code. Maybe I should say that explicitly.

nixos/lib/test-driver/test_driver/driver.py

roberth · 2024-08-08T14:30:59Z

nixos/lib/test-driver/test_driver/driver.py

+                                / "bin"
+                                / "switch-to-configuration"
+                            )
+                            machine.succeed(f"{switch_cmd} test")


This only works if the host store is forwarded into the VM.
If the VM is completely image based, the new store paths will have to be copied in.
What if we always forward the host store to /root/host-store in the guests? No sane program will find it there, so it doesn't affect the purity or performance of the test, but it does allow us to do a nix copy just for the purpose of making this work.
This should error out early if

nixpkgs/nixos/modules/virtualisation/qemu-vm.nix

Line 771 in 883180e

virtualisation.useNixStoreImage =

We don't have access to the nixos configuration here, only the built output. We could check for the presence of the store path, and if it doesn't exist produce a helpful error message that points to this as a likely culprit. Were you imagining something else?

Always forwarding the host store to /root/host-store seems like a big change for only this purpose. It seems not very common to set useNixStoreImage in tests, and IIUC it's only for performance purposes, so it can be turned off while interactively developing the test.

That being said, my feelings against adding /root/host-store are not strong, so if you or other people feel like it is a good change, I'm happy to add it in.

YorikSar · 2024-08-09T14:43:57Z

nixos/doc/manual/development/running-nixos-tests-interactively.section.md

+```ShellSession
+$ build_cmd="nix-build . -A nixosTests.login.driverInteractive"
+$ eval $build_cmd
+$ ./result/bin/nixos-test-driver --rebuild-cmd $build_cmd 2>machine_output


Suggested change

$ ./result/bin/nixos-test-driver --rebuild-cmd $build_cmd 2>machine_output

$ ./result/bin/nixos-test-driver --rebuild-cmd "$build_cmd"

$build_cmd has spaces, so should be in quotes to avoid potential issues depending on shell configuration.

2>machine_output will consume all stderr output, not just output from the VM. Maybe remove it from the docs to avoid confusion?

Good call on the quotes.

What stderr output are you worried about? Python errors, for instance, are output to stdout.

I added it to the docs because in my experience the interactive driver is extremely irritating to use with the VM output clogging up the interactive terminal, and it took me a while to realise I could redirect it.

Oh indeed calls to logger.info() and the like get sent to stderr. I'll just remove this, since it's not actually related to the PR.

YorikSar · 2024-08-09T15:05:18Z

nixos/lib/test-driver/test_driver/driver.py

+        new_driver_info = subprocess.check_output(
+            [exe, "--internal-print-update-driver-info-and-exit"],
+            text=True,
+        )
+        (
+            start_scripts,
+            vlans_str,
+            testscript,
+            output_directory,
+        ) = new_driver_info.rstrip().split("\n")


I think it would be better to write these arguments in JSON in the interactive driver derivation output instead. This way we don't have less stuff to run outside of the sandbox, and less guessing about the binary path and its arguments.

You could even add a separate derivation that only outputs this JSON to avoid having to fetch new versions of host packages (qemu) that we won't use here anyway.

The reason I did it the way I did is that things seem to be built so that the driver itself is a standalone executable. It gets wrapped in a wrapper that sets arguments. So I tried with this PR to maintain the separation between those layers. But they could also get set by command line arguments or environment variables.

What you're proposing puts us in a state where the executable depends on the wrapper being set up in a particular way, which feels like mixing layers to me.

Maybe I'm being overly concerned about this though, and the layers are already mixed up or we just don't care if we mix them up. If so, I think your solution is indeed better, but I didn't know enough to make that judgment call myself.

Radvendii · 2024-08-15T13:01:05Z

A couple of updates:

As has been noted in several issues Python: use environment variable for script name #24525 General problems with environment variable wrappers #60260 Finally handle wrappers correctly to make them invisible to applications #150841, argv0 doesn't get passed on properly to python scripts.

Assuming the new driver executable is at argv0 does not work. It will have to be passed in explicitly, unless we want to assume result/bin/nixos-test-driver which feels like mixing layers of abstraction. Though admittedly, more convenient and would work in most cases.
Python globals() is... weird. Bottom line, it's not actually global variables. See stackoverflow explanation. I managed to hack together the behaviour I was going for (add/remove global variables) using inspect. It seems like a big kludge but I couldn't think of a better way to do it.

Radvendii · 2024-10-01T00:04:28Z

@roberth Wow, thanks for that test! At some point I tried to work out how to automate testing for this feature and kind of gave up as "tests within tests are broken". Good to see that it is possible.
I added my own stuff to the test for more edge-cases that I ran into. I think it should be good to go now. In theory JSON is better, but I think better to get this work in and do it as a follow-up if needed.

Co-authored-by: Robert Hensing <[email protected]>

there are caveats, and it's not actually related to this PR

Co-authored-by: Robert Hensing <[email protected]>

Radvendii · 2024-10-28T17:50:25Z

@roberth can we push this through?

tfc · 2024-10-29T08:13:11Z

Hey @Radvendii, i am rejoining this thread very late. Just tried it out on my machine and got an error where i am not sure if i did anything wrong or if it is a bug.

What i did so far was the following:

run nix-build . -A pkgs.nixosTests.bittorrent.driverInteractive (in the nixpkgs repo of course)
run ./result/bin/nixos-test-driver --rebuild-cmd "nix-build . -A pkgs.nixosTests.bittorrent.driverInteractive"
after start_all() and waiting a bit, i changed the configuration of one machine (client1 in that case) by adding some new package to its environment.systemPackages.
Run rebuild().
This leads to an error /nix/store/[...]-nixos-vm/system/bin/switch-to-configuration: No such file or directory

The store path exists, but there's no bin in the system folder. Not sure what the assumptions were that come with the new code.

Can you give me a pointer what i might have done wrong here?

Radvendii · 2024-10-30T22:40:46Z

You're holding it completely right. This is definitely a bug I introduced in later commits. Thanks for finding it! I know vaguely where it is but I haven't had the time and brainspace to fix it and work out why the test isn't failing. I'll ping you when I do.

github-actions bot added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Jan 16, 2024

Radvendii mentioned this pull request Jan 16, 2024

Feature Request: redeploying to interactive nixos tests #281332

Open

ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Jan 16, 2024

Radvendii force-pushed the nixos_rebuild_tests branch from 04373da to ececb67 Compare January 24, 2024 17:32

roberth previously requested changes Feb 14, 2024

View reviewed changes

nixos/lib/testing/interactive.nix Outdated Show resolved Hide resolved

nixos/lib/testing/interactive.nix Outdated Show resolved Hide resolved

wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 5, 2024

Radvendii force-pushed the nixos_rebuild_tests branch from ececb67 to 916d8b0 Compare July 24, 2024 18:38

github-actions bot added the 6.topic: testing Tooling for automated testing of packages and modules label Jul 24, 2024

ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jul 24, 2024

Radvendii force-pushed the nixos_rebuild_tests branch from 916d8b0 to b8eaddf Compare July 30, 2024 15:06

Radvendii requested a review from tfc as a code owner July 30, 2024 15:06

github-actions bot added the 8.has: documentation This PR adds or changes documentation label Jul 30, 2024

ofborg bot added 10.rebuild-darwin: 1-10 and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin labels Jul 30, 2024

tfc reviewed Jul 31, 2024

View reviewed changes

roberth reviewed Aug 8, 2024

View reviewed changes

YorikSar reviewed Aug 9, 2024

View reviewed changes

github-actions bot added the 8.has: changelog label Sep 30, 2024

Radvendii force-pushed the nixos_rebuild_tests branch from d7019ed to 0a0f92c Compare October 8, 2024 11:17

Radvendii and others added 20 commits October 28, 2024 13:39

update from the driver

4f5bc39

add --rebuild-cmd flag

7fe8d70

document

f6262f4

fix vlans issue

dbeb9b0

Apply suggestions from code review

22bcea5

Co-authored-by: Robert Hensing <[email protected]>

more updates from review

c7092b3

better logging

c23ee6e

forgot quotes in docs

8493c42

log actual rebuild command

054cdb6

remove separating outputs section

025a65d

there are caveats, and it's not actually related to this PR

need exe as an explicit argument

aee2cbd

correctly modify interactive repl's environment

3cd9467

correctly detect if machine is booted

25bb6f4

reasonable default for --rebuild-exe

770bf1c

nixosTests.nixos-test-driver.interactive-redeploy: init

c418bc5

typo

050ccd6

Co-authored-by: Robert Hensing <[email protected]>

nixfmt & correct typo

39ea584

add release notes

a7f6fa0

add more testing for interactive-redeploy

b4ce3fb

reformat

df81599

Radvendii force-pushed the nixos_rebuild_tests branch from 0a0f92c to df81599 Compare October 28, 2024 17:40

FliegendeWurst added the awaiting_changes (old Marvin label, do not use) label Dec 5, 2024

wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Jan 4, 2025

SigmaSquadron removed the awaiting_changes (old Marvin label, do not use) label Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow redeploying to interactive tests #281331

Allow redeploying to interactive tests #281331

Radvendii commented Jan 16, 2024 •

edited

Loading

Radvendii commented Jan 16, 2024

Radvendii commented Feb 13, 2024 •

edited

Loading

roberth commented Feb 14, 2024

tfc commented Feb 15, 2024 •

edited

Loading

Radvendii commented Feb 22, 2024

nixos-discourse commented Mar 8, 2024

RaitoBezarius commented Mar 8, 2024

K900 commented Mar 8, 2024

Radvendii commented Mar 8, 2024

Radvendii commented Mar 28, 2024

Radvendii commented Jul 30, 2024

tfc Jul 31, 2024 •

edited

Loading

Radvendii Jul 31, 2024

Radvendii Jul 31, 2024

roberth Aug 8, 2024

Radvendii Aug 8, 2024

roberth Aug 8, 2024

Radvendii Aug 8, 2024

roberth Aug 8, 2024

Radvendii Aug 8, 2024

YorikSar Aug 9, 2024

Radvendii Aug 9, 2024

Radvendii Aug 15, 2024

YorikSar Aug 9, 2024

Radvendii Aug 9, 2024

Radvendii commented Aug 15, 2024

Radvendii commented Oct 1, 2024

Radvendii commented Oct 28, 2024

tfc commented Oct 29, 2024 •

edited

Loading

Radvendii commented Oct 30, 2024

	$ ./result/bin/nixos-test-driver --rebuild-cmd $build_cmd 2>machine_output
	$ ./result/bin/nixos-test-driver --rebuild-cmd "$build_cmd"

Allow redeploying to interactive tests #281331

Are you sure you want to change the base?

Allow redeploying to interactive tests #281331

Conversation

Radvendii commented Jan 16, 2024 • edited Loading

Description of changes

Radvendii commented Jan 16, 2024

Radvendii commented Feb 13, 2024 • edited Loading

roberth commented Feb 14, 2024

tfc commented Feb 15, 2024 • edited Loading

Radvendii commented Feb 22, 2024

nixos-discourse commented Mar 8, 2024

RaitoBezarius commented Mar 8, 2024

K900 commented Mar 8, 2024

Radvendii commented Mar 8, 2024

Radvendii commented Mar 28, 2024

Radvendii commented Jul 30, 2024

tfc Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Radvendii commented Aug 15, 2024

Radvendii commented Oct 1, 2024

Radvendii commented Oct 28, 2024

tfc commented Oct 29, 2024 • edited Loading

Radvendii commented Oct 30, 2024

Radvendii commented Jan 16, 2024 •

edited

Loading

Radvendii commented Feb 13, 2024 •

edited

Loading

tfc commented Feb 15, 2024 •

edited

Loading

tfc Jul 31, 2024 •

edited

Loading

tfc commented Oct 29, 2024 •

edited

Loading