-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow redeploying to interactive tests #281331
base: master
Are you sure you want to change the base?
Conversation
I opened the issue (#281332) for discussion of alternate solutions and whether this is an issue worth solving at all. |
04373da
to
ececb67
Compare
Question: The |
It feels like this should neither increase the complexity of the python code nor of the standard test module code. The reason for my thinking is that this is quite an opinionated way to do it (i.e. not switching on a few flags but even adding another host that ought to be used as jumphost etc.) and might not automatically play well with any test. |
Points well taken. To summarize
I will work on finding a way around (2). But if there isn't a clean way to do it in the end, maybe the solution is more to put this in NUR (or a flake), and in Nixpkgs the only change necessary would be to expose the underlying module system structure of the tests, so that it can be extended outside of Nixpkgs. |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/tweag-nix-dev-update-55/40996/1 |
cc @nikstur as folks working on the NixOS tests ecosystem have not been paged. |
Hard disagree. If we have slow iteration times, we should be removing complexity, not adding more complexity to bypass original complexity. |
It's not complexity that leads to the slow iteration time, it's the time to restart the VM and spin up all the services again. For some tests this is quick, but for others it can be slow and there's not much we can do to change that if the test needs services that are slow to start. There's also the issue of: if I am running the interactive driver, and making stateful changes to test things out, and then I want to make a change to the configuration and test that out, I now have to redo all of the stateful changes I made. It's a question of workflow getting disrupted. |
There's another approach demonstrated here: https://github.com/tweag/nix-hour/blob/master/templates/vm/configuration.nix#L164-L167 Which mounts the relevant directory into the VM so that you can run some version of This only works though because the directory structure is highly constrained. NixOS tests, and really any NixOS VM are Nix derivations that can come from, in theory, any Nix code, so you can't know how to extract the derivation from the directory, or even which directory to mount. There could even be references to arbitrary other parts of the file system. I don't see a way to solve that problem... |
ececb67
to
916d8b0
Compare
916d8b0
to
b8eaddf
Compare
Okay, I've completley reworked how this PR functions. No jumphosts, no sshing, no modification of the machines in the test. Instead, I've added a method to the driver, It then does that update, by changing fields on the machines, adding new machines and removing old ones, and running nixos-rebuild on machines whose configurations have changed. I've also added a flag to the driver for internal use which just produces that information needed to update. There's a little bit of "magic" going on where we assume the path of the new driver executable will be the same as the one we're running now. This works well with |
if machine.name not in names: | ||
if machine.is_up(): | ||
self.logger.warning( | ||
f"{machine.name} removed from the test, but it's running so we're not going to shut it down automatically. Call driver.machines[\"{machine.name}\"].stop() and re-run rebuild() to remove." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, with N machines running, we still need to stop all of them before executing rebuild(...)
. This approach doesn’t seem significantly more advantageous than stopping the entire driver and rerunning it with the --keep-vm-state
parameter. This way, the interactive shell history and the VMs’ disks remain intact for the next run. The most important difference is that machines that are not stopped remain running with all their state in RAM if that is of value for some scenario, right?
Could you please elaborate on your specific use case and the exact benefits you see from this additional complexity? Understanding this will help evaluate if the added complexity is justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, with N machines running, we still need to stop all of them before executing rebuild(...)
No! This is only if you delete a machine from the test, communicating that you no longer want it to be there at all.
If you change the config of a machine, it will do the equivalent of running nixos-rebuild switch
on that machine. Namely, it will build the new configuration and run <new-configuration-store-path>/system/bin/switch-to-configuration test
on the running VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect deleting machines from the test in this context will be quite uncommon. This code is just handling an edge case gracefully. Most of the time you will be modifying existing machines, and that's where this PR shines, since it allows you to redeploy the VMs in-place rather than restart them.
nixos/doc/manual/development/running-nixos-tests-interactively.section.md
Outdated
Show resolved
Hide resolved
nixos/doc/manual/development/running-nixos-tests-interactively.section.md
Outdated
Show resolved
Hide resolved
Logs from the running virtual machines get output in-line and can be disruptive | ||
while using the python REPL. You can separate these streams by redirecting | ||
stderr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation team recommends one sentence per line, which can be a long line. It works better with GitHub suggestions, as it requires no reflowing of these ignored line breaks.
(Pre-existing problem and not urgent for these additions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess I was trying to match the style of the document. Should I reformat the page since I'm editing it anyways, or should that be its own PR?
machine.start_command = start_cmd | ||
else: | ||
self.logger.warning( | ||
f"{start_cmd.machine_name} has multiple instances. This shouldn't be possible, but either way it's now ambiguous which to update." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we log info about the running instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this would happen so I'm not sure what would be useful info. If it does happen it's certainly a bug in the testing framework and not a problem with the user's code. Maybe I should say that explicitly.
/ "bin" | ||
/ "switch-to-configuration" | ||
) | ||
machine.succeed(f"{switch_cmd} test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works if the host store is forwarded into the VM.
If the VM is completely image based, the new store paths will have to be copied in.
What if we always forward the host store to /root/host-store
in the guests? No sane program will find it there, so it doesn't affect the purity or performance of the test, but it does allow us to do a nix copy
just for the purpose of making this work.
This should error out early if
virtualisation.useNixStoreImage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have access to the nixos configuration here, only the built output. We could check for the presence of the store path, and if it doesn't exist produce a helpful error message that points to this as a likely culprit. Were you imagining something else?
Always forwarding the host store to /root/host-store
seems like a big change for only this purpose. It seems not very common to set useNixStoreImage
in tests, and IIUC it's only for performance purposes, so it can be turned off while interactively developing the test.
That being said, my feelings against adding /root/host-store
are not strong, so if you or other people feel like it is a good change, I'm happy to add it in.
```ShellSession | ||
$ build_cmd="nix-build . -A nixosTests.login.driverInteractive" | ||
$ eval $build_cmd | ||
$ ./result/bin/nixos-test-driver --rebuild-cmd $build_cmd 2>machine_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ ./result/bin/nixos-test-driver --rebuild-cmd $build_cmd 2>machine_output | |
$ ./result/bin/nixos-test-driver --rebuild-cmd "$build_cmd" |
$build_cmd
has spaces, so should be in quotes to avoid potential issues depending on shell configuration.
2>machine_output
will consume all stderr output, not just output from the VM. Maybe remove it from the docs to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the quotes.
What stderr output are you worried about? Python errors, for instance, are output to stdout.
I added it to the docs because in my experience the interactive driver is extremely irritating to use with the VM output clogging up the interactive terminal, and it took me a while to realise I could redirect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh indeed calls to logger.info()
and the like get sent to stderr. I'll just remove this, since it's not actually related to the PR.
new_driver_info = subprocess.check_output( | ||
[exe, "--internal-print-update-driver-info-and-exit"], | ||
text=True, | ||
) | ||
( | ||
start_scripts, | ||
vlans_str, | ||
testscript, | ||
output_directory, | ||
) = new_driver_info.rstrip().split("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to write these arguments in JSON in the interactive driver derivation output instead. This way we don't have less stuff to run outside of the sandbox, and less guessing about the binary path and its arguments.
You could even add a separate derivation that only outputs this JSON to avoid having to fetch new versions of host packages (qemu) that we won't use here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did it the way I did is that things seem to be built so that the driver itself is a standalone executable. It gets wrapped in a wrapper that sets arguments. So I tried with this PR to maintain the separation between those layers. But they could also get set by command line arguments or environment variables.
What you're proposing puts us in a state where the executable depends on the wrapper being set up in a particular way, which feels like mixing layers to me.
Maybe I'm being overly concerned about this though, and the layers are already mixed up or we just don't care if we mix them up. If so, I think your solution is indeed better, but I didn't know enough to make that judgment call myself.
A couple of updates:
|
@roberth Wow, thanks for that test! At some point I tried to work out how to automate testing for this feature and kind of gave up as "tests within tests are broken". Good to see that it is possible. |
d7019ed
to
0a0f92c
Compare
Co-authored-by: Robert Hensing <[email protected]>
there are caveats, and it's not actually related to this PR
Co-authored-by: Robert Hensing <[email protected]>
0a0f92c
to
df81599
Compare
@roberth can we push this through? |
Hey @Radvendii, i am rejoining this thread very late. Just tried it out on my machine and got an error where i am not sure if i did anything wrong or if it is a bug. What i did so far was the following:
The store path exists, but there's no Can you give me a pointer what i might have done wrong here? |
You're holding it completely right. This is definitely a bug I introduced in later commits. Thanks for finding it! I know vaguely where it is but I haven't had the time and brainspace to fix it and work out why the test isn't failing. I'll ping you when I do. |
Description of changes
rebuild()
command to the NixOS testing driver, that will rebuild the test and push any changes to running VMs--rebuild-cmd
flag to the driver to specify what command rebuilds the test (can also be passed directly torebuild()
--internal-print-update-driver-info-and-exit
flag to the driver that will just print the information needed to update another driver.testScript
Bugs
rebuild()
for some reasonThis PR has been reworked since it's original formulation. See below for the original text.
Original PR description
driverInteractive
, as well as in its own outputredeploy
, which will deploy the current test machine configuration to an actively runningdriverInteractive
driverInteractive
which serves as a jump host so that the user can ssh into machines in the test (including to do the redeploy).driverInteractive
, as well as in its own outputsshConfig
to take advantage of that jump host. This can be used withssh -F ./result/ssh_config
redeploy_jumphost.start()
after starting the interactive driver (couldn't figure out how to do this. does anyone have ideas?)Add a 👍 reaction to pull requests you find important.