Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce maintenance mode #482

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

chanchiwai-ray
Copy link
Contributor

@chanchiwai-ray chanchiwai-ray commented Dec 19, 2024

Description

MicroCeph currently supports scaling up and down, but not maintenance. In the actual operation, the operators may also need to perform regular maintenance to the storage nodes, and MicroCeph does not offer a simple solution to this problem. We would like to introduce sub-commands (e.g. microceph cluster maintenance enter and microceph cluster maintenance exit) to MicroCeph to enable the feature.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • CleanCode (Code refactor, test updates, does not introduce functional changes)
  • Documentation update (Doc only change)

How Has This Been Tested?

NOTE: All functional changes should accompany corresponding tests (unit tests, functional tests etc).

Please describe the addition/modification of tests done to verify this change. Please also list any relevant details for your test configuration.

Contributor's Checklist

Please check that you have:

  • self-reviewed the code in this PR.
  • added code comments, particularly in hard-to-understand areas.
  • updated the user documentation with corresponding changes.
  • added tests to verify effectiveness of this change.

Add one endpoint `/osds` to enable stopping and starting snap service
for cluster members. The endpoint only accept PUT method.

Signed-off-by: Chi Wai Chan <[email protected]>
Introduce two subcommands: `microceph cluster maintenance enter` and
`microceph cluster maintenance exit` allow cloud operators to do
maintenance operations on the node without affecting the ceph cluster.

Signed-off-by: Chi Wai Chan <[email protected]>
Copy link

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall no big issue. Only some suggestions.

microceph/ceph/maintenance.go Outdated Show resolved Hide resolved
microceph/ceph/maintenance.go Outdated Show resolved Hide resolved
microceph/ceph/osd.go Outdated Show resolved Hide resolved
microceph/ceph/osd.go Show resolved Hide resolved
microceph/ceph/maintenance.go Outdated Show resolved Hide resolved
microceph/ceph/maintenance.go Outdated Show resolved Hide resolved
Signed-off-by: Chi Wai Chan <[email protected]>
@chanchiwai-ray chanchiwai-ray requested a review from jneo8 December 20, 2024 07:23
Copy link

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@@ -0,0 +1,51 @@
package api
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking):

It will be nice if you can separate the API part to another PR, which may be more easy for reviewer.

Copy link
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chanchiwai-ray thanks for this, think this is a sound approach! I've did a first pass on review and left some suggestions / nits.

Also it would be good to have functional testing on this too.

microceph/api/osds.go Outdated Show resolved Hide resolved
microceph/api/types/osds.go Outdated Show resolved Hide resolved
microceph/ceph/operations.go Show resolved Hide resolved

// DryRun prints out the action plan.
func (o *CheckOsdOkToStopOps) DryRun(name string) string {
return fmt.Sprintf("Check if osds in node '%s' are ok-to-stop.", name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to list which OSDs exactly are going to be stopped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is addressed

@chanchiwai-ray chanchiwai-ray force-pushed the maintenance-mode branch 2 times, most recently from 88b1d19 to 6d1aa77 Compare January 3, 2025 13:06
@chanchiwai-ray chanchiwai-ray marked this pull request as ready for review January 7, 2025 01:23
@chanchiwai-ray chanchiwai-ray requested a review from sabaini January 7, 2025 01:23
Copy link
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jneo8 thanks for these!
Some suggestions inline.

I'd also like to suggest running an API test against the newly created /osds endpoint. See the hurl() function and tests/hurl dir. I think a simple dry-run test would be sufficient.

.github/workflows/tests.yml Outdated Show resolved Hide resolved
microceph/ceph/operations.go Show resolved Hide resolved
microceph/ceph/operations.go Outdated Show resolved Hide resolved
microceph/ceph/operations.go Show resolved Hide resolved
}

// Run checks if a node is in the microceph cluster.
func (o *CheckNodeInClusterOps) Run(name string) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if its necessary to go via the API client here. At this point we're in server code and could call the resp. handler directly I think.
Similarly in the below client calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure what do you mean here, can you elaborate more?

At this point we're in server code

The operations are not run in server side (i.e. microcephd)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh apologies, missed that the operations are run client-side.

As a rule we try to make as much logic as possible available server-side. Many clients such as charm-microceph (and therefore sunbeam) control microceph via the API so for the sake of those clients we probably would need to have an API for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed in the recent commits

microceph/cmd/microceph/cluster_maintenance_exit.go Outdated Show resolved Hide resolved
tests/scripts/actionutils.sh Outdated Show resolved Hide resolved
tests/scripts/actionutils.sh Outdated Show resolved Hide resolved
This test only check if it will handle PUT requests expectedly.

Signed-off-by: Chi Wai Chan <[email protected]>
Comment on lines 59 to 71
if c.flagSetNoout {
operations = append(operations, []ceph.Operation{
&ceph.SetNooutOps{ClusterOps: clusterOps},
&ceph.AssertNooutFlagSetOps{ClusterOps: clusterOps},
}...)
}

// optionally stop osd service
if c.flagStopOsds {
operations = append(operations, []ceph.Operation{
&ceph.StopOsdOps{ClusterOps: clusterOps},
}...)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, if these are optional, what actually happens if the user runs microceph maintenance enter --set-noout=false --stop-osds=false? What does maintenance mode mean without those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will then be only checks

@@ -954,27 +954,55 @@ func outDownOSD(osd int64) error {
return nil
}

func safetyCheckStop(osd int64) error {
func setOsdNooutFlag(set bool) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer as two functions:

func setOsdNooutFlag() error
func UnsetOsdNooutFlag() error

Then we don't need to deal with a boolean argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my initial implementation, but it turns out it's just simple to add a boolean argument to remove the all-most-identical code. Moreover, I think the function name is a bit clear now: "set the osd noout flag" true means set the flag, false means don't set the flag.

Comment on lines +272 to +284
func (o *UnsetNooutOps) Run(name string) error {
err := setOsdNooutFlag(false)
if err != nil {
return err
}
logger.Info("unset osd noout.")
return nil
}

// DryRun prints out the action plan.
func (o *UnsetNooutOps) DryRun(name string) string {
return "Run `ceph osd unset noout`."
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little bit of abstraction leaking between Run and DryRun here: Run uses the setOsdNooutFlag() function, but DryRun prints the lower level ceph command used. Just an observation. Not sure how nitpicky this is, or the best way to go - perhaps inlining the setOsdNooutFlag function to remove the extra layer of indirection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DryRun function should give you the actual action performed, not the string representation of the function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants