-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce maintenance mode #482
base: main
Are you sure you want to change the base?
Conversation
Add one endpoint `/osds` to enable stopping and starting snap service for cluster members. The endpoint only accept PUT method. Signed-off-by: Chi Wai Chan <[email protected]>
Introduce two subcommands: `microceph cluster maintenance enter` and `microceph cluster maintenance exit` allow cloud operators to do maintenance operations on the node without affecting the ceph cluster. Signed-off-by: Chi Wai Chan <[email protected]>
f342430
to
e45b338
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall no big issue. Only some suggestions.
Signed-off-by: Chi Wai Chan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM.
@@ -0,0 +1,51 @@ | |||
package api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking):
It will be nice if you can separate the API part to another PR, which may be more easy for reviewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @chanchiwai-ray thanks for this, think this is a sound approach! I've did a first pass on review and left some suggestions / nits.
Also it would be good to have functional testing on this too.
microceph/ceph/operations.go
Outdated
|
||
// DryRun prints out the action plan. | ||
func (o *CheckOsdOkToStopOps) DryRun(name string) string { | ||
return fmt.Sprintf("Check if osds in node '%s' are ok-to-stop.", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to list which OSDs exactly are going to be stopped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is addressed
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
88b1d19
to
6d1aa77
Compare
Signed-off-by: Chi Wai Chan <[email protected]>
6d1aa77
to
c184eea
Compare
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jneo8 thanks for these!
Some suggestions inline.
I'd also like to suggest running an API test against the newly created /osds endpoint. See the hurl() function and tests/hurl dir. I think a simple dry-run test would be sufficient.
microceph/ceph/operations.go
Outdated
} | ||
|
||
// Run checks if a node is in the microceph cluster. | ||
func (o *CheckNodeInClusterOps) Run(name string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if its necessary to go via the API client here. At this point we're in server code and could call the resp. handler directly I think.
Similarly in the below client calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not sure what do you mean here, can you elaborate more?
At this point we're in server code
The operations are not run in server side (i.e. microcephd
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh apologies, missed that the operations are run client-side.
As a rule we try to make as much logic as possible available server-side. Many clients such as charm-microceph (and therefore sunbeam) control microceph via the API so for the sake of those clients we probably would need to have an API for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be addressed in the recent commits
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
This test only check if it will handle PUT requests expectedly. Signed-off-by: Chi Wai Chan <[email protected]>
if c.flagSetNoout { | ||
operations = append(operations, []ceph.Operation{ | ||
&ceph.SetNooutOps{ClusterOps: clusterOps}, | ||
&ceph.AssertNooutFlagSetOps{ClusterOps: clusterOps}, | ||
}...) | ||
} | ||
|
||
// optionally stop osd service | ||
if c.flagStopOsds { | ||
operations = append(operations, []ceph.Operation{ | ||
&ceph.StopOsdOps{ClusterOps: clusterOps}, | ||
}...) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, if these are optional, what actually happens if the user runs microceph maintenance enter --set-noout=false --stop-osds=false
? What does maintenance mode mean without those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will then be only checks
@@ -954,27 +954,55 @@ func outDownOSD(osd int64) error { | |||
return nil | |||
} | |||
|
|||
func safetyCheckStop(osd int64) error { | |||
func setOsdNooutFlag(set bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be clearer as two functions:
func setOsdNooutFlag() error
func UnsetOsdNooutFlag() error
Then we don't need to deal with a boolean argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my initial implementation, but it turns out it's just simple to add a boolean argument to remove the all-most-identical code. Moreover, I think the function name is a bit clear now: "set the osd noout flag" true means set the flag, false means don't set the flag.
func (o *UnsetNooutOps) Run(name string) error { | ||
err := setOsdNooutFlag(false) | ||
if err != nil { | ||
return err | ||
} | ||
logger.Info("unset osd noout.") | ||
return nil | ||
} | ||
|
||
// DryRun prints out the action plan. | ||
func (o *UnsetNooutOps) DryRun(name string) string { | ||
return "Run `ceph osd unset noout`." | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a little bit of abstraction leaking between Run and DryRun here: Run uses the setOsdNooutFlag() function, but DryRun prints the lower level ceph command used. Just an observation. Not sure how nitpicky this is, or the best way to go - perhaps inlining the setOsdNooutFlag function to remove the extra layer of indirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DryRun
function should give you the actual action performed, not the string representation of the function
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Signed-off-by: Chi Wai Chan <[email protected]>
Description
MicroCeph currently supports scaling up and down, but not maintenance. In the actual operation, the operators may also need to perform regular maintenance to the storage nodes, and MicroCeph does not offer a simple solution to this problem. We would like to introduce sub-commands (e.g. microceph cluster maintenance enter and microceph cluster maintenance exit) to MicroCeph to enable the feature.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the addition/modification of tests done to verify this change. Please also list any relevant details for your test configuration.
Contributor's Checklist
Please check that you have: