-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in-place upgrade explanation #770
Add in-place upgrade explanation #770
Conversation
Thanks for the images. I will need to put those in the asset manager and update the links in the text. I will do that first and then take a look through the words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, thank you!
Hi, looks like pyspelling job found some issues, you can check it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work here and lovely images!!
Let's cut some implementation details from this explanation and make this a focused doc.
Upgrading the Kubernetes version of the machines is sometimes a necessity. | ||
Rolling upgrades are the most popular way to go, but in certain situations | ||
we might want or need to go with an in-place upgrade. Examples of these | ||
situations are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrading the Kubernetes version of the machines is sometimes a necessity. | |
Rolling upgrades are the most popular way to go, but in certain situations | |
we might want or need to go with an in-place upgrade. Examples of these | |
situations are: | |
Regularly upgrading the Kubernetes version of the machines in a cluster is important. | |
While rolling upgrades a popular strategy, certain situations | |
may demand in-place upgrades: |
Try and focus your text more, as you write something ask yourself about each piece: is this valuable information or just nice fluff. The art here is to convey the minimum information in a way that still flows nicely.
As a second note, make statements that are confident and avoid sometimes
and might want
etc.
- Resource constraints (i.e. we can not afford spawning new machines). | ||
- Costly and manual setup process for each machine which we can not afford. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Resource constraints (i.e. we can not afford spawning new machines). | |
- Costly and manual setup process for each machine which we can not afford. | |
- Resource constraints (i.e. cost of additional machines). | |
- Expensive manual setup process for nodes. |
In this section we're going to see how the orchestrated in-place upgrades | ||
work in {{product}} CAPI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section we're going to see how the orchestrated in-place upgrades | |
work in {{product}} CAPI. |
This isn't new information we're in the CAPI section of the docs.
In CAPI, machines are considered immutable. They can not be changed once | ||
they’re created. In order to change a machine, according to CAPI design | ||
decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CAPI, machines are considered immutable. They can not be changed once | |
they’re created. In order to change a machine, according to CAPI design | |
decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]: | |
CAPI machines are considered immutable. Consequently, machines are replaced instead of reconfigured. |
> From the perspective of Cluster API, all Machines are immutable: once | ||
they are created, they are never updated (except for labels, annotations and | ||
status), only deleted. | ||
|
||
{{product}} CAPI leverages the fact that "annotations" can be changed on | ||
an already created machine. As we will later see in details, we can perform | ||
an in-place upgrade by changing a machine’s annotations. Changing “labels” | ||
or “status” does not make much sense, as the former is mostly used as a | ||
[grouping/organization mechanism][2] and the latter is [supplied and updated | ||
by Kubernetes][3]. | ||
|
||
That being said, the whole idea of doing an in-place upgrade might not be | ||
completely aligned with the upstream cluster API design decisions as we’re | ||
essentially changing a machine’s Kubernetes version, something that’s | ||
supposed to be achieved by doing a rolling upgrade (replacing old ones). | ||
Either way, having the ability to perform an in-place upgrade comes with | ||
many benefits, so we decided to design and implement a way to enable users | ||
perform these upgrades with minimum friction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too much information for the user.
Let's just point to upstream docs for the curious and say something like:
While CAPI doesn't support in-place upgrades, Canonical Kubernetes CAPI does by leveraging annotations for the implementation.
### Orchestrated In-Place Upgrade Controller | ||
|
||
While the “Single Machine In-Place Upgrade Controller” is responsible | ||
for upgrading individual machines, if our workload cluster is made up | ||
of tens, hundreds, or thousands of nodes, annotating them one by one and | ||
keeping track of their statuses, failures, reasons and errors can become | ||
a daunting or even impossible task. “Orchestrated In-Place Upgrade | ||
Controller” to the rescue! | ||
|
||
The main idea behind this class of controllers is to enable us to upgrade | ||
multiple machines by only annotating a single object: the owner of | ||
those machines. In CAPI, we mostly have two main machine groups in | ||
our workload cluster: “control-plane-node machines” and “worker-node | ||
machines”. | ||
|
||
Let’s say we want to upgrade all of our worker nodes. In this case, we | ||
only need to annotate the `MachineDeployment` object. The orchestrated | ||
upgrade controller watches for the `upgrade-to` annotation on the | ||
`MachineDeployment` and will trigger an in-place upgrade for all of | ||
its owned machines by annotating them one by one. The responsibility | ||
is then delegated to the single machine upgrader to make sure each | ||
individual machine is getting upgraded successfully. The orchestrator | ||
is only there to keep track of the upgrade status of these machines. | ||
If any of the machines fail to get upgraded, the orchestrator marks | ||
the owner (here, `MachineDeployment`) with the respective annotations | ||
(`status: failed` and `last-failed-attempt-at`), publishes helpful | ||
and informative events and trusts the single machine controller to do | ||
the retry and eventually succeed. When all the owned machines are | ||
upgraded as expected, the orchestrator considers the operation successful, | ||
marks the owner with `release` and `status: done` and steps down. | ||
|
||
To paint a better picture, let’s have a look the flow of the orchestrator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reduce/remove a good chunk of this.
Rename the section to In-place upgrades on large workload clusters
.
Then add 2-3 sentences max and then your picture explains the rest.
|
||
For the detailed implementation of this controller, make sure to check out | ||
the [{{product}} CAPI repository][capi-repo]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the detailed implementation of this controller, make sure to check out | |
the [{{product}} CAPI repository][capi-repo]. |
For the detailed implementation of this controller, make sure to check out | ||
the [{{product}} CAPI repository][capi-repo]. | ||
|
||
#### Locking The Upgrade Process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Locking The Upgrade Process | |
#### (Optional) Locking The Upgrade Process |
There might be scenarios where we need to make sure that only a limited | ||
number of machines are going to get upgraded at the same time. An example | ||
might be upgrading control plane machines. If multiple nodes become | ||
unavailable due to getting upgraded, we might lose quorum, experience severe | ||
downtime, or end up in an undesirable state that is very costly to get out of. | ||
|
||
Let’s say we only want to allow a single upgrade at any given point (no | ||
parallel upgrades). A lock or a semaphore can be implemented to ensure | ||
that the orchestrator is not going to trigger an upgrade for multiple | ||
machines, even if multiple instances of the orchestrator are reconciling | ||
the same object in parallel. Note that this implementation can be | ||
generalized to allow `n` upgrades at the same time, instead of only 1. | ||
|
||
For the detailed implementation of the locking process, make sure to | ||
checkout the [{{product}} CAPI repository][capi-repo]. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user all I care about is where do I configure n
, please add that explanation and remove the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for now it can't be configured :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, then let's remove this until we have a way to configure it.
## Conclusion | ||
|
||
Provisioning a cluster can be challenging. Trying to upgrade the machines | ||
of that cluster without carefully engineered tools can get out of hand | ||
pretty quickly. In-Place Upgrade Controllers that come out of the box with | ||
{{product}} CAPI providers can take a huge burden off our shoulders by | ||
taking care of the upgrade process in a responsive, self-healing and | ||
cloud-native way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Conclusion | |
Provisioning a cluster can be challenging. Trying to upgrade the machines | |
of that cluster without carefully engineered tools can get out of hand | |
pretty quickly. In-Place Upgrade Controllers that come out of the box with | |
{{product}} CAPI providers can take a huge burden off our shoulders by | |
taking care of the upgrade process in a responsive, self-healing and | |
cloud-native way. |
This is nice fluff but not critical for understanding in-place upgrades.
Hi, looks like pyspelling job found some issues, you can check it here |
Hi, looks like pyspelling job found some issues, you can check it here |
Hi, looks like pyspelling job found some issues, you can check it here |
Hi, looks like pyspelling job found some issues, you can check it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very good. Thanks for these details. Once the docs are more assertive we'll be good for merge
|
||
## Controllers | ||
|
||
In {{product}} CAPI, we have two main types of controllers that handle the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In {{product}} CAPI, we have two main types of controllers that handle the | |
In {{product}} CAPI, there are two main types of controllers that handle the |
|
||
#### (Optional) Locking The Upgrade Process | ||
|
||
There might be scenarios where we need to make sure that only a limited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more assertive here too
There might be scenarios where we need to make sure that only a limited | |
The controller ensures that only a limited... |
Hi, looks like pyspelling job found some issues, you can check it here |
Hi, looks like pyspelling job found some issues, you can check it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! One last round of comments then we should be ready to merge :)
is important. While rolling upgrades a popular strategy, certain situations | ||
will require in-place upgrades: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is important. While rolling upgrades a popular strategy, certain situations | |
will require in-place upgrades: | |
is important. While rolling upgrades are a popular strategy, certain situations | |
will require in-place upgrades: |
For more information about CAPI design decisions, have a look at the | ||
following: | ||
|
||
- [Machine immutability in CAPI][1] | ||
- [Kubernetes objects: `labels`][2] | ||
- [Kubernetes objects: `spec` and `status`][3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more information about CAPI design decisions, have a look at the | |
following: | |
- [Machine immutability in CAPI][1] | |
- [Kubernetes objects: `labels`][2] | |
- [Kubernetes objects: `spec` and `status`][3] | |
For a deeper understanding of the CAPI design decisions, consider reading about | |
[machine immutability in CAPI][1], and Kubernetes objects: [`labels`][2],[`spec` and `status`][3]. |
Nice! Let's embed the links in the text.
Upgrader”. It watches for certain annotations on machines and reconciles | ||
them to make sure the upgrades happen as expected. As its name suggests, | ||
it reconciles each machine individually, considers the machines separately | ||
and does not assume any sort of relation between them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and does not assume any sort of relation between them. | |
and does not assume any relation between them. |
it reconciles each machine individually, considers the machines separately | ||
and does not assume any sort of relation between them. | ||
|
||
The “Orchestrator” on the other hand, watches for certain annotations on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The “Orchestrator” on the other hand, watches for certain annotations on | |
The “Orchestrator” watches for certain annotations on |
{{product}} has a daemon running in the background called the `k8sd`. | ||
It’s responsible for many things in the context of {{product}} but | ||
one of them is to expose certain endpoints that can be used to | ||
interact with the cluster. The single machine upgrader calls the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{product}} has a daemon running in the background called the `k8sd`. | |
It’s responsible for many things in the context of {{product}} but | |
one of them is to expose certain endpoints that can be used to | |
interact with the cluster. The single machine upgrader calls the | |
The {{product}}'s `k8sd` daemon is exposes endpoints that can be used to | |
interact with the cluster. The single machine upgrader calls the |
`/snap/refresh` endpoint on the machine that it’s trying to upgrade. | ||
This endpoint call will trigger the “actual” upgrade process and in | ||
the meantime, the `/snap/refresh-status` will be called periodically | ||
by the single machine upgrader to see how things are going. It’s worth | ||
noting that ensuring secure communication between the single machine | ||
upgrader and the {{product}} daemon (k8sd) is really important and | ||
something that is handled internally as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`/snap/refresh` endpoint on the machine that it’s trying to upgrade. | |
This endpoint call will trigger the “actual” upgrade process and in | |
the meantime, the `/snap/refresh-status` will be called periodically | |
by the single machine upgrader to see how things are going. It’s worth | |
noting that ensuring secure communication between the single machine | |
upgrader and the {{product}} daemon (k8sd) is really important and | |
something that is handled internally as well. | |
`/snap/refresh` endpoint on the machine to trigger the upgrade process while checking `/snap/refresh-status` periodically. |
IMHO, we can remove the note on secure communication.
|
||
While the “Single Machine In-Place Upgrade Controller” is responsible | ||
for upgrading individual machines, the "Orchestrated In-Place Upgrade | ||
Controller" makes sure that groups of machines will get upgraded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controller" makes sure that groups of machines will get upgraded. | |
Controller" ensures that groups of machines will get upgraded. |
|
||
#### (Optional) Locking The Upgrade Process | ||
|
||
The controllers ensure that only a single machine | ||
is going to get upgraded at the same time. | ||
This is to prevent undesirable situations like quorum loss or severe downtimes. | ||
|
||
{{product}} CAPI uses a lock to ensure only 1 control plane machine is | ||
getting upgraded at a given time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### (Optional) Locking The Upgrade Process | |
The controllers ensure that only a single machine | |
is going to get upgraded at the same time. | |
This is to prevent undesirable situations like quorum loss or severe downtimes. | |
{{product}} CAPI uses a lock to ensure only 1 control plane machine is | |
getting upgraded at a given time. |
Collapsing this into the previous paragraph.
By applying the `upgrade-to` annotation on an object that owns machines | ||
(e.g. a `MachineDeployment`), this controller will mark the owned machines | ||
one by one which will cause the "Single Machine Upgrader" to pickup those | ||
annotations and upgrade the machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotations and upgrade the machines. | |
annotations and upgrade the machines. To avoid undesirable situations like quorum loss or severe downtime, these upgrades happen in sequence. |
Hi, looks like pyspelling job found some issues, you can check it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more little nits.
Upgrader”. It watches for certain annotations on machines and reconciles | ||
them to make sure the upgrades happen as expected. As its name suggests, | ||
it reconciles each machine individually, considers the machines separately | ||
and does not assume any relation between them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrader”. It watches for certain annotations on machines and reconciles | |
them to make sure the upgrades happen as expected. As its name suggests, | |
it reconciles each machine individually, considers the machines separately | |
and does not assume any relation between them. | |
Upgrader”. The controller watches for annotations on machines and reconciles | |
them to ensure the upgrades happen smoothly. |
Since the name suggests this we don't need to write it out- I think it becomes clear this is the single machine controller vs in the Orchestrator :)
The core component of performing an in-place upgrade is the “Single Machine | ||
Upgrader”. It watches for certain annotations on machines and reconciles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core component of performing an in-place upgrade is the “Single Machine | |
Upgrader”. It watches for certain annotations on machines and reconciles | |
The core component of performing an in-place upgrade is the `Single Machine | |
Upgrader`. It watches for certain annotations on machines and reconciles |
The “Orchestrator” watches for certain annotations on | ||
machine owners, reconciles them and upgrades groups of owned machines. | ||
It’s responsible for making sure that all the machines owned by the | ||
reconciled object get upgraded successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The “Orchestrator” watches for certain annotations on | |
machine owners, reconciles them and upgrades groups of owned machines. | |
It’s responsible for making sure that all the machines owned by the | |
reconciled object get upgraded successfully. | |
The `Orchestrator` watches for certain annotations on | |
machine owners, reconciles them and upgrades groups of owned machines. | |
It’s responsible for ensuring that all the machines owned by the | |
reconciled object get upgraded successfully. |
Upgrade methods or options can be specified to upgrade to a snap channel, | ||
revision, or install a new snap from a file already placed on the | ||
machine to make up for air-gapped environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to get applied :)
Hi, looks like pyspelling job found some issues, you can check it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please look into the build-the-docs CI error before merging.
6a63b76
to
b168550
Compare
Hi, looks like pyspelling job found some issues, you can check it here |
Overview
This PR adds the explanation page for CAPI in-place upgrades.