Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-place upgrade explanation #770

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

HomayoonAlimohammadi
Copy link
Contributor

Overview

This PR adds the explanation page for CAPI in-place upgrades.

@HomayoonAlimohammadi HomayoonAlimohammadi requested a review from a team as a code owner November 7, 2024 10:03
@evilnick
Copy link
Contributor

evilnick commented Nov 8, 2024

Thanks for the images. I will need to put those in the asset manager and update the links in the text. I will do that first and then take a look through the words

Copy link
Contributor

@eaudetcobello eaudetcobello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, thank you!

docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
@canonical canonical deleted a comment from github-actions bot Nov 11, 2024
Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work here and lovely images!!
Let's cut some implementation details from this explanation and make this a focused doc.

Comment on lines 3 to 6
Upgrading the Kubernetes version of the machines is sometimes a necessity.
Rolling upgrades are the most popular way to go, but in certain situations
we might want or need to go with an in-place upgrade. Examples of these
situations are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Upgrading the Kubernetes version of the machines is sometimes a necessity.
Rolling upgrades are the most popular way to go, but in certain situations
we might want or need to go with an in-place upgrade. Examples of these
situations are:
Regularly upgrading the Kubernetes version of the machines in a cluster is important.
While rolling upgrades a popular strategy, certain situations
may demand in-place upgrades:

Try and focus your text more, as you write something ask yourself about each piece: is this valuable information or just nice fluff. The art here is to convey the minimum information in a way that still flows nicely.

As a second note, make statements that are confident and avoid sometimes and might want etc.

Comment on lines 8 to 9
- Resource constraints (i.e. we can not afford spawning new machines).
- Costly and manual setup process for each machine which we can not afford.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Resource constraints (i.e. we can not afford spawning new machines).
- Costly and manual setup process for each machine which we can not afford.
- Resource constraints (i.e. cost of additional machines).
- Expensive manual setup process for nodes.

Comment on lines 11 to 12
In this section we're going to see how the orchestrated in-place upgrades
work in {{product}} CAPI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this section we're going to see how the orchestrated in-place upgrades
work in {{product}} CAPI.

This isn't new information we're in the CAPI section of the docs.

Comment on lines 16 to 18
In CAPI, machines are considered immutable. They can not be changed once
they’re created. In order to change a machine, according to CAPI design
decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In CAPI, machines are considered immutable. They can not be changed once
they’re created. In order to change a machine, according to CAPI design
decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]:
CAPI machines are considered immutable. Consequently, machines are replaced instead of reconfigured.

Comment on lines 20 to 37
> From the perspective of Cluster API, all Machines are immutable: once
they are created, they are never updated (except for labels, annotations and
status), only deleted.

{{product}} CAPI leverages the fact that "annotations" can be changed on
an already created machine. As we will later see in details, we can perform
an in-place upgrade by changing a machine’s annotations. Changing “labels”
or “status” does not make much sense, as the former is mostly used as a
[grouping/organization mechanism][2] and the latter is [supplied and updated
by Kubernetes][3].

That being said, the whole idea of doing an in-place upgrade might not be
completely aligned with the upstream cluster API design decisions as we’re
essentially changing a machine’s Kubernetes version, something that’s
supposed to be achieved by doing a rolling upgrade (replacing old ones).
Either way, having the ability to perform an in-place upgrade comes with
many benefits, so we decided to design and implement a way to enable users
perform these upgrades with minimum friction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too much information for the user.
Let's just point to upstream docs for the curious and say something like:
While CAPI doesn't support in-place upgrades, Canonical Kubernetes CAPI does by leveraging annotations for the implementation.

Comment on lines 161 to 192
### Orchestrated In-Place Upgrade Controller

While the “Single Machine In-Place Upgrade Controller” is responsible
for upgrading individual machines, if our workload cluster is made up
of tens, hundreds, or thousands of nodes, annotating them one by one and
keeping track of their statuses, failures, reasons and errors can become
a daunting or even impossible task. “Orchestrated In-Place Upgrade
Controller” to the rescue!

The main idea behind this class of controllers is to enable us to upgrade
multiple machines by only annotating a single object: the owner of
those machines. In CAPI, we mostly have two main machine groups in
our workload cluster: “control-plane-node machines” and “worker-node
machines”.

Let’s say we want to upgrade all of our worker nodes. In this case, we
only need to annotate the `MachineDeployment` object. The orchestrated
upgrade controller watches for the `upgrade-to` annotation on the
`MachineDeployment` and will trigger an in-place upgrade for all of
its owned machines by annotating them one by one. The responsibility
is then delegated to the single machine upgrader to make sure each
individual machine is getting upgraded successfully. The orchestrator
is only there to keep track of the upgrade status of these machines.
If any of the machines fail to get upgraded, the orchestrator marks
the owner (here, `MachineDeployment`) with the respective annotations
(`status: failed` and `last-failed-attempt-at`), publishes helpful
and informative events and trusts the single machine controller to do
the retry and eventually succeed. When all the owned machines are
upgraded as expected, the orchestrator considers the operation successful,
marks the owner with `release` and `status: done` and steps down.

To paint a better picture, let’s have a look the flow of the orchestrator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce/remove a good chunk of this.

Rename the section to In-place upgrades on large workload clusters.
Then add 2-3 sentences max and then your picture explains the rest.

Comment on lines 200 to 202

For the detailed implementation of this controller, make sure to check out
the [{{product}} CAPI repository][capi-repo].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For the detailed implementation of this controller, make sure to check out
the [{{product}} CAPI repository][capi-repo].

For the detailed implementation of this controller, make sure to check out
the [{{product}} CAPI repository][capi-repo].

#### Locking The Upgrade Process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Locking The Upgrade Process
#### (Optional) Locking The Upgrade Process

Comment on lines 206 to 121
There might be scenarios where we need to make sure that only a limited
number of machines are going to get upgraded at the same time. An example
might be upgrading control plane machines. If multiple nodes become
unavailable due to getting upgraded, we might lose quorum, experience severe
downtime, or end up in an undesirable state that is very costly to get out of.

Let’s say we only want to allow a single upgrade at any given point (no
parallel upgrades). A lock or a semaphore can be implemented to ensure
that the orchestrator is not going to trigger an upgrade for multiple
machines, even if multiple instances of the orchestrator are reconciling
the same object in parallel. Note that this implementation can be
generalized to allow `n` upgrades at the same time, instead of only 1.

For the detailed implementation of the locking process, make sure to
checkout the [{{product}} CAPI repository][capi-repo].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user all I care about is where do I configure n, please add that explanation and remove the details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for now it can't be configured :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, then let's remove this until we have a way to configure it.

Comment on lines 222 to 229
## Conclusion

Provisioning a cluster can be challenging. Trying to upgrade the machines
of that cluster without carefully engineered tools can get out of hand
pretty quickly. In-Place Upgrade Controllers that come out of the box with
{{product}} CAPI providers can take a huge burden off our shoulders by
taking care of the upgrade process in a responsive, self-healing and
cloud-native way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Conclusion
Provisioning a cluster can be challenging. Trying to upgrade the machines
of that cluster without carefully engineered tools can get out of hand
pretty quickly. In-Place Upgrade Controllers that come out of the box with
{{product}} CAPI providers can take a huge burden off our shoulders by
taking care of the upgrade process in a responsive, self-healing and
cloud-native way.

This is nice fluff but not critical for understanding in-place upgrades.

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

@addyess addyess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall very good. Thanks for these details. Once the docs are more assertive we'll be good for merge


## Controllers

In {{product}} CAPI, we have two main types of controllers that handle the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In {{product}} CAPI, we have two main types of controllers that handle the
In {{product}} CAPI, there are two main types of controllers that handle the


#### (Optional) Locking The Upgrade Process

There might be scenarios where we need to make sure that only a limited
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more assertive here too

Suggested change
There might be scenarios where we need to make sure that only a limited
The controller ensures that only a limited...

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! One last round of comments then we should be ready to merge :)

Comment on lines 4 to 5
is important. While rolling upgrades a popular strategy, certain situations
will require in-place upgrades:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is important. While rolling upgrades a popular strategy, certain situations
will require in-place upgrades:
is important. While rolling upgrades are a popular strategy, certain situations
will require in-place upgrades:

Comment on lines 16 to 21
For more information about CAPI design decisions, have a look at the
following:

- [Machine immutability in CAPI][1]
- [Kubernetes objects: `labels`][2]
- [Kubernetes objects: `spec` and `status`][3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information about CAPI design decisions, have a look at the
following:
- [Machine immutability in CAPI][1]
- [Kubernetes objects: `labels`][2]
- [Kubernetes objects: `spec` and `status`][3]
For a deeper understanding of the CAPI design decisions, consider reading about
[machine immutability in CAPI][1], and Kubernetes objects: [`labels`][2],[`spec` and `status`][3].

Nice! Let's embed the links in the text.

Upgrader”. It watches for certain annotations on machines and reconciles
them to make sure the upgrades happen as expected. As its name suggests,
it reconciles each machine individually, considers the machines separately
and does not assume any sort of relation between them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and does not assume any sort of relation between them.
and does not assume any relation between them.

it reconciles each machine individually, considers the machines separately
and does not assume any sort of relation between them.

The “Orchestrator” on the other hand, watches for certain annotations on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The “Orchestrator” on the other hand, watches for certain annotations on
The “Orchestrator” watches for certain annotations on

Comment on lines 108 to 111
{{product}} has a daemon running in the background called the `k8sd`.
It’s responsible for many things in the context of {{product}} but
one of them is to expose certain endpoints that can be used to
interact with the cluster. The single machine upgrader calls the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{product}} has a daemon running in the background called the `k8sd`.
It’s responsible for many things in the context of {{product}} but
one of them is to expose certain endpoints that can be used to
interact with the cluster. The single machine upgrader calls the
The {{product}}'s `k8sd` daemon is exposes endpoints that can be used to
interact with the cluster. The single machine upgrader calls the

Comment on lines 112 to 118
`/snap/refresh` endpoint on the machine that it’s trying to upgrade.
This endpoint call will trigger the “actual” upgrade process and in
the meantime, the `/snap/refresh-status` will be called periodically
by the single machine upgrader to see how things are going. It’s worth
noting that ensuring secure communication between the single machine
upgrader and the {{product}} daemon (k8sd) is really important and
something that is handled internally as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`/snap/refresh` endpoint on the machine that it’s trying to upgrade.
This endpoint call will trigger the “actual” upgrade process and in
the meantime, the `/snap/refresh-status` will be called periodically
by the single machine upgrader to see how things are going. It’s worth
noting that ensuring secure communication between the single machine
upgrader and the {{product}} daemon (k8sd) is really important and
something that is handled internally as well.
`/snap/refresh` endpoint on the machine to trigger the upgrade process while checking `/snap/refresh-status` periodically.

IMHO, we can remove the note on secure communication.


While the “Single Machine In-Place Upgrade Controller” is responsible
for upgrading individual machines, the "Orchestrated In-Place Upgrade
Controller" makes sure that groups of machines will get upgraded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Controller" makes sure that groups of machines will get upgraded.
Controller" ensures that groups of machines will get upgraded.

Comment on lines 139 to 147

#### (Optional) Locking The Upgrade Process

The controllers ensure that only a single machine
is going to get upgraded at the same time.
This is to prevent undesirable situations like quorum loss or severe downtimes.

{{product}} CAPI uses a lock to ensure only 1 control plane machine is
getting upgraded at a given time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### (Optional) Locking The Upgrade Process
The controllers ensure that only a single machine
is going to get upgraded at the same time.
This is to prevent undesirable situations like quorum loss or severe downtimes.
{{product}} CAPI uses a lock to ensure only 1 control plane machine is
getting upgraded at a given time.

Collapsing this into the previous paragraph.

By applying the `upgrade-to` annotation on an object that owns machines
(e.g. a `MachineDeployment`), this controller will mark the owned machines
one by one which will cause the "Single Machine Upgrader" to pickup those
annotations and upgrade the machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
annotations and upgrade the machines.
annotations and upgrade the machines. To avoid undesirable situations like quorum loss or severe downtime, these upgrades happen in sequence.

Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more little nits.

Comment on lines 29 to 32
Upgrader”. It watches for certain annotations on machines and reconciles
them to make sure the upgrades happen as expected. As its name suggests,
it reconciles each machine individually, considers the machines separately
and does not assume any relation between them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Upgrader”. It watches for certain annotations on machines and reconciles
them to make sure the upgrades happen as expected. As its name suggests,
it reconciles each machine individually, considers the machines separately
and does not assume any relation between them.
Upgrader”. The controller watches for annotations on machines and reconciles
them to ensure the upgrades happen smoothly.

Since the name suggests this we don't need to write it out- I think it becomes clear this is the single machine controller vs in the Orchestrator :)

Comment on lines 28 to 29
The core component of performing an in-place upgrade is the “Single Machine
Upgrader”. It watches for certain annotations on machines and reconciles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The core component of performing an in-place upgrade is the Single Machine
Upgrader. It watches for certain annotations on machines and reconciles
The core component of performing an in-place upgrade is the `Single Machine
Upgrader`. It watches for certain annotations on machines and reconciles

Comment on lines 34 to 35
The “Orchestrator” watches for certain annotations on
machine owners, reconciles them and upgrades groups of owned machines.
It’s responsible for making sure that all the machines owned by the
reconciled object get upgraded successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Orchestrator watches for certain annotations on
machine owners, reconciles them and upgrades groups of owned machines.
It’s responsible for making sure that all the machines owned by the
reconciled object get upgraded successfully.
The `Orchestrator` watches for certain annotations on
machine owners, reconciles them and upgrades groups of owned machines.
It’s responsible for ensuring that all the machines owned by the
reconciled object get upgraded successfully.

docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
Comment on lines 68 to 70
Upgrade methods or options can be specified to upgrade to a snap channel,
revision, or install a new snap from a file already placed on the
machine to make up for air-gapped environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to get applied :)

docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
docs/src/capi/explanation/in-place-upgrades.md Outdated Show resolved Hide resolved
Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please look into the build-the-docs CI error before merging.

@HomayoonAlimohammadi HomayoonAlimohammadi force-pushed the KU-1894/in-place-upgrades-explanation branch from 6a63b76 to b168550 Compare November 15, 2024 14:23
Copy link
Contributor

Hi, looks like pyspelling job found some issues, you can check it here

@HomayoonAlimohammadi HomayoonAlimohammadi merged commit 3528193 into main Nov 15, 2024
6 checks passed
@HomayoonAlimohammadi HomayoonAlimohammadi deleted the KU-1894/in-place-upgrades-explanation branch November 15, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants