Skip to content

Commit

Permalink
Address comments
Browse files Browse the repository at this point in the history
  • Loading branch information
HomayoonAlimohammadi committed Nov 13, 2024
1 parent 8823533 commit f735ec1
Showing 1 changed file with 45 additions and 111 deletions.
156 changes: 45 additions & 111 deletions docs/src/capi/explanation/in-place-upgrades.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,23 @@
# In-Place Upgrades

Upgrading the Kubernetes version of the machines is sometimes a necessity.
Rolling upgrades are the most popular way to go, but in certain situations
we might want or need to go with an in-place upgrade. Examples of these
situations are:
Regularly upgrading the Kubernetes version of the machines in a cluster
is important. While rolling upgrades a popular strategy, certain situations
may demand in-place upgrades:

- Resource constraints (i.e. we can not afford spawning new machines).
- Costly and manual setup process for each machine which we can not afford.

In this section we're going to see how the orchestrated in-place upgrades
work in {{product}} CAPI.
- Resource constraints (i.e. cost of additional machines).
- Expensive manual setup process for nodes.

## Annotations

In CAPI, machines are considered immutable. They can not be changed once
they’re created. In order to change a machine, according to CAPI design
decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]:

> From the perspective of Cluster API, all Machines are immutable: once
they are created, they are never updated (except for labels, annotations and
status), only deleted.

{{product}} CAPI leverages the fact that "annotations" can be changed on
an already created machine. As we will later see in details, we can perform
an in-place upgrade by changing a machine’s annotations. Changing “labels”
or “status” does not make much sense, as the former is mostly used as a
[grouping/organization mechanism][2] and the latter is [supplied and updated
by Kubernetes][3].

That being said, the whole idea of doing an in-place upgrade might not be
completely aligned with the upstream cluster API design decisions as we’re
essentially changing a machine’s Kubernetes version, something that’s
supposed to be achieved by doing a rolling upgrade (replacing old ones).
Either way, having the ability to perform an in-place upgrade comes with
many benefits, so we decided to design and implement a way to enable users
perform these upgrades with minimum friction.
CAPI machines are considered immutable. Consequently, machines are replaced
instead of reconfigured.
While CAPI doesn't support in-place upgrades, {{product}} CAPI does
by leveraging annotations for the implementation.
For more information about CAPI design decisions, have a look at the
following:
- [Machine immutability in CAPI][1]

Check failure on line 18 in docs/src/capi/explanation/in-place-upgrades.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/src/capi/explanation/in-place-upgrades.md:18 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- [Machine immutability in CAP..."] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md
- [Kubernetes objects: `labels`][2]
- [Kubernetes objects: `spec` and `status`][3]

## Controllers

Expand All @@ -59,54 +42,41 @@ The main annotations that drive the upgrade process are as follows (for a
complete and up-to-date list of these annotations and their values please
refer to [annotations reference page][4]):

- `v1beta2.k8sd.io/in-place-upgrade-to` : Instructs the controller to
perform an upgrade with the specified option/method. This is the only
annotation that we as the users need to put on the objects.
- `v1beta2.k8sd.io/in-place-upgrade-status`: As soon as the controller
starts the upgrade process, the object will be marked with this
- `v1beta2.k8sd.io/in-place-upgrade-to` --> `upgrade-to` : Instructs
the controller to perform an upgrade with the specified option/method.
This is the only annotation that we as the users need to put on the objects.
- `v1beta2.k8sd.io/in-place-upgrade-status` --> `status` : As soon as the
controller starts the upgrade process, the object will be marked with this
annotation to indicate the status of the upgrade. It can either be
`in-progress`, `failed` or `done`.
- `v1beta2.k8sd.io/in-place-upgrade-release`: When the upgrade is
performed successfully, this annotation will indicate the current
- `v1beta2.k8sd.io/in-place-upgrade-release` --> `release` : When the
upgrade is performed successfully, this annotation will indicate the current
Kubernetes release/version installed on the machine.

From now on, let's use the following abbreviations to make this article
shorter and more readable:

- `v1beta2.k8sd.io/in-place-upgrade-to` --> `upgrade-to`
- `v1beta2.k8sd.io/in-place-upgrade-status` --> `status`
- `v1beta2.k8sd.io/in-place-upgrade-release` --> `release`
- `v1beta2.k8sd.io/in-place-upgrade-last-failed-attempt-at` -->
`last-failed-attempt-at`
Note that abbreviations of these labels are used to make this article more readable.

Check failure on line 56 in docs/src/capi/explanation/in-place-upgrades.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Line length

docs/src/capi/explanation/in-place-upgrades.md:56:81 MD013/line-length Line length [Expected: 80; Actual: 84] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md013.md

### Single Machine In-Place Upgrade Controller

The Machine objects can be marked with the `upgrade-to` annotation to
trigger an in-place upgrade for that machine. As an example, let’s say
we have a machine with `1.30/stable` Kubernetes snap installed on it.
To initiate an in-place upgrade for this machine, we can annotate it
with `upgrade-to: channel=1.31/stable`. The single machine upgrader
trigger an in-place upgrade for that machine. The single machine upgrader
(which is watching for changes on machines), notices this annotation
and attempts to upgrade the Kubernetes version of that machine to the
specified version.

Because {{product}} is shipped in a snap package, performing an upgrade
can be as easy as doing a `snap refresh`. Upgrade methods or options
can be specified to upgrade to a snap channel, revision, or install a
new snap from a file already placed on the machine to make up for
air-gapped environments.
Upgrade methods or options can be specified to upgrade to a snap channel,
revision, or install a new snap from a file already placed on the
machine to make up for air-gapped environments.

When the upgrade is finished successfully, we will notice (at least) the
following annotations on the machine:
A successfully upgraded machine shows the following annotations:

```yaml
annotations:
v1beta2.k8sd.io/in-place-upgrade-release: "channel=1.31/stable"
v1beta2.k8sd.io/in-place-upgrade-status: "done"
```
If the upgrade fails, the controller will mark the machine with the
following annotations and retry immediately:
If the upgrade fails, the controller will mark the machine and retry
the upgrade immediately:
```yaml
annotations:
Expand Down Expand Up @@ -147,70 +117,34 @@ something that is handled internally as well.

![Diagram][img-k8sd-call]

### Orchestrated In-Place Upgrade Controller
### In-place upgrades on large workload clusters

While the “Single Machine In-Place Upgrade Controller” is responsible
for upgrading individual machines, if our workload cluster is made up
of tens, hundreds, or thousands of nodes, annotating them one by one and
keeping track of their statuses, failures, reasons and errors can become
a daunting or even impossible task. “Orchestrated In-Place Upgrade
Controller” to the rescue!

The main idea behind this class of controllers is to enable us to upgrade
multiple machines by only annotating a single object: the owner of
those machines. In CAPI, we mostly have two main machine groups in
our workload cluster: “control-plane-node machines” and “worker-node
machines”.

Let’s say we want to upgrade all of our worker nodes. In this case, we
only need to annotate the `MachineDeployment` object. The orchestrated
upgrade controller watches for the `upgrade-to` annotation on the
`MachineDeployment` and will trigger an in-place upgrade for all of
its owned machines by annotating them one by one. The responsibility
is then delegated to the single machine upgrader to make sure each
individual machine is getting upgraded successfully. The orchestrator
is only there to keep track of the upgrade status of these machines.
If any of the machines fail to get upgraded, the orchestrator marks
the owner (here, `MachineDeployment`) with the respective annotations
(`status: failed` and `last-failed-attempt-at`), publishes helpful
and informative events and trusts the single machine controller to do
the retry and eventually succeed. When all the owned machines are
upgraded as expected, the orchestrator considers the operation successful,
marks the owner with `release` and `status: done` and steps down.

To paint a better picture, let’s have a look the flow of the orchestrator:
for upgrading individual machines, the "Orchestrated In-Place Upgrade
Controller" makes sure that groups of machines will get upgraded.
By applying the `upgrade-to` annotation on an object that owns machines
(e.g. a `MachineDeployment`), this controller will mark the owned machines
one by one which will cause the "Single Machine Upgrader" to pickup those
annotations and upgrade the machines.

![Diagram][img-orchestrated]
Failure and success of individual machine upgrades will be reported back
to the orchestrator by the single machine upgrader via annotations.

For the detailed implementation of this controller, make sure to check out
the [{{product}} CAPI repository][capi-repo].
To paint a better picture, let’s have a look the flow of orchestrated
in-place upgrades:

![Diagram][img-orchestrated]

#### Locking The Upgrade Process
#### (Optional) Locking The Upgrade Process

There might be scenarios where we need to make sure that only a limited
number of machines are going to get upgraded at the same time. An example
might be upgrading control plane machines. If multiple nodes become
unavailable due to getting upgraded, we might lose quorum, experience severe
downtime, or end up in an undesirable state that is very costly to get out of.

Let’s say we only want to allow a single upgrade at any given point (no
parallel upgrades). A lock or a semaphore can be implemented to ensure
that the orchestrator is not going to trigger an upgrade for multiple
machines, even if multiple instances of the orchestrator are reconciling
the same object in parallel. Note that this implementation can be
generalized to allow `n` upgrades at the same time, instead of only 1.

For the detailed implementation of the locking process, make sure to
checkout the [{{product}} CAPI repository][capi-repo].

## Conclusion

Provisioning a cluster can be challenging. Trying to upgrade the machines
of that cluster without carefully engineered tools can get out of hand
pretty quickly. In-Place Upgrade Controllers that come out of the box with
{{product}} CAPI providers can take a huge burden off our shoulders by
taking care of the upgrade process in a responsive, self-healing and
cloud-native way.
{{product}} CAPI uses a lock to ensure only 1 control plane machine is
getting upgraded at a given time.

<!-- IMAGES -->

Expand Down

0 comments on commit f735ec1

Please sign in to comment.