Add in-place upgrade explanation #770

HomayoonAlimohammadi · 2024-11-07T10:03:52Z

Overview

This PR adds the explanation page for CAPI in-place upgrades.

evilnick · 2024-11-08T13:45:05Z

Thanks for the images. I will need to put those in the asset manager and update the links in the text. I will do that first and then take a look through the words

eaudetcobello

Excellent work, thank you!

docs/src/capi/explanation/in-place-upgrades.md

github-actions · 2024-11-12T09:19:56Z

Hi, looks like pyspelling job found some issues, you can check it here

louiseschmidtgen

Fantastic work here and lovely images!!
Let's cut some implementation details from this explanation and make this a focused doc.

louiseschmidtgen · 2024-11-13T12:02:34Z

docs/src/capi/explanation/in-place-upgrades.md

+Upgrading the Kubernetes version of the machines is sometimes a necessity. 
+Rolling upgrades are the most popular way to go, but in certain situations 
+we might want or need to go with an in-place upgrade. Examples of these 
+situations are:


Suggested change

Upgrading the Kubernetes version of the machines is sometimes a necessity.

Rolling upgrades are the most popular way to go, but in certain situations

we might want or need to go with an in-place upgrade. Examples of these

situations are:

Regularly upgrading the Kubernetes version of the machines in a cluster is important.

While rolling upgrades a popular strategy, certain situations

may demand in-place upgrades:

Try and focus your text more, as you write something ask yourself about each piece: is this valuable information or just nice fluff. The art here is to convey the minimum information in a way that still flows nicely.

As a second note, make statements that are confident and avoid sometimes and might want etc.

louiseschmidtgen · 2024-11-13T12:04:10Z

docs/src/capi/explanation/in-place-upgrades.md

+- Resource constraints (i.e. we can not afford spawning new machines).
+- Costly and manual setup process for each machine which we can not afford.


Suggested change

- Resource constraints (i.e. we can not afford spawning new machines).

- Costly and manual setup process for each machine which we can not afford.

- Resource constraints (i.e. cost of additional machines).

- Expensive manual setup process for nodes.

louiseschmidtgen · 2024-11-13T12:06:39Z

docs/src/capi/explanation/in-place-upgrades.md

+In this section we're going to see how the orchestrated in-place upgrades 
+work in {{product}} CAPI.


Suggested change

In this section we're going to see how the orchestrated in-place upgrades

work in {{product}} CAPI.

This isn't new information we're in the CAPI section of the docs.

louiseschmidtgen · 2024-11-13T12:11:28Z

docs/src/capi/explanation/in-place-upgrades.md

+In CAPI, machines are considered immutable. They can not be changed once 
+they’re created. In order to change a machine, according to CAPI design 
+decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]:


Suggested change

In CAPI, machines are considered immutable. They can not be changed once

they’re created. In order to change a machine, according to CAPI design

decisions, we must replace it with a new one. Quoting from [CAPI concepts][1]:

CAPI machines are considered immutable. Consequently, machines are replaced instead of reconfigured.

louiseschmidtgen · 2024-11-13T12:16:50Z

docs/src/capi/explanation/in-place-upgrades.md

+> From the perspective of Cluster API, all Machines are immutable: once 
+they are created, they are never updated (except for labels, annotations and 
+status), only deleted.
+
+{{product}} CAPI leverages the fact that "annotations" can be changed on 
+an already created machine. As we will later see in details, we can perform 
+an in-place upgrade by changing a machine’s annotations. Changing “labels” 
+or “status” does not make much sense, as the former is mostly used as a 
+[grouping/organization mechanism][2] and the latter is [supplied and updated 
+by Kubernetes][3].
+
+That being said, the whole idea of doing an in-place upgrade might not be 
+completely aligned with the upstream cluster API design decisions as we’re 
+essentially changing a machine’s Kubernetes version, something that’s 
+supposed to be achieved by doing a rolling upgrade (replacing old ones). 
+Either way, having the ability to perform an in-place upgrade comes with 
+many benefits, so we decided to design and implement a way to enable users 
+perform these upgrades with minimum friction.


This is too much information for the user.
Let's just point to upstream docs for the curious and say something like:
While CAPI doesn't support in-place upgrades, Canonical Kubernetes CAPI does by leveraging annotations for the implementation.

louiseschmidtgen · 2024-11-13T12:36:54Z

docs/src/capi/explanation/in-place-upgrades.md

+### Orchestrated In-Place Upgrade Controller
+
+While the “Single Machine In-Place Upgrade Controller” is responsible 
+for upgrading individual machines, if our workload cluster is made up 
+of tens, hundreds, or thousands of nodes, annotating them one by one and 
+keeping track of their statuses, failures, reasons and errors can become 
+a daunting or even impossible task. “Orchestrated In-Place Upgrade 
+Controller” to the rescue!
+
+The main idea behind this class of controllers is to enable us to upgrade 
+multiple machines by only annotating a single object: the owner of 
+those machines. In CAPI, we mostly have two main machine groups in 
+our workload cluster: “control-plane-node machines” and “worker-node 
+machines”.
+
+Let’s say we want to upgrade all of our worker nodes. In this case, we 
+only need to annotate the `MachineDeployment` object. The orchestrated 
+upgrade controller watches for the `upgrade-to` annotation on the 
+`MachineDeployment` and will trigger an in-place upgrade for all of 
+its owned machines by annotating them one by one. The responsibility 
+is then delegated to the single machine upgrader to make sure each 
+individual machine is getting upgraded successfully. The orchestrator 
+is only there to keep track of the upgrade status of these machines.
+If any of the machines fail to get upgraded, the orchestrator marks 
+the owner (here, `MachineDeployment`) with the respective annotations 
+(`status: failed` and `last-failed-attempt-at`), publishes helpful 
+and informative events and trusts the single machine controller to do 
+the retry and eventually succeed. When all the owned machines are 
+upgraded as expected, the orchestrator considers the operation successful, 
+marks the owner with `release` and `status: done` and steps down.
+
+To paint a better picture, let’s have a look the flow of the orchestrator:


Let's reduce/remove a good chunk of this.

Rename the section to In-place upgrades on large workload clusters.
Then add 2-3 sentences max and then your picture explains the rest.

louiseschmidtgen · 2024-11-13T12:37:05Z

docs/src/capi/explanation/in-place-upgrades.md

+
+For the detailed implementation of this controller, make sure to check out 
+the [{{product}} CAPI repository][capi-repo].


Suggested change

For the detailed implementation of this controller, make sure to check out

the [{{product}} CAPI repository][capi-repo].

louiseschmidtgen · 2024-11-13T12:38:28Z

docs/src/capi/explanation/in-place-upgrades.md

+For the detailed implementation of this controller, make sure to check out 
+the [{{product}} CAPI repository][capi-repo].
+
+#### Locking The Upgrade Process


Suggested change

#### Locking The Upgrade Process

#### (Optional) Locking The Upgrade Process

louiseschmidtgen · 2024-11-13T12:39:41Z

docs/src/capi/explanation/in-place-upgrades.md

+There might be scenarios where we need to make sure that only a limited 
+number of machines are going to get upgraded at the same time. An example 
+might be upgrading control plane machines. If multiple nodes become 
+unavailable due to getting upgraded, we might lose quorum, experience severe 
+downtime, or end up in an undesirable state that is very costly to get out of.
+
+Let’s say we only want to allow a single upgrade at any given point (no 
+parallel upgrades). A lock or a semaphore can be implemented to ensure 
+that the orchestrator is not going to trigger an upgrade for multiple 
+machines, even if multiple instances of the orchestrator are reconciling 
+the same object in parallel. Note that this implementation can be 
+generalized to allow `n` upgrades at the same time, instead of only 1.
+
+For the detailed implementation of the locking process, make sure to 
+checkout the [{{product}} CAPI repository][capi-repo].
+


As a user all I care about is where do I configure n, please add that explanation and remove the details.

Well, for now it can't be configured :)

Alright, then let's remove this until we have a way to configure it.

louiseschmidtgen · 2024-11-13T12:41:00Z

docs/src/capi/explanation/in-place-upgrades.md

+## Conclusion
+
+Provisioning a cluster can be challenging. Trying to upgrade the machines 
+of that cluster without carefully engineered tools can get out of hand 
+pretty quickly. In-Place Upgrade Controllers that come out of the box with 
+{{product}} CAPI providers can take a huge burden off our shoulders by 
+taking care of the upgrade process in a responsive, self-healing and 
+cloud-native way.


Suggested change

## Conclusion

Provisioning a cluster can be challenging. Trying to upgrade the machines

of that cluster without carefully engineered tools can get out of hand

pretty quickly. In-Place Upgrade Controllers that come out of the box with

{{product}} CAPI providers can take a huge burden off our shoulders by

taking care of the upgrade process in a responsive, self-healing and

cloud-native way.

This is nice fluff but not critical for understanding in-place upgrades.

github-actions · 2024-11-13T13:20:31Z

Hi, looks like pyspelling job found some issues, you can check it here

github-actions · 2024-11-13T13:57:19Z

Hi, looks like pyspelling job found some issues, you can check it here

github-actions · 2024-11-13T14:33:03Z

Hi, looks like pyspelling job found some issues, you can check it here

github-actions · 2024-11-13T14:34:10Z

Hi, looks like pyspelling job found some issues, you can check it here

addyess

Overall very good. Thanks for these details. Once the docs are more assertive we'll be good for merge

addyess · 2024-11-13T16:23:43Z

docs/src/capi/explanation/in-place-upgrades.md

+
+## Controllers
+
+In {{product}} CAPI, we have two main types of controllers that handle the 


Suggested change

In {{product}} CAPI, we have two main types of controllers that handle the

In {{product}} CAPI, there are two main types of controllers that handle the

addyess · 2024-11-13T16:29:48Z

docs/src/capi/explanation/in-place-upgrades.md

+
+#### (Optional) Locking The Upgrade Process
+
+There might be scenarios where we need to make sure that only a limited 


Be more assertive here too

Suggested change

There might be scenarios where we need to make sure that only a limited

The controller ensures that only a limited...

github-actions · 2024-11-14T07:48:38Z

Hi, looks like pyspelling job found some issues, you can check it here

github-actions · 2024-11-14T07:54:48Z

Hi, looks like pyspelling job found some issues, you can check it here

louiseschmidtgen

Great work! One last round of comments then we should be ready to merge :)

louiseschmidtgen · 2024-11-14T08:41:16Z

docs/src/capi/explanation/in-place-upgrades.md

+is important. While rolling upgrades a popular strategy, certain situations 
+will require in-place upgrades:


Suggested change

is important. While rolling upgrades a popular strategy, certain situations

will require in-place upgrades:

is important. While rolling upgrades are a popular strategy, certain situations

will require in-place upgrades:

louiseschmidtgen · 2024-11-14T08:55:55Z

docs/src/capi/explanation/in-place-upgrades.md

+For more information about CAPI design decisions, have a look at the 
+following:
+
+- [Machine immutability in CAPI][1]
+- [Kubernetes objects: `labels`][2]
+- [Kubernetes objects: `spec` and `status`][3]


Suggested change

For more information about CAPI design decisions, have a look at the

following:

- [Machine immutability in CAPI][1]

- [Kubernetes objects: `labels`][2]

- [Kubernetes objects: `spec` and `status`][3]

For a deeper understanding of the CAPI design decisions, consider reading about

[machine immutability in CAPI][1], and Kubernetes objects: [`labels`][2],[`spec` and `status`][3].

Nice! Let's embed the links in the text.

docs/src/capi/explanation/in-place-upgrades.md

louiseschmidtgen · 2024-11-14T09:02:09Z

docs/src/capi/explanation/in-place-upgrades.md

+Upgrader”. It watches for certain annotations on machines and reconciles 
+them to make sure the upgrades happen as expected. As its name suggests, 
+it reconciles each machine individually, considers the machines separately 
+and does not assume any sort of relation between them.


Suggested change

and does not assume any sort of relation between them.

and does not assume any relation between them.

louiseschmidtgen · 2024-11-14T09:02:33Z

docs/src/capi/explanation/in-place-upgrades.md

+it reconciles each machine individually, considers the machines separately 
+and does not assume any sort of relation between them.
+
+The “Orchestrator” on the other hand, watches for certain annotations on 


Suggested change

The “Orchestrator” on the other hand, watches for certain annotations on

The “Orchestrator” watches for certain annotations on

louiseschmidtgen · 2024-11-14T09:18:03Z

docs/src/capi/explanation/in-place-upgrades.md

+{{product}} has a daemon running in the background called the `k8sd`. 
+It’s responsible for many things in the context of {{product}} but 
+one of them is to expose certain endpoints that can be used to 
+interact with the cluster. The single machine upgrader calls the 


Suggested change

{{product}} has a daemon running in the background called the `k8sd`.

It’s responsible for many things in the context of {{product}} but

one of them is to expose certain endpoints that can be used to

interact with the cluster. The single machine upgrader calls the

The {{product}}'s `k8sd` daemon is exposes endpoints that can be used to

interact with the cluster. The single machine upgrader calls the

louiseschmidtgen · 2024-11-14T09:21:01Z

docs/src/capi/explanation/in-place-upgrades.md

+`/snap/refresh` endpoint on the machine that it’s trying to upgrade. 
+This endpoint call will trigger the “actual” upgrade process and in 
+the meantime, the `/snap/refresh-status` will be called periodically 
+by the single machine upgrader to see how things are going. It’s worth 
+noting that ensuring secure communication between the single machine 
+upgrader and the {{product}} daemon (k8sd) is really important and 
+something that is handled internally as well.


Suggested change

`/snap/refresh` endpoint on the machine that it’s trying to upgrade.

This endpoint call will trigger the “actual” upgrade process and in

the meantime, the `/snap/refresh-status` will be called periodically

by the single machine upgrader to see how things are going. It’s worth

noting that ensuring secure communication between the single machine

upgrader and the {{product}} daemon (k8sd) is really important and

something that is handled internally as well.

`/snap/refresh` endpoint on the machine to trigger the upgrade process while checking `/snap/refresh-status` periodically.

IMHO, we can remove the note on secure communication.

louiseschmidtgen · 2024-11-14T09:21:49Z

docs/src/capi/explanation/in-place-upgrades.md

+
+While the “Single Machine In-Place Upgrade Controller” is responsible 
+for upgrading individual machines, the "Orchestrated In-Place Upgrade 
+Controller" makes sure that groups of machines will get upgraded.


Suggested change

Controller" makes sure that groups of machines will get upgraded.

Controller" ensures that groups of machines will get upgraded.

louiseschmidtgen · 2024-11-14T09:24:55Z

docs/src/capi/explanation/in-place-upgrades.md

+
+#### (Optional) Locking The Upgrade Process
+
+The controllers ensure that only a single machine 
+is going to get upgraded at the same time.
+This is to prevent undesirable situations like quorum loss or severe downtimes. 
+
+{{product}} CAPI uses a lock to ensure only 1 control plane machine is 
+getting upgraded at a given time.


Suggested change

#### (Optional) Locking The Upgrade Process

The controllers ensure that only a single machine

is going to get upgraded at the same time.

This is to prevent undesirable situations like quorum loss or severe downtimes.

{{product}} CAPI uses a lock to ensure only 1 control plane machine is

getting upgraded at a given time.

Collapsing this into the previous paragraph.

louiseschmidtgen · 2024-11-14T09:27:57Z

docs/src/capi/explanation/in-place-upgrades.md

+By applying the `upgrade-to` annotation on an object that owns machines 
+(e.g. a `MachineDeployment`), this controller will mark the owned machines 
+one by one which will cause the "Single Machine Upgrader" to pickup those 
+annotations and upgrade the machines.


Suggested change

annotations and upgrade the machines.

annotations and upgrade the machines. To avoid undesirable situations like quorum loss or severe downtime, these upgrades happen in sequence.

github-actions · 2024-11-14T10:47:14Z

Hi, looks like pyspelling job found some issues, you can check it here

louiseschmidtgen

Some more little nits.

louiseschmidtgen · 2024-11-15T11:45:31Z

docs/src/capi/explanation/in-place-upgrades.md

+Upgrader”. It watches for certain annotations on machines and reconciles 
+them to make sure the upgrades happen as expected. As its name suggests, 
+it reconciles each machine individually, considers the machines separately 
+and does not assume any relation between them.


Suggested change

Upgrader”. It watches for certain annotations on machines and reconciles

them to make sure the upgrades happen as expected. As its name suggests,

it reconciles each machine individually, considers the machines separately

and does not assume any relation between them.

Upgrader”. The controller watches for annotations on machines and reconciles

them to ensure the upgrades happen smoothly.

Since the name suggests this we don't need to write it out- I think it becomes clear this is the single machine controller vs in the Orchestrator :)

louiseschmidtgen · 2024-11-15T11:45:52Z

docs/src/capi/explanation/in-place-upgrades.md

+The core component of performing an in-place upgrade is the “Single Machine 
+Upgrader”. It watches for certain annotations on machines and reconciles 


Suggested change

The core component of performing an in-place upgrade is the “Single Machine

Upgrader”. It watches for certain annotations on machines and reconciles

The core component of performing an in-place upgrade is the `Single Machine

Upgrader`. It watches for certain annotations on machines and reconciles

louiseschmidtgen · 2024-11-15T11:46:25Z

docs/src/capi/explanation/in-place-upgrades.md

+The “Orchestrator” watches for certain annotations on 
+machine owners, reconciles them and upgrades groups of owned machines. 
+It’s responsible for making sure that all the machines owned by the 
+reconciled object get upgraded successfully.


Suggested change

The “Orchestrator” watches for certain annotations on

machine owners, reconciles them and upgrades groups of owned machines.

It’s responsible for making sure that all the machines owned by the

reconciled object get upgraded successfully.

The `Orchestrator` watches for certain annotations on

machine owners, reconciles them and upgrades groups of owned machines.

It’s responsible for ensuring that all the machines owned by the

reconciled object get upgraded successfully.

docs/src/capi/explanation/in-place-upgrades.md

louiseschmidtgen · 2024-11-15T11:50:40Z

docs/src/capi/explanation/in-place-upgrades.md

+Upgrade methods or options can be specified to upgrade to a snap channel, 
+revision, or install a new snap from a file already placed on the 
+machine to make up for air-gapped environments.


This needs to get applied :)

docs/src/capi/explanation/in-place-upgrades.md

github-actions · 2024-11-15T13:49:38Z

Hi, looks like pyspelling job found some issues, you can check it here

louiseschmidtgen

LGTM, please look into the build-the-docs CI error before merging.

github-actions · 2024-11-15T14:24:31Z

Hi, looks like pyspelling job found some issues, you can check it here

HomayoonAlimohammadi requested a review from a team as a code owner November 7, 2024 10:03

eaudetcobello requested changes Nov 11, 2024

View reviewed changes

canonical deleted a comment from github-actions bot Nov 11, 2024

eaudetcobello approved these changes Nov 11, 2024

View reviewed changes

louiseschmidtgen reviewed Nov 13, 2024

View reviewed changes

HomayoonAlimohammadi requested a review from louiseschmidtgen November 13, 2024 14:45

addyess reviewed Nov 13, 2024

View reviewed changes

louiseschmidtgen reviewed Nov 14, 2024

View reviewed changes

louiseschmidtgen reviewed Nov 15, 2024

View reviewed changes

louiseschmidtgen approved these changes Nov 15, 2024

View reviewed changes

Add in-place upgrade explanation

b168550

HomayoonAlimohammadi force-pushed the KU-1894/in-place-upgrades-explanation branch from 6a63b76 to b168550 Compare November 15, 2024 14:23

HomayoonAlimohammadi merged commit 3528193 into main Nov 15, 2024
6 checks passed

HomayoonAlimohammadi deleted the KU-1894/in-place-upgrades-explanation branch November 15, 2024 14:28

		- Resource constraints (i.e. we can not afford spawning new machines).
		- Costly and manual setup process for each machine which we can not afford.

		In this section we're going to see how the orchestrated in-place upgrades
		work in {{product}} CAPI.


		For the detailed implementation of this controller, make sure to check out
		the [{{product}} CAPI repository][capi-repo].

	#### Locking The Upgrade Process
	#### (Optional) Locking The Upgrade Process


		## Controllers

		In {{product}} CAPI, we have two main types of controllers that handle the

	In {{product}} CAPI, we have two main types of controllers that handle the
	In {{product}} CAPI, there are two main types of controllers that handle the


		#### (Optional) Locking The Upgrade Process

		There might be scenarios where we need to make sure that only a limited

	There might be scenarios where we need to make sure that only a limited
	The controller ensures that only a limited...

		is important. While rolling upgrades a popular strategy, certain situations
		will require in-place upgrades:

	and does not assume any sort of relation between them.
	and does not assume any relation between them.

	The “Orchestrator” on the other hand, watches for certain annotations on
	The “Orchestrator” watches for certain annotations on

	Controller" makes sure that groups of machines will get upgraded.
	Controller" ensures that groups of machines will get upgraded.

	annotations and upgrade the machines.
	annotations and upgrade the machines. To avoid undesirable situations like quorum loss or severe downtime, these upgrades happen in sequence.

		The core component of performing an in-place upgrade is the “Single Machine
		Upgrader”. It watches for certain annotations on machines and reconciles

Add in-place upgrade explanation #770

Add in-place upgrade explanation #770

Conversation

HomayoonAlimohammadi commented Nov 7, 2024

Overview

evilnick commented Nov 8, 2024

eaudetcobello left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 12, 2024

louiseschmidtgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

addyess left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

louiseschmidtgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 14, 2024

louiseschmidtgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 15, 2024

louiseschmidtgen left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 15, 2024