You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
The rancher-system-agent service appears to take many snapshots in a very short timeframe (seconds) if s3 is misconfigured and uploading to the s3 bucket fails. Then after this brief time period, it will still keep trying, seemingly infinitely, over a longer time period to take snapshots and upload (about every 5 minutes). The details outlined below should provide more information and context.
In Rancher, create an 1-node RKE2 cluster and set up a mostly proper s3 config -- meaning correct keys, region, etc. Just don't supply a legit bucket name - enter a fake one. Tell Rancher to take the snapshot in the Rancher UI and it will. If you look at the rancher-system-agent logs this will fail but it does take snapshots locally. The agent does this many times for some reason in a very short period of time.
Watch the /var/lib/rancher/rke2/server/db/snapshots directory on the node. When you take the snapshot it takes many in a very short period.
If you simultaneously watch the rancher-system-agent systemd logs you'll see it loops very quickly over taking all these snapshots. It seems to back off and keep trying but more slowly after that point.
Expected Result:
I expected the rancher-system-agent to indeed retry taking a snapshot when it fails to upload to my s3 bucket, but not to retake snapshots so aggressively in such a short period/loop.
Actual Result:
The rancher-system-agent takes many snapshots in a very short period of time, then more slowly but seemingly infinitely over time.
Consider this directory listing of /var/lib/rancher/rke2/server/db/snapshots on the node:
Notice that initially the on-demand snapshot was taken about 12 times in a short span of seconds apart. After that it seems ot retry approximately every 5 minutes which seems more reasonable. Ideally it would only retry about every 5 minutes right at the start, and fail / give up after perhaps a few tries (and log an error).
If we look at the rancher-system-agent logs we can see that initially around this timeframe it indeed was very quickly taking all these snapshots this loops a lot at first. A snippet is shown below - this repeats rapidly (see full logs attached below).
Aug 19 20:09:25 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:25Z" level=debug msg="[K8s] (kube-apiserver) writing probe status to map"
Aug 19 20:09:25 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:25Z" level=debug msg="[K8s] secret data/string-data did not change, not updating secret"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Processing secret dave-s3improper-bootstrap-template-kqddg-machine-plan in namespace fleet-default at generation 0 with resource version 8094"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Calculated checksum to be 9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Remote plan had an applied checksum value of 134aeacac334133ae2d2616b36215edfa2cff4e1d74cf8d83f4860bb5ad27f79"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Calling Applyinator to apply the plan"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="Applying plan with checksum 9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Writing applied calculated plan contents to historical plan directory /var/lib/rancher/agent/applied"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="reconciling file permissions for /var/lib/rancher/agent/applied/20210819-200930 to 0:0 384"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Writing file /etc/rancher/agent/registries.yaml"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="requested file permission for /etc/rancher/agent/registries.yaml was , defaulting to 384"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="file /etc/rancher/agent/registries.yaml does not need to be written"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="reconciling file permissions for /etc/rancher/agent/registries.yaml to 0:0 384"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Cleaning working directory before applying /var/lib/rancher/agent/work"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Executing instruction 0 for plan 9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/agent/work/20210819-200930-applied.plan/9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd_0"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="requested file permission for /var/lib/rancher/agent/work/20210819-200930-applied.plan/9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd_0 was , defaulting to 493"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="reconciling file permissions for /var/lib/rancher/agent/work/20210819-200930-applied.plan/9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd_0 to 0:0 493"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="Running command: rke2 [etcd-snapshot --node-name=dave-s3improper-pool1-gzrql --s3 --s3-bucket=dave-fakebucket --s3-access-key=REDACTED_AWS_KEY --s3-region=us-east-2 --s3-folder=dave-fakefolder --s3-endpoint=s3.us-east-2.amazonaws.com --s3-skip-ssl-verify]"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Managed etcd cluster bootstrap already complete and initialized\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: W0819 20:09:30.620735 15143 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: W0819 20:09:30.646278 15143 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: W0819 20:09:30.670290 15143 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682065 15143 client.go:360] parsed scheme: \"endpoint\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682109 15143 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}]"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682211 15143 client.go:360] parsed scheme: \"passthrough\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682247 15143 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://127.0.0.1:2379 <nil> 0 <nil>}] <nil> <nil>}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682268 15143 clientconn.go:948] ClientConn switching balancer to \"pick_first\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682314 15143 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-dave-s3improper-pool1-gzrql-1629403770\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.686770 15143 client.go:360] parsed scheme: \"endpoint\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.686792 15143 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}]"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"created temporary db file\",\"path\":\"/var/lib/rancher/rke2/server/db/snapshots/on-demand-dave-s3improper-pool1-gzrql-1629403770.part\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: {\"level\":\"info\",\"ts\":\"2021-08-19T20:09:30.689Z\",\"caller\":\"clientv3/maintenance.go:200\",\"msg\":\"opened snapshot stream; downloading\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"fetching snapshot\",\"endpoint\":\"https://127.0.0.1:2379\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: {\"level\":\"info\",\"ts\":\"2021-08-19T20:09:30.753Z\",\"caller\":\"clientv3/maintenance.go:208\",\"msg\":\"completed snapshot read; closing\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"fetched snapshot\",\"endpoint\":\"https://127.0.0.1:2379\",\"size\":\"9.6 MB\",\"took\":\"77.981212ms\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"saved\",\"path\":\"/var/lib/rancher/rke2/server/db/snapshots/on-demand-dave-s3improper-pool1-gzrql-1629403770\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Saving etcd snapshot on-demand-dave-s3improper-pool1-gzrql-1629403770 to S3\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Checking if S3 bucket dave-fakebucket exists\""
Aug 19 20:09:31 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:31Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:31Z\" level=fatal msg=\"bucket: dave-fakebucket does not exist\""
The logs will show entries like this and it will loop quickly and keep taking snapshots on disk.
Desired Result:
In an ideal world, I would like RKE2 (or maybe the rancher-system-agent?) to validate that the credentials provided are valid and to check if the bucket indeed exists first, if not then perhaps do not attempt to upload, rather snapshot locally and spit out an error in logs.
The text was updated successfully, but these errors were encountered:
Summary:
The rancher-system-agent service appears to take many snapshots in a very short timeframe (seconds) if s3 is misconfigured and uploading to the s3 bucket fails. Then after this brief time period, it will still keep trying, seemingly infinitely, over a longer time period to take snapshots and upload (about every 5 minutes). The details outlined below should provide more information and context.
Environment:
Rancher version: v2.6-head
2778ecaba
pulled at 8/19/21 12:51 pm PacificRancher cluster type: single-node docker
Docker version: 20.10
Downstream cluster type: rke2
Downstream K8s version: v1.21.3-rc6+rke2r2
Steps to Reproduce:
Expected Result:
I expected the rancher-system-agent to indeed retry taking a snapshot when it fails to upload to my s3 bucket, but not to retake snapshots so aggressively in such a short period/loop.
Actual Result:
The rancher-system-agent takes many snapshots in a very short period of time, then more slowly but seemingly infinitely over time.
Consider this directory listing of
/var/lib/rancher/rke2/server/db/snapshots
on the node:Notice that initially the on-demand snapshot was taken about 12 times in a short span of seconds apart. After that it seems ot retry approximately every 5 minutes which seems more reasonable. Ideally it would only retry about every 5 minutes right at the start, and fail / give up after perhaps a few tries (and log an error).
If we look at the rancher-system-agent logs we can see that initially around this timeframe it indeed was very quickly taking all these snapshots this loops a lot at first. A snippet is shown below - this repeats rapidly (see full logs attached below).
The logs will show entries like this and it will loop quickly and keep taking snapshots on disk.
FULL rancher-system-agent logs here:
rancher-system-agent.log
Desired Result:
In an ideal world, I would like RKE2 (or maybe the rancher-system-agent?) to validate that the credentials provided are valid and to check if the bucket indeed exists first, if not then perhaps do not attempt to upload, rather snapshot locally and spit out an error in logs.
The text was updated successfully, but these errors were encountered: