[WIP] Revert PR #17160 #17224

hakuna-matatah · 2025-01-20T20:14:54Z

Testing etcd revert of Revert #17160 to see if scale tests succeeds - https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

As tests started failing since the etcd version bump - 8063d83...581f363

So to rule this out, i want to test this out on a presubmit test

Related to kubernetes/kubernetes#129593

This reverts commit c3a748f.

This reverts commit f8fa173.

k8s-ci-robot · 2025-01-20T20:15:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johngmyers for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hakuna-matatah · 2025-01-20T20:15:06Z

cc: @hakman

hakuna-matatah · 2025-01-20T20:18:30Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-01-20T20:19:07Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-01-20T21:08:00Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-01-20T22:01:58Z

Looking at the latest test failure on Jan 20 - https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1879785938800873472


ClusterLoaderV2: load: [step: 29] gathering measurements [01] - APIResponsivenessPrometheusSimple expand_less | 3s
-- | --
{ Failure :0 [measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheusSimple error: top latency metric: there should be no high-latency requests, but: [got: &{Resource:deployments Subresource: Verb:DELETE Scope:resource Latency:perc50: 28.499225ms, perc90: 103.312822ms, perc99: 14.426998674s Count:16654 SlowCount:1545}; expected perc99 <= 1s got: &{Resource:configmaps Subresource: Verb:GET Scope:resource Latency:perc50: 26.098901ms, perc90: 46.978021ms, perc99: 5.65s Count:571 SlowCount:18}; expected perc99 <= 1s got: &{Resource:leases Subresource: Verb:PUT Scope:resource Latency:perc50: 25.656061ms, perc90: 46.18091ms, perc99: 2.923699145s Count:2685704 SlowCount:42303}; expected perc99 <= 1s got: &{Resource:pods Subresource:binding Verb:POST Scope:resource Latency:perc50: 25.608064ms, perc90: 46.094515ms, perc99: 2.498237242s Count:184035 SlowCount:2678}; expected perc99 <= 1s got: &{Resource:deployments Subresource: Verb:POST Scope:resource Latency:perc50: 25.553303ms, perc90: 45.995946ms, perc99: 1.412040099s Count:16643 SlowCount:213}; expected perc99 <= 1s]] :0}

It is failing for two reasons ^^^:

API Server Latency SLOs getting breached
APIServer Availability issues

As for why APIServer availability SLOs not meeting, relevant error logs from CL2

W0116 08:49:25.882192   62110 api_availability_measurement.go:117] cluster not available; HTTP status code: 0
W0116 08:49:35.888376   62110 api_availability_measurement.go:83] execservice issue: problem with RunCommand(): output="", err=signal: killed
W0116 08:49:35.888419   62110 api_availability_measurement.go:86] host 10.0.35.88 not available; HTTP status code: 
W0116 08:49:35.888972   62110 api_availability_measurement.go:83] execservice issue: problem with RunCommand(): output="", err=signal: killed
W0116 08:49:35.889008   62110 api_availability_measurement.go:86] host 10.0.162.64 not available; HTTP status code: 
W0116 08:49:35.891926   62110 api_availability_measurement.go:83] execservice issue: problem with RunCommand(): output="", err=signal: killed
W0116 08:49:35.891956   62110 api_availability_measurement.go:86] host 10.0.89.149 not available; HTTP status code: 
W0116 08:54:25.354601   62110 api_availability_measurement.go:117] cluster not available; HTTP status code: 0
W0116 08:59:27.509292   62110 api_availability_measurement.go:117] cluster not available; HTTP status code: 0

So, need to understand what's going on APIServer ? Was it being overloaded w.r.t resources like CPU/Mem/Network ?

Lets check apiserver health logs:

Seems like livez and healthz of all 3 instances were failing, why ?? ^^^

Let's Inspect APIServer logs shows that all 3 instances failed for same reason i.e

I0116 07:49:23.962930       1 main.go:127] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error
I0116 07:49:23.964841       1 main.go:127] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error

Correlating apiserver healthz failures to following:

I0116 07:49:23.720377      12 trace.go:236] Trace[493004600]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:78bd3622-bdf1-47c0-9740-5d1f388ea4e6,client:10.7.186.93,api-group:coordination.k8s.io,api-version:v1,name:i-0b8eb7d055e168aab,subresource:,namespace:kube-node-lease,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/i-0b8eb7d055e168aab,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PUT (16-Jan-2025 07:49:21.114) (total time: 2605ms):
Trace[493004600]: ["GuaranteedUpdate etcd3" audit-id:78bd3622-bdf1-47c0-9740-5d1f388ea4e6,key:/leases/kube-node-lease/i-0b8eb7d055e168aab,type:*coordination.Lease,resource:leases.coordination.k8s.io 2605ms (07:49:21.114)
Trace[493004600]:  ---"Txn call completed" 2605ms (07:49:23.720)]
Trace[493004600]: [2.605846831s] [2.605846831s] END
I0116 07:49:23.720456      12 trace.go:236] Trace[271183379]: "Patch" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:0a95c947-1de3-42df-a4d1-a26142427057,client:10.10.184.191,api-group:,api-version:v1,name:i-0aa04d8fec4795449,subresource:status,namespace:,protocol:HTTP/2.0,resource:nodes,scope:resource,url:/api/v1/nodes/i-0aa04d8fec4795449/status,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PATCH (16-Jan-2025 07:49:21.110) (total time: 2609ms):
Trace[271183379]: ["GuaranteedUpdate etcd3" audit-id:0a95c947-1de3-42df-a4d1-a26142427057,key:/minions/i-0aa04d8fec4795449,type:*core.Node,resource:nodes 2609ms (07:49:21.110)
Trace[271183379]:  ---"Txn call completed" 2606ms (07:49:23.720)]
I0116 07:49:23.962503      12 healthz.go:280] etcd check failed: healthz
[-]etcd failed: error getting data from etcd: context deadline exceeded
I0116 07:49:23.964388      12 healthz.go:280] etcd check failed: healthz
[-]etcd failed: error getting data from etcd: context deadline exceeded
I0116 07:49:24.174751      12 trace.go:236] Trace[2087418895]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:9dfb98fe-c518-4110-ae6a-5bbc82b78030,client:10.5.185.132,api-group:coordination.k8s.io,api-version:v1,name:i-01b9c7e8e59c0452b,subresource:,namespace:kube-node-lease,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/i-01b9c7e8e59c0452b,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PUT (16-Jan-2025 07:49:21.150) (total time: 3023ms):
Trace[2087418895]: ["GuaranteedUpdate etcd3" audit-id:9dfb98fe-c518-4110-ae6a-5bbc82b78030,key:/leases/kube-node-lease/i-01b9c7e8e59c0452b,type:*coordination.Lease,resource:leases.coordination.k8s.io 3023ms (07:49:21.151)
Trace[2087418895]:  ---"Txn call completed" 3023ms (07:49:24.174)]
Trace[2087418895]: [3.023712131s] [3.023712131s] END
I0116 07:49:24.174769      12 trace.go:236] Trace[1464681109]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:21a6a0b3-721a-4455-97b6-87978ad1e1d2,client:10.6.57.54,api-group:coordination.k8s.io,api-version:v1,name:i-0d0d9a8e21d529f4a,subresource:,namespace:kube-node-lease,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/i-0d0d9a8e21d529f4a,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PUT (16-Jan-2025 07:49:21.154) (total time: 3019ms):
Trace[1464681109]: ["GuaranteedUpdate etcd3" audit-id:21a6a0b3-721a-4455-97b6-87978ad1e1d2,key:/leases/kube-node-lease/i-0d0d9a8e21d529f4a,type:*coordination.Lease,resource:leases.coordination.k8s.io 3019ms (07:49:21.154)
Trace[1464681109]:  ---"Txn call completed" 3019ms (07:49:24.174)]
Trace[1464681109]: [3.01980043s] [3.01980043s] END

^^^ It appears that etcd is failing health check probes from APIServer , which explains APIServer Availability SLOs breach ( it is correlated multiple times)

Also around the same min/sec 07:49:23.720456 and 07:49:24.174751 , etcd was taking around 3 seconds for patching node object and for lease renewals , these calls are already breaching APIServer latency SLOs.

Next steps to understand

why etcd was not responding to health check probes within time from APIserver
why etcd transactions for lease renewals and patch node object etc API calls are taking O(seconds)

evaluate ^^^ if its due to high CPU/Mem/Network utilization or something else ?

hakuna-matatah · 2025-01-23T17:22:42Z

This should likely help with pre-submit job failure for aws-scale tests - kubernetes/test-infra#34201

hakuna-matatah · 2025-01-23T17:43:23Z

Given this is merged kubernetes/test-infra#34203
I'm going to retry the scale test with this revert

hakuna-matatah · 2025-01-23T17:47:38Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2025-01-24T01:17:56Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

k8s-ci-robot · 2025-01-24T07:27:01Z

@hakuna-matatah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kops-e2e-k8s-gce-ipalias	`d4217e7`	link	true	`/test pull-kops-e2e-k8s-gce-ipalias`
presubmit-kops-aws-scale-amazonvpc-using-cl2	`d4217e7`	link	false	`/test presubmit-kops-aws-scale-amazonvpc-using-cl2`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hakuna-matatah added 2 commits January 20, 2025 20:10

Revert "hack/update-expected.sh"

4a2c703

This reverts commit c3a748f.

Revert "Update etcd to v3.5.17"

d4217e7

This reverts commit f8fa173.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 20, 2025

k8s-ci-robot requested review from hakman and zetaab January 20, 2025 20:15

hakuna-matatah changed the title ~~[WIP] Revert https://github.com/kubernetes/kops/pull/17160/commits~~ [WIP] Revert https://github.com/kubernetes/kops/pull/17160 Jan 20, 2025

hakuna-matatah changed the title ~~[WIP] Revert https://github.com/kubernetes/kops/pull/17160~~ [WIP] Revert #17160 Jan 20, 2025

hakuna-matatah changed the title ~~[WIP] Revert #17160~~ [WIP] Revert PR #17160 Jan 20, 2025

hakuna-matatah mentioned this pull request Jan 23, 2025

AWS Scale test presubmits are failing to get scheduled kubernetes/test-infra#34201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Revert PR #17160 #17224

[WIP] Revert PR #17160 #17224

hakuna-matatah commented Jan 20, 2025 •

edited

Loading

k8s-ci-robot commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025 •

edited

Loading

hakuna-matatah commented Jan 23, 2025

hakuna-matatah commented Jan 23, 2025

hakuna-matatah commented Jan 23, 2025

hakuna-matatah commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

[WIP] Revert PR #17160 #17224

Are you sure you want to change the base?

[WIP] Revert PR #17160 #17224

Conversation

hakuna-matatah commented Jan 20, 2025 • edited Loading

k8s-ci-robot commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025

hakuna-matatah commented Jan 20, 2025 • edited Loading

It is failing for two reasons ^^^:

As for why APIServer availability SLOs not meeting, relevant error logs from CL2

Lets check apiserver health logs:

Let's Inspect APIServer logs shows that all 3 instances failed for same reason i.e

Next steps to understand

hakuna-matatah commented Jan 23, 2025

hakuna-matatah commented Jan 23, 2025

hakuna-matatah commented Jan 23, 2025

hakuna-matatah commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

hakuna-matatah commented Jan 20, 2025 •

edited

Loading

hakuna-matatah commented Jan 20, 2025 •

edited

Loading