Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Revert PR #17160 #17224

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hakuna-matatah
Copy link
Contributor

@hakuna-matatah hakuna-matatah commented Jan 20, 2025

Testing etcd revert of Revert #17160 to see if scale tests succeeds - https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

As tests started failing since the etcd version bump - 8063d83...581f363

So to rule this out, i want to test this out on a presubmit test

Related to kubernetes/kubernetes#129593

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 20, 2025
@k8s-ci-robot k8s-ci-robot requested review from hakman and zetaab January 20, 2025 20:15
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johngmyers for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hakuna-matatah
Copy link
Contributor Author

cc: @hakman

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah hakuna-matatah changed the title [WIP] Revert https://github.com/kubernetes/kops/pull/17160/commits [WIP] Revert https://github.com/kubernetes/kops/pull/17160 Jan 20, 2025
@hakuna-matatah hakuna-matatah changed the title [WIP] Revert https://github.com/kubernetes/kops/pull/17160 [WIP] Revert #17160 Jan 20, 2025
@hakuna-matatah hakuna-matatah changed the title [WIP] Revert #17160 [WIP] Revert PR #17160 Jan 20, 2025
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 20, 2025

Looking at the latest test failure on Jan 20 - https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1879785938800873472


ClusterLoaderV2: load: [step: 29] gathering measurements [01] - APIResponsivenessPrometheusSimple expand_less | 3s
-- | --
{ Failure :0 [measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheusSimple error: top latency metric: there should be no high-latency requests, but: [got: &{Resource:deployments Subresource: Verb:DELETE Scope:resource Latency:perc50: 28.499225ms, perc90: 103.312822ms, perc99: 14.426998674s Count:16654 SlowCount:1545}; expected perc99 <= 1s got: &{Resource:configmaps Subresource: Verb:GET Scope:resource Latency:perc50: 26.098901ms, perc90: 46.978021ms, perc99: 5.65s Count:571 SlowCount:18}; expected perc99 <= 1s got: &{Resource:leases Subresource: Verb:PUT Scope:resource Latency:perc50: 25.656061ms, perc90: 46.18091ms, perc99: 2.923699145s Count:2685704 SlowCount:42303}; expected perc99 <= 1s got: &{Resource:pods Subresource:binding Verb:POST Scope:resource Latency:perc50: 25.608064ms, perc90: 46.094515ms, perc99: 2.498237242s Count:184035 SlowCount:2678}; expected perc99 <= 1s got: &{Resource:deployments Subresource: Verb:POST Scope:resource Latency:perc50: 25.553303ms, perc90: 45.995946ms, perc99: 1.412040099s Count:16643 SlowCount:213}; expected perc99 <= 1s]] :0}


It is failing for two reasons ^^^:

  • API Server Latency SLOs getting breached
  • APIServer Availability issues

As for why APIServer availability SLOs not meeting, relevant error logs from CL2

W0116 08:49:25.882192   62110 api_availability_measurement.go:117] cluster not available; HTTP status code: 0
W0116 08:49:35.888376   62110 api_availability_measurement.go:83] execservice issue: problem with RunCommand(): output="", err=signal: killed
W0116 08:49:35.888419   62110 api_availability_measurement.go:86] host 10.0.35.88 not available; HTTP status code: 
W0116 08:49:35.888972   62110 api_availability_measurement.go:83] execservice issue: problem with RunCommand(): output="", err=signal: killed
W0116 08:49:35.889008   62110 api_availability_measurement.go:86] host 10.0.162.64 not available; HTTP status code: 
W0116 08:49:35.891926   62110 api_availability_measurement.go:83] execservice issue: problem with RunCommand(): output="", err=signal: killed
W0116 08:49:35.891956   62110 api_availability_measurement.go:86] host 10.0.89.149 not available; HTTP status code: 
W0116 08:54:25.354601   62110 api_availability_measurement.go:117] cluster not available; HTTP status code: 0
W0116 08:59:27.509292   62110 api_availability_measurement.go:117] cluster not available; HTTP status code: 0

So, need to understand what's going on APIServer ? Was it being overloaded w.r.t resources like CPU/Mem/Network ?

Lets check apiserver health logs:

Seems like livez and healthz of all 3 instances were failing, why ?? ^^^

Let's Inspect APIServer logs shows that all 3 instances failed for same reason i.e

I0116 07:49:23.962930       1 main.go:127] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error
I0116 07:49:23.964841       1 main.go:127] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error

Correlating apiserver healthz failures to following:

I0116 07:49:23.720377      12 trace.go:236] Trace[493004600]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:78bd3622-bdf1-47c0-9740-5d1f388ea4e6,client:10.7.186.93,api-group:coordination.k8s.io,api-version:v1,name:i-0b8eb7d055e168aab,subresource:,namespace:kube-node-lease,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/i-0b8eb7d055e168aab,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PUT (16-Jan-2025 07:49:21.114) (total time: 2605ms):
Trace[493004600]: ["GuaranteedUpdate etcd3" audit-id:78bd3622-bdf1-47c0-9740-5d1f388ea4e6,key:/leases/kube-node-lease/i-0b8eb7d055e168aab,type:*coordination.Lease,resource:leases.coordination.k8s.io 2605ms (07:49:21.114)
Trace[493004600]:  ---"Txn call completed" 2605ms (07:49:23.720)]
Trace[493004600]: [2.605846831s] [2.605846831s] END
I0116 07:49:23.720456      12 trace.go:236] Trace[271183379]: "Patch" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:0a95c947-1de3-42df-a4d1-a26142427057,client:10.10.184.191,api-group:,api-version:v1,name:i-0aa04d8fec4795449,subresource:status,namespace:,protocol:HTTP/2.0,resource:nodes,scope:resource,url:/api/v1/nodes/i-0aa04d8fec4795449/status,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PATCH (16-Jan-2025 07:49:21.110) (total time: 2609ms):
Trace[271183379]: ["GuaranteedUpdate etcd3" audit-id:0a95c947-1de3-42df-a4d1-a26142427057,key:/minions/i-0aa04d8fec4795449,type:*core.Node,resource:nodes 2609ms (07:49:21.110)
Trace[271183379]:  ---"Txn call completed" 2606ms (07:49:23.720)]
I0116 07:49:23.962503      12 healthz.go:280] etcd check failed: healthz
[-]etcd failed: error getting data from etcd: context deadline exceeded
I0116 07:49:23.964388      12 healthz.go:280] etcd check failed: healthz
[-]etcd failed: error getting data from etcd: context deadline exceeded
I0116 07:49:24.174751      12 trace.go:236] Trace[2087418895]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:9dfb98fe-c518-4110-ae6a-5bbc82b78030,client:10.5.185.132,api-group:coordination.k8s.io,api-version:v1,name:i-01b9c7e8e59c0452b,subresource:,namespace:kube-node-lease,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/i-01b9c7e8e59c0452b,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PUT (16-Jan-2025 07:49:21.150) (total time: 3023ms):
Trace[2087418895]: ["GuaranteedUpdate etcd3" audit-id:9dfb98fe-c518-4110-ae6a-5bbc82b78030,key:/leases/kube-node-lease/i-01b9c7e8e59c0452b,type:*coordination.Lease,resource:leases.coordination.k8s.io 3023ms (07:49:21.151)
Trace[2087418895]:  ---"Txn call completed" 3023ms (07:49:24.174)]
Trace[2087418895]: [3.023712131s] [3.023712131s] END
I0116 07:49:24.174769      12 trace.go:236] Trace[1464681109]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:21a6a0b3-721a-4455-97b6-87978ad1e1d2,client:10.6.57.54,api-group:coordination.k8s.io,api-version:v1,name:i-0d0d9a8e21d529f4a,subresource:,namespace:kube-node-lease,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/i-0d0d9a8e21d529f4a,user-agent:kubelet/v1.33.0 (linux/amd64) kubernetes/6d24d3b,verb:PUT (16-Jan-2025 07:49:21.154) (total time: 3019ms):
Trace[1464681109]: ["GuaranteedUpdate etcd3" audit-id:21a6a0b3-721a-4455-97b6-87978ad1e1d2,key:/leases/kube-node-lease/i-0d0d9a8e21d529f4a,type:*coordination.Lease,resource:leases.coordination.k8s.io 3019ms (07:49:21.154)
Trace[1464681109]:  ---"Txn call completed" 3019ms (07:49:24.174)]
Trace[1464681109]: [3.01980043s] [3.01980043s] END

^^^ It appears that etcd is failing health check probes from APIServer , which explains APIServer Availability SLOs breach ( it is correlated multiple times)

Also around the same min/sec 07:49:23.720456 and 07:49:24.174751 , etcd was taking around 3 seconds for patching node object and for lease renewals , these calls are already breaching APIServer latency SLOs.

Next steps to understand

  • why etcd was not responding to health check probes within time from APIserver
  • why etcd transactions for lease renewals and patch node object etc API calls are taking O(seconds)

evaluate ^^^ if its due to high CPU/Mem/Network utilization or something else ?

@hakuna-matatah
Copy link
Contributor Author

This should likely help with pre-submit job failure for aws-scale tests - kubernetes/test-infra#34201

@hakuna-matatah
Copy link
Contributor Author

Given this is merged kubernetes/test-infra#34203
I'm going to retry the scale test with this revert

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

1 similar comment
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@k8s-ci-robot
Copy link
Contributor

@hakuna-matatah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kops-e2e-k8s-gce-ipalias d4217e7 link true /test pull-kops-e2e-k8s-gce-ipalias
presubmit-kops-aws-scale-amazonvpc-using-cl2 d4217e7 link false /test presubmit-kops-aws-scale-amazonvpc-using-cl2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants