Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpa-recommender can issue 0 as memory recommendation after restart #7726

Open
plkokanov opened this issue Jan 20, 2025 · 0 comments
Open

vpa-recommender can issue 0 as memory recommendation after restart #7726

plkokanov opened this issue Jan 20, 2025 · 0 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@plkokanov
Copy link

Which component are you using?:

/area vertical-pod-autoscaler

What version of the component are you using?:

Component version: vpa-recommender1.2.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version

Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5

What environment is this in?:

Not relevant.

What did you expect to happen?:

vpa-recommender to always issue correct memory recommendations.

What happened instead?:
After restarts of the vpa-recommender it could set very low (equal to minAllowed) memory recommendations resulting in constant OOMKills. After such an incorrect recommendation occurs, vpa-recommender does not recommend higher memory again and this could persist even after additional restarts of vpa-recommender. This seems to happen only when using VerticalPodAutoscalerCheckpoints to work with historical histogram data.

In more details, vpa-recommender tries to set 0 memory recommendations which is then capped to the configured minimum allowed resources. The 0 memory recommendations come from the histogram.Percentile(...) function returning 0. This happens because the histogram.IsEmpty() function returns true, even though there are buckets in the histogram with weights larger than epsilon (0.0001).

To investigate this, we used a forked version of vpa-recommender with increased logging verbosity to also display data from the histogram's buckets before they are saved in a VPACheckpoint.
Here is the logs that we got when this issue occurred:

{"log":"Computed \"memory\" resources for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" were 0 in percentile estimator of \"targetEstimator\"!","pid":"1","severity":"WARN","source":"estimator.go:133"}
2024-12-17 06:41:24	
{"log":"Computed \"memory\" resources for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" were 0 after applying margin in margin estimator of \"targetEstimator\", they were 0 before that","pid":"1","severity":"WARN","source":"estimator.go:194"}
2024-12-17 06:41:24	
{"log":"Computed \"memory\" resources for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" were below minimum in min resource estimator of \"targetEstimator\"! Computed 0, minimum is 10485760.","pid":"1","severity":"WARN","source":"estimator.go:212"}
2024-12-17 06:41:24	
{"log":"The memory histogram for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\" is empty!","pid":"1","severity":"WARN","source":"estimator.go:240"}
2024-12-17 06:41:24	
{"log":"Here's the string representation of the memory histogram for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\": referenceTimestamp: 2024-12-18 00:00:00 +0000 UTC, halfLife: 24h0m0s; minBucket: 1, maxBucket: 38, totalWeight: 9.050249754647215; %-tile value: ; 0: 0; 5: 0; 10: 0; 15: 0; 20: 0; 25: 0; 30: 0; 35: 0; 40: 0; 45: 0; 50: 0; 55: 0; 60: 0; 65: 0; 70: 0; 75: 0; 80: 0; 85: 0; 90: 0; 95: 0; 100: 0; buckets value; 0: 0; 1: 9.994727728962825e-05; 2: 1.5697793785237693; 3: 0.37959975914600813; 4: 0.17430805159311166; 5: 0; 6: 0; 7: 0; 8: 0; 9: 0; 10: 0; 11: 0; 12: 0; 13: 0; 14: 0; 15: 0; 16: 0; 17: 0; 18: 0.19719597809243655; 19: 0; 20: 0; 21: 0; 22: 0.001599156436634052; 23: 0.5493102359837969; 24: 0.6782422236874174; 25: 9.994727728962825e-05; 26: 0.27465511799189846; 27: 0.9994727728962826; 28: 0.6677477595720064; 29: 0.027685395809227027; 30: 0.9786837392200398; 31: 1.21727424476889; 32: 0.6442601494089437; 33: 0.6688471796221923; 34: 0; 35: 0; 36: 0; 37: 0; 38: 0.021388717339980445; 39: 0; 40: 0; 41: 0; 42: 0; 43: 0; 44: 0; 45: 0; 46: 0; 47: 0; 48: 0; 49: 0; 50: 0; 51: 0; 52: 0; 53: 0; 54: 0; 55: 0; 56: 0; 57: 0; 58: 0; 59: 0; 60: 0; 61: 0; 62: 0; 63: 0; 64: 0; 65: 0; 66: 0; 67: 0; 68: 0; 69: 0; 70: 0; 71: 0; 72: 0; 73: 0; 74: 0; 75: 0; 76: 0; 77: 0; 78: 0; 79: 0; 80: 0; 81: 0; 82: 0; 83: 0; 84: 0; 85: 0; 86: 0; 87: 0; 88: 0; 89: 0; 90: 0; 91: 0; 92: 0; 93: 0; 94: 0; 95: 0; 96: 0; 97: 0; 98: 0; 99: 0; 100: 0; 101: 0; 102: 0; 103: 0; 104: 0; 105: 0; 106: 0; 107: 0; 108: 0; 109: 0; 110: 0; 111: 0; 112: 0; 113: 0; 114: 0; 115: 0; 116: 0; 117: 0; 118: 0; 119: 0; 120: 0; 121: 0; 122: 0; 123: 0; 124: 0; 125: 0; 126: 0; 127: 0; 128: 0; 129: 0; 130: 0; 131: 0; 132: 0; 133: 0; 134: 0; 135: 0; 136: 0; 137: 0; 138: 0; 139: 0; 140: 0; 141: 0; 142: 0; 143: 0; 144: 0; 145: 0; 146: 0; 147: 0; 148: 0; 149: 0; 150: 0; 151: 0; 152: 0; 153: 0; 154: 0; 155: 0; 156: 0; 157: 0; 158: 0; 159: 0; 160: 0; 161: 0; 162: 0; 163: 0; 164: 0; 165: 0; 166: 0; 167: 0; 168: 0; 169: 0; 170: 0; 171: 0; 172: 0; 173: 0; 174: 0; 175: 0","pid":"1","severity":"WARN","source":"estimator.go:243"}
2024-12-17 06:41:24	
{"log":"Here's the checkpoint/state for VPA \"extension-os-suse-chost-7bhgq/gardener-extension-os-suse-chost-vpa\": {\"lastUpdateTime\":\"2024-12-17T06:41:24Z\",\"version\":\"v3\",\"cpuHistogram\":{\"referenceTimestamp\":\"2024-12-17T00:00:00Z\",\"bucketWeights\":{\"0\":10000,\"1\":637,\"10\":223,\"11\":202,\"12\":138,\"13\":77,\"14\":86,\"15\":79,\"16\":40,\"17\":42,\"18\":25,\"19\":27,\"2\":3816,\"20\":27,\"21\":27,\"22\":20,\"23\":16,\"24\":5,\"25\":4,\"26\":11,\"27\":13,\"28\":7,\"29\":5,\"3\":2836,\"30\":5,\"31\":1,\"32\":4,\"33\":4,\"34\":1,\"35\":1,\"36\":1,\"37\":4,\"38\":1,\"39\":1,\"4\":717,\"40\":1,\"5\":311,\"6\":183,\"7\":57,\"8\":146,\"9\":183},\"totalWeight\":487.1371144429181},\"memoryHistogram\":{\"referenceTimestamp\":\"2024-12-18T00:00:00Z\",\"bucketWeights\":{\"1\":1,\"18\":1256,\"2\":10000,\"22\":10,\"23\":3499,\"24\":4321,\"25\":1,\"26\":1750,\"27\":6367,\"28\":4254,\"29\":176,\"3\":2418,\"30\":6235,\"31\":7754,\"32\":4104,\"33\":4261,\"38\":136,\"4\":1110},\"totalWeight\":9.050249754647215},\"firstSampleStart\":\"2023-03-16T12:35:34Z\",\"lastSampleStart\":\"2024-12-17T06:40:46Z\",\"totalSamplesCount\":1623603}","pid":"1","severity":"WARN","source":"estimator.go:251"}
2024-12-17 06:42:12

You can see that in the memory histogram, the first bucket has a weight of 9.994727728962825e-05 which is less than 0.0001 and hence histogram.IsEmtpy()would return true when called.

Below I try to explain how this can happen:

The main reason for this issue seems to be the transformation of the weights in the histograms from floating point numbers to integer values when saving them to VPACheckpoints and then back to float from integer when loading them from VPACheckpoints. Basically, a weight that was very close to epsilon, e.g. 0.00013, could get rounded to 1 when saved in the VPACheckpoint. When loaded from the VPACheckpoint the same weight could become less than epsilon (e.g. 9.994727728962825e-05 as in the example above). This is because weights are multiplied by different factors when saving and loading and there is some precision loss.

Note also that In our clusters we run vpa-recommender under VPA as well (managed by separate VPA components) which means that it will be restarted around every 12 hours to bring the resource requests closer to the recommendations. The frequency of the restarts contributes to the issue.

Here are the steps that need to happen for the issue to occur:

  1. vpa-recommender is restarted
  2. Histograms from VPACheckpioints are loaded and saved in the ContainersInitialAggregateState variable
  3. New metrics are ingested and added to the aggregateStateMap and the aggregateContainerStates for the corresponding VPA
  4. During the recommendation, the container states from the ContainersInitialAggregateState and aggregateContainerStates are merged. The referenceTimestamps in the histogram in the aggregateContainerStates happens to be ahead of the referenceTimestamp in the initial state so the referenceTimestamp has to be shifted forward and the weights that were loaded from the checkpoint are scaled down. Those less than epsilon are "discarded" via the histogram.updateMinAndMaxBucket() function.
  5. A correct target recommendation is issued
  6. The histograms are saved in the VPACheckpoints with the weight of the min bucket being saved as 1
  7. vpa-recommender is restarted again
  8. Steps 2, 3 and 4 are repeated. The weight of the min bucket is loaded as 9.994727728962825e-05 However, this time during the merge weights are not scaled as the referenceTimestamp of the histograms match (referenceTimestamp of the histogram was already shifted during the previous restart, merge and save). This means that no scale down and removal of potential weights less than epsilon has occurred.
  9. VPA issues an incorrect recommendation as the min bucket is now less than `epsilon

Basically, what this means is that if there is no referenceTimestamp shift and no scale down of weights after they are loaded from a checkpoint, there is a possibility for the histogram to be incorrectly detected as being empty.

How to reproduce it (as minimally and precisely as possible):
This issue can only be reproduced if vpa-recommender is restarted twice in the frame of 24 hours (more concretely twice in the frame of the decay half life setting of the histograms)
I wrote this small main function which approximates what VPA does to reproduce the issue: https://gist.github.com/plkokanov/339ddfe33d05cdc028b273bbac5bdb0a

Anything else we need to know?:

So far We've come up with two ways that seem to fix the issue:

  1. Call histogram.updateMinAndMaxBucket() after loading a checkpoint. Seems to be working ok, but there might be some additional precision loss - weights that would contribute to the histogram's totalWeight before being saved and loaded from the VPACheckpoint, no longer do so.
  2. Another way that I've managed to fix this issue is to ensure that when referenceTimestamps are set, they are set to a time before the timestamp of the added samples. This way (if my calculations are correct), the weight of the most recent sample will never be less than 1 after weights are scaled. This seems to ensure that the ratios used when loading/saving weights to the checkpoint are such that the issue doesn't occur.

/cc @voelzmo and @ialidzhikov as they also took part during the investigation of this issue.

@plkokanov plkokanov added the kind/bug Categorizes issue or PR as related to a bug. label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants