vpa-recommender
can issue 0 as memory recommendation after restart
#7726
Labels
vpa-recommender
can issue 0 as memory recommendation after restart
#7726
Which component are you using?:
/area vertical-pod-autoscaler
What version of the component are you using?:
Component version: vpa-recommender
1.2.1
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Not relevant.
What did you expect to happen?:
vpa-recommender
to always issue correct memory recommendations.What happened instead?:
After restarts of the
vpa-recommender
it could set very low (equal tominAllowed
) memory recommendations resulting in constantOOMKills
. After such an incorrect recommendation occurs,vpa-recommender
does not recommend higher memory again and this could persist even after additional restarts ofvpa-recommender
. This seems to happen only when usingVerticalPodAutoscalerCheckpoints
to work with historical histogram data.In more details,
vpa-recommender
tries to set 0 memory recommendations which is then capped to the configured minimum allowed resources. The 0 memory recommendations come from thehistogram.Percentile(...)
function returning 0. This happens because thehistogram.IsEmpty()
function returns true, even though there are buckets in the histogram with weights larger thanepsilon
(0.0001).To investigate this, we used a forked version of
vpa-recommender
with increased logging verbosity to also display data from thehistogram
's buckets before they are saved in aVPACheckpoint
.Here is the logs that we got when this issue occurred:
You can see that in the memory histogram, the first bucket has a weight of
9.994727728962825e-05
which is less than0.0001
and hencehistogram.IsEmtpy()
would return true when called.Below I try to explain how this can happen:
The main reason for this issue seems to be the transformation of the weights in the histograms from floating point numbers to integer values when saving them to
VPACheckpoints
and then back to float from integer when loading them fromVPACheckpoints
. Basically, a weight that was very close toepsilon
, e.g.0.00013
, could get rounded to1
when saved in theVPACheckpoint
. When loaded from theVPACheckpoint
the same weight could become less thanepsilon
(e.g.9.994727728962825e-05
as in the example above). This is because weights are multiplied by different factors when saving and loading and there is some precision loss.Note also that In our clusters we run
vpa-recommender
under VPA as well (managed by separate VPA components) which means that it will be restarted around every 12 hours to bring the resource requests closer to the recommendations. The frequency of the restarts contributes to the issue.Here are the steps that need to happen for the issue to occur:
vpa-recommender
is restartedVPACheckpioint
s are loaded and saved in theContainersInitialAggregateState
variableaggregateStateMap
and theaggregateContainerStates
for the corresponding VPAContainersInitialAggregateState
andaggregateContainerStates
are merged. ThereferenceTimestamps
in the histogram in theaggregateContainerStates
happens to be ahead of thereferenceTimestamp
in the initial state so thereferenceTimestamp
has to be shifted forward and the weights that were loaded from the checkpoint are scaled down. Those less thanepsilon
are "discarded" via thehistogram.updateMinAndMaxBucket()
function.VPACheckpoint
s with the weight of the min bucket being saved as1
vpa-recommender
is restarted again9.994727728962825e-05
However, this time during the merge weights are not scaled as thereferenceTimestamp
of the histograms match (referenceTimestamp of the histogram was already shifted during the previous restart, merge and save). This means that no scale down and removal of potential weights less than epsilon has occurred.Basically, what this means is that if there is no
referenceTimestamp
shift and no scale down of weights after they are loaded from a checkpoint, there is a possibility for the histogram to be incorrectly detected as being empty.How to reproduce it (as minimally and precisely as possible):
This issue can only be reproduced if
vpa-recommender
is restarted twice in the frame of 24 hours (more concretely twice in the frame of the decay half life setting of the histograms)I wrote this small main function which approximates what VPA does to reproduce the issue: https://gist.github.com/plkokanov/339ddfe33d05cdc028b273bbac5bdb0a
Anything else we need to know?:
So far We've come up with two ways that seem to fix the issue:
histogram.updateMinAndMaxBucket()
after loading a checkpoint. Seems to be working ok, but there might be some additional precision loss - weights that would contribute to the histogram's totalWeight before being saved and loaded from theVPACheckpoint
, no longer do so.referenceTimestamp
s are set, they are set to a time before the timestamp of the added samples. This way (if my calculations are correct), the weight of the most recent sample will never be less than1
after weights are scaled. This seems to ensure that the ratios used when loading/saving weights to the checkpoint are such that the issue doesn't occur./cc @voelzmo and @ialidzhikov as they also took part during the investigation of this issue.
The text was updated successfully, but these errors were encountered: