Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Model undeploying giving empty response #3285

Open
gaurav7830 opened this issue Dec 17, 2024 · 3 comments · May be fixed by #3380
Open

[BUG] Model undeploying giving empty response #3285

gaurav7830 opened this issue Dec 17, 2024 · 3 comments · May be fixed by #3380
Assignees
Labels
bug Something isn't working

Comments

@gaurav7830
Copy link

What is the bug?
When the model is in partially deployed state and undeployed is being called on that domain than the undeploy is giving emtpy response.

How can one reproduce the bug?

  1. Create a domain.
  2. Update the model state to partially deployed state.
  3. Call model undeploy api.
  4. Response will be empty.

What is the expected behavior?
Model should be in undeployed state after calling undeploy api.

What is your host/environment?
OS 2.17 version

Do you have any screenshots?
None

Do you have any additional context?
None

@gaurav7830 gaurav7830 added bug Something isn't working untriaged labels Dec 17, 2024
@brianf-aws
Copy link
Contributor

Hey Gaurav, Thank you for creating this issue I will look into it!

@brianf-aws
Copy link
Contributor

brianf-aws commented Dec 19, 2024

Hey @gaurav7830 Im trying to replicate locally. but I can not seem to have that empty response even when I purposely bring down a node during deployment it seems smart enough to give back correct results.

Im wondering if you have the plugins.ml_commons.sync_up_job_interval_in_seconds: (number) set on your cluster
https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#set-sync-job-intervals

Currently I tested with the sync up as "disabled". But please let me know what interval you are using so I can closely get as close.

@brianf-aws
Copy link
Contributor

brianf-aws commented Dec 24, 2024

You can get the scenario organically if you do the following.

Suppose you have a OS cluster with nodes a,b,c,d

  1. Deploy to all nodes a,b,c,d
  2. Check the model status and also the profile API GET {{ _.domain }}/_plugins/_ml/profile/models/{{ _.model_id }} Notice that the model is deployed on all nodes worker_nodes : {a,b,c,d}
  3. Bring down one node (Lets use node a). While this occurs make sure the sync up job is running for example 3 seconds should change state
  4. After some time youll see that the model changes state to PARTIALLY_DEPLOYED

At this point Ive theorized that the cache of node is volatile (i.e. It gets lost when the node gets shut down)

  1. (Stop the sync up job) Bring back a, if you check the profile api youll notice that even though the the node is back the profile API states that only 3 nodes service the model.

If you undeploy at this state you'll get back a valid response stating that it could undeploy from {b,c,d} and that node a had a not_found response

if (modelCacheHelper.isModelDeployed(modelId)) {
modelUndeployStatus.put(modelId, UNDEPLOYED);
mlStats.getStat(MLNodeLevelStat.ML_DEPLOYED_MODEL_COUNT).decrement();
mlStats.getStat(MLNodeLevelStat.ML_REQUEST_COUNT).increment();
mlStats
.createCounterStatIfAbsent(getModelFunctionName(modelId), ActionName.UNDEPLOY, ML_ACTION_REQUEST_COUNT)
.increment();
mlStats.createModelCounterStatIfAbsent(modelId, ActionName.UNDEPLOY, ML_ACTION_REQUEST_COUNT).increment();
} else {
modelUndeployStatus.put(modelId, NOT_FOUND);
}
removeModel(modelId);
}

  1. Bring down other nodes {b,c,d} When you bring them back now youll see the edge case of PARTIALLY_DEPLOYED that is hard to work with which is why it returns a {} response so the models aren't found in their cache so it skips (Since nodes {a,b,c,d} all report failures of not found in cache)
    private MLModelCache getExistingModelCache(String modelId) {
    MLModelCache modelCache = modelCaches.get(modelId);
    if (modelCache == null) {
    throw new IllegalArgumentException("Model not found in cache");
    }
    return modelCache;
    }

    void processUndeployModelResponseAndUpdate(
    MLUndeployModelNodesResponse undeployModelNodesResponse,
    ActionListener<MLUndeployModelNodesResponse> listener
    ) {
    List<MLUndeployModelNodeResponse> responses = undeployModelNodesResponse.getNodes();
    if (responses == null || responses.isEmpty()) {
    listener.onResponse(undeployModelNodesResponse);
    return;
    }

In Summary

The reality is that here i've illustrated two edge cases of the PARTIALLY_DEPLOYED state (There may be more edge cases). In one scenario if nodes have a recollection of the model state, then undeploy can give back a accurate response. But if all nodes have been shut down then there is no way to check it can undeploy since the nodes dont have the model in their cache.

@brianf-aws brianf-aws linked a pull request Jan 11, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: On-deck
Development

Successfully merging a pull request may close this issue.

3 participants