[BUG] Model undeploying giving empty response #3285

gaurav7830 · 2024-12-17T04:59:56Z

What is the bug?
When the model is in partially deployed state and undeployed is being called on that domain than the undeploy is giving emtpy response.

How can one reproduce the bug?

Create a domain.
Update the model state to partially deployed state.
Call model undeploy api.
Response will be empty.

What is the expected behavior?
Model should be in undeployed state after calling undeploy api.

What is your host/environment?
OS 2.17 version

Do you have any screenshots?
None

Do you have any additional context?
None

brianf-aws · 2024-12-17T18:17:35Z

Hey Gaurav, Thank you for creating this issue I will look into it!

brianf-aws · 2024-12-19T23:16:42Z

Hey @gaurav7830 Im trying to replicate locally. but I can not seem to have that empty response even when I purposely bring down a node during deployment it seems smart enough to give back correct results.

Im wondering if you have the plugins.ml_commons.sync_up_job_interval_in_seconds: (number) set on your cluster
https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#set-sync-job-intervals

Currently I tested with the sync up as "disabled". But please let me know what interval you are using so I can closely get as close.

brianf-aws · 2024-12-24T18:29:37Z

You can get the scenario organically if you do the following.

Suppose you have a OS cluster with nodes a,b,c,d

Deploy to all nodes a,b,c,d
Check the model status and also the profile API GET {{ _.domain }}/_plugins/_ml/profile/models/{{ _.model_id }} Notice that the model is deployed on all nodes worker_nodes : {a,b,c,d}
Bring down one node (Lets use node a). While this occurs make sure the sync up job is running for example 3 seconds should change state
After some time youll see that the model changes state to PARTIALLY_DEPLOYED

At this point Ive theorized that the cache of node is volatile (i.e. It gets lost when the node gets shut down)

(Stop the sync up job) Bring back a, if you check the profile api youll notice that even though the the node is back the profile API states that only 3 nodes service the model.

If you undeploy at this state you'll get back a valid response stating that it could undeploy from {b,c,d} and that node a had a not_found response

ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

Lines 1872 to 1884 in 7209a10

if (modelCacheHelper.isModelDeployed(modelId)) {

modelUndeployStatus.put(modelId, UNDEPLOYED);

mlStats.getStat(MLNodeLevelStat.ML_DEPLOYED_MODEL_COUNT).decrement();

mlStats.getStat(MLNodeLevelStat.ML_REQUEST_COUNT).increment();

mlStats

.createCounterStatIfAbsent(getModelFunctionName(modelId), ActionName.UNDEPLOY, ML_ACTION_REQUEST_COUNT)

.increment();

mlStats.createModelCounterStatIfAbsent(modelId, ActionName.UNDEPLOY, ML_ACTION_REQUEST_COUNT).increment();

} else {

modelUndeployStatus.put(modelId, NOT_FOUND);

}

removeModel(modelId);

}

Bring down other nodes {b,c,d} When you bring them back now youll see the edge case of PARTIALLY_DEPLOYED that is hard to work with which is why it returns a {} response so the models aren't found in their cache so it skips (Since nodes {a,b,c,d} all report failures of not found in cache)

ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelCacheHelper.java

Lines 735 to 741 in 618678f

    
           private MLModelCache getExistingModelCache(String modelId) { 
        
               MLModelCache modelCache = modelCaches.get(modelId); 
        
               if (modelCache == null) { 
        
                   throw new IllegalArgumentException("Model not found in cache"); 
        
               } 
        
               return modelCache; 
        
           }

ml-commons/plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelAction.java

Lines 102 to 110 in d878fbd

    
           void processUndeployModelResponseAndUpdate( 
        
               MLUndeployModelNodesResponse undeployModelNodesResponse, 
        
               ActionListener<MLUndeployModelNodesResponse> listener 
        
           ) { 
        
               List<MLUndeployModelNodeResponse> responses = undeployModelNodesResponse.getNodes(); 
        
               if (responses == null || responses.isEmpty()) { 
        
                   listener.onResponse(undeployModelNodesResponse); 
        
                   return; 
        
               }

In Summary

The reality is that here i've illustrated two edge cases of the PARTIALLY_DEPLOYED state (There may be more edge cases). In one scenario if nodes have a recollection of the model state, then undeploy can give back a accurate response. But if all nodes have been shut down then there is no way to check it can undeploy since the nodes dont have the model in their cache.

gaurav7830 added bug Something isn't working untriaged labels Dec 17, 2024

dhrubo-os removed the untriaged label Dec 17, 2024

dhrubo-os assigned brianf-aws Dec 17, 2024

dhrubo-os added this to ml-commons projects Dec 17, 2024

dhrubo-os moved this to On-deck in ml-commons projects Dec 17, 2024

brianf-aws linked a pull request Jan 11, 2025 that will close this issue

Undeploy models with no WorkerNodes #3380

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Model undeploying giving empty response #3285

[BUG] Model undeploying giving empty response #3285

gaurav7830 commented Dec 17, 2024

brianf-aws commented Dec 17, 2024

brianf-aws commented Dec 19, 2024 •

edited

Loading

brianf-aws commented Dec 24, 2024 •

edited

Loading

[BUG] Model undeploying giving empty response #3285

[BUG] Model undeploying giving empty response #3285

Comments

gaurav7830 commented Dec 17, 2024

brianf-aws commented Dec 17, 2024

brianf-aws commented Dec 19, 2024 • edited Loading

brianf-aws commented Dec 24, 2024 • edited Loading

In Summary

brianf-aws commented Dec 19, 2024 •

edited

Loading

brianf-aws commented Dec 24, 2024 •

edited

Loading