You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There seems to be an error of method in the mimic evaluation. For each saliency method, the topk most important features are replaced by an average. However this topk is selected on the whole test set, which means that theoretically some patients' data could be completely replaced by the average. This is a problem since such evaluation method not only rewards important features, but also important patients.
To corroborate this insight, I've run a simple method: select the topk patients with the highest model predictions, and replace all of their data by the average. Such a method performs "better" than DeepLift, without explaining any feature, which is concerning. I've included the plots and code below:
N.B: DeepLift results were run with model in eval mode: please see #8 .
In experiments/results/mimic/plot_benchmarks.py:
...
name_dict= {
"fit": "FIT",
"deep_lift": "DL",
"afo": "AFO",
"fo": "FO",
"retain": "RT",
"integrated_gradient": "IG",
"gradient_shap": "GS",
"lime": "LIME",
"dynamask": "MASK",
"top_pred": "TOPK PREDS"
}
...
# Load the model:model=StateClassifier(
feature_size=N_features, n_state=2, hidden_size=200, rnn="GRU", device=device, return_all=True
)
model.load_state_dict(torch.load(os.path.join(path, f"model_{cv}.pt")))
model.eval()
# For each mask area, we compute the CE and the ACC for each attribution method:fori, fractioninenumerate(areas):
N_drop=int(fraction*N_exp*N_features*T) # The number of inputs to perturbY=model(X.transpose(1, 2))
Y=Y[:, -1]
Y=Y.reshape(-1, 2)
Y_s=torch.softmax(Y, dim=-1)
Y=torch.argmax(Y_s, dim=-1).detach().cpu().numpy() # This is the predicted class for the unperturbed input# For each attribution method, use the saliency map to construct a perturbed input:fork, explainerinenumerate(explainers):
ifexplainer=="dynamask":
...
elifexplainer=="top_pred":
idx=torch.topk(Y_s[:, 1], int(len(Y_s) *fraction)).indicesmask_tensor=torch.zeros_like(X)
mask_tensor[idx] =1.# Perturb the most relevant inputs and compute the associated output:X_pert= (1-mask_tensor) *X+mask_tensor*X_avgY_pert=model(X_pert.transpose(1, 2))
Y_pert=Y_pert[:, -1]
Y_pert=Y_pert.reshape(-1, 2)
Y_pert=torch.softmax(Y_pert, dim=-1)
proba_pert=Y_pert.detach().cpu().numpy()
Y_pert=torch.argmax(Y_pert, dim=-1).detach().cpu().numpy()
metrics_array[k, i, 0, cv] =metrics.log_loss(Y, proba_pert)
metrics_array[k, i, 1, cv] =metrics.accuracy_score(Y, Y_pert) # This is ACCelse:
...
Hi,
There seems to be an error of method in the mimic evaluation. For each saliency method, the topk most important features are replaced by an average. However this topk is selected on the whole test set, which means that theoretically some patients' data could be completely replaced by the average. This is a problem since such evaluation method not only rewards important features, but also
important patients
.To corroborate this insight, I've run a simple method: select the topk patients with the highest model predictions, and replace all of their data by the average. Such a method performs "better" than DeepLift, without explaining any feature, which is concerning. I've included the plots and code below:
N.B: DeepLift results were run with model in
eval
mode: please see #8 .In experiments/results/mimic/plot_benchmarks.py:
acc.pdf
ce.pdf
The text was updated successfully, but these errors were encountered: