-
Notifications
You must be signed in to change notification settings - Fork 108
Variable Selection in Shifu
Shifu variable(aka feature) selection is configured in section "varselect" of ModelConfig.json file, here is a sample
"varSelect" : {
"forceEnable" : true,
"candidateColumnNameFile" : "columns/candidate.column.names"
"forceSelectColumnNameFile" : "columns/forceselect.column.names",
"forceRemoveColumnNameFile" : "columns/forceremove.column.names",
"filterEnable" : true,
"filterNum" : 100,
"filterOutRatio" : 0.05,
"filterBy" : "FI",
"missingRateThreshold" : 0.98,
"params" : null
}
Whether or not to enable force selection. If true, all variables specified in forceSelectColumnNameFile will be force selected and variables specified in forceRemoveColumnNameFile will be force removed for model training.
File contains name of variables which can be used in variable selection. If candidateColumnNameFile is not set, or the content is empty, all variables will be candidate variables. Otherwise, only variables in candidateColumnNameFile could be used as variables
File contains name of variables which should be force selected for model training, each variable name occupies one line E.g
variable_name_1
variable_name_2
...
variable_name_n
File contains name of variables which should be force removed for model training, file format is the same as forceSelectColumnNameFile
Whether or not to enable filter. If true, ColumnConfig.json file will be modified based on your variable select settings after run shifu varselect command, if false, ColumnConfig.json will not be modified. Typically if user wants to only output sensitivity analysis report or feature importance report but without re-selecting variables, this would be set to false.
Integer type, the number of variables need to be selected for model training. FilterNum has higher priority than filterOutRatio. in another word, once filterNum is set, filterOutRatio will be ignored. If you need to run variable selection iteratively, you need set filterNum to 0.
Float type, ratio of variables that needs to be filtered out after running shifu varselect. For example, in ColumnConfig.json file, 100 variables are set to finalSelect=true and filterOutRatio is set to 0.05 in ModelConfig.json file, once you run shifu varselect command, 5 variables will be set to finalSelect=false in ColumnConfig.json file.
Method to select variables
In stats step, KS and IV value are computed per each feature and used for variable selection. According to number of features to be selected, sort by KS or IV in descending order to do variable selection.
- KS – What is the maximum difference in the cumulative distribution functions of the good’s/bad’s on a given feature? “Regions”/bins of impact
- IV – Information Value – Overall strong split characteristics -- How well a variable can distinguish between categories of the response
- FI – Feature Importance - Works only for tree models. If filterEnabled is set to false, it will read an existing tree model and output feature importance values into featureImportance/all.fi file. If filterEnabled is set to true, a new tree model will be trained based on training settings and used for variable selection.
- SE – Sensitivity analysis comparing with model output
- ST – Sensitivity analysis comparing with target value
This solution works well in neural network model variable selection.
- Train a model at first
- Per each instance in training data, each time drop one feature and compute new score, based on such score, compute diff with original score (SE) or target value (ST)
- For all training data, compute mean and stddev per each diff and sort in descending order for mean
- Remove 5% configured by user
- Redo the same cycle until it meets final number of features.
A:
Solution 1: By feature importance (FT)
Solution 2: Set training parameters to NN and do feature selection by Sensitivity Analysis and then change training parameters to GBT/RF related.
A: Do coarse feature selection by KS/IV at first keep to 2000 features and then use Sensitivity Analysis to filter out 0.05 feature in each round.