This repo contains the code for the empirical evaluation in the paper Pessimism: Offline Policy Optimization In Contextual Bandits.
We implement several offline policy optimization methods with inverse probability weighting and doubly robust estimators, policy-gradient-based and linear-regression-based cost-sensitive classification oracles, pseudo loss and sample variance regularizers.
Make sure conda is installed. Run
conda env create -f environment.yml
source activate cb-learn
./scripts_discrete/exp_params.py
Run
python ./scripts_discrete/prepare_data.py
On a cluster with Slurm workload manager, run
python ./scripts_discrete/run_simulate_bandit_feedback.py
On a cluster with Slurm workload manager, run
python ./scripts_discrete/run_OPO.py
Run
python ./scripts_discrete/model_selection.py
Run
python ./scripts_discrete/plot_improvement_figure.py
python ./scripts_discrete/generate_table.py
python ./scripts_discrete/transform_table.py
./scripts_continuous/exp_params.py
Run
python ./scripts_continuous/prepare_data.py
On a cluster with Slurm workload manager, run
python ./scripts_continuous/run_simulate_bandit_feedback.py
On a cluster with Slurm workload manager, run
python ./scripts_continuous/run_OPO.py
Run
python ./scripts_continuous/model_selection.py
Run
python ./scripts_continuous/plot_improvement_figure.py
python ./scripts_continuous/generate_table.py
python ./scripts_continuous/transform_table.py