This repo contains our code for the paper Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction.
Heart failure 30-day hospital readmission prediction (HF_readmission_prediction
):
python 3.9
imblearn==0.0
joblib==1.2.0
numpy==1.24.4
pandas==2.0.0
pymongo==4.7.0
scikit_learn==1.4.2
shap==0.45.0
tqdm==4.65.0
xgboost==1.7.6
LLMs to annotate SDOH variables (LLM_SDOH_annotation
):
python 3.9
datasets==2.11.0
huggingface_hub==0.17.3
numpy==1.24.4
pandas==2.0.0
peft==0.10.0
torch==2.0.0
tqdm==4.65.0
transformers==4.34.1
The social determinants of health (SDOH) datasets used in this study can be found below:
Dataset | Number of SDOH variables Used |
---|---|
NaNDA | 223 |
AHRQ SDOHD | 506 |
For zero-shot and 1-shot inference of SDOH Domains for AHRQ and NaNDA variables, please use the commands in LLM_SDOH_annotation/commands
folder for experiments.
For example, to perform one round of inference with the following arguments run:
python general_LLM_inference_rel_extraction_col_type.py --base_model='meta-llama/Llama-2-7b-chat-hf' --feat_set='a' --num_shots=0 --input_data_file='INPUT_AHRQ_tract_2010-2018.csv' --output_data_file='a_zeroshot_llama7b-chat_domain_AHRQ_outputs.csv'
- Language model: Llama-2-7b-chat-hf. Feature set: A (SDOH variable name), Number of shots (inference): 0 (i.e., zero-shot), Input file: AHRQ variables, Output file (optional): will be automatically named based on other arguments.
The patient dataset is unavailable due to privacy reasons --- however the following commands demonstrate the steps we used to train and evaluate binary classification models (using clinical and public SDOH data):
To train binary classification models on HF 30-day hospital readmission prediction (in file, choose classification algorithm, features):
python bal_allfeats_nosmote_sgs_evaluate_baselines_nestKfold.py
To analyze results of HF models:
python sgs_analyze_baseline.py