Rx2Gantt is a Python-based project designed to streamline the process of extracting, processing, and visualizing medication-related data from PDF files. It processes medical prescription records from Taipei Veterans General Hospital to generate structured summaries and visual Gantt charts for clear, timeline-based insights into medication usage.
- PDF Data Extraction: Extracts prescription details from PDF files.
- Data Cleaning and Processing: Cleans and preprocesses extracted data, including handling dates, merging rows, and standardizing drug names.
- Drug Classification: Enriches the data with drug classifications such as MOA (Mechanism of Action), EPC (Established Pharmacologic Class), and PE (Physiologic Effect) via the RxNav API.
- Gantt Chart Visualization: Creates a detailed Gantt chart showing medication timelines, dosages, and frequencies.
-
Clone the repository:
git clone https://github.com/jimchen1551/Rx2Gantt.git cd Rx2Gantt
-
Install the required dependencies:
pip install -r requirements.txt
-
Update the configuration:
- Open
config.py
and update theINPUT_FOLDER
variable to point to the directory containing your PDF files.
- Open
-
Place your PDF files in the folder specified in the
INPUT_FOLDER
variable. -
Run the project:
python main.py
-
Outputs:
- CSV File: A summary CSV file for each PDF, saved in the
summary
folder within the same directory as the input PDF. - Gantt Chart: A visual representation of the medication timeline, saved as a PNG file in the
gantt
folder.
- CSV File: A summary CSV file for each PDF, saved in the
Rx2Gantt/
├── examples # Examples for ouput files
├── config.py # Configuration settings
├── main.py # Main entry point for the program
├── pdf_processor.py # PDF extraction logic
├── data_cleaner.py # Data cleaning and preprocessing logic
├── drug_classifier.py # Drug classification using RxNav API
├── gantt_visualizer.py # Gantt chart visualization logic
├── requirements.txt # Python dependencies
└── README.md # Project documentation
The config.py
file contains customizable settings, including:
- Input Folder: The directory containing PDF files to process.
- Column Boundaries: x-coordinate boundaries for column detection in PDFs.
- RxNav API URL: Base URL for drug classification.
- Logging Level: Adjust logging verbosity (default:
INFO
). - Chart Color Scheme: Matplotlib color scheme for the Gantt chart.
- Python 3.8 or higher
fitz
(PyMuPDF) for PDF processingpandas
for data handlingmatplotlib
for Gantt chart visualizationrequests
for RxNav API calls
Install dependencies with:
pip install -r requirements.txt
A sample CSV output includes enriched data with classifications:
Drug name, EPC, MOA, PE, DDI, SE
Aspirin, Cyclooxygenase Inhibitor, Platelet Aggregation Inhibition, Analgesic Effect, ,
The Gantt chart visually displays medication timelines, highlighting:
- Start and stop dates of each medication
- Dosage and frequency annotations
We welcome contributions! To contribute:
- Fork the repository.
- Create a feature branch:
git checkout -b feature-name
. - Commit your changes:
git commit -m "Add new feature"
. - Push to the branch:
git push origin feature-name
. - Open a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- RxNav API: https://rxnav.nlm.nih.gov
- Matplotlib for charting
For questions or feedback, please open an issue on the repository or contact the project maintainer.