Skip to content

Latest commit

 

History

History
30 lines (19 loc) · 1.06 KB

README.md

File metadata and controls

30 lines (19 loc) · 1.06 KB

CDPH Scrape

These scripts were created to scrape the monthly California Department of Public Health (CDPH) arbovirus case updates to CSV files for easy analysis.

  • pdfDownload.py: Downloads PDFs defined in sources.py to a directory.
  • pdfScrape.py: Scrapes the PDFs looking for tables and getting rid of parentheses data and notes (redundant to counts).

Use

Currently, paths need to be changed 'in-file' but I'll add a wrapper in the near future to call the whole thing from bash. To parse the tables from the PDFs run the scripts as follows (making sure the PATH_O from pdfDownload matches PATH_I in pdfScrape):

python pdfDownload.py
python pdfScrape.py

Dependencies

To install the required dependencies, run:

pip install camelot-py pandas

Authors


Héctor M. Sánchez C., Tomás León