Spark application that creates a machine learning model for predicting the arrival delay of commercial flights
First, install the requirements
python -m pip install -r requirements.txt
For local, execute spark-submit -master local[*] FlightDelay.py PATH SAMPLE LOG
- PATH is the location of CVS files; default is data/*.csv
- SAMPLE is the fraction [0-1] for sampling the original data set, 0.1 is 10%; default: 1.0 (100%)
- LOG is the Log level: INFO, WARN, ERROR; default: WARN
You can find data here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7
- Execution example with PATH
%SPARK_HOME%\bin\spark-submit --master local[*] FlightDelay.py file:///C:\UPM\big_data_assignments\data\2000
- Execution example with PATH and SAMPLE
%SPARK_HOME%\bin\spark-submit --master local[*] FlightDelay.py file:///C:\UPM\FlightDelaySpark\data 0.1
- Execution example with PATH, SAMPLE and LOG
%SPARK_HOME%\bin\spark-submit --master local[*] FlightDelay.py file:///C:\UPM\FlightDelaySpark\data 0.05 ERROR