Skip to content

experiments to test accuracy of skewed training dataset on worker nodes in a distributed setting (xgboost)

Notifications You must be signed in to change notification settings

andrewlawhh/xgb-distributed-skew-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Investigate effects of data skew across worker machines on distributed XGBoost

Directory Structure

The ./predictions/ directory contains predictions and evaluations of each setting.

The ./models/ directory contains XGBoost model binaries as well as human readable representations of each decision tree.

Experiment Information and Methods

This dataset was downloaded from the UCI Machine Learning Repository

Each model was trained using the same training parameters, found in the training function of ./scripts/train_model.py and evaluated by the same metrics, namely (# wrong predictions)/(# total predictions).

The 'skew' on the non-categorical variables was performed by first sorting the dataset by that column, and then partitioning to each worker machine in blocks of (1/num_workers), where num_workers in this case was 5.

The code for 'skew'-ing the data can be found in ./scripts/sort_training_data_by_columns.py, and the code for partitioning to each workercan be found in ./scripts/train_model.py

The skew on the categorical variables was performed by matching each worker node to a different value that the set could take on. For instance, in this dataset, 'month' was a categorical variable. I set up 12 worker machines that each took all the rows that corresponded to their respective months. ie. worker 1 took all rows with month == January, worker 2 took all rows with month == February, etc.

Results

Control (No skew)

eval-error:0.245060

{'false_positives': 490, 'false_negatives': 2834, 'total predictions': 13564}

Label Skew

eval-error:0.246461

{'false_positives': 501, 'false_negatives': 2842, 'total predictions': 13564}

Age Skew

eval-error:0.249115

{'false_positives': 465, 'false_negatives': 2914, 'total predictions': 13564}

Balance Skew

eval-error:0.246093

{'false_positives': 479, 'false_negatives': 2859, 'total predictions': 13564}

Campaign Skew

eval-error:0.246166

{'false_positives': 507, 'false_negatives': 2832, 'total predictions': 13564}

Default Skew

eval-error:0.249853

{'false_positives': 434, 'false_negatives': 2955, 'total predictions': 13564}

Duration Skew

eval-error:0.251401

{'false_positives': 462, 'false_negatives': 2948, 'total predictions': 13564}

Housing Skew

eval-error:0.244618

{'false_positives': 465, 'false_negatives': 2853, 'total predictions': 13564}

Loan Skew

eval-error:0.247272

{'false_positives': 429, 'false_negatives': 2925, 'total predictions': 13564}

PDays Skew

eval-error:0.244471

{'false_positives': 509, 'false_negatives': 2807, 'total predictions': 13564}

POutcome Skew

eval-error:0.249926

{'false_positives': 487, 'false_negatives': 2903, 'total predictions': 13564}

Previous Skew

eval-error:0.244471

{'false_positives': 509, 'false_negatives': 2807, 'total predictions': 13564}


Categorical Skews

Job Skew (12 Worker Machines + 1 Tracker)

eval-error:0.244176

{'false_positives': 403, 'false_negatives': 2909, 'total predictions': 13564}

Marital Skew (3 Worker Machines + 1 Tracker)

eval-error:0.246387

{'false_positives': 471, 'false_negatives': 2871, 'total predictions': 13564}

Education Skew (4 Worker Machines + 1 Tracker)

eval-error:0.249705

{'false_positives': 451, 'false_negatives': 2936, 'total predictions': 13564}

Contact Skew (3 Worker Machines + 1 Tracker)

eval-error:0.247051

{'false_positives': 450, 'false_negatives': 2901, 'total predictions': 13564}

About

experiments to test accuracy of skewed training dataset on worker nodes in a distributed setting (xgboost)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published