A classifier which is able to recognize quasar objects from variable transients.
For SDSS Stripe82 quasar-targated dataset, the data repository is here: https://www.kaggle.com/sherrysheng97/sdss-stripe82-quasar-targeted-dataset
For PLASTiCC dataset, the data repository is here: https://www.kaggle.com/c/PLAsTiCC-2018
- Tensorflow 2.1.0
- numpy 1.17.2
- pandas 0.25.1
- feets 0.4
- glob 1.2.0
- sklearn 0.23.2
Jump to the Train your classifier part, adjust the config.txt file
, and run the train.py
.
Kaggle notebook is strongly suggested to run all codes: https://www.kaggle.com/sherrysheng97/quasar-classifier-sdss-plasticc
All training and test data are provided. You just need to modify the configuration settings, and then click 'run all'.
In train/
folder, configs.txt
is used for designing the architecture of the classifier. After setting the configurations, run the train.py
file to train and test the classifier.
python train.py
Config type | Parameters | Explaination | Example |
input config | train_path | the path of the test set file | ../data/processed/unbalanced/final_v1.csv |
save_path | the folder that saves all results | results | |
seed | the seed for generating random number | 1 | |
features | the bands/features used in training.\n All features: g, r, i, z, u, g_error, r_error, i_error, z_error, u_error | g,r,i | |
format | the input format for the training. Three formats are provided: simple, group, season | group | |
processed | the preprocess method for the input data. Three methods are provided: s: standardization; n: normalization; d: difference between neighboring data points | s | |
set_GPR | whether to choose the Gaussian Process Regression. This method will generate a new regressed light curve for each group of light curve of an object | True | |
group_size | the number of days in all groups | 67 | |
group_num | the number of groups for each object | 7 | |
cut_fraction | For prediction, the fraction of data drop from each group. This parameter is used for testing the improvement of accuracy with more complete data | 0.1 or empty | |
network config | rnn_type | the type of RNN layers. Three types are provided: LSTM, GRU, Simple | LSTM |
hidden_layers | a list of hiddren layers' neuron numbers. | [256,256,256,256,256,256] | |
dropout | the fraction of objects dropped before being fed into the next layer to avoid overfitting | 0.25 | |
plot_model | whether to plot the model architecture | True | |
train config | batch_size | the number of sequences fed into the layer for each time | 32 |
num_epochs | the times for the input data being processed | 10 | |
test_fraction | the fraction of test set among all data | 0.2 | |
optimizer | the optimization method for the loss function. Two functions are preferred: Adam, SGD | Adam | |
learning_rate | a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function | 0.001 | |
decay | whether the learning rate will decrease with the increasing nunmber of epochs. If true, decay_value = learning_rate/num_epochs | True | |
metrics | the metrics during training for testing the performance of the classifier | accuracy,AUC,f1_score |