This is the official repository and implementation associated with the paper Representation learning for time-domain high-energy astrophysics: Discovery of extragalactic fast X-ray transient XRT 200515 (Dillmann et al. 2024).
We present novel event file representations and the first representation learning based anomaly detection approach for the discovery of high-energy transients. This involves extracting features from the representations using principal component analysis or an autoencoder followed by dimensionality reduction and clustering. By associating these clusters with previously identified transients and performing nearest-neighbor searches, we create a catalog of X-ray transient candidates. This novel transient detection method for time-domain high-energy astrophysics is applicable to data from other high-energy observatories like XMM-Newton, Swift XRT, eROSITA and the Einstein Probe.
We denote the embedding results from the different event file representations and feature extractions as in the table below.
Case | Event File Representation | Feature Extraction |
---|---|---|
|
Principal Component Analysis (15 components) | |
|
Principal Component Analysis (22 components) | |
|
Autoencoder, convolutional (12 latent features) | |
|
Autoencoder, fully-connected (24 latent features) |
We provide the project datasets in this Google Drive Folder. It includes the following folders and files:
This folder includes the input data for the transient detection pipeline and additional event file properties. Include these files in the ml-xraytransients/datasets
directory, if you want to reproduce the published results from scratch.
eventfiles_table.csv
: A table including all event files used in this project. To filter for a single eventfile use the obsreg_id column.properties_table.csv
: A table listing the variability and hardness ratio properties for all eventfiles, retrieved from the Chandra Source Catalog.properties_full_table.csv
: A table listing further properties for all eventfiles, retrieved from the Chandra Source Catalog.bonafide_transients.json
: A directory listing the obsreg_id ID for the event files from the bona-fide transients used in this project. Each key corresponds to a different type of transient and can be customized.
This folder includes the final output data of the transient detection pipeline with the hyperparameters outlined in the paper. Include these files in the ml-xraytransients/output/embeddings
directory, if you want to use the embedding results described in the paper.
-
paper2DAE_embedding.csv
: The t-SNE embedding of the autoencoder features of the$E-t$ Maps. -
paper2DAE_clusters.csv
: The t-SNE embedding of the autoencoder features of the$E-t$ Maps (including DBSCAN clusters). -
paper2DPCA_embedding.csv
: The t-SNE embedding of the PCA features of the$E-t$ Maps. -
paper2DPCA_clusters.csv
: The t-SNE embedding of the PCA features of the$E-t$ Maps (including DBSCAN clusters). -
paper3DAE_embedding.csv
: The t-SNE embedding of the autoencoder features of the$E-t-dt$ Cubes. -
paper3DAE_clusters.csv
: The t-SNE embedding of the autoencoder features of the$E-t-dt$ Cubes (including DBSCAN clusters). -
paper3DPCA_embeddings.csv
: The t-SNE embedding of the PCA features of the$E-t-dt$ Cubes. -
paper3DPCA_clusters.csv
: The t-SNE embedding of the PCA features of the$E-t-dt$ Cubes (including DBSCAN clusters).
To get started with this project, ensure your system meets the following requirements:
- Python: Version 3.9 or higher must be installed on your system.
- Conda: This is used for managing the Python environment.
- Keras (TensorFlow): Version 2.12.0 or higher required for utilizing neural networks.
To setup the project environment and install the corresponding dependencies, use the following commands:
conda create --name [new_env] python=3.9
conda activate [new_env]
pip install --upgrade pip
pip install -r requirements.txt
where [new_env]
is your chosen environment name.
The following commands are to be executed from the ../src
directory.
Run the following scripts to generate the event file representations:
-
$E-t$ Maps (normalized):
python run_eventfile_representation.py '../datasets/eventfiles_table.csv' 'et'
-
$E-t$ Maps (not normalized):
python run_eventfile_representation.py '../datasets/eventfiles_table.csv' 'et' -norm False
-
$E-t-dt$ Cubes (normalized):
python run_eventfile_representation.py '../datasets/eventfiles_table.csv' 'etdt'
-
$E-t-dt$ Cubes (not normalized):
python run_eventfile_representation.py '../datasets/eventfiles_table.csv' 'etdt' -norm False
Run the following scripts to extract features from the eventfile representations:
-
$2D-PCA$ Case:
python run_feature_extraction.py '../output/representations/et_16-24_normFalse_representations.pkl.' 'PCA' 15
-
$3D-PCA$ Case:
python run_feature_extraction.py '../output/representations/etdt_16-24-24_normFalse_representations.pkl.' 'PCA' 22
-
$2D-AE$ Case:
python run_feature_extraction.py '../output/representations/et_16-24_normTrue_representations.pkl.' '../encoders/encoder_et.h5'
-
$3D-AE$ Case:
python run_feature_extraction.py '../output/representations/etdt_16-24-24_normTrue_representations.pkl.' '../encoders/encoder_etdt.h5'
Run the following script to perform dimensionality reduction on the extracted features:
python run_dimensionality_reduction.py <feature_path> [-n n_components] [-p perplexity] [-lr learning_rate] [-iter n_iter] [-exag early_exaggeration] [-init init] [-rs random_state]
where <feature_path>
is the path to the chosen feature set and the rest of the inputs are the t-SNE algorithm hyperparameters.
Run the following script to perform cluster on the embeddings:
python run_embedding_clustering.py <embedding_path> [-eps eps] [-ms min_samples]
where <embedding_path>
is the path to the chosen feature embedding and the rest of the inputs are the DBSCAN algorithm hyperparameters.
A demonstration of the pipeline is available in the Jupyter Notebook demo.ipynb. This notebook also includes a tool to identify analogs to bona-fide transients defined in the datasets/bonafide_transients.json:
For any questions, feedback, or assistance, please feel free to reach out via email at [email protected].
This project is licensed under the MIT License.
The project is in a state ready for submission. All essential features have been implemented, and the codebase is stable. Future updates may focus on minor improvements, bug fixes, or optimizations.
@article{Parker_2024, title={AstroCLIP: a cross-modal foundation model for galaxies}, volume={531}, ISSN={1365-2966}, url={http://dx.doi.org/10.1093/mnras/stae1450}, DOI={10.1093/mnras/stae1450}, number={4}, journal={Monthly Notices of the Royal Astronomical Society}, publisher={Oxford University Press (OUP)}, author={Parker, Liam and Lanusse, Francois and Golkar, Siavash and Sarra, Leopoldo and Cranmer, Miles and Bietti, Alberto and Eickenberg, Michael and Krawezik, Geraud and McCabe, Michael and Morel, Rudy and Ohana, Ruben and Pettee, Mariel and Régaldo-Saint Blancard, Bruno and Cho, Kyunghyun and Ho, Shirley}, year={2024}, month=jun, pages={4990–5011} }
@article{Dillmann_2024, title={Representation learning for time-domain high-energy astrophysics: Discovery of extragalactic Fast X-ray Transient XRT 200515}, volume={531}, ISSN={1365-2966}, url={http://dx.doi.org/10.1093/mnras/stae1450}, DOI={10.1093/mnras/stae2808}, number={4}, journal={Monthly Notices of the Royal Astronomical Society}, publisher={Oxford University Press (OUP)}, author={Dillmann, Steven and Martínez-Galarza, Juan Rafael and Soria, Roberto and Di Stefano, Rosanne and Kashyap, Vinay L.}, year={2024}, month=dec, pages={4990–5011} }
Many thanks to the following contributors:
- Steven Dillmann, Stanford University
- Rafael Martínez-Galarza, Center for Astrophysics | Harvard & Smithsonian
- Rosanne Di Stefano, Center for Astrophysics | Harvard & Smithsonian
- Roberto Soria, Italian National Institute for Astrophysics (INAF)
- Vinay Kashyap, Center for Astrophysics | Harvard & Smithsonian
This project is maintained by Steven Dillmann.
2nd January 2025