- Introduction
- Project Structure
- Requirements
- Installation
- ETL Flow
- Clustering Flow
- Dashboard Map
- Testing
- CI/CD
- License
- Acknowledgements
This is a monolith Docker image to help you get started with geospatial analysis and visualization of lightning atmospheric data. The data comes from US National Oceanic and Atmospheric Administration (NOAA) Geostationary Lightning Mapper (GLM) - Data Product sourced from AWS s3 buckets. There are currently two main component:
- ETL Ingestion - data ingestion and analysis processes.
- Streamlit dashboard app - frontend gis visualization dashboard.
Processing done using Pandas dataframes, SQlite with Spatialite extension as the local storage and self-hosted Prefect server instance for orchestration and observability of the processing pipelines.
Architecture: Docker + Prefect + Pandas + SQLite + Streamlit |
Brief Data Summary Lightning Cluster Filter Algorithm (LCFA)
The multidimensional data structures stored in the netCDF4 files contain a rich variety of
data including metadata with descriptors. In general, the main variables: flashes, groups,
events form an hierarchy, i.e. a series of detected radiant events are clustered into groups and groups
are clustered into flashes using LCFA.
lightning-containers/
|
βββ src/
β βββ flows.py
β βββ tasks/
| βββ analytics/
| βββ etl/
βββ app/
| βββ dashboard.py
βββ notebooks/
| βββ clustering/
| βββ mapping/
| βββ streaming/
βββ tests/
β βββ test_clustering.py
| βββ test_extract.py
| βββ test_load.py
| βββ test_transform.py
βββ docs/
β βββ index.md
βββ img/
βββ .streamlit/
β βββ config.toml
β βββ secrets.toml
βββ .github/
β βββ workflows/
β βββ docker-image.yml
βββ data/
βββ .gitignore
βββ LICENSE
βββ CONTRIBUTING.md
βββ CODE_OF_CONDUCT.md
βββ Dockerfile
βββ docker-compose.yml
βββ README.md
Resource | Minimum | Recommended |
---|---|---|
CPU | 2 cores | 4+ cores |
RAM | 6GB | 16GB |
Storage | 8GB | 24GB |
- Clone the repository.
git clone https://github.com/BayoAdejare/lightning-containers.git
cd lightning-containers
- Can be ran with docker containers or installed locally.
docker-compose up -d # spin up containers
Make sure you have the virtual environment configured:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
For requirements, this can be installed from the project directory via pip's setup command:
pip install -r requirements.txt # =< python3.12
Run the command to start the prefect workflow orchestration:
prefect server start # Start prefect engine and UI i.e. http://localhost:4200/
The prefect orchestration platform is required to start the scheduling, from the prefect ui, you can run and monitor the data flows.
Run the command to start the data app.
python src/flows.py # Start backend
streamlit run app/dashboard.py # Start frontend i.e. http://localhost:8501/
ETL flow data tasks:
Source
: extracts NOAA GOES-R GLM file datasets from AWS s3 bucket, default is GOES-18.Transformations
: transforms dataset into time series csv.Sink
: loads dataset to persistant storage.
Ingests the data needed based on specified time window: start and end dates.
extract
: downloads NOAA GOES-R GLM netCDF4 files from AWS s3 bucket.transform
: converts GLM netCDF into time and geo series CSVs.load
: loads CSVs to a local backend, persistant SQLite with Spatialite extension.
Performs grouping of the ingested data by implementing K-Means clustering algorithm.
preprocessor
: prepares the data for cluster model, clean and normalize the data.kmeans_cluster
: fits the data to an implementation of k-means cluster algorithm.silhouette_evaluator
: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.elbow_evaluator
: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.
Lightning containers dashboard |
Use the following command to run tests:
pytest
This project uses GitHub Actions for CI/CD. The workflow is defined in the .github/workflows/docker-image.yml
file. This includes:
- Automated testing on pull requests
- Data quality checks on scheduled intervals
- Deployment of updated ml models and Spark jobs to production
Please read CONTRIBUTING.md for details on our contributing guidelines and the process for submitting pull requests.
This project is licensed under the Apache 2.0 License - see the Apache 2.0 License file for details.
This work would not have been possible without amazing open source software and datasets, including but not limited to:
- GLM Dataset from NOAA NESDIS
- Prefect from PrefectHQ
- Streamlit
- Built on the codebase of Lightning Streams.
Thank you to the authors of these software and datasets for making them available to the community!