Skip to content

BayoAdejare/lightning-containers

Repository files navigation

⚑Lightning Containers: docker-powered lightning atmospheric dataset πŸ“ˆ

Buy Me a Coffee at ko-fi.com

Table of Contents

Introduction

This is a monolith Docker image to help you get started with geospatial analysis and visualization of lightning atmospheric data. The data comes from US National Oceanic and Atmospheric Administration (NOAA) Geostationary Lightning Mapper (GLM) - Data Product sourced from AWS s3 buckets. There are currently two main component:

  1. ETL Ingestion - data ingestion and analysis processes.
  2. Streamlit dashboard app - frontend gis visualization dashboard.

Processing done using Pandas dataframes, SQlite with Spatialite extension as the local storage and self-hosted Prefect server instance for orchestration and observability of the processing pipelines.

Technologies used and respective logos
Architecture: Docker + Prefect + Pandas + SQLite + Streamlit

Brief Data Summary Lightning Cluster Filter Algorithm (LCFA)

The multidimensional data structures stored in the netCDF4 files contain a rich variety of 
data including metadata with descriptors. In general, the main variables: flashes, groups, 
events form an hierarchy, i.e. a series of detected radiant events are clustered into groups and groups 
are clustered into flashes using LCFA.

Project Structure

lightning-containers/
|
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ flows.py
β”‚   └── tasks/
|       └── analytics/
|       └── etl/
β”œβ”€β”€ app/
|   └── dashboard.py
β”œβ”€β”€ notebooks/
|   └── clustering/
|   └── mapping/
|   └── streaming/
β”œβ”€β”€ tests/
β”‚   └── test_clustering.py
|   └── test_extract.py
|   └── test_load.py
|   └── test_transform.py
β”œβ”€β”€ docs/
β”‚   └── index.md
β”œβ”€β”€ img/
β”œβ”€β”€ .streamlit/
β”‚   └── config.toml
β”‚   └── secrets.toml
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── docker-image.yml
β”œβ”€β”€ data/
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ CONTRIBUTING.md
β”œβ”€β”€ CODE_OF_CONDUCT.md
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
└── README.md

Requirements

Resource Minimum Recommended
CPU 2 cores 4+ cores
RAM 6GB 16GB
Storage 8GB 24GB

Installation

Quick Start: Docker Container

  1. Clone the repository.
git clone https://github.com/BayoAdejare/lightning-containers.git
cd lightning-containers
  1. Can be ran with docker containers or installed locally.
docker-compose up -d # spin up containers

Local install

Make sure you have the virtual environment configured:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

For requirements, this can be installed from the project directory via pip's setup command:

pip install -r requirements.txt # =< python3.12

Start Flow

Run the command to start the prefect workflow orchestration:

prefect server start # Start prefect engine and UI i.e. http://localhost:4200/

The prefect orchestration platform is required to start the scheduling, from the prefect ui, you can run and monitor the data flows.

Run the command to start the data app.

python src/flows.py # Start backend

streamlit run app/dashboard.py # Start frontend i.e. http://localhost:8501/

ETL Flow

ETL flow data tasks:

  • Source: extracts NOAA GOES-R GLM file datasets from AWS s3 bucket, default is GOES-18.
  • Transformations: transforms dataset into time series csv.
  • Sink: loads dataset to persistant storage.

Data Ingestion

Ingests the data needed based on specified time window: start and end dates.

Data Processes
  • extract: downloads NOAA GOES-R GLM netCDF4 files from AWS s3 bucket.
  • transform: converts GLM netCDF into time and geo series CSVs.
  • load: loads CSVs to a local backend, persistant SQLite with Spatialite extension.

Clustering Flow

Cluster Analysis

Performs grouping of the ingested data by implementing K-Means clustering algorithm.

Data Tasks
  • preprocessor: prepares the data for cluster model, clean and normalize the data.
  • kmeans_cluster: fits the data to an implementation of k-means cluster algorithm.
  • silhouette_evaluator: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
  • elbow_evaluator: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.

Dashboard Map

An example dashboard of flash event data points
Lightning containers dashboard

Testing

Use the following command to run tests:

pytest

CI/CD

This project uses GitHub Actions for CI/CD. The workflow is defined in the .github/workflows/docker-image.yml file. This includes:

  • Automated testing on pull requests
  • Data quality checks on scheduled intervals
  • Deployment of updated ml models and Spark jobs to production

Contributing

Please read CONTRIBUTING.md for details on our contributing guidelines and the process for submitting pull requests.

License

This project is licensed under the Apache 2.0 License - see the Apache 2.0 License file for details.

Acknowledgements

This work would not have been possible without amazing open source software and datasets, including but not limited to:

Thank you to the authors of these software and datasets for making them available to the community!