Applying Kedro Orchestration Pipelines to Deploy a Deep Learning Transformer Architecture for the Toxic Comment Classification Problem
This project leverages Kedro, a data and machine learning pipeline orchestration tool, to classify toxic comments using deep learning transformer architecture. It integrates Kedro-Viz for pipeline visualization and follows data engineering best practices for reproducibility and maintainability.
- Pipeline Visualization: Integrates Kedro-Viz to provide an interactive visual representation of data workflows for better debugging and workflow comprehension.
- Multilingual Dataset Support: Handles multilingual datasets, including English, with preprocessing pipelines optimized for transformer models.
- Transformer-Based Modeling: Supports state-of-the-art deep learning models, such as BERT, for toxicity prediction.
- Kedro
v0.19.10
- TensorFlow for model training
- FastAPI for serving predictions
- MLflow for experiment tracking
- Docker for containerization
- Prometheus and Grafana for monitoring
For more information, see the Kedro documentation.
This project utilizes datasets from the Jigsaw "Toxic Comment Classification" competition. The datasets include labeled comments sourced from platforms like Wikipedia and Civil Comments.
Description:
- Comment text: The primary data containing user-submitted comments;
- Toxic Column: A binary label where
1
indicates toxic comments, and0
indicates non-toxic comments.
After that, you need to download the follow files:
jigsaw-toxic-comment-train.csv
;validation.csv
;test.csv
.
For all the files, add in the follow path: data/01_raw/
To ensure consistency and reproducibility:
- .gitignore Compliance: Do not remove any lines from the provided
.gitignore
file. - Data Engineering Convention: Follow Kedro's data engineering conventions.
- Data Handling: Do not commit raw datasets to the repository.
- Credential Security: Avoid committing credentials or local configurations to the repository. Store these in
conf/local/
.
Install the required dependencies:
pip install -r requirements.txt
Execute the Kedro pipeline:
kedro run
Run tests using pytest
:
pytest
To serve the champions model, you can build the container toxic-comment-endpoint
using the fastapi
package for serving.
-
Build the Docker image:
docker build -t toxic-comment-endpoint .
-
Run the Docker container:
docker run -p 8001:8001 -e PORT=8001 toxic-comment-endpoint
-
Click on the link
http://0.0.0.0:8001
and you see the follow web application
Create a prometheus.yml
file for monitoring and sending metrics to Grafana:
global:
scrape_interval: 15s
scrape_configs:
- job_name: "fastapi"
metrics_path: "/metrics"
static_configs:
- targets: ["0.0.0.0:8001"]
remote_write:
- url: "<YOUR_GRAFANA_URL>"
remote_timeout: "30s"
send_exemplars: false
follow_redirects: true
basic_auth:
username: "<YOUR_USERNAME>"
password: "<YOUR_API_TOKEN>"
Run Prometheus with the configuration file:
prometheus --config.file=prometheus.yml