Applying Kedro Orchestration Pipelines to Deploy a Deep Learning Transformer Architecture for the Toxic Comment Classification Problem

Project Overview

This project leverages Kedro, a data and machine learning pipeline orchestration tool, to classify toxic comments using deep learning transformer architecture. It integrates Kedro-Viz for pipeline visualization and follows data engineering best practices for reproducibility and maintainability.

Key Features

Pipeline Visualization: Integrates Kedro-Viz to provide an interactive visual representation of data workflows for better debugging and workflow comprehension.
Multilingual Dataset Support: Handles multilingual datasets, including English, with preprocessing pipelines optimized for transformer models.
Transformer-Based Modeling: Supports state-of-the-art deep learning models, such as BERT, for toxicity prediction.

Tech Stack

Kedro v0.19.10
TensorFlow for model training
FastAPI for serving predictions
MLflow for experiment tracking
Docker for containerization
Prometheus and Grafana for monitoring

For more information, see the Kedro documentation.

System Design

Kedro Pipeline Overview

Dataset Overview

This project utilizes datasets from the Jigsaw "Toxic Comment Classification" competition. The datasets include labeled comments sourced from platforms like Wikipedia and Civil Comments.

Description:

Comment text: The primary data containing user-submitted comments;
Toxic Column: A binary label where 1 indicates toxic comments, and 0 indicates non-toxic comments.

After that, you need to download the follow files:

jigsaw-toxic-comment-train.csv;
validation.csv;
test.csv.

For all the files, add in the follow path: data/01_raw/

Rules and guidelines

To ensure consistency and reproducibility:

.gitignore Compliance: Do not remove any lines from the provided .gitignore file.
Data Engineering Convention: Follow Kedro's data engineering conventions.
Data Handling: Do not commit raw datasets to the repository.
Credential Security: Avoid committing credentials or local configurations to the repository. Store these in conf/local/.

Get started

Installation

Install the required dependencies:

pip install -r requirements.txt

Running the Pipeline

Execute the Kedro pipeline:

kedro run

Testing the Pipeline

Run tests using pytest:

pytest

Deploying the Model API

To serve the champions model, you can build the container toxic-comment-endpoint using the fastapi package for serving.

Build the Docker image:

docker build -t toxic-comment-endpoint .

Run the Docker container:

docker run -p 8001:8001 -e PORT=8001 toxic-comment-endpoint

Click on the link http://0.0.0.0:8001 and you see the follow web application

Monitoring with Prometheus and Grafana

Create a prometheus.yml file for monitoring and sending metrics to Grafana:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "fastapi"
    metrics_path: "/metrics"
    static_configs:
      - targets: ["0.0.0.0:8001"]

remote_write:
  - url: "<YOUR_GRAFANA_URL>"
    remote_timeout: "30s"
    send_exemplars: false
    follow_redirects: true
    basic_auth:
      username: "<YOUR_USERNAME>"
      password: "<YOUR_API_TOKEN>"

Run Prometheus with the configuration file:

prometheus --config.file=prometheus.yml

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
conf		conf
data		data
docs/source		docs/source
figures		figures
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
README.md		README.md
endpoint.py		endpoint.py
endpoint_conf.yml		endpoint_conf.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applying Kedro Orchestration Pipelines to Deploy a Deep Learning Transformer Architecture for the Toxic Comment Classification Problem

Project Overview

Key Features

Tech Stack

System Design

Kedro Pipeline Overview

Dataset Overview

Rules and guidelines

Get started

Installation

Running the Pipeline

Testing the Pipeline

Deploying the Model API

Monitoring with Prometheus and Grafana

Grafana Visualization

Additional Resources

About

Packages

Languages

neemiasbsilva/kedro_orchestrate_dl_transformer_arch

Folders and files

Latest commit

History

Repository files navigation

Applying Kedro Orchestration Pipelines to Deploy a Deep Learning Transformer Architecture for the Toxic Comment Classification Problem

Project Overview

Key Features

Tech Stack

System Design

Kedro Pipeline Overview

Dataset Overview

Rules and guidelines

Get started

Installation

Running the Pipeline

Testing the Pipeline

Deploying the Model API

Monitoring with Prometheus and Grafana

Grafana Visualization

Additional Resources

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Languages

Packages