Merge pull request #39 from danilyef/readme_branch

README.md
danilyef · Dec 17, 2024 · bdc47e3 · bdc47e3
2 parents b216735 + d8316a6
commit bdc47e3
Show file tree

Hide file tree

Showing 45 changed files with 405 additions and 438 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,84 @@
-# How to start:
+# Machine Learning in Production
 
-#### create virtual environment in the root folder:
+![image](intro.jpg)
+
+
+The **Machine Learning in Production Course** is a comprehensive curriculum designed to equip learners with the knowledge and practical skills needed to build, deploy, and manage machine learning systems at scale. The course combines theoretical insights with hands-on assignments to prepare participants for real-world challenges in MLOps (Machine Learning Operations). Below is an overview of the key topics covered in this course:
+
+### **Course Modules**
+
+1. **MLOps Introduction**  
+   - Fundamentals of MLOps and its importance in modern machine learning workflows.
+
+2. **Infrastructure Setup**  
+   - Setting up infrastructure for machine learning projects.
+   - Focus on tools, cloud platforms, and deployment environments.
+
+3. **Data Storage and Processing**  
+   - Best practices for managing data at scale.
+   - Storage strategies, data preprocessing, and pipelines.
+
+4. **Versioning and Labeling**  
+   - Version control for datasets and models.
+   - Effective labeling and validation strategies.
+
+5. **Training and Experimentation**  
+   - Designing robust training pipelines and running experiments.
+   - Tools for tracking metrics and improving model performance.
+
+6. **Testing and CI/CD**  
+   - Implementing testing strategies for machine learning systems.
+   - Continuous Integration and Continuous Deployment for ML projects.
+
+7. **Orchestration with Kubeflow and Airflow**  
+   - Automating workflows using orchestration tools like Kubeflow and Airflow.
+
+8. **Orchestration with Dagster**  
+   - Advanced orchestration techniques with Dagster.
+
+9. **Serving Basics**  
+   - Fundamentals of serving machine learning models via APIs.
+
+10. **Inference Servers**  
+    - Understanding inference servers and optimizing their performance.
+
+11. **Advanced Serving Features and Benchmarking**  
+    - Advanced serving techniques and benchmarking model performance.
+
+12. **Scaling Infrastructure and Models**  
+    - Techniques for scaling machine learning models and infrastructure to handle production workloads.
+
+13. **Monitoring and Observability**  
+    - Tools and techniques for monitoring ML systems in production.
+    - Implementing observability to track model health and data quality.
+
+14. **Tools, LLMs, and Data Moats**  
+    - Exploring state-of-the-art tools and methodologies.
+    - Leveraging large language models (LLMs) and building competitive data strategies.
+
+15. **ML Platforms**  
+    - Overview of ML platforms and their role in scaling machine learning operations.
+
+
+### How to start:
+
+1. **Create virtual environment in the root folder:**
 ```bash
 cd /path/to/your/root/folder
 python -m venv env
 ```
 
-#### activate virtual environment:  
+2. **Activate virtual environment:**
 ```bash
 source env/bin/activate
 ```
 
-#### upgrade pip:
+3. **Upgrade pip:**
 ```bash
 python -m pip install --upgrade pip
 ```
 
-#### install requirements:
+4. **Install requirements:**
 ```bash
 pip install -r main_requirements.txt
 ```
diff --git a/homework_2/README.md b/homework_2/README.md
@@ -1,20 +1,18 @@
-# Homework 2
+# Homework 2: Infrastructure setup
 
 This repository contains tasks related to Docker, Kubernetes, and GitHub Actions.
 
-## Structure
 
-The repository is organized into the following directories:
+## Tasks:
 
-- `homework_2/task1/`: Contains Docker-related tasks.
-- `homework_2/task2/`: Contains Kubernetes-related tasks.
-- `.github/workflows/`: Contains the GitHub Actions workflows.
+- PR1: Write a dummy Dockerfile with a simple server and push it to your docker hub or github docker registry.
+- PR2: Write CI/CD pipeline with github action that does this for each PR.
+- PR3: Write YAML definition for Pod, Deployment, Service, and Job with your Docker image, Use minikube/kind for testing it.Install k9s tool.
 
 ### Task 1: Docker
 
-#### PR1: `homework_2/task1`
-
-This task involves working with Docker. The following scripts are available:
+- folder: `homework_2/pr1`
+- This task involves working with Docker. The following scripts are available:
 
 1. **First Docker Container:**
    - **Purpose**: Builds and runs a task that prints output.
@@ -35,15 +33,13 @@ This task involves working with Docker. The following scripts are available:
 
 ### Task 2: GitHub Actions
 
-#### PR2: `.github/workflows`
-
-This directory contains GitHub Actions workflows used for CI/CD automation.
+- folder: `.github/workflows`
+- This directory contains GitHub Actions workflows used for CI/CD automation.
 
 ### Task 3: Kubernetes
 
-#### PR3: `homework_2/task2`
-
-This task involves working with Kubernetes resources.
+- folder: `homework_2/pr3`
+- This task involves working with Kubernetes resources.
 
 1. **Pod**:
    - **Command**: 

diff --git a/homework_2/task1/Dockerfile.run → homework_2/pr1/Dockerfile.run b/homework_2/task1/Dockerfile.run → homework_2/pr1/Dockerfile.run
diff --git a/homework_2/task1/Dockerfile.web → homework_2/pr1/Dockerfile.web b/homework_2/task1/Dockerfile.web → homework_2/pr1/Dockerfile.web
diff --git a/homework_2/task1/app.py → homework_2/pr1/app.py b/homework_2/task1/app.py → homework_2/pr1/app.py
diff --git a/homework_2/task1/dockerignore → homework_2/pr1/dockerignore b/homework_2/task1/dockerignore → homework_2/pr1/dockerignore
diff --git a/homework_2/task1/requirements.txt → homework_2/pr1/requirements.txt b/homework_2/task1/requirements.txt → homework_2/pr1/requirements.txt
diff --git a/homework_2/task1/run.sh → homework_2/pr1/run.sh b/homework_2/task1/run.sh → homework_2/pr1/run.sh
diff --git a/homework_2/task1/web.sh → homework_2/pr1/web.sh b/homework_2/task1/web.sh → homework_2/pr1/web.sh
diff --git a/homework_2/task2/deployment_service.yaml → homework_2/pr3/deployment_service.yaml b/homework_2/task2/deployment_service.yaml → homework_2/pr3/deployment_service.yaml
diff --git a/homework_2/task2/job.yaml → homework_2/pr3/job.yaml b/homework_2/task2/job.yaml → homework_2/pr3/job.yaml
diff --git a/homework_2/task2/pod.yaml → homework_2/pr3/pod.yaml b/homework_2/task2/pod.yaml → homework_2/pr3/pod.yaml
diff --git a/homework_3/README.md b/homework_3/README.md
@@ -0,0 +1,18 @@
+# Homework 3: Storage and Processing
+
+## Tasks:
+
+- PR1: Write README instructions detailing how to deploy MinIO with the following options: Local, Docker, Kubernetes (K8S)-based.
+- PR2: Develop a CRUD Python client for MinIO and accompany it with comprehensive tests.
+- PR3: Write code to benchmark various Pandas formats in terms of data saving/loading, focusing on load time and save time.
+- PR4: Create code to benchmark inference performance using single and multiple processes, and report the differences in time.
+- PR5: Develop code for converting your dataset into the StreamingDataset format.
+- PR6: Write code for transforming your dataset into a vector format, and utilize VectorDB for ingestion and querying.
+
+
+### PR6: example
+
+```bash
+python main.py create-index
+python main.py search-index "Who are you?" --top-n 2
+```
diff --git a/homework_3/pr2/tests.py b/homework_3/pr2/tests.py
@@ -1,14 +1,3 @@
-'''
-Before starting the script, create a virtual environment:
-
-1. cd /path/to/your/project
-2. python -m venv env
-3. source env/bin/activate
-4. pip install -r requirements.txt
-
-After these steps start script from cmd:
-5. python tests.py
-'''
 from minio import Minio
 from minio.error import S3Error
 import pytest

diff --git a/homework_3/pr3/README.md b/homework_3/pr3/README.md
diff --git a/homework_3/pr3/main.py b/homework_3/pr3/main.py
@@ -1,18 +1,3 @@
-'''
-Before starting the script, create a virtual environment:
-
-1. cd /path/to/your/project
-2. python -m venv env
-3. source env/bin/activate
-4. pip install -r requirements.txt
-
-After these steps start script from cmd:
-5. python main.py
-'''
-
-
-
-
 import pandas as pd
 import numpy as np
 import time

diff --git a/homework_3/pr5/inference_time_comparison.jpg → homework_3/pr4/inference_time_comparison.jpg b/homework_3/pr5/inference_time_comparison.jpg → homework_3/pr4/inference_time_comparison.jpg
diff --git a/homework_3/pr4/main.py b/homework_3/pr4/main.py
@@ -0,0 +1,90 @@
+'''
+Before starting the script, create a virtual environment:
+
+1. cd /path/to/your/project
+2. python -m venv env
+3. source env/bin/activate
+4. pip install -r requirements.txt
+
+After these steps start script from cmd:
+5. python main.py
+'''
+import time
+import multiprocessing as mp
+from multiprocessing import Pool, cpu_count
+from sklearn.linear_model import LinearRegression
+from sklearn.datasets import make_regression
+from sklearn.model_selection import train_test_split
+import numpy as np
+import matplotlib.pyplot as plt
+
+
+# Prepare a sample dataset and model for benchmarking
+def create_model_and_data():
+    # Create a synthetic regression dataset with 100 features
+    X, y = make_regression(n_samples=200000, n_features=100, noise=0.1, random_state=42)
+
+    # Split into training and testing sets
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99, random_state=42)
+
+    # Train a simple Linear Regression model
+    model = LinearRegression()
+    model.fit(X_train, y_train)
+
+    return model, X_test
+
+# Single inference task using the sklearn model
+def inference_task(args):
+    model, data = args
+    time.sleep(0.005)
+    # Simulate model inference (predicting the data)
+    return model.predict(data)
+
+
+def single_process_inference(model, batches):
+    start_time = time.time()
+
+    for batch in batches:
+        inference_task((model, batch))
+
+    elapsed_time = time.time() - start_time
+    return elapsed_time
+
+
+def multiple_process_inference(model, batches, num_processes=16):
+    start_time = time.time()
+
+    with mp.Pool(processes=num_processes) as pool:
+        pool.map(inference_task, [(model, batch) for batch in batches])
+
+    elapsed_time = time.time() - start_time
+    return elapsed_time
+
+
+if __name__ == '__main__':
+    model, X_test = create_model_and_data()
+
+    batch_sizes = [100, 2000]
+    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))
+
+    colors = ['#1f77b4', '#ff7f0e']  
+
+    for i, batch_size in enumerate(batch_sizes):
+        num_batches = len(X_test) // batch_size + (1 if len(X_test) % batch_size != 0 else 0)
+        data_batches = np.array_split(X_test, num_batches)
+
+        single_process_time = single_process_inference(model, data_batches)
+        multiple_process_time = multiple_process_inference(model, data_batches)
+
+        methods = ['Single Process', 'Multiple Processes']
+        times = [single_process_time, multiple_process_time]
+
+        ax = ax1 if i == 0 else ax2
+        ax.bar(methods, times, color=colors)
+        ax.set_title(f'Inference Time Comparison (Batch Size: {batch_size})')
+        ax.set_xlabel('Method')
+        ax.set_ylabel('Time (seconds)')
+
+    plt.tight_layout()
+    plt.savefig('inference_time_comparison.jpg')
+    plt.close(fig)
diff --git a/homework_3/pr4/requirements.txt b/homework_3/pr4/requirements.txt
@@ -0,0 +1,6 @@
+pandas==2.2.2
+numpy===1.26.4
+pyarrow==17.0.0
+matplotlib==3.8.4
+tables==3.9.2
+scikit-learn==1.5.1
diff --git a/homework_3/pr5/README.md b/homework_3/pr5/README.md