Skip to content

Commit

Permalink
Merge pull request #39 from danilyef/readme_branch
Browse files Browse the repository at this point in the history
README.md
  • Loading branch information
danilyef authored Dec 17, 2024
2 parents b216735 + d8316a6 commit bdc47e3
Show file tree
Hide file tree
Showing 45 changed files with 405 additions and 438 deletions.
72 changes: 67 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,84 @@
# How to start:
# Machine Learning in Production

#### create virtual environment in the root folder:
![image](intro.jpg)


The **Machine Learning in Production Course** is a comprehensive curriculum designed to equip learners with the knowledge and practical skills needed to build, deploy, and manage machine learning systems at scale. The course combines theoretical insights with hands-on assignments to prepare participants for real-world challenges in MLOps (Machine Learning Operations). Below is an overview of the key topics covered in this course:

### **Course Modules**

1. **MLOps Introduction**
- Fundamentals of MLOps and its importance in modern machine learning workflows.

2. **Infrastructure Setup**
- Setting up infrastructure for machine learning projects.
- Focus on tools, cloud platforms, and deployment environments.

3. **Data Storage and Processing**
- Best practices for managing data at scale.
- Storage strategies, data preprocessing, and pipelines.

4. **Versioning and Labeling**
- Version control for datasets and models.
- Effective labeling and validation strategies.

5. **Training and Experimentation**
- Designing robust training pipelines and running experiments.
- Tools for tracking metrics and improving model performance.

6. **Testing and CI/CD**
- Implementing testing strategies for machine learning systems.
- Continuous Integration and Continuous Deployment for ML projects.

7. **Orchestration with Kubeflow and Airflow**
- Automating workflows using orchestration tools like Kubeflow and Airflow.

8. **Orchestration with Dagster**
- Advanced orchestration techniques with Dagster.

9. **Serving Basics**
- Fundamentals of serving machine learning models via APIs.

10. **Inference Servers**
- Understanding inference servers and optimizing their performance.

11. **Advanced Serving Features and Benchmarking**
- Advanced serving techniques and benchmarking model performance.

12. **Scaling Infrastructure and Models**
- Techniques for scaling machine learning models and infrastructure to handle production workloads.

13. **Monitoring and Observability**
- Tools and techniques for monitoring ML systems in production.
- Implementing observability to track model health and data quality.

14. **Tools, LLMs, and Data Moats**
- Exploring state-of-the-art tools and methodologies.
- Leveraging large language models (LLMs) and building competitive data strategies.

15. **ML Platforms**
- Overview of ML platforms and their role in scaling machine learning operations.


### How to start:

1. **Create virtual environment in the root folder:**
```bash
cd /path/to/your/root/folder
python -m venv env
```

#### activate virtual environment:
2. **Activate virtual environment:**
```bash
source env/bin/activate
```

#### upgrade pip:
3. **Upgrade pip:**
```bash
python -m pip install --upgrade pip
```

#### install requirements:
4. **Install requirements:**
```bash
pip install -r main_requirements.txt
```
26 changes: 11 additions & 15 deletions homework_2/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
# Homework 2
# Homework 2: Infrastructure setup

This repository contains tasks related to Docker, Kubernetes, and GitHub Actions.

## Structure

The repository is organized into the following directories:
## Tasks:

- `homework_2/task1/`: Contains Docker-related tasks.
- `homework_2/task2/`: Contains Kubernetes-related tasks.
- `.github/workflows/`: Contains the GitHub Actions workflows.
- PR1: Write a dummy Dockerfile with a simple server and push it to your docker hub or github docker registry.
- PR2: Write CI/CD pipeline with github action that does this for each PR.
- PR3: Write YAML definition for Pod, Deployment, Service, and Job with your Docker image, Use minikube/kind for testing it.Install k9s tool.

### Task 1: Docker

#### PR1: `homework_2/task1`

This task involves working with Docker. The following scripts are available:
- folder: `homework_2/pr1`
- This task involves working with Docker. The following scripts are available:

1. **First Docker Container:**
- **Purpose**: Builds and runs a task that prints output.
Expand All @@ -35,15 +33,13 @@ This task involves working with Docker. The following scripts are available:

### Task 2: GitHub Actions

#### PR2: `.github/workflows`

This directory contains GitHub Actions workflows used for CI/CD automation.
- folder: `.github/workflows`
- This directory contains GitHub Actions workflows used for CI/CD automation.

### Task 3: Kubernetes

#### PR3: `homework_2/task2`

This task involves working with Kubernetes resources.
- folder: `homework_2/pr3`
- This task involves working with Kubernetes resources.

1. **Pod**:
- **Command**:
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
18 changes: 18 additions & 0 deletions homework_3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Homework 3: Storage and Processing

## Tasks:

- PR1: Write README instructions detailing how to deploy MinIO with the following options: Local, Docker, Kubernetes (K8S)-based.
- PR2: Develop a CRUD Python client for MinIO and accompany it with comprehensive tests.
- PR3: Write code to benchmark various Pandas formats in terms of data saving/loading, focusing on load time and save time.
- PR4: Create code to benchmark inference performance using single and multiple processes, and report the differences in time.
- PR5: Develop code for converting your dataset into the StreamingDataset format.
- PR6: Write code for transforming your dataset into a vector format, and utilize VectorDB for ingestion and querying.


### PR6: example

```bash
python main.py create-index
python main.py search-index "Who are you?" --top-n 2
```
11 changes: 0 additions & 11 deletions homework_3/pr2/tests.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,3 @@
'''
Before starting the script, create a virtual environment:
1. cd /path/to/your/project
2. python -m venv env
3. source env/bin/activate
4. pip install -r requirements.txt
After these steps start script from cmd:
5. python tests.py
'''
from minio import Minio
from minio.error import S3Error
import pytest
Expand Down
25 changes: 0 additions & 25 deletions homework_3/pr3/README.md

This file was deleted.

15 changes: 0 additions & 15 deletions homework_3/pr3/main.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,3 @@
'''
Before starting the script, create a virtual environment:
1. cd /path/to/your/project
2. python -m venv env
3. source env/bin/activate
4. pip install -r requirements.txt
After these steps start script from cmd:
5. python main.py
'''




import pandas as pd
import numpy as np
import time
Expand Down
90 changes: 90 additions & 0 deletions homework_3/pr4/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
'''
Before starting the script, create a virtual environment:
1. cd /path/to/your/project
2. python -m venv env
3. source env/bin/activate
4. pip install -r requirements.txt
After these steps start script from cmd:
5. python main.py
'''
import time
import multiprocessing as mp
from multiprocessing import Pool, cpu_count
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt


# Prepare a sample dataset and model for benchmarking
def create_model_and_data():
# Create a synthetic regression dataset with 100 features
X, y = make_regression(n_samples=200000, n_features=100, noise=0.1, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99, random_state=42)

# Train a simple Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

return model, X_test

# Single inference task using the sklearn model
def inference_task(args):
model, data = args
time.sleep(0.005)
# Simulate model inference (predicting the data)
return model.predict(data)


def single_process_inference(model, batches):
start_time = time.time()

for batch in batches:
inference_task((model, batch))

elapsed_time = time.time() - start_time
return elapsed_time


def multiple_process_inference(model, batches, num_processes=16):
start_time = time.time()

with mp.Pool(processes=num_processes) as pool:
pool.map(inference_task, [(model, batch) for batch in batches])

elapsed_time = time.time() - start_time
return elapsed_time


if __name__ == '__main__':
model, X_test = create_model_and_data()

batch_sizes = [100, 2000]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))

colors = ['#1f77b4', '#ff7f0e']

for i, batch_size in enumerate(batch_sizes):
num_batches = len(X_test) // batch_size + (1 if len(X_test) % batch_size != 0 else 0)
data_batches = np.array_split(X_test, num_batches)

single_process_time = single_process_inference(model, data_batches)
multiple_process_time = multiple_process_inference(model, data_batches)

methods = ['Single Process', 'Multiple Processes']
times = [single_process_time, multiple_process_time]

ax = ax1 if i == 0 else ax2
ax.bar(methods, times, color=colors)
ax.set_title(f'Inference Time Comparison (Batch Size: {batch_size})')
ax.set_xlabel('Method')
ax.set_ylabel('Time (seconds)')

plt.tight_layout()
plt.savefig('inference_time_comparison.jpg')
plt.close(fig)
6 changes: 6 additions & 0 deletions homework_3/pr4/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pandas==2.2.2
numpy===1.26.4
pyarrow==17.0.0
matplotlib==3.8.4
tables==3.9.2
scikit-learn==1.5.1
25 changes: 0 additions & 25 deletions homework_3/pr5/README.md

This file was deleted.

Loading

0 comments on commit bdc47e3

Please sign in to comment.