ray-serve-onnx

ray-serve-onnx provides a collection of starter code and examples to deploy ONNX models using Ray Serve. Whether you are exploring ONNX for the first time or looking for best practices in production, this repository will help you get started with a scalable and flexible serving setup on Ray.

Overview

ONNX (Open Neural Network Exchange) is a popular format for model interoperability, enabling you to easily switch between frameworks without rewriting model code. Ray Serve is a scalable model serving framework that can handle high-throughput inference.

This repository provides:

Starter code for serving ONNX models with Ray Serve.
Best practices for model loading, inference, and deployment.
Example code showing how to customize request handling, scale out to multiple replicas, and monitor performance.

Key Features

ONNX Model Deployment: Easily load and run inference on ONNX models using Ray Serve.
Scalability: Leverage Ray’s distributed architecture to scale out your inference workloads.
Customization: Inject custom logic into request handling with Ray Serve’s flexible deployment APIs.
Performance: Combine ONNX’s efficient runtime with Ray Serve’s built-in scaling and concurrency management.

Getting Started

Prerequisites

Python 3.10+
Ray 2.40+ (or a recent version of your choice)
onnxruntime or equivalent ONNX runtime library

Installation

Clone the repository:

git clone https://github.com/gilljon/ray-serve-onnx.git
cd ray-serve-onnx

(Optional) Create and activate a new virtual environment:

python -m venv env
source env/bin/activate  # On Unix-based systems
# or
env\Scripts\activate     # On Windows

Install the required Python packages (via pip):
```
pip install -r requirements.txt
```
or via Poetry:
```
poetry install --no-root
```

Usage

Start Ray:
```
ray start --head
```
Or just let Ray automatically start in local mode from your Python script (see examples).
Run a serving script:
```
serve run examples.embedding_inference:build
```
This will:
- Initialize Ray (if not already initialized).
- Download the ONNX model weights from onnx-models/all-MiniLM-L12-v2-onnx and cache them in ./onnx-models
- Create a Ray Serve deployment.
- Start the Ray Serve HTTP API on port 8000 to receieve inference requests.

Send an inference request:

curl -X POST -H "Content-Type: application/json" \
     -d '{"inputs": ["John Doe works at the store."]}' \
     http://localhost:8000/

Adjust the request body based on your model’s input structure.

Examples

1. Basic Inference (embedding)

File: examples/embedding_inference.py
Description: Simple demonstration of loading a single ONNX embedding model and serving it through Ray Serve.
Command: serve run examples.embedding:build

More examples coming soon. Please feel free to open an issue for specific example requests! :)

Performance Tips

Use ONNX Runtime Execution Providers: Try GPU providers (CUDA, ROCm) or specialized hardware providers (e.g., TensorRT) to speed up your inference.
Batching: Ray Serve allows request batching to improve throughput. Configure max_batch_size and batch_wait_timeout_s for your deployment.
Parallelization: Ray allows easy scaling with multiple replicas; each replica can host the ONNX model in memory.
Profiling: Use Ray’s built-in metrics or external profilers to identify bottlenecks (I/O, CPU/GPU usage, etc.).

Contributing

Contributions are welcome! If you want to add new examples, fix bugs, or propose new features:

Fork the repository.
Create a new branch with your contribution: git checkout -b feature/my-new-feature.
Commit your changes: git commit -m "Add some feature".
Push to the branch: git push origin feature/my-new-feature.
Create a pull request in this repository describing your changes.

License

This project is licensed under the MIT License. Feel free to use, modify, and distribute the code in this repository. See the LICENSE file for full details.

If you encounter any issues or have any questions, please open an issue. I welcome all feedback and contributions. Happy serving!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ray-serve-onnx

Table of Contents

Overview

Key Features

Getting Started

Prerequisites

Installation

Usage

Examples

1. Basic Inference (embedding)

Performance Tips

Contributing

License

About

Releases

Packages

License

gilljon/ray-serve-onnx

Folders and files

Latest commit

History

Repository files navigation

ray-serve-onnx

Table of Contents

Overview

Key Features

Getting Started

Prerequisites

Installation

Usage

Examples

1. Basic Inference (embedding)

Performance Tips

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages