Skip to content

Commit

Permalink
Support Cross encoder models (vllm-project#10400)
Browse files Browse the repository at this point in the history
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Flavia Beo <[email protected]>
Co-authored-by: Flavia Beo <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
  • Loading branch information
2 people authored and mfournioux committed Nov 28, 2024
1 parent 4ac4813 commit 42adcc8
Show file tree
Hide file tree
Showing 28 changed files with 1,370 additions and 62 deletions.
142 changes: 142 additions & 0 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,148 @@ We currently support the following OpenAI APIs:
- This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*

## Score API for Cross Encoder Models

vLLM supports *cross encoders models* at the **/v1/score** endpoint, which is not an OpenAI API standard endpoint. You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

A ***Cross Encoder*** takes exactly two sentences / texts as input and either predicts a score or label for this sentence pair. It can for example predict the similarity of the sentence pair on a scale of 0 … 1.

### Example of usage for a pair of a string and a list of texts

In this case, the model will compare the first given text to each of the texts containing the list.

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"text_1": "What is the capital of France?",
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```

Response:

```bash
{
"id": "score-request-id",
"object": "list",
"created": 693570,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": [
0.001094818115234375
]
},
{
"index": 1,
"object": "score",
"score": [
1
]
}
],
"usage": {}
}
```

### Example of usage for a pair of two lists of texts

In this case, the model will compare the one by one, making pairs by same index correspondent in each list.

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
```

Response:

```bash
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": [
1
]
},
{
"index": 1,
"object": "score",
"score": [
1
]
}
],
"usage": {}
}
```

### Example of usage for a pair of two strings

In this case, the model will compare the strings of texts.

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": "What is the capital of France?",
"text_2": "The capital of France is Paris."
}'
```

Response:

```bash
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": [
1
]
}
],
"usage": {}
}
```

## Extra Parameters

vLLM supports a set of parameters that are not part of the OpenAI API.
Expand Down
58 changes: 58 additions & 0 deletions examples/openai_cross_encoder_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
"""Examples Python client Score for Cross Encoder Models
"""

import argparse
import json
import pprint

import requests


def post_http_request(prompt: json, api_url: str) -> requests.Response:
headers = {"User-Agent": "Test Client"}
response = requests.post(api_url, headers=headers, json=prompt)
return response


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--model", type=str, default="BAAI/bge-reranker-v2-m3")
args = parser.parse_args()
api_url = f"http://{args.host}:{args.port}/v1/score"

model_name = args.model

text_1 = "What is the capital of France?"
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]
prompt = {"model": model_name, "text_1": text_1, "text_2": text_2}
score_response = post_http_request(prompt=prompt, api_url=api_url)
print("Prompt for text_1 is string and text_2 is a list:")
pprint.pprint(prompt)
print("Score Response:")
pprint.pprint(score_response.data)

text_1 = [
"What is the capital of Brazil?", "What is the capital of France?"
]
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]
prompt = {"model": model_name, "text_1": text_1, "text_2": text_2}
score_response = post_http_request(prompt=prompt, api_url=api_url)
print("Prompt for text_1 and text_2 are lists:")
pprint.pprint(prompt)
print("Score Response:")
pprint.pprint(score_response.data)

text_1 = "What is the capital of Brazil?"
text_2 = "The capital of Brazil is Brasilia."
prompt = {"model": model_name, "text_1": text_1, "text_2": text_2}
score_response = post_http_request(prompt=prompt, api_url=api_url)
print("Prompt for text_1 and text_2 are strings:")
pprint.pprint(prompt)
print("Score Response:")
pprint.pprint(score_response.data)
20 changes: 20 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ def __init__(
model_kwargs: Optional[Dict[str, Any]] = None,
is_embedding_model: bool = False,
is_sentence_transformer: bool = False,
is_cross_encoder: bool = False,
skip_tokenizer_init: bool = False,
auto_cls: Type[_BaseAutoModelClass] = AutoModelForCausalLM,
postprocess_inputs: Callable[..., BatchEncoding] = identity,
Expand All @@ -282,6 +283,14 @@ def __init__(
device="cpu",
trust_remote_code=True,
).to(dtype=torch_dtype))
elif is_cross_encoder:
# Lazy init required for AMD CI
from sentence_transformers import CrossEncoder
self.model = CrossEncoder(model_name,
device="cpu",
trust_remote_code=True)
self.model.model = self.wrap_device(self.model.model)\
.to(dtype=torch_dtype)
else:
model_kwargs = model_kwargs if model_kwargs is not None else {}
self.model = self.wrap_device(
Expand Down Expand Up @@ -625,6 +634,9 @@ def generate_encoder_decoder_greedy_logprobs_limit(
def encode(self, prompts: List[str]) -> List[List[torch.Tensor]]:
return self.model.encode(prompts)

def predict(self, prompts: List[List[str]]) -> torch.Tensor:
return self.model.predict(prompts, convert_to_tensor=True)

def __enter__(self):
return self

Expand Down Expand Up @@ -898,6 +910,14 @@ def encode(
req_outputs = self.model.encode(inputs)
return [req_output.outputs.embedding for req_output in req_outputs]

def score(
self,
text_1: Union[str, List[str]],
text_2: Union[str, List[str]],
) -> List[List[float]]:
req_outputs = self.model.score(text_1, text_2)
return [req_output.outputs.embedding for req_output in req_outputs]

def __enter__(self):
return self

Expand Down
93 changes: 93 additions & 0 deletions tests/entrypoints/openai/test_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import pytest
import requests

from vllm.entrypoints.openai.protocol import ScoreResponse

from ...utils import RemoteOpenAIServer

MODEL_NAME = "BAAI/bge-reranker-v2-m3"


@pytest.fixture(scope="module")
def server():
args = [
"--enforce-eager",
]

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_text_1_str_text_2_list(server: RemoteOpenAIServer,
model_name: str):
text_1 = "What is the capital of France?"
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("v1/score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2
assert score.data[0].score[0] <= 0.01
assert score.data[1].score[0] >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_text_1_list_text_2_list(server: RemoteOpenAIServer,
model_name: str):
text_1 = [
"What is the capital of the United States?",
"What is the capital of France?"
]
text_2 = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

score_response = requests.post(server.url_for("v1/score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 2
assert score.data[0].score[0] <= 0.01
assert score.data[1].score[0] >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_text_1_str_text_2_str(server: RemoteOpenAIServer,
model_name: str):
text_1 = "What is the capital of France?"
text_2 = "The capital of France is Paris."

score_response = requests.post(server.url_for("v1/score"),
json={
"model": model_name,
"text_1": text_1,
"text_2": text_2,
})
score_response.raise_for_status()
score = ScoreResponse.model_validate(score_response.json())

assert score.id is not None
assert score.data is not None
assert len(score.data) == 1
assert score.data[0].score[0] >= 0.9
Loading

0 comments on commit 42adcc8

Please sign in to comment.