-
Notifications
You must be signed in to change notification settings - Fork 398
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add background processing jobs (#5432)
# Description This PR add the following changes: - [x] Add `rq` to help us execute background jobs. - [x] Add a background job to update all records for a dataset when the dataset distribution strategy is updated. - [x] Change HuggingFace Dockerfile to install Redis and run `rq` workers inside honcho Procfile. - [x] Add documentation about new `ARGILLA_REDIS_URL` environment variable. - [x] Add ping to Redis so Argilla server is not started if Redis is not ready. - [x] Change Argilla docker compose file to include a container with Redis and rq workers. - [x] Update Argilla server `README.md` file adding Redis as dependency to install. - [x] Add documentation about Redis being a new Argilla server dependency. - [x] Add `BACKGROUND_NUM_WORKERS` environment variable to specify the number of workers in the HF Space container. - [ ] ~~Modify `Dockerfile` template on HF to include the environment variable #5443 ``` # (since: v2.2.0) Uncomment the next line to specify the number of background job workers to run (default: 2). # ENV BACKGROUND_NUM_WORKERS=2 ``` - [ ] Remove some `TODO` sections before merging. - [ ] Review K8s documentation (maybe delete it?). - [ ] If we want to persist Redis data on HF Spaces we can change our `Procfile` Redis process to the following: ``` redis: /usr/bin/redis-server --dbfilename argilla-redis.rdb --dir ${ARGILLA_HOME_PATH} ``` - [ ] <del>Allow tests job workers synchronously (with pytest)</del> It's not working due to asyncio stuff (running an asynchronous loop inside another one, more info here: rq/rq#1986). Closes #5431 # Benchmarks The following timings were obtained updating the distribution strategy of a dataset with 100 and 10.000 records, using a basic and an upgraded CPU on HF Spaces, with and without persistent storage and measuring how much time the background job takes to complete: CPU basic: 2 vCPU, 16GB RAM CPU upgrade: 8 vCPU, 32GB RAM * CPU basic (with persistent storage): * 100 records dataset: ~8 seconds. * 10.000 records dataset: ~9 minutes. * CPU upgrade (with persistent storage): * 100 records dataset: ~5 seconds. * 10.000 records dataset: ~6 minutes. * CPU basic (no persistent storage): * 10.000 records dataset: ~101 seconds. * CPU upgrade (no persistent storage): * 10.000 records dataset: ~62 seconds. **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Testing it on HF Spaces. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Damián Pumar <[email protected]>
- Loading branch information
1 parent
fee1f5a
commit 84c8aa7
Showing
24 changed files
with
308 additions
and
107 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,4 @@ | ||
elastic: /usr/share/elasticsearch/bin/elasticsearch | ||
redis: /usr/bin/redis-server | ||
worker: sleep 30; rq worker-pool --num-workers ${BACKGROUND_NUM_WORKERS} | ||
argilla: sleep 30; /bin/bash start_argilla_server.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
honcho | ||
rq ~= 1.16.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Copyright 2021-present, the Recognai S.L. team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import typer | ||
|
||
from typing import List | ||
|
||
from argilla_server.jobs.queues import DEFAULT_QUEUE | ||
|
||
DEFAULT_NUM_WORKERS = 2 | ||
|
||
|
||
def worker( | ||
queues: List[str] = typer.Option([DEFAULT_QUEUE.name], help="Name of queues to listen"), | ||
num_workers: int = typer.Option(DEFAULT_NUM_WORKERS, help="Number of workers to start"), | ||
) -> None: | ||
from rq.worker_pool import WorkerPool | ||
from argilla_server.jobs.queues import REDIS_CONNECTION | ||
|
||
worker_pool = WorkerPool( | ||
connection=REDIS_CONNECTION, | ||
queues=queues, | ||
num_workers=num_workers, | ||
) | ||
|
||
worker_pool.start() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Copyright 2021-present, the Recognai S.L. team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Copyright 2021-present, the Recognai S.L. team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from uuid import UUID | ||
|
||
from rq import Retry | ||
from rq.decorators import job | ||
|
||
from sqlalchemy import func, select | ||
|
||
from argilla_server.models import Record, Response | ||
from argilla_server.database import AsyncSessionLocal | ||
from argilla_server.jobs.queues import DEFAULT_QUEUE | ||
from argilla_server.search_engine.base import SearchEngine | ||
from argilla_server.settings import settings | ||
from argilla_server.contexts import distribution | ||
|
||
JOB_TIMEOUT_DISABLED = -1 | ||
JOB_RECORDS_YIELD_PER = 100 | ||
|
||
|
||
@job(DEFAULT_QUEUE, timeout=JOB_TIMEOUT_DISABLED, retry=Retry(max=3)) | ||
async def update_dataset_records_status_job(dataset_id: UUID): | ||
"""This Job updates the status of all the records in the dataset when the distribution strategy changes.""" | ||
|
||
record_ids = [] | ||
|
||
async with AsyncSessionLocal() as db: | ||
stream = await db.stream( | ||
select(Record.id) | ||
.join(Response) | ||
.where(Record.dataset_id == dataset_id) | ||
.order_by(Record.inserted_at.asc()) | ||
.execution_options(yield_per=JOB_RECORDS_YIELD_PER) | ||
) | ||
|
||
async for record_id in stream.scalars(): | ||
record_ids.append(record_id) | ||
|
||
# NOTE: We are updating the records status outside the database transaction to avoid database locks with SQLite. | ||
async with SearchEngine.get_by_name(settings.search_engine) as search_engine: | ||
for record_id in record_ids: | ||
await distribution.update_record_status(search_engine, record_id) |
Oops, something went wrong.