[FEAT] hub integration with dataset and configuration #5161

burtenshaw · 2024-07-04T12:36:20Z

This is an experimental WIP pr to get feedback on the approach. I've migrated the code out of the v1 argilla client, and reimplemented for v2 changes. It would be great to get feedback on issues like this:

testing with a mocked hub api.
dealing with responses across argilla servers and mismatched user_ids.
dealing with dependencies like huggingface_hub. I like the decorator used in the v1 client.

Here's a dataset that I pushed. Still uses default readme from v1:

https://huggingface.co/datasets/burtenshaw/test-argilla-dataset

To test this implementation do:

import uuid

from datetime import datetime

import argilla as rg

client = rg.Argilla(api_key="owner.apikey")
workspace = client.workspaces[0]
mock_dataset_name = (
    f"test_add_record_with_suggestions {datetime.now().strftime('%Y%m%d%H%M%S')}"
)
mock_data = [
    {
        "text": "Hello World, how are you?",
        "label": "positive",
        "id": uuid.uuid4(),
        "comment": "I'm doing great, thank you!",
        "topics": ["topic1", "topic2"],
        "topics.score": [0.9, 0.8],
    },
    {
        "text": "Hello World, how are you?",
        "label": "negative",
        "id": uuid.uuid4(),
        "comment": "I'm doing great, thank you!",
        "topics": ["topic3"],
        "topics.score": [0.9],
    },
    {
        "text": "Hello World, how are you?",
        "label": "positive",
        "id": uuid.uuid4(),
        "comment": "I'm doing great, thank you!",
        "comment_score": 0.9,  # This field will be ignored because it is not in the mapping
        "rating": 1,
        "topics": ["topic1", "topic2", "topic3"],
        "topics.score": [0.9, 0.8, 0.7],
        "ranking": ["label1", "label2", "label3"],
        "span": [
            {
                "start": 0,
                "end": 5,
                "label": "label1",
            },
            {
                "start": 6,
                "end": 11,
                "label": "label2",
            },
            {
                "start": 12,
                "end": 17,
                "label": "label3",
            },
        ],
        "vector": [1, 2, 3],
    },
]
settings = rg.Settings(
    fields=[
        rg.TextField(name="text"),
    ],
    questions=[
        rg.LabelQuestion(name="label", labels=["positive", "negative"]),
        rg.RatingQuestion(name="rating", values=[1, 2, 3, 4, 5]),
        rg.RankingQuestion(name="ranking", values=["label1", "label2", "label3"]),
        rg.TextQuestion(name="comment", use_markdown=False),
        rg.MultiLabelQuestion(
            name="topics",
            labels=["topic1", "topic2", "topic3"],
            labels_order="suggestion",
        ),
        rg.SpanQuestion(
            name="span", labels=["label1", "label2", "label3"], field="text"
        ),
    ],
    metadata=[
        rg.FloatMetadataProperty(name="comment_score"),
    ],
    vectors=[
        rg.VectorField(name="vector", dimensions=3),
    ],
)
dataset = rg.Dataset(
    name=mock_dataset_name,
    settings=settings,
    client=client,
)
dataset.create()
dataset.records.log(
    mock_data,
    mapping={
        "comment": "comment.suggestion",
        "comment_score": "comment.suggestion.score",  # This field will be ignored because it is not in the mapping
        "topics": "topics.suggestion",
        "topics.score": "topics.suggestion.score",
        "label": "label.response",
        "span": "span",
    },
)

dataset.to_hub(repo_id="burtenshaw/test-argilla-dataset")


pulled_dataset = from_huggingface("burtenshaw/test-argilla-dataset")

…re ingestion loop

…with refactor and tqdm and exceptions

for more information, see https://pre-commit.ci

Reviewing and improving records.log Instead of: <img width="1335" alt="Captura de pantalla 2024-06-26 a las 12 48 14" src="https://github.com/argilla-io/argilla/assets/2518789/02283f4c-fe6a-464f-96b3-36853e6c7622"> for 50 records, records.log can log 1000: <img width="870" alt="Captura de pantalla 2024-06-26 a las 12 48 57" src="https://github.com/argilla-io/argilla/assets/2518789/d20f0469-0b33-427e-aa12-b4b7e1d40cd1">

…o/argilla into spike/mapping-to-tuple

for more information, see https://pre-commit.ci

…o/argilla into spike/mapping-to-tuple

for more information, see https://pre-commit.ci

…argilla into feat/hub-integration

argilla/src/argilla/datasets/_export/_hub.py

Co-authored-by: Paco Aranda <[email protected]>

for more information, see https://pre-commit.ci

Wauplin

Final review from my side. Integration looks good to me! ✔️
Left 2 small comments regarding future maintenance.

argilla/src/argilla/datasets/_export/_hub.py

These PR introduces a new guide on exporting datasets and/or their records. It moves content out of the current 'query and export' guide, and creates a soley export guide. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Vila Suero <[email protected]>

for more information, see https://pre-commit.ci

burtenshaw and others added 30 commits June 24, 2024 20:29

test: update tests for refactored mapping method

10965d3

refactor: introduce independent mapping method and move logic to befo…

a416a2f

…re ingestion loop

docs: update all doc strings in dataset records

35db9f6

chore: improve typing and docs on type

eae088b

docs: wrong method in records api reference

4490d11

feat: add exception for record ingestion

b5b3396

refactor: improve explainabilitity and readability in ingestion code …

ffeb0b0

…with refactor and tqdm and exceptions

[pre-commit.ci] auto fixes from pre-commit.com hooks

594283e

for more information, see https://pre-commit.ci

enhancement: move mapping out of record loop

16f14d1

Merge branch 'spike/mapping-to-tuple' of https://github.com/argilla-i…

5f06e20

…o/argilla into spike/mapping-to-tuple

enhancement: use just one progress bar

05df51a

chore: update typing of mapping

863dde2

fix: move render mapping into infer record method

bf9e864

[pre-commit.ci] auto fixes from pre-commit.com hooks

07aa249

for more information, see https://pre-commit.ci

fix: align add records parameters with render function

8a6d484

feat: implement ingestion mapping as class

0b623fd

feat: use ingestion mapping class in dataset records not dataset records

14faccf

[pre-commit.ci] auto fixes from pre-commit.com hooks

e2bfc88

for more information, see https://pre-commit.ci

chore: tidy imports

8889c0a

Merge branch 'spike/mapping-to-tuple' of https://github.com/argilla-i…

7ad5075

…o/argilla into spike/mapping-to-tuple

docs: update mapping parameters in how to guides

63e0f7b

test: broaden suggestion mapping in test

ecbdd4e

feat: add flat for including records in export

ebd77e4

feat: extract dot notation with regex not string splitting

99235b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ca8932

for more information, see https://pre-commit.ci

docs: typo in docs

3eb8c0d

feat: improve record switch in ingest method

b10fbe8

feat: refactor id mapping away from dict to type

db27e1b

[pre-commit.ci] auto fixes from pre-commit.com hooks

716dfa8

for more information, see https://pre-commit.ci

pre-commit-ci bot and others added 5 commits July 11, 2024 12:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

4cf072f

for more information, see https://pre-commit.ci

fix: deprecate _hf_datasets helper submodule due to dependency change

a0dd106

Merge branch 'feat/hub-integration' of https://github.com/argilla-io/…

799ddc9

…argilla into feat/hub-integration

Merge branch 'feat/hub-integration' of https://github.com/argilla-io/…

6f1b719

…argilla into feat/hub-integration

Merge branch 'feat/hub-integration' of https://github.com/argilla-io/…

20a10d4

…argilla into feat/hub-integration

frascuchon reviewed Jul 11, 2024

View reviewed changes