Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] hub integration with dataset and configuration #5161

Merged
merged 106 commits into from
Jul 12, 2024

Conversation

burtenshaw
Copy link
Contributor

@burtenshaw burtenshaw commented Jul 4, 2024

This is an experimental WIP pr to get feedback on the approach. I've migrated the code out of the v1 argilla client, and reimplemented for v2 changes. It would be great to get feedback on issues like this:

  • testing with a mocked hub api.
  • dealing with responses across argilla servers and mismatched user_ids.
  • dealing with dependencies like huggingface_hub. I like the decorator used in the v1 client.

Here's a dataset that I pushed. Still uses default readme from v1:

https://huggingface.co/datasets/burtenshaw/test-argilla-dataset

To test this implementation do:

import uuid

from datetime import datetime

import argilla as rg

client = rg.Argilla(api_key="owner.apikey")
workspace = client.workspaces[0]
mock_dataset_name = (
    f"test_add_record_with_suggestions {datetime.now().strftime('%Y%m%d%H%M%S')}"
)
mock_data = [
    {
        "text": "Hello World, how are you?",
        "label": "positive",
        "id": uuid.uuid4(),
        "comment": "I'm doing great, thank you!",
        "topics": ["topic1", "topic2"],
        "topics.score": [0.9, 0.8],
    },
    {
        "text": "Hello World, how are you?",
        "label": "negative",
        "id": uuid.uuid4(),
        "comment": "I'm doing great, thank you!",
        "topics": ["topic3"],
        "topics.score": [0.9],
    },
    {
        "text": "Hello World, how are you?",
        "label": "positive",
        "id": uuid.uuid4(),
        "comment": "I'm doing great, thank you!",
        "comment_score": 0.9,  # This field will be ignored because it is not in the mapping
        "rating": 1,
        "topics": ["topic1", "topic2", "topic3"],
        "topics.score": [0.9, 0.8, 0.7],
        "ranking": ["label1", "label2", "label3"],
        "span": [
            {
                "start": 0,
                "end": 5,
                "label": "label1",
            },
            {
                "start": 6,
                "end": 11,
                "label": "label2",
            },
            {
                "start": 12,
                "end": 17,
                "label": "label3",
            },
        ],
        "vector": [1, 2, 3],
    },
]
settings = rg.Settings(
    fields=[
        rg.TextField(name="text"),
    ],
    questions=[
        rg.LabelQuestion(name="label", labels=["positive", "negative"]),
        rg.RatingQuestion(name="rating", values=[1, 2, 3, 4, 5]),
        rg.RankingQuestion(name="ranking", values=["label1", "label2", "label3"]),
        rg.TextQuestion(name="comment", use_markdown=False),
        rg.MultiLabelQuestion(
            name="topics",
            labels=["topic1", "topic2", "topic3"],
            labels_order="suggestion",
        ),
        rg.SpanQuestion(
            name="span", labels=["label1", "label2", "label3"], field="text"
        ),
    ],
    metadata=[
        rg.FloatMetadataProperty(name="comment_score"),
    ],
    vectors=[
        rg.VectorField(name="vector", dimensions=3),
    ],
)
dataset = rg.Dataset(
    name=mock_dataset_name,
    settings=settings,
    client=client,
)
dataset.create()
dataset.records.log(
    mock_data,
    mapping={
        "comment": "comment.suggestion",
        "comment_score": "comment.suggestion.score",  # This field will be ignored because it is not in the mapping
        "topics": "topics.suggestion",
        "topics.score": "topics.suggestion.score",
        "label": "label.response",
        "span": "span",
    },
)

dataset.to_hub(repo_id="burtenshaw/test-argilla-dataset")


pulled_dataset = from_huggingface("burtenshaw/test-argilla-dataset")

burtenshaw and others added 30 commits June 24, 2024 20:29
Reviewing and improving records.log

Instead of: 

<img width="1335" alt="Captura de pantalla 2024-06-26 a las 12 48 14"
src="https://github.com/argilla-io/argilla/assets/2518789/02283f4c-fe6a-464f-96b3-36853e6c7622">

for 50 records, records.log can log 1000:

<img width="870" alt="Captura de pantalla 2024-06-26 a las 12 48 57"
src="https://github.com/argilla-io/argilla/assets/2518789/d20f0469-0b33-427e-aa12-b4b7e1d40cd1">
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review from my side. Integration looks good to me! ✔️
Left 2 small comments regarding future maintenance.

argilla/src/argilla/datasets/_export/_hub.py Outdated Show resolved Hide resolved
argilla/src/argilla/datasets/_export/_hub.py Outdated Show resolved Hide resolved
burtenshaw and others added 9 commits July 12, 2024 13:27
These PR introduces a new guide on exporting datasets and/or their
records. It moves content out of the current 'query and export' guide,
and creates a soley export guide.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Vila Suero <[email protected]>
@burtenshaw burtenshaw merged commit 9daefee into develop Jul 12, 2024
7 checks passed
@burtenshaw burtenshaw deleted the feat/hub-integration branch July 12, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Hub integration with v2 SDK
5 participants