-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] add migration notebook to docs format #5002
Conversation
Hi @burtenshaw, looks good. Adding it to guides makes sense to me too. I have not taken the time to look at the migration guides yet but I think we might be able to make the migration a lot more user-friendly by providing code that maps everything to the Datasets using the 2.0 SDK automatically. TextClassification -> TextField ("name=text") | MultiLabel|LabelQuestion ("name=label") + metadata configs It takes us some more time to figure out the logic and mappings but will save a lot of community members a lot of time and effort too. @frascuchon @burtenshaw WDYT? |
The document says that Feedback datasets do not need to be migrated, but if you have a server with legacy+feedback datasets I assume that one would like to move all of them over to the new server, right? @burtenshaw We may also want to discuss what to do with the migration of predictions to suggestions in multi-label text classification datasets. If I'm not mistaken, with the current code all labels would be selected as suggestions (as in legacy datasets you can have a prediction for each label). Maybe we can add a threshold that the user can modify so only labels where |
@nataliaElv I don't think the FeedbackDataset need to be moved. The underlying server logic is the same but the way of interacting with the server through the new SDK as API is different. |
@nataliaElv Good catch. I think we need to provide a warning and on let the user choose to add the score processing mechanism, which might be as simple as choosing the highest or lowest one. |
This is a nice suggestion but I think it depends on use, and the metrics are declining on legacy datasets. Personally, I would implement this based on community engagement with this guide. |
@nataliaElv Is on to something here. The guide proposes that the v2 server is 'their new server'. So from a user persepctive, they would want their datasets migrated. I will add the a snippet that shows them how to |
I would add this to the tab with multilabel text classification as a note box. |
argilla/docs/guides/how_to_guides/migrate_from_legacy_datasets.md
Outdated
Show resolved
Hide resolved
argilla/docs/guides/how_to_guides/migrate_from_legacy_datasets.md
Outdated
Show resolved
Hide resolved
argilla/docs/guides/how_to_guides/migrate_from_legacy_datasets.md
Outdated
Show resolved
Hide resolved
argilla/docs/guides/how_to_guides/migrate_from_legacy_datasets.md
Outdated
Show resolved
Hide resolved
4452759
to
fd979dd
Compare
Helper functions for record mapping have been reviewed and tested with several datasets.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #5002 +/- ##
============================================
+ Coverage 61.13% 91.93% +30.80%
============================================
Files 328 135 -193
Lines 17670 5818 -11852
============================================
- Hits 10802 5349 -5453
+ Misses 6868 469 -6399
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Co-authored-by: Paco Aranda <[email protected]>
Co-authored-by: Paco Aranda <[email protected]>
## Migrating Feedback Datasets on your Argilla 1.* server | ||
|
||
As mentioned above, `FeedbackDataset`'s are compatible with Argilla V2 and do not need to be reformatted. However, you may want to migrate your feedback datasets to the new server so that you can deprecate your Argilla 1.* server. Here is a guide on how to migrate your feedback datasets: | ||
|
||
```python | ||
import argilla as rg | ||
|
||
# Initialize the API with an Argilla server less than 2.0 | ||
old_client = rg.Argilla(old_server_api_url, old_server_api_key) | ||
new_client = rg.Argilla(new_server_api_url, new_server_api_key) | ||
|
||
dataset_name = "feedback-dataset" | ||
old_dataset = old_client.datasets(name=dataset_name) | ||
new_dataset = new_client.datasets.add(old_dataset) | ||
|
||
# Load the records from the old server | ||
new_dataset.records.log( | ||
old_dataset.records( | ||
with_responses=True, # (1) | ||
with_suggestions=True, | ||
with_vectors=True, | ||
) | ||
) | ||
``` | ||
|
||
1. The `with_responses`, `with_suggestions`, `with_vectors`, and `with_metadata` flags are used to load the records with the responses, suggestions, vectors, and metadata respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scenario has several problems.
- We cannot share resources between clients, or even create new resources from other resources. Those actions will modify the internal state of the original resource, so old_dataset settings will change when the new_dataset is created
- Copy responses from one server to another requires a revision of response users. User IDs won't exist in the target server so we should provide a user mapping in the same way we do for legacy datasets.
Not sure how common would be this scenario. But I propose to skip this for now and think a bit more about how the SDK should work for this. cc @burtenshaw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, suggestions are linked to questions by a question ID. When we do:
new_dataset.records.log(
list(old_dataset.records(with_responses=False, with_suggestions=True, with_vectors=True)),
)
an error will be raised since the error contains wrong question info for the provided dataset.
This PR transfer the notebook guide by @frascuchon and @nataliaElv over to the documentation format, which makes it easier to follow.
You can find the page here: http://localhost:8000/argilla-python/guides/how_to_guides/migrate_from_legacy_datasets/
n.b. the page is not available via the menu. @davidberenstein1957 we should figure out navigation presentation for this page.
The guide looks like this: