Skip to content

Commit

Permalink
Update image preference dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
davidberenstein1957 committed Aug 14, 2024
1 parent bbf45ac commit b8eee26
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 307 deletions.
2 changes: 1 addition & 1 deletion argilla/docs/tutorials/image_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Even if we have created the dataset, it still lacks the information to be annotated (you can check it in the UI). We will use the `ylecun/mnist` dataset from [the Hugging Face Hub](https://huggingface.co/datasets/ylecun/mnist). Specifically, we will use the `train` split and get `100` examples. \n",
"Even if we have created the dataset, it still lacks the information to be annotated (you can check it in the UI). We will use the `ylecun/mnist` dataset from [the Hugging Face Hub](https://huggingface.co/datasets/ylecun/mnist). Specifically, we will use `100` examples. Because we are dealing with a potentially large image dataset, we will set `streaming=True` to avoid loading the entire dataset into memory and iterate over the data to lazily load it.\n",
"\n",
"!!! tip\n",
" When working with Hugging Face dataset you can set `Image(decode=False)` so that we can get [public image URLs](https://huggingface.co/docs/datasets/en/image_load#local-files), however, this depends on the dataset."
Expand Down
315 changes: 9 additions & 306 deletions argilla/docs/tutorials/image_preference.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -219,14 +219,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add records"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even if we have created the dataset, it still lacks the information to be annotated (you can check it in the UI). We will use the `openbmb/RLAIF-V-Dataset` dataset from [the Hugging Face Hub](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Specifically, we will use the `train` split and get `100` examples. Because we are dealing with a large dataset, we will set `streaming=True` to avoid loading the entire dataset into memorym and iterate over the data to lazily load it.\n",
"## Add records\n",
"\n",
"Even if we have created the dataset, it still lacks the information to be annotated (you can check it in the UI). We will use the `openbmb/RLAIF-V-Dataset` dataset from [the Hugging Face Hub](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Specifically, we will use `100` examples. Because we are dealing with a potentially large image dataset, we will set `streaming=True` to avoid loading the entire dataset into memory and iterate over the data to lazily load it.\n",
"\n",
"!!! tip\n",
" When working with Hugging Face dataset you can set `Image(decode=False)` so that we can get [public image URLs](https://huggingface.co/docs/datasets/en/image_load#local-files), however, this depends on the dataset."
Expand Down Expand Up @@ -370,7 +365,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Log into Argilla\n"
"### Log to Argilla\n"
]
},
{
Expand All @@ -387,20 +382,23 @@
"outputs": [],
"source": [
"hf_dataset = hf_dataset.add_column(\"id\", range(len(hf_dataset)))\n",
"dataset.records.log(records=hf_dataset[:100], mapping={\n",
"dataset.records.log(records=hf_dataset, mapping={\n",
" \"image_data_uri\": \"image\",\n",
" \"idx\": \"id\",\n",
" \"question\": \"question\",\n",
" \"chosen\": \"chosen\",\n",
" \"rejected\": \"rejected\",\n",
" \"task_type\": \"task_type\",\n",
" \"question_vector\": \"question_vector\",\n",
" \"origin_dataset\": \"origin_dataset\"\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voilà! We have added the suggestions to the dataset, and they will appear in the UI marked with a ✨. "
"Voilà! We have also added the suggestions to the dataset for the `chosen` `rejected` pairs, and they will appear in the UI marked with a ✨. "
]
},
{
Expand All @@ -425,301 +423,6 @@
" Check this [how-to guide](../how_to_guides/annotate.md) to know more about annotating in the UI."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train your model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After the annotation, we will have a robust dataset to train the main model. In our case, we will fine-tune using transformers and the . However, you can select the one that best fits your requirements. So, let's start by retrieving the annotated records.\n",
"\n",
"!!! note\n",
" Check this [how-to guide](../how_to_guides/query.md) to know more about filtering and querying in Argilla. Also, you can check the Hugging Face docs on [fine-tuning an image classification model](https://huggingface.co/docs/transformers/en/tasks/image_classification)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Formatting the data"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"dataset = client.datasets(\"image_classification_dataset\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"status_filter = rg.Query(filter=rg.Filter((\"response.status\", \"==\", \"submitted\")))\n",
"\n",
"submitted = dataset.records(status_filter).to_list(flatten=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then need to convert our base64 images to a format that the model can understand so we will convert them to PIL images again."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"def base64_to_pil(base64_string):\n",
" image_data = re.sub('^data:image/.+;base64,', '', base64_string)\n",
" image = Image.open(io.BytesIO(base64.b64decode(image_data)))\n",
" return image"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's apply that to the whole dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"submitted_pil_image = [\n",
" {\n",
" \"id\": sample[\"id\"],\n",
" \"image\": base64_to_pil(sample[\"image\"]),\n",
" \"label\": sample[\"image_label.responses\"][0],\n",
" }\n",
" for sample in submitted\n",
"]\n",
"submitted_pil_image[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now need to ensure our images are forwarded with the correct dimensions. Because the original MNIST dataset is greyscale and the VIT model expects RGB, we need to add a channel dimension to the images. We will do this by stacking the images along the channel axis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def greyscale_to_rgb(img) -> Image:\n",
" return Image.merge('RGB', (img, img, img))\n",
"\n",
"submitted_pil_image_rgb = [\n",
" {\n",
" \"image\": greyscale_to_rgb(sample[\"image\"]),\n",
" \"label\": sample[\"label\"],\n",
" }\n",
" for sample in submitted_pil_image\n",
"]\n",
"submitted_pil_image_rgb[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will load the `ImageProcessor` for fine-tuning the model. This processor will handle the image resizing and normalization in order to be compatible with the model we intend to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"checkpoint = \"google/vit-base-patch16-224-in21k\"\n",
"processor = AutoImageProcessor.from_pretrained(checkpoint)\n",
"\n",
"submitted_pil_image_rgb_processed = [\n",
" {\n",
" \"pixel_values\": processor(sample[\"image\"], return_tensors='pt')[\"pixel_values\"],\n",
" \"label\": sample[\"label\"],\n",
" }\n",
" for sample in submitted_pil_image_rgb\n",
"]\n",
"submitted_pil_image_rgb_processed[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now convert the images to a Hugging Face datasets Dataset that is ready for fine-tuning."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prepared_ds = Dataset.from_list(submitted_pil_image_rgb_processed)\n",
"prepared_ds = prepared_ds.train_test_split(test_size=0.2)\n",
"prepared_ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The actual training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then need to define our data collator, which will ensure the data is unpacked and stacked correctly for the model. We wi"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def collate_fn(batch):\n",
" return {\n",
" 'pixel_values': torch.stack([torch.tensor(x['pixel_values'][0]) for x in batch]),\n",
" 'labels': torch.tensor([int(x['label']) for x in batch])\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we can define our training metrics. We will use the accuracy metric to evaluate the model's performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metric = load_metric(\"accuracy\")\n",
"def compute_metrics(p):\n",
" return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then load our model and configure the labels that we will use for training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = AutoModelForImageClassification.from_pretrained(\n",
" checkpoint,\n",
" num_labels=len(labels),\n",
" id2label={int(i): int(c) for i, c in enumerate(labels)},\n",
" label2id={int(c): int(i) for i, c in enumerate(labels)}\n",
")\n",
"model.config"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we define the training arguments and start the training process."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_args = TrainingArguments(\n",
" output_dir=\"./image-classifier\",\n",
" per_device_train_batch_size=16,\n",
" evaluation_strategy=\"steps\",\n",
" num_train_epochs=1,\n",
" fp16=False, # True if you have a GPU with mixed precision support\n",
" save_steps=100,\n",
" eval_steps=100,\n",
" logging_steps=10,\n",
" learning_rate=2e-4,\n",
" save_total_limit=2,\n",
" remove_unused_columns=True,\n",
" push_to_hub=False,\n",
" load_best_model_at_end=True,\n",
")\n",
"\n",
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" data_collator=collate_fn,\n",
" compute_metrics=compute_metrics,\n",
" train_dataset=prepared_ds[\"train\"],\n",
" eval_dataset=prepared_ds[\"test\"],\n",
" tokenizer=processor,\n",
")\n",
"\n",
"train_results = trainer.train()\n",
"trainer.save_model()\n",
"trainer.log_metrics(\"train\", train_results.metrics)\n",
"trainer.save_metrics(\"train\", train_results.metrics)\n",
"trainer.save_state()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the training data had a better-quality, we can expect a better model. So, we can update the remainder of our original dataset with the new model's suggestions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipe = pipeline(\"image-classification\", model=model, image_processor=processor)\n",
"\n",
"def run_inference(batch):\n",
" predictions = pipe(batch[\"image\"])\n",
" batch[\"image_label\"] = [prediction[0][\"label\"] for prediction in predictions]\n",
" batch[\"image_label.score\"] = [prediction[0][\"score\"] for prediction in predictions]\n",
" return batch\n",
"\n",
"hf_dataset = hf_dataset.map(run_inference, batched=True)\n",
"dataset.records.log(records=hf_dataset[:100], mapping={\"image_data_uri\": \"image\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit b8eee26

Please sign in to comment.