diff --git a/integrations/model-training/sagemaker/log_custom_scripts/huggingface-transformers-cifar/sagemaker_notebook.ipynb b/integrations/model-training/sagemaker/log_custom_scripts/huggingface-transformers-cifar/sagemaker_notebook.ipynb new file mode 100644 index 0000000..d7f0c58 --- /dev/null +++ b/integrations/model-training/sagemaker/log_custom_scripts/huggingface-transformers-cifar/sagemaker_notebook.ipynb @@ -0,0 +1,601 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Z-1m1KP-Jtek" + }, + "source": [ + "# Huggingface Sagemaker - Vision Transformer\n", + "\n", + "### Image Classification with the `google/vit` on `cifar10`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7oCkA-P1Jtet" + }, + "source": [ + "1. [Introduction](#Introduction) \n", + "2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n", + " 1. [Installation](#Installation) \n", + " 3. [Permissions](#Permissions)\n", + "3. [Processing](#Preprocessing) \n", + " 1. [convert features and transform images](#convert-features-and-transform-images) \n", + " 2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket) \n", + "4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\\&-starting-Sagemaker-Training-Job) \n", + " 1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mM3N4p52Jtew" + }, + "source": [ + "# Introduction\n", + "\n", + "Welcome to our end-to-end binary Image-Classification example. In this demo, we will use the Hugging Faces `transformers` and `datasets` library together with Amazon SageMaker to fine-tune a pre-trained vision transformers on image classification.\n", + "\n", + "The script and notebook is inspired by [NielsRogges](https://github.com/NielsRogge) example notebook of [Fine-tune the Vision Transformer on CIFAR-10](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb). Niels was also the contributor of the Vision Transformer into `transformers`.\n", + "\n", + "\n", + "_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SphiTeR7Jtey" + }, + "source": [ + "![Bildschirmfoto%202021-06-09%20um%2010.08.22.png](attachment:Bildschirmfoto%202021-06-09%20um%2010.08.22.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DFTSuPhdJte0" + }, + "source": [ + "# Development Environment and Permissions\n", + "\n", + "\n", + "_**Use at least a `t3.large` instance otherwise preprocessing will take ages.**_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UIQyuqMTJte2" + }, + "source": [ + "## Installation\n", + "\n", + "_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if not already installed_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AD-mEXfoJte3" + }, + "outputs": [], + "source": [ + "%pip install \"comet_ml>=3.44.0\" \"sagemaker>=2.140.0\" \"transformers~=4.36.1\" \"datasets\" s3fs \"torch~=2.1.0\" --upgrade" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YaYTTk-GJte7" + }, + "source": [ + "## Permissions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v-I65aPoJte_" + }, + "source": [ + "_If you are going to use Sagemaker in a local environment, you need access to an IAM Role with the required permissions for Sagemaker. You can find out more about this [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-UP-YR-wJtfF", + "tags": [] + }, + "outputs": [], + "source": [ + "import sagemaker\n", + "import boto3\n", + "\n", + "# Uncomment if you need to use a specific AWS profile\n", + "# boto3.setup_default_session(profile_name=\"profile\")\n", + "\n", + "sess = sagemaker.Session()\n", + "# sagemaker session bucket -> used for uploading data, models and logs\n", + "# sagemaker will automatically create this bucket if it not exists\n", + "sagemaker_session_bucket = None\n", + "if sagemaker_session_bucket is None and sess is not None:\n", + " # set to default bucket if a bucket name is not given\n", + " sagemaker_session_bucket = sess.default_bucket()\n", + "\n", + "role = None\n", + "\n", + "# Uncomment if you need to use a specific AWS Sagemaker Role\n", + "# role = \"arn:aws:iam::276069367280:role/service-role/AmazonSageMaker-ExecutionRole-20240620T150642\"\n", + "\n", + "if role is None:\n", + " try:\n", + " role = sagemaker.get_execution_role()\n", + " except ValueError:\n", + " iam = boto3.client(\"iam\")\n", + " role = iam.get_role(RoleName=\"sagemaker_execution_role\")[\"Role\"][\"Arn\"]\n", + "\n", + "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", + "\n", + "print(f\"sagemaker role arn: {role}\")\n", + "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", + "print(f\"sagemaker session region: {sess.boto_region_name}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(role)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YemPLk8LJtfI" + }, + "source": [ + "# Preprocessing\n", + "\n", + "We are using the `datasets` library to download and preprocess the `fashion-mnist` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html) are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "krpuo-EKJtfL" + }, + "source": [ + "_Note from Nils: \"that in the ViT paper, the best results were obtained when fine-tuning at a higher resolution. For this, one interpolates the pre-trained absolute position embeddings\"._\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PPhwZGW1JtfO" + }, + "source": [ + "## Convert Features and transform images" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2haEdKSBJtfP", + "tags": [] + }, + "outputs": [], + "source": [ + "from transformers import AutoProcessor\n", + "from datasets import load_dataset\n", + "import numpy as np\n", + "from PIL import Image\n", + "from random import randint\n", + "\n", + "# dataset used\n", + "dataset_name = \"cifar10\"\n", + "\n", + "# s3 key prefix for the data\n", + "s3_prefix = \"samples/datasets/cifar10\"\n", + "\n", + "# FeatureExtractor used in preprocessing\n", + "model_name = \"google/vit-base-patch16-224-in21k\"\n", + "\n", + "image_processor = AutoProcessor.from_pretrained(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SWSSbO9SJtfR" + }, + "source": [ + "We are downsampling dataset to make it faster to preprocess." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NGN-KTtjJtfS", + "tags": [] + }, + "outputs": [], + "source": [ + "# load dataset\n", + "train_dataset, test_dataset = load_dataset(\n", + " dataset_name, split=[\"train[:500]\", \"test[:200]\"]\n", + ")\n", + "\n", + "# display random sample\n", + "train_dataset[0][\"img\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W6vcCLiUJtfU", + "tags": [] + }, + "outputs": [], + "source": [ + "from datasets import Features, Array3D\n", + "\n", + "# we need to extend the features\n", + "features = Features(\n", + " {\n", + " **train_dataset.features,\n", + " \"pixel_values\": Array3D(dtype=\"float32\", shape=(3, 224, 224)),\n", + " }\n", + ")\n", + "\n", + "# extractor helper function\n", + "def preprocess_images(examples):\n", + " # get batch of images\n", + " images = examples[\"img\"]\n", + " inputs = image_processor(images=images)\n", + " examples[\"pixel_values\"] = inputs[\"pixel_values\"]\n", + "\n", + " return examples\n", + "\n", + "\n", + "# preprocess dataset\n", + "train_dataset = train_dataset.map(preprocess_images, batched=True, features=features)\n", + "test_dataset = test_dataset.map(preprocess_images, batched=True, features=features)\n", + "\n", + "# set to torch format for training\n", + "train_dataset.set_format(\"torch\", columns=[\"pixel_values\", \"label\"])\n", + "test_dataset.set_format(\"torch\", columns=[\"pixel_values\", \"label\"])\n", + "\n", + "# remove unused column\n", + "train_dataset = train_dataset.remove_columns(\"img\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R0LdSmWtJtfW" + }, + "source": [ + "## Uploading data to `sagemaker_session_bucket`\n", + "\n", + "After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pITn2ip0JtfX", + "tags": [] + }, + "outputs": [], + "source": [ + "import botocore\n", + "from s3fs import S3FileSystem\n", + "\n", + "# save train_dataset to s3\n", + "training_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/train\"\n", + "train_dataset.save_to_disk(training_input_path, num_shards=1)\n", + "\n", + "# save test_dataset to s3\n", + "test_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/test\"\n", + "test_dataset.save_to_disk(test_input_path, num_shards=1)\n", + "\n", + "print(f\"train dataset is uploaded to {training_input_path}\")\n", + "print(f\"test dataset is uploaded to {test_input_path}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Training code\n", + "\n", + "Here is our training code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/train.py\n", + "\n", + "import comet_ml\n", + "from transformers import ViTForImageClassification, Trainer, TrainingArguments,default_data_collator,ViTFeatureExtractor\n", + "from datasets import load_from_disk,load_metric\n", + "import random\n", + "import logging\n", + "import sys\n", + "import argparse\n", + "import os\n", + "import numpy as np\n", + "import subprocess\n", + "\n", + "subprocess.run([\n", + " \"git\",\n", + " \"config\",\n", + " \"--global\",\n", + " \"user.email\",\n", + " \"sagemaker@huggingface.co\",\n", + " ], check=True)\n", + "subprocess.run([\n", + " \"git\",\n", + " \"config\",\n", + " \"--global\",\n", + " \"user.name\",\n", + " \"sagemaker\",\n", + " ], check=True)\n", + "\n", + "\n", + "def main(args):\n", + " experiment = comet_ml.start()\n", + " \n", + " # Set up logging\n", + " logger = logging.getLogger(__name__)\n", + "\n", + " logging.basicConfig(\n", + " level=logging.getLevelName(\"INFO\"),\n", + " handlers=[logging.StreamHandler(sys.stdout)],\n", + " format=\"%(asctime)s - %(name)s - %(levelname)s - %(message)s\",\n", + " )\n", + "\n", + " # load datasets\n", + " train_dataset = load_from_disk(args.training_dir)\n", + " test_dataset = load_from_disk(args.test_dir)\n", + " num_classes = train_dataset.features[\"label\"].num_classes\n", + "\n", + "\n", + " logger.info(f\" loaded train_dataset length is: {len(train_dataset)}\")\n", + " logger.info(f\" loaded test_dataset length is: {len(test_dataset)}\")\n", + "\n", + " metric_name = \"accuracy\"\n", + " # compute metrics function for binary classification\n", + "\n", + " metric = load_metric(metric_name)\n", + "\n", + " def compute_metrics(eval_pred):\n", + " predictions, labels = eval_pred\n", + " predictions = np.argmax(predictions, axis=1)\n", + " return metric.compute(predictions=predictions, references=labels)\n", + "\n", + " # download model from model hub\n", + " model = ViTForImageClassification.from_pretrained(args.model_name,num_labels=num_classes)\n", + " \n", + " # change labels\n", + " id2label = {key:train_dataset.features[\"label\"].names[index] for index,key in enumerate(model.config.id2label.keys())}\n", + " label2id = {train_dataset.features[\"label\"].names[index]:value for index,value in enumerate(model.config.label2id.values())}\n", + " model.config.id2label = id2label\n", + " model.config.label2id = label2id\n", + " \n", + " \n", + " # define training args\n", + " training_args = TrainingArguments(\n", + " output_dir=args.output_dir,\n", + " num_train_epochs=args.num_train_epochs,\n", + " per_device_train_batch_size=args.per_device_train_batch_size,\n", + " per_device_eval_batch_size=args.per_device_eval_batch_size,\n", + " warmup_steps=args.warmup_steps,\n", + " weight_decay=args.weight_decay,\n", + " evaluation_strategy=\"steps\",\n", + " logging_dir=f\"{args.output_dir}/logs\",\n", + " learning_rate=float(args.learning_rate),\n", + " load_best_model_at_end=True,\n", + " metric_for_best_model=metric_name,\n", + " )\n", + " \n", + " \n", + " # create Trainer instance\n", + " trainer = Trainer(\n", + " model=model,\n", + " args=training_args,\n", + " compute_metrics=compute_metrics,\n", + " train_dataset=train_dataset,\n", + " eval_dataset=test_dataset,\n", + " data_collator=default_data_collator,\n", + " )\n", + "\n", + " # train model\n", + " trainer.train()\n", + "\n", + " # evaluate model\n", + " eval_result = trainer.evaluate(eval_dataset=test_dataset)\n", + "\n", + " # writes eval result to file which can be accessed later in s3 ouput\n", + " with open(os.path.join(args.output_dir, \"eval_results.txt\"), \"w\") as writer:\n", + " print(f\"***** Eval results *****\")\n", + " for key, value in sorted(eval_result.items()):\n", + " writer.write(f\"{key} = {value}\\n\")\n", + "\n", + " # Saves the model to s3\n", + " trainer.save_model(args.output_dir)\n", + "\n", + "\n", + "if __name__ == \"__main__\":\n", + "\n", + " parser = argparse.ArgumentParser()\n", + "\n", + " # hyperparameters sent by the client are passed as command-line arguments to the script.\n", + " parser.add_argument(\"--model_name\", type=str)\n", + " parser.add_argument(\"--output_dir\", type=str,default=\"/opt/ml/model\")\n", + " parser.add_argument(\"--extra_model_name\", type=str,default=\"sagemaker\")\n", + " parser.add_argument(\"--dataset\", type=str,default=\"cifar10\")\n", + " parser.add_argument(\"--task\", type=str,default=\"image-classification\")\n", + "\n", + " parser.add_argument(\"--num_train_epochs\", type=int, default=3)\n", + " parser.add_argument(\"--per_device_train_batch_size\", type=int, default=32)\n", + " parser.add_argument(\"--per_device_eval_batch_size\", type=int, default=64)\n", + " parser.add_argument(\"--warmup_steps\", type=int, default=500)\n", + " parser.add_argument(\"--weight_decay\", type=float, default=0.01)\n", + " parser.add_argument(\"--learning_rate\", type=str, default=2e-5)\n", + "\n", + " parser.add_argument(\"--training_dir\", type=str, default=os.environ[\"SM_CHANNEL_TRAIN\"])\n", + " parser.add_argument(\"--test_dir\", type=str, default=os.environ[\"SM_CHANNEL_TEST\"])\n", + "\n", + " args, _ = parser.parse_known_args()\n", + "\n", + " main(args)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And we need to add few dependencies:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile src/requirements.txt\n", + "\n", + "comet_ml" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s2qdxSWPJtfZ" + }, + "source": [ + "# Fine-tuning & starting Sagemaker Training Job\n", + "\n", + "In order to create a sagemaker training job we need a `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In an Estimator, we define which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....\n", + "\n", + "```python\n", + "/opt/conda/bin/python train.py --num_train_epochs 1 --model_name google/vit-base-patch16-224-in21k --per_device_train_batch_size 16\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VHkl_MyOJtfa" + }, + "source": [ + "## Creating an Estimator and start a training job" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hzad36OmJtfb", + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.huggingface import HuggingFace\n", + "\n", + "# hyperparameters, which are passed into the training job\n", + "hyperparameters = {\n", + " \"num_train_epochs\": 3, # train epochs\n", + " \"per_device_train_batch_size\": 16, # batch size\n", + " \"model_name\": model_name, # model which will be trained on\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "buTX9A3-Jtfc", + "tags": [] + }, + "outputs": [], + "source": [ + "import comet_ml.config\n", + "\n", + "COMET_API_KEY = comet_ml.config.get_config()[\"comet.api_key\"]\n", + "\n", + "huggingface_estimator = HuggingFace(\n", + " entry_point=\"train.py\",\n", + " source_dir=\"./src\",\n", + " instance_type=\"ml.p3.2xlarge\",\n", + " instance_count=1,\n", + " role=role,\n", + " transformers_version=\"4.36\",\n", + " pytorch_version=\"2.1\",\n", + " py_version=\"py310\",\n", + " hyperparameters=hyperparameters,\n", + " environment={\n", + " \"COMET_API_KEY\": COMET_API_KEY,\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EvKi347tJtfe", + "tags": [] + }, + "outputs": [], + "source": [ + "# starting the train job with our uploaded datasets as input\n", + "huggingface_estimator.fit({\"train\": training_input_path, \"test\": test_input_path})" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}