From a14b3c5553799d87bca0c9e15430c243447b500c Mon Sep 17 00:00:00 2001 From: Brian Nord <184985+bnord@users.noreply.github.com> Date: Mon, 18 Nov 2024 11:59:50 -0600 Subject: [PATCH 1/3] upload notebook tutorial --- ...geClassificationWithTensorflow_Draft.ipynb | 1870 +++++++++++++++++ 1 file changed, 1870 insertions(+) create mode 100644 AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb diff --git a/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb b/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb new file mode 100644 index 0000000..9939bcd --- /dev/null +++ b/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb @@ -0,0 +1,1870 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + " \n", + "
AI0: Introduction to AI-based Image Classification with Tensorflow
\n", + "Contact author: Brian Nord
\n", + "Last verified to run: YYYY-MM-DD
\n", + "LSST Science Pipelines version: ??
\n", + "Container size: medium
\n", + "Targeted learning level: beginner
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jce50kKEfHC1" + }, + "source": [ + "**Description:** An introduction to the classification of images with AI-based classification algorithms." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jce50kKEfHC1" + }, + "source": [ + "**Skills:** Examine AI training data, prepare it for a classification task, perform classification with a neural network, and examine the diagnostics of the classification task." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**LSST Data Products:** None; MNIST data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Packages:** numpy, matplotlib, sklearn, tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Credits and Acknowledgments:** None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Get Support:**\n", + "Find DP0-related documentation and resources at dp0.lsst.io. Questions are welcome as new topics in the Support - Data Preview 0 Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Introduction\n", + "\n", + "This Jupyter Notebook introduces artificial intelligence (AI)-based image classification. It demonstrates how to perform a few key steps:\n", + "1. examine and prepare data for classification;\n", + "2. train an AI algorithm;\n", + "3. plot diagnostics of the training performance;\n", + "4. initially assess those diagnostics. \n", + "\n", + "AI is a class of algorithms for building statistical models. These algorithms primarily use data for training, as opposed to models that use analytic formulae or models that are based on physical reasoning. Machine learning is a subclass of algorithms -- e.g., random forests. Deep learning is a subclass of algorithms -- e.g., neural networks. \n", + "\n", + "This notebook uses `tensorflow`, one of the two most commonly used `python` libraries for deep learning. `Tensorflow` is often easier to use because of how it handles data sets and the logic used for model building. However, it is typically also difficult to develop network models creatively. We use `tensorflow` first in this series of tutorials so that users who are new to deep learning can focus on learning AI. In later tutorials, we will use `pytorch` because it is more flexible and more commonly used in science applications. \n", + "\n", + "This notebook uses [MNIST AI benchmarking data](https://en.wikipedia.org/wiki/MNIST_database). In a future notebook, we will we'll use stars and galaxies drawn from DP0 data.\n", + "\n", + "The use of data in this notebook requires a medium-sized ram allocation (8Gi).\n", + "\n", + "The end of this notebook contains a Glossary of Terms and a comment regarding usage of terms in AI contexts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%reload_ext pycodestyle_magic\n", + "%flake8_on\n", + "import logging\n", + "logging.getLogger(\"flake8\").setLevel(logging.FATAL)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V3xHhKu6c5-e" + }, + "source": [ + "### 1.1. Import Packages\n", + "\n", + "[`numpy`](https://numpy.org/) is a widely used Python library for computations and mathematical operations on multi-dimensional arrays.\n", + "\n", + "[`matplotlib`](https://matplotlib.org/) is a widely used Python plot library. \n", + "\n", + "[`tensorflow`](https://www.tensorflow.org) is a widely used library from Google for fast tensor operations --- often used for building neural network models. \n", + "\n", + "[`sklearn`](https://scikit-learn.org/stable/) is a library for machine learning." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "puW54XTfdo1C" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import os\n", + "import datetime\n", + "\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import cm\n", + "\n", + "import tensorflow as tf\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix\n", + "from sklearn.metrics import roc_curve, roc_auc_score, auc, RocCurveDisplay\n", + "from sklearn.preprocessing import LabelBinarizer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.2 Define Functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def normalizeInputs(x_temp, input_minimum, input_maximum):\n", + " \"\"\"Normalize a datum that is an input to the neural network\n", + "\n", + " Parameters\n", + " ----------\n", + " x_temp: `numpy.array`\n", + " image data\n", + " input_minimum: `float`\n", + " minimum value for normalization\n", + " input_maximum: `float`\n", + " maximum value for normalization\n", + "\n", + " Returns\n", + " -------\n", + " x_temp_norm: `numpy.array`\n", + " normalized image data\n", + " \"\"\"\n", + " x_temp_norm = (x_temp - input_minimum)/input_maximum\n", + " return x_temp_norm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def createFileUidTimestamp():\n", + " \"\"\"Create a timestamp for a filename.\n", + "\n", + " Parameters\n", + " ----------\n", + " None\n", + "\n", + " Returns\n", + " -------\n", + " file_uid_timestamp : `string`\n", + " String from date and time.\n", + " \"\"\"\n", + " file_uid_timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + " return file_uid_timestamp\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def createFileName(file_prefix=\"\", file_location=\"Data/Sandbox/\",\n", + " file_suffix=\"\", useuid=True, verbose=True):\n", + " \"\"\"Create a file name.\n", + "\n", + " Parameters\n", + " ----------\n", + " file_prefix: `string`\n", + " prefix of file name\n", + " file_location: `string`\n", + " path to file\n", + " file_suffix: `string`\n", + " suffix/extension of file name\n", + " useuid: 'bool'\n", + " choose to use a unique id\n", + " verbose: 'bool'\n", + " choose to print the file name\n", + "\n", + " Returns\n", + " -------\n", + " file_final: `string`\n", + " filename used for saving\n", + " \"\"\"\n", + " if useuid:\n", + " file_uid = createFileUidTimestamp()\n", + " else:\n", + " file_uid = \"\"\n", + "\n", + " file_final = file_location + file_prefix + \"_\" + file_uid + file_suffix\n", + "\n", + " if verbose:\n", + " print(file_final)\n", + "\n", + " return file_final\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotArrayImageExamples(x_tra, y_tra, num=10,\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot an array of examples of images and labels\n", + "\n", + " Parameters\n", + " ----------\n", + " x_tra: `numpy.ndarray`\n", + " training data images\n", + " y_tra: `numpy.ndarray`\n", + " training data labels\n", + " num: `int`, optional\n", + " number examples to plot\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " num_row = 2\n", + " num_col = 5\n", + " images = x_tra[:num]\n", + " labels = y_tra[:num]\n", + "\n", + " fig, axes = plt.subplots(num_row, num_col,\n", + " figsize=(1.5*num_col, 2*num_row))\n", + " for i in range(num):\n", + " ax = axes[i//num_col, i%num_col]\n", + " ax.imshow(images[i], cmap='gray')\n", + " ax.set_title('Label: {}'.format(labels[i]))\n", + "\n", + " plt.tight_layout()\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotROCMulticlassOnevsrest(y_tra, y_tes, y_pred_tes, label_target_list,\n", + " color_list,\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot Receiver Operator Curve for one-vs-rest scenario\n", + "\n", + " Parameters\n", + " ----------\n", + " y_tra: `numpy.ndarray`\n", + " training data images\n", + " y_tes: `numpy.ndarray`\n", + " test data images\n", + " y_pred_tes: `numpy.ndarray`\n", + " test data predicted labels\n", + " label_target_list: 'list'\n", + " color_list: 'list'\n", + "\n", + " Returns\n", + " -------\n", + " file_final: `string`\n", + " \"\"\"\n", + " fig, ax = plt.subplots(figsize=(6, 6))\n", + "\n", + " for label_target, color in zip(label_target_list, color_list):\n", + "\n", + " label_binarizer = LabelBinarizer().fit(y_tra)\n", + " y_onehot_tes = label_binarizer.transform(y_tes)\n", + "\n", + " class_id = np.flatnonzero(label_binarizer.classes_ == label_target)[0]\n", + "\n", + " display = RocCurveDisplay.from_predictions(\n", + " y_onehot_tes[:, class_id],\n", + " y_pred_tes[:, class_id],\n", + " name=f\"{label_target} vs the rest\",\n", + " color=color,\n", + " ax=ax,\n", + " plot_chance_level=(class_id == 0)\n", + " )\n", + "\n", + " _ = display.ax_.set(\n", + " xlabel=\"False Positive Rate\",\n", + " ylabel=\"True Positive Rate\",\n", + " title=\"ROC: One-vs-Rest\",\n", + " )\n", + "\n", + " if save_file:\n", + " createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotROCMulticlassOnevsone(y_tra, y_tes, y_pred_tes, label_target_list,\n", + " color_list, save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot Receiver Operator Curve for one-vs-one scenario\n", + "\n", + " Parameters\n", + " ----------\n", + " y_tra: `numpy.ndarray`\n", + " training data true labels\n", + " y_tes: `numpy.ndarray`\n", + " test data true labels\n", + " y_pred_tes: `numpy.ndarray`\n", + " test data predicted labels\n", + " label_target_list: 'list'\n", + " color_list: 'list'\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " fig, ax = plt.subplots(figsize=(6, 6))\n", + "\n", + " for label_target, color in zip(label_target_list, color_list):\n", + "\n", + " label_binarizer = LabelBinarizer().fit(y_tra)\n", + " y_onehot_tes = label_binarizer.transform(y_tes)\n", + "\n", + " class_id = np.flatnonzero(label_binarizer.classes_ == label_target)[0]\n", + "\n", + " display = RocCurveDisplay.from_predictions(\n", + " y_onehot_tes[:, class_id],\n", + " y_pred_tes[:, class_id],\n", + " name=f\"{label_target} vs the rest\",\n", + " color=color,\n", + " ax=ax,\n", + " plot_chance_level=(class_id == 0)\n", + " )\n", + "\n", + " _ = display.ax_.set(\n", + " xlabel=\"False Positive Rate\",\n", + " ylabel=\"True Positive Rate\",\n", + " title=\"ROC: One-vs-Rest\",\n", + " )\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + "\n", + " plt.savefig(file_final, bbox_inches='tight')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotArrayHistogramExamples(x_tra, y_tra, num=10,\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot histograms of image pixel values\n", + "\n", + " Parameters\n", + " ----------\n", + " x_tra: `numpy.ndarray`\n", + " training image data\n", + " y_tra: `numpy.ndarray`\n", + " training label data\n", + " num: `int`, optional\n", + " number of examples to show\n", + " file_prefix: 'string', optional\n", + " prefix of file name\n", + " file_location: 'string', optional\n", + " path to file\n", + " file_suffix: 'string', optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " n_bins = 10\n", + " num = 10\n", + " num_row = 2\n", + " num_col = 5\n", + " images = x_tra[:num]\n", + " labels = y_tra[:num]\n", + "\n", + " fig, axes = plt.subplots(num_row, num_col,\n", + " figsize=(1.5*num_col, 2*num_row))\n", + "\n", + " for i in range(num):\n", + " ax = axes[i//num_col, i%num_col]\n", + " ax.hist(images[i], bins=n_bins)\n", + " ax.set_title('Label: {}'.format(labels[i]))\n", + "\n", + " plt.tight_layout()\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + "\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotPredictionHistogram(y_prediction_a, y_prediction_b=None,\n", + " y_prediction_c=None, n_classes=None,\n", + " n_objects_a=None, n_colors=None,\n", + " title_a=None, title_b=None,\n", + " title_c=None, label_a=None,\n", + " label_b=None, label_c=None,\n", + " alpha=0.5, figsize=(12, 5),\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot histogram of predicted labels\n", + "\n", + " Parameters\n", + " ----------\n", + " y_prediction_a: `numpy.ndarray`\n", + " y_prediction_b: `numpy.ndarray`, optional\n", + " y_prediction_c: `numpy.ndarray`, optional\n", + " n_classes: `int`, optional\n", + " n_objects_a: `int`, optional\n", + " n_colors: `int`, optional\n", + " title_a: `string`, optional\n", + " title_b: `string`, optional\n", + " title_c: `string`, optional\n", + " label_a: `string`, optional\n", + " label_b: `string`, optional\n", + " label_c: `string`, optional\n", + " alpha: `float`, optional\n", + " transparency\n", + " figsize: `tuple`, optional\n", + " figure size\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " ndim = y_prediction_a.ndim\n", + "\n", + " if ndim == 2:\n", + " fig, (axa, axb, axc) = plt.subplots(1, 3, figsize=figsize)\n", + " fig.subplots_adjust(wspace=0.35)\n", + " elif ndim == 1:\n", + " fig, ax = plt.subplots(figsize=figsize)\n", + "\n", + " shape_a = np.shape(y_prediction_a)\n", + "\n", + " if n_objects_a is None:\n", + " n_objects_a = shape_a[0]\n", + "\n", + " if ndim == 2:\n", + " if n_classes == None:\n", + " n_classes = shape_a[1]\n", + " if n_colors is None:\n", + " n_colors = n_classes\n", + " elif ndim == 1:\n", + " if n_colors is None:\n", + " n_colors = 1\n", + "\n", + " if ndim == 2:\n", + " colors = cm.Purples(np.linspace(0, 1, n_colors))\n", + " xlabel = \"Probability for Each Class\"\n", + "\n", + " axa.set_ylim(0, n_objects_a)\n", + " axa.set_xlabel(xlabel)\n", + " axa.set_title(title_a)\n", + "\n", + " for i in np.arange(n_classes):\n", + " axa.hist(y_prediction_a[:, i], alpha=alpha,\n", + " color=colors[i], label=\"'\" + str(i) + \"'\")\n", + "\n", + " if y_prediction_b is not None:\n", + " shape_b = np.shape(y_prediction_b)\n", + " axb.set_ylim(0, shape_b[0])\n", + " axb.set_xlabel(xlabel)\n", + " axb.set_title(title_b)\n", + "\n", + " for i in np.arange(n_classes):\n", + " axb.hist(y_prediction_b[:, i], alpha=alpha,\n", + " color=colors[i], label=\"'\" + str(i) + \"'\")\n", + "\n", + " if y_prediction_c is not None:\n", + " shape_c = np.shape(y_prediction_c)\n", + " axc.set_ylim(0, shape_c[0])\n", + " axc.set_xlabel(xlabel)\n", + " axc.set_title(title_c)\n", + "\n", + " for i in np.arange(n_classes):\n", + " axc.hist(y_prediction_c[:, i], alpha=alpha,\n", + " color=colors[i], label=\"'\" + str(i) + \"'\")\n", + "\n", + " elif ndim == 1:\n", + " ya, xa, _ = plt.hist(y_prediction_a, alpha=alpha, color='purple',\n", + " label=label_a)\n", + " y_max_list = [max(ya)]\n", + "\n", + " if y_prediction_b is not None:\n", + " yb, xb, _ = plt.hist(y_prediction_b, alpha=alpha, color='blue',\n", + " label=label_b)\n", + " y_max_list.append(max(yb))\n", + "\n", + " if y_prediction_c is not None:\n", + " yc, xc, _ = plt.hist(y_prediction_c, alpha=alpha, color='green',\n", + " label=label_c)\n", + " y_max_list.append(max(yc))\n", + "\n", + " plt.ylim(0, np.max(y_max_list)*1.1)\n", + " plt.xlabel(\"Top Choice-Class\")\n", + "\n", + " plt.legend(loc='upper right')\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotLossHistory(history, figsize=(8, 5),\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot loss history of the model as function of epoch\n", + "\n", + " Parameters\n", + " ----------\n", + " history: `keras.src.callbacks.history.History`\n", + " keras callback history object containing the losses at each epoch\n", + " figsize: `tuple`, optional\n", + " figure size\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=figsize)\n", + "\n", + " loss_tra = np.array(history.history['loss'])\n", + " loss_val = np.array(history.history['val_loss'])\n", + " loss_dif = loss_val - loss_tra\n", + "\n", + " ax1.plot(loss_tra, label='Training')\n", + " ax1.plot(loss_val, label='Validation')\n", + " ax1.legend()\n", + "\n", + " ax2.plot(loss_dif, color='red', label='residual')\n", + " ax2.axhline(y=0, color='grey', linestyle='dashed', label='zero bias')\n", + " ax2.sharex(ax1)\n", + " ax2.legend()\n", + "\n", + " ax1.set_title('Loss History')\n", + " ax1.set_ylabel('Loss')\n", + " ax2.set_ylabel('Loss Residual')\n", + " ax2.set_xlabel('Epoch')\n", + " plt.tight_layout()\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + "\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotConfusionMatrix(cm_tra, cm_val, cm_tes, save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot the confusion matrix of predictions.\n", + "\n", + " Parameters\n", + " ----------\n", + " confusion_matrix_tra: `numpy.ndarray`\n", + " confusion matrix for the training data\n", + " confusion_matrix_val: `numpy.ndarray`\n", + " confusion matrix for the validation data\n", + " confusion_matrix_tes: `numpy.ndarray`\n", + " confusion matrix for the test data\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + "\n", + " cm_display_tra = ConfusionMatrixDisplay(confusion_matrix=cm_tra)\n", + " cm_display_val = ConfusionMatrixDisplay(confusion_matrix=cm_val)\n", + " cm_display_tes = ConfusionMatrixDisplay(confusion_matrix=cm_tes)\n", + "\n", + " fig, (axa, axb, axc) = plt.subplots(1, 3, figsize=(22, 5))\n", + "\n", + " cm_display_tra.plot(ax=axa)\n", + " cm_display_val.plot(ax=axb)\n", + " cm_display_tes.plot(ax=axc)\n", + "\n", + " axa.set_title(\"Training\")\n", + " axb.set_title(\"Validation\")\n", + " axc.set_title(\"Testing\")\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + "\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotArrayImageConfusion(x_tra, y_tra, y_pred_tra_topchoice,\n", + " title_main=None, num=10,\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot images of examples objects that are misclassified.\n", + "\n", + " Parameters\n", + " ----------\n", + " x_tra: `numpy.ndarray`\n", + " training image data\n", + " y_tra: `numpy.ndarray`\n", + " training label data\n", + " y_pred_tra_topchoice: `numpy.ndarray`\n", + " top choice of the predicted labels\n", + " title_main: `string`, optional\n", + " title for the plot\n", + " num: `int`, optional\n", + " number of examples\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " num_row = 2\n", + " num_col = 5\n", + " images = x_tra[:num]\n", + " labels_true = y_tra[:num]\n", + " labels_pred = y_pred_tra_topchoice[:num]\n", + "\n", + " fig, axes = plt.subplots(num_row, num_col,\n", + " figsize=(1.5*num_col, 2*num_row))\n", + "\n", + " fig.patch.set_linewidth(5)\n", + " fig.patch.set_edgecolor('cornflowerblue')\n", + "\n", + " for i in range(num):\n", + " ax = axes[i//num_col, i%num_col]\n", + " ax.imshow(images[i], cmap='gray')\n", + " ax.set_title(r'True: {}'.format(labels_true[i]) + '\\n'\n", + " + 'Pred: {}'.format(labels_pred[i]))\n", + "\n", + " fig.suptitle(title_main)\n", + " plt.tight_layout()\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + "\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def plotArrayHistogramConfusion(x_tra, y_tra, y_pred_tra_topchoice,\n", + " title_main=None, num=10,\n", + " save_file=False,\n", + " file_prefix=\"prediction_histogram\",\n", + " file_location=\"./\",\n", + " file_suffix=\".png\"):\n", + " \"\"\"Plot histograms of pixel values for images that are misclassified.\n", + "\n", + " Parameters\n", + " ----------\n", + " x_tra: `numpy.ndarray`\n", + " training image data\n", + " y_tra: `numpy.ndarray`\n", + " training label data\n", + " y_pred_tra_topchoice: `numpy.ndarray`\n", + " top choice of the predicted labels\n", + " title_main: `string`, optional\n", + " title of plot\n", + " num: `int`, optional\n", + " number of examples\n", + " file_prefix: `string`, optional\n", + " prefix of file name\n", + " file_location: `string`, optional\n", + " path to file\n", + " file_suffix: `string`, optional\n", + " suffix/extension of file name\n", + "\n", + " Returns\n", + " -------\n", + " None\n", + " \"\"\"\n", + " n_bins = 10\n", + " num_row = 2\n", + " num_col = 5\n", + " images = x_tra[:num]\n", + " labels_true = y_tra[:num]\n", + " labels_pred = y_pred_tra_topchoice[:num]\n", + "\n", + " fig, axes = plt.subplots(num_row, num_col,\n", + " figsize=(1.5*num_col, 2*num_row))\n", + "\n", + " fig.patch.set_linewidth(5)\n", + " fig.patch.set_edgecolor('cornflowerblue')\n", + "\n", + " for i in range(num):\n", + " ax = axes[i//num_col, i%num_col]\n", + " ax.hist(images[i], bins=n_bins)\n", + " ax.set_title(r'True: {}'.format(labels_true[i]) + '\\n'\n", + " + 'Pred: {}'.format(labels_pred[i]))\n", + "\n", + " fig.suptitle(title_main)\n", + " plt.tight_layout()\n", + "\n", + " if save_file:\n", + " file_final = createFileName(file_prefix=file_prefix,\n", + " file_location=file_location,\n", + " file_suffix=file_suffix,\n", + " useuid=True)\n", + "\n", + " plt.savefig(file_final, bbox_inches='tight')\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.3 Define Paths for Data and Plots" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Neural network training (i.e., model fitting) typically requires many numerical experiments to achieve an ideal model. To facilitate the comparison of these experiments/models, it is helpful to organize data carefully. We set paths for the model weight parameters and diagnostic figures. We also set the variable `run_label` for each training run. We also save these paths in a dictionary to facilitate passing information to plotting functions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_label = \"Run000\"\n", + "\n", + "path_dict = {'run_label': run_label,\n", + " 'dir_data_model': \"Data/Models/\",\n", + " 'dir_data_figures': \"Data/Figures/\",\n", + " 'file_model_prefix': \"Model\",\n", + " 'file_figure_prefix': \"Figure\",\n", + " 'file_figure_suffix': \".png\",\n", + " 'file_model_suffix': \".keras\"\n", + " }\n", + "\n", + "if not os.path.exists(path_dict['dir_data_model']):\n", + " os.makedirs(path_dict['dir_data_model'])\n", + "\n", + "if not os.path.exists(path_dict['dir_data_figures']):\n", + " os.makedirs(path_dict['dir_data_figures'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jce50kKEfHC1" + }, + "source": [ + "## 2. Load and Prepare data: MNIST Handwritten Digits" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-dHXIDGEfLmO" + }, + "source": [ + "### 2.1. Download Dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-dHXIDGEfLmO" + }, + "source": [ + "The [`MNIST handwritten digits dataset`](https://ieeexplore.ieee.org/document/6296535) comprises 10 classes --- one for each digit. This is a useful dataset for learning the basics of neural networks and other AI algorithms. MNIST is one of a few canonical AI benchmark data sets for image classification. `tensorflow` has a simple function easily downloading the MNIST data to your local server for free. It automatically downloads the data into.\n", + "\n", + "The **input** data are held in `x_`, while the **output** (aka, label) data are held in `y_`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zDRvT2QkfISn", + "outputId": "68276f81-5e32-443a-d9c3-5291ef61715c" + }, + "outputs": [], + "source": [ + "mnist = tf.keras.datasets.mnist\n", + "train_temp, test_temp = mnist.load_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O4FPxkLKiJKe" + }, + "source": [ + "### 2.2. Split Data into Train/Validation/Test" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O4FPxkLKiJKe" + }, + "source": [ + "It is essential to split for a proper 'blind' analysis and optimization of an AI model.\n", + "\n", + "There are three primary data sets used in model optimization:\n", + "\n", + "* **Training** (`_tra`) data is used directly by the algorithm to update the parameters of the AI model -- e.g., the weights of the computational neurons on the edges in neural networks.\n", + "* **Validation** (`_val`) data is used indirectly to update the hyperparameters of the AI model -- e.g., the batchsize, the learning rate, or the layers in the architecture of a neural network. Each time the neural network has completed training with the training data, the human looks at those diagnostics when run on the training and the validation data.\n", + "* **Test(ing)** (`_tes`) data is only used when the model is trained and validated and will no longer be update or further trained. \n", + "\n", + "The `TF` class automatically downloads data into training and test data sets. Therefore, we use the `sklearn` `train_test_split()` function to further split the training set into training and validation data sets. We then \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RgjGyErDfNrg" + }, + "outputs": [], + "source": [ + "fraction_validation = 0.25" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# set the test data sets from the temp data at read-in\n", + "x_tes, y_tes = test_temp[0], test_temp[1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# set the training and validata data sets from the temp data at read-in\n", + "# use the sklearn train_test_split function\n", + "x_tra, x_val, y_tra, y_val = train_test_split(train_temp[0], train_temp[1],\n", + " test_size=fraction_validation,\n", + " random_state=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e8Ab7Yx7v1CY" + }, + "source": [ + "### 2.3. Normalize data\n", + "\n", + "First, we make sure that the input data are floats. This allows us to perform computations on the real number line for the inputs.\n", + "\n", + "Second, we normalize the data according to the maximum value in all the data sets. The inputs will all exist on a smaller range. This improves the stability of the training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ci1q1O8hv8U6" + }, + "outputs": [], + "source": [ + "# set to floats\n", + "x_tra = x_tra.astype('float32')\n", + "x_val = x_val.astype('float32')\n", + "x_tes = x_tes.astype('float32')\n", + "\n", + "# calculate min and max across all input images\n", + "input_minimum = np.min([np.min(x_tra), np.min(x_val), np.min(x_tes)])\n", + "input_maximum = np.max([np.max(x_tra), np.max(x_val), np.max(x_tes)])\n", + "\n", + "print(\"Before\")\n", + "print(\"min/max\", np.min(x_tra), np.max(x_tra))\n", + "\n", + "x_tra = normalizeInputs(x_tra, input_minimum, input_maximum)\n", + "x_val = normalizeInputs(x_val, input_minimum, input_maximum)\n", + "x_tes = normalizeInputs(x_tes, input_minimum, input_maximum)\n", + "\n", + "print(\"After\")\n", + "print(\"min/max\", np.min(x_tra), np.max(x_tra))\n", + "\n", + "# get shapes\n", + "image_shape = x_tra[0, :, :].shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CdwlTbFafOYc" + }, + "source": [ + "### 2.4. Examine Raw Data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CdwlTbFafOYc" + }, + "source": [ + "Review data shapes. \n", + "\n", + "The zeroth elements of the `x` and `y` shapes should match. The first and second elements of `x` should be equal: these are the dimensions of the images. The image size, in part determines the depth of the neural network that can be created." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Print the data shapes to make sure you understand how many objects there are and what the number of pixels is for each image." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 484 + }, + "id": "u3nrxn00fRUE", + "outputId": "5bac7964-621b-4984-be23-8e711bd00dc4" + }, + "outputs": [], + "source": [ + "print('check data shapes')\n", + "print('x_train:', x_tra.shape)\n", + "print('y_train:', y_tra.shape)\n", + "print('x_valid:', x_val.shape)\n", + "print('y_valid:', y_val.shape)\n", + "print('x_test:', x_tes.shape)\n", + "print('y_test:', y_tes.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plot examples to gain visual familiarity. Do these all look like hand-written digits?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 484 + }, + "id": "u3nrxn00fRUE", + "outputId": "5bac7964-621b-4984-be23-8e711bd00dc4" + }, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"Example_Image_Array\"\\\n", + " + \"_\" + path_dict['run_label']\n", + "plotArrayImageExamples(x_tra, y_tra,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plot pixel distributions to further understand data. Is it normalized? Do the disributions of the pixel values make sense according to what you see in the related images above? \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"Example_Histogram_Array\"\\\n", + " + \"_\" + path_dict['run_label']\n", + "\n", + "plotArrayHistogramExamples(x_tra, y_tra,\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rmAXXIHOfUnD" + }, + "source": [ + "## 3. Train Model: Dense Neural Network" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": { + "iopub.execute_input": "2024-07-14T14:21:02.040978Z", + "iopub.status.busy": "2024-07-14T14:21:02.040269Z", + "iopub.status.idle": "2024-07-14T14:21:02.043255Z", + "shell.execute_reply": "2024-07-14T14:21:02.042798Z", + "shell.execute_reply.started": "2024-07-14T14:21:02.040956Z" + }, + "id": "bKRmx2k2wNtE" + }, + "source": [ + "### 3.1. Define Model Training Parameters" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": { + "iopub.execute_input": "2024-07-14T14:21:02.040978Z", + "iopub.status.busy": "2024-07-14T14:21:02.040269Z", + "iopub.status.idle": "2024-07-14T14:21:02.043255Z", + "shell.execute_reply": "2024-07-14T14:21:02.042798Z", + "shell.execute_reply.started": "2024-07-14T14:21:02.040956Z" + }, + "id": "bKRmx2k2wNtE" + }, + "source": [ + "Define optimizer\n", + "Define loss\n", + "Define accuracy\n", + "Define batch_size\n", + "Define epochs\n", + "Define metrics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tlgJk3oLwRnh" + }, + "outputs": [], + "source": [ + "epochs = 10\n", + "batch_size = 32\n", + "verbose = True\n", + "optimizer = \"sgd\"\n", + "loss = tf.keras.losses.SparseCategoricalCrossentropy()\n", + "metrics = ['accuracy']\n", + "dropout_rate = 0.3\n", + "learning_rate = 0.01\n", + "momentum = 0.9\n", + "seed = 1000" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Set the random seed for neural network weight initialization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tf.keras.utils.set_random_seed(seed)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rmAXXIHOfUnD" + }, + "source": [ + "### 3.2. Define Model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rmAXXIHOfUnD" + }, + "source": [ + "Define Sequential Model\n", + "Define layers\n", + "Define flat layer\n", + "Define dense layers\n", + "Define activation function; define types activation functions -- sigmoid and relu\n", + "Define weights and biases" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "U6vymJn1wJBu" + }, + "outputs": [], + "source": [ + "model_layers = [tf.keras.layers.Input(shape=image_shape),\n", + " tf.keras.layers.Flatten(),\n", + " tf.keras.layers.Dense(256, activation='sigmoid'),\n", + " tf.keras.layers.Dense(64, activation='sigmoid'),\n", + " tf.keras.layers.Dropout(dropout_rate),\n", + " tf.keras.layers.Dense(10, activation='softmax')]\n", + "\n", + "model = tf.keras.models.Sequential(model_layers)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U6vymJn1wJBu" + }, + "source": [ + "View a summary of the network architecture. Examine the shapes of the layers and the numbers of parameters. Too few parameters may prevent the model from being flexible enough to model the data. Too many parameters could lead to overfitting of the model and a high computational cost." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "U6vymJn1wJBu" + }, + "outputs": [], + "source": [ + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": { + "iopub.execute_input": "2024-07-14T14:21:02.040978Z", + "iopub.status.busy": "2024-07-14T14:21:02.040269Z", + "iopub.status.idle": "2024-07-14T14:21:02.043255Z", + "shell.execute_reply": "2024-07-14T14:21:02.042798Z", + "shell.execute_reply.started": "2024-07-14T14:21:02.040956Z" + }, + "id": "bKRmx2k2wNtE" + }, + "source": [ + "### 3.3. Compile and Train Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Compile the model with the model settings created earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YwLTHCccwTeo", + "scrolled": true + }, + "outputs": [], + "source": [ + "model.compile(optimizer=optimizer, loss=loss, metrics=metrics)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Train (fit) the model. The output `history` contains the loss value of the training data and the validation data for each epoch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "history = model.fit(x_tra, y_tra,\n", + " batch_size=batch_size,\n", + " epochs=epochs,\n", + " validation_data=(x_val, y_val),\n", + " verbose=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Save the model as a `.keras` zip archive so that it can be used later -- e.g., for comparison to other models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_model_prefix'] + \"_\" + path_dict['run_label']\n", + "file_name_final = createFileName(file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_model'],\n", + " file_suffix=path_dict['file_model_suffix'],\n", + " useuid=True,\n", + " verbose=True)\n", + "\n", + "model.save(file_name_final)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "suHSArn6wb27" + }, + "source": [ + "## 4. Diagnosing the Results of the Classification Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.1. Key Terms for Diagnostic Metrics\n", + "\n", + "We use the following diagnostics to assess the status of the network optimization and efficacy. \n", + "https://scikit-learn.org/stable/modules/model_evaluation.html\n", + "\n", + "\n", + "* **Metrics**\n", + " * Loss:\n", + " * Accuracy: Use as a rough indicator of model training progress/convergence for balanced datasets. For model performance, use only in combination with other metrics. Avoid for imbalanced datasets. Consider using another metric.\n", + " * tpr (Recall): Use when false negatives are more expensive than false positives.\n", + " * for: Use when false positives are more expensive than false negatives.\n", + " * precision: Use when it's very important for positive predictions to be accurate.\n", + " * \n", + "* **Generalization Error**: The Generalization Error (GE) is the difference in loss when the model is applied to training data versus when it is applied to validation data and test data.\n", + "* **Confusion Matrix**:\n", + "* **Receiver Operator Characteristic (ROC) Curve**:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.2. Classification Predictions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Predict classification probabilities on the training, validation and test sets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_pred_tra = model.predict(x_tra, verbose=True)\n", + "y_pred_val = model.predict(x_val, verbose=True)\n", + "y_pred_tes = model.predict(x_tes, verbose=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Identify what the top-choice class is for each object in the training, validation and test sets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_pred_tra_topchoice = y_pred_tra.argmax(axis=1)\n", + "y_pred_val_topchoice = y_pred_val.argmax(axis=1)\n", + "y_pred_tes_topchoice = y_pred_tes.argmax(axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"10 probabilities for each object:\", np.shape(y_pred_tra))\n", + "print(\"Top choice for each object:\", np.shape(y_pred_tra_topchoice))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Histograms of prediction distributions by class" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"Histograms_top_choice\"\\\n", + " + \"_\" + path_dict['run_label']\n", + "plotPredictionHistogram(y_pred_tra_topchoice,\n", + " y_prediction_b=y_pred_val_topchoice,\n", + " y_prediction_c=y_pred_tes_topchoice,\n", + " label_a=\"Training Set\",\n", + " label_b=\"Validation Set\",\n", + " label_c=\"Testing Set\",\n", + " figsize=(12, 5),\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"Histograms_class_probabilities\"\\\n", + " + \"_\" + path_dict['run_label']\n", + "plotPredictionHistogram(y_pred_tra,\n", + " y_prediction_b=y_pred_val,\n", + " y_prediction_c=y_pred_tes,\n", + " title_a='Training Set',\n", + " title_b='Validation Set',\n", + " title_c='Testing Set',\n", + " figsize=(15, 4),\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Observations about these histograms ...\n", + "1. very similar shapes across the data sets: that's good" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "### 4.3. Generalization Error\n", + "\n", + "The primary task in optimizing a network is to minimize the Generalization Error. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.3.1. Loss History: History of Loss and Accuracy during Training" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plot the loss history for the validation and training sets. We reserve the test set for a 'blind' analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EIiJTdK-weWf" + }, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"LossHistory\"\\\n", + " + \"_\"\\\n", + " + path_dict['run_label']\n", + "plotLossHistory(history,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.3.2. Confusion Matrix: Bias in Trained Model?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Compute confusion matrices" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cm_tra = confusion_matrix(y_pred_tra_topchoice, y_tra)\n", + "cm_val = confusion_matrix(y_pred_val_topchoice, y_val)\n", + "cm_tes = confusion_matrix(y_pred_tes_topchoice, y_tes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "plot confusion matrices for training, validation, and test samples (left, right, middle)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plotConfusionMatrix(cm_tra, cm_val, cm_tes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.3.4. Investigating Errant Classifications: Look at the examples\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Choose a digit/class (human option/choice) for examination." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class_value = 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Find all objects that have that class value. \n", + "Obtain indices for the true positives (tp's), false positives (fp's), true negatives (tn's), and false negatives (fn's)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ind_class_tp_tra = np.where((y_tra == class_value)\n", + " & (y_pred_tra_topchoice == class_value))[0]\n", + "\n", + "ind_class_fp_tra = np.where((y_tra != class_value)\n", + " & (y_pred_tra_topchoice == class_value))[0]\n", + "\n", + "ind_class_tn_tra = np.where((y_tra != class_value)\n", + " & (y_pred_tra_topchoice != class_value))[0]\n", + "\n", + "ind_class_fn_tra = np.where((y_tra == class_value)\n", + " & (y_pred_tra_topchoice != class_value))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "plot examples of false positives" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"ExampleImages_TruePostives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayImageConfusion(x_tra[ind_class_tp_tra],\n", + " y_tra[ind_class_tp_tra],\n", + " y_pred_tra_topchoice[ind_class_tp_tra],\n", + " title_main=\"True Positives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"ExampleImages_FalsePostives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayImageConfusion(x_tra[ind_class_fp_tra],\n", + " y_tra[ind_class_fp_tra],\n", + " y_pred_tra_topchoice[ind_class_fp_tra],\n", + " title_main=\"False Positives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"ExampleImages_TrueNegatives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayImageConfusion(x_tra[ind_class_tn_tra],\n", + " y_tra[ind_class_tn_tra],\n", + " y_pred_tra_topchoice[ind_class_tn_tra],\n", + " title_main=\"True Negatives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", + " + \"ExampleImages_FalseNegatives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayImageConfusion(x_tra[ind_class_fn_tra],\n", + " y_tra[ind_class_fn_tra],\n", + " y_pred_tra_topchoice[ind_class_fn_tra],\n", + " title_main=\"False Negatives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plot histograms of images pixels of true positives, false positives, true negatives, and false negatives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "file_prefix = path_dict['file_figure_prefix']\\\n", + " + \"_\" + \"ExampleImages_TruePostives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayHistogramConfusion(x_tra[ind_class_tp_tra],\n", + " y_tra[ind_class_tp_tra],\n", + " y_pred_tra_topchoice[ind_class_tp_tra],\n", + " title_main=\"True Positives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix']\\\n", + " + \"_\" + \"ExampleImages_FalsePostives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayHistogramConfusion(x_tra[ind_class_fp_tra],\n", + " y_tra[ind_class_fp_tra],\n", + " y_pred_tra_topchoice[ind_class_fp_tra],\n", + " title_main=\"False Positives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix']\\\n", + " + \"_\" + \"ExampleImages_TrueNegatives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayHistogramConfusion(x_tra[ind_class_tn_tra],\n", + " y_tra[ind_class_tn_tra],\n", + " y_pred_tra_topchoice[ind_class_tn_tra],\n", + " title_main=\"True Negatives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])\n", + "\n", + "file_prefix = path_dict['file_figure_prefix']\\\n", + " + \"_\" + \"ExampleImages_FalseNegatives_on_class_\"\\\n", + " + str(class_value) + \"_\" + path_dict['run_label']\n", + "plotArrayHistogramConfusion(x_tra[ind_class_fn_tra],\n", + " y_tra[ind_class_fn_tra],\n", + " y_pred_tra_topchoice[ind_class_fn_tra],\n", + " title_main=\"False Negatives\",\n", + " num=10,\n", + " file_prefix=file_prefix,\n", + " file_location=path_dict['dir_data_figures'],\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Exercises for the Learner" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each time you train a new model, re-run all the diagnostic plots.\n", + "\n", + "1. How do the loss and accuracy histories change when batch size is small or large? Why?\n", + "2. Does the NN take more or less time (more or fewer epochs) to converge if the input image data are normalized or not normalized? Why?\n", + "3. How does the size of the training set affect the model's accuracy and loss -- keeping the number of epochs the same? Why?\n", + "3. How does the random seed for the weight initialization affect the model's accuracy and loss -- keeping the number of epochs the same?\n", + "5. Use the `time` module to estimate the time for the model fitting. Record that time. Increase and then decrease the number of weights in the NN by an order of magnitude. Train the NN for each of those models and record the times. How does the number of weights in the neural network affect the training time and model loss and accuracy?\n", + "6. Use the `time` module to estimate the time for the model fitting. Record that time. Increase and then decrease the number of layers in the NN. Train the NN for each of those models and record the times. How does the number of weights in the neural network affect the training time and model loss and accuracy?\n", + "7. Use the `time` module to estimate the time for the model fitting. Record that time. Add a convolutional layer to the NN. Train the NN for each of those models and record the times. How does the number of weights in the neural network affect the training time and model loss and accuracy?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Glossary of neural network terms\n", + "\n", + "1. network weights\n", + "2. deep learning\n", + "3. machine learning\n", + "4. learning\n", + "5. activation function\n", + "6. pool(ing)\n", + "7. convolution\n", + "8. layer\n", + "9. loss function\n", + "10. confusion matrix\n", + "11. epoch\n", + "12. batch size\n", + "13. learning rate\n", + "14. momentum\n", + "15. stochastic gradient descent\n", + "16. optimizer\n", + "17. receiver operator characteristic (ROC)\n", + "18. area under the curve (AUC)\n", + "19. training\n", + "20. validation\n", + "21. testing\n", + "22. class\n", + "23. hyperparameter (vs. parameter)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. AI is math, not magic.\n", + "\n", + "AI is firmly based in math, computer science, and statistics. Additionally, some of the approaches are inspired by concepts or notions in biology (e.g., the computational neuron) and in physics (e.g., the reverse Boltzmann machine). \n", + "\n", + "Much of the jargon in AI is anthropomorphic, which can make it appear that some other than math is happening. For example, consider the following list of terms that are very often used in AI -- and what these terms actually mean mathematically.\n", + "\n", + "1. learn $\\rightarrow$ fit\n", + "2. hallucinate/lie $\\rightarrow$ predict incorrectly\n", + "3. understand $\\rightarrow$ model has converged\n", + "4. cheat $\\rightarrow$ more efficiently guesses the best weight parameters of the model\n", + "5. believe $\\rightarrow$ predict/infer based on statistical priors\n", + "\n", + "When we over-anthropomorphize this mathematical tool, we obfuscate how it actually works, and that makes it harder to build and refine models. That is, AI models are not 'learning' or 'understanding'; they are large-parameter models that are being fit to data. The only learning that's happening is what we do with these models." + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "LSST", + "language": "python", + "name": "lsst" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 6e3d13758ec4f4404ff154b3301cc7539b2f792b Mon Sep 17 00:00:00 2001 From: MelissaGraham Date: Mon, 25 Nov 2024 20:17:48 +0000 Subject: [PATCH 2/3] add 16a --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8ec5f12..9307751 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,7 @@ Tutorial titles in **bold** have Spanish-language versions. | 13a. Using The Image Cutout Tool With DP0.2 | Demonstration of the use of the image cutout tool with a few science applications. | | 14. Injecting Synthetic Sources Into Single-Visit Images | Inject artificial stars and galaxies into images. | | 15. Survey Property Maps | Use the tools to visualize full-area survey property maps. | +| 16a. Introduction to Tensorflow | Learn to classify images with AI-based classification algorithms. | ## DP0.3 Tutorials @@ -119,7 +120,7 @@ The *content* of these notebooks are licensed under the Apache 2.0 License. Tha | **13a. Using The Image Cutout Tool With DP0.2**
**Level:** Beginner
**Description:** This notebook demonstrates how to use the Rubin Image Cutout Service.
**Skills:** Run the Rubin Image Cutout Service for visual inspection of small cutouts of LSST images.
**Data Products:** Images (deepCoadd, calexp), catalogs (objectTable, diaObject, truthTables, ivoa.ObsCore).
**Packages:** PyVO, lsst.rsp.get_tap_service, lsst.pipe.tasks.registerImage, lsst.afw.display
| | **14. Injecting Synthetic Sources Into Single-Visit Images**
**Level:** Advanced
**Description:** This tutorial demonstrates a method to inject artificial sources (stars and galaxies) into calexp images using the measured point-spread function of the given calexp image. Confirmation that the synthetic sources were correctly injected into the image is done by running a difference imaging task from the pipelines.
**Skills:** Use the `source_injection` tools to inject synthetic sources into images. Create a difference image from a `calexp` with injected sources.
**Data Products:** Butler calexp images and corresponding src catalogs, goodSeeingDiff_templateExp images, and injection_catalogs.
**Packages:** lsst.source.injection
| | **15. Survey Property Maps**
**Level:** Intermediate
**Description:** Use the tools to visualize full-area survey property maps.
**Skills:** Load and visualize survey property maps using healsparse and skyproj.
**Data Products:** Survey property maps.
**Packages:** healsparse, skyproj, lsst.daf.butler
| - +| **16a. Introduction to Tensorflow**
**Level:** Beginner
**Description:** An introduction to the classification of images with AI-based classification algorithms.
**Skills:** Examine AI training data, prepare it for a classification task, perform classification with a neural network, and examine the diagnostics of the classification task.
**Data Products:** MNIST data.
**Packages:** sklearn, tensorflow
| | Skills in **DP0.3** Tutorial Notebooks | |---| From fab6265ebcbf5b8c9b20b759190b2dff59efea1d Mon Sep 17 00:00:00 2001 From: MelissaGraham Date: Tue, 26 Nov 2024 00:28:51 +0000 Subject: [PATCH 3/3] MLG edits --- ...geClassificationWithTensorflow_Draft.ipynb | 293 ++++++++++++------ 1 file changed, 192 insertions(+), 101 deletions(-) diff --git a/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb b/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb index 9939bcd..e7aafd0 100644 --- a/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb +++ b/AI0_Intro_AI_ImageClassificationWithTensorflow_Draft.ipynb @@ -7,10 +7,10 @@ }, "source": [ " \n", - "
AI0: Introduction to AI-based Image Classification with Tensorflow
\n", + "
Introduction to AI-based Image Classification with Tensorflow
\n", "Contact author: Brian Nord
\n", - "Last verified to run: YYYY-MM-DD
\n", - "LSST Science Pipelines version: ??
\n", + "Last verified to run: 2024-11-25
\n", + "LSST Science Pipelines version: Weekly 2024_42
\n", "Container size: medium
\n", "Targeted learning level: beginner
" ] @@ -76,13 +76,25 @@ "\n", "AI is a class of algorithms for building statistical models. These algorithms primarily use data for training, as opposed to models that use analytic formulae or models that are based on physical reasoning. Machine learning is a subclass of algorithms -- e.g., random forests. Deep learning is a subclass of algorithms -- e.g., neural networks. \n", "\n", - "This notebook uses `tensorflow`, one of the two most commonly used `python` libraries for deep learning. `Tensorflow` is often easier to use because of how it handles data sets and the logic used for model building. However, it is typically also difficult to develop network models creatively. We use `tensorflow` first in this series of tutorials so that users who are new to deep learning can focus on learning AI. In later tutorials, we will use `pytorch` because it is more flexible and more commonly used in science applications. \n", + "This notebook uses `tensorflow`, one of the two most commonly used `python` libraries for deep learning. `Tensorflow` is often easier to use because of how it handles data sets and the logic used for model building. However, it is typically also difficult to develop network models creatively. This tutorial is the first in a series, and uses `tensorflow` so that users who are new to deep learning can focus on learning AI. In later tutorials, `pytorch` will be used because it is more flexible and more commonly used in science applications. \n", "\n", - "This notebook uses [MNIST AI benchmarking data](https://en.wikipedia.org/wiki/MNIST_database). In a future notebook, we will we'll use stars and galaxies drawn from DP0 data.\n", + "Instead of using DP0 data, this tutorials uses [MNIST AI benchmarking data](https://en.wikipedia.org/wiki/MNIST_database), a large database of handwritten digits that is commonly used for training and testing machine learning algorithms. It is simple to understand, so that users who are new to deep learning can focus on learning AI. Later tutorials in this series will use stars and galaxies drawn from DP0 data.\n", "\n", - "The use of data in this notebook requires a medium-sized ram allocation (8Gi).\n", + "### 1.1. AI is math, not magic.\n", "\n", - "The end of this notebook contains a Glossary of Terms and a comment regarding usage of terms in AI contexts." + "AI is firmly based in math, computer science, and statistics. Additionally, some of the approaches are inspired by concepts or notions in biology (e.g., the computational neuron) and in physics (e.g., the reverse Boltzmann machine). \n", + "\n", + "Much of the jargon in AI is anthropomorphic, which can make it appear that some other than math is happening. For example, consider the following list of terms that are very often used in AI -- and what these terms actually mean mathematically.\n", + "\n", + "1. learn $\\rightarrow$ fit\n", + "2. hallucinate/lie $\\rightarrow$ predict incorrectly\n", + "3. understand $\\rightarrow$ model has converged\n", + "4. cheat $\\rightarrow$ more efficiently guesses the best weight parameters of the model\n", + "5. believe $\\rightarrow$ predict/infer based on statistical priors\n", + "\n", + "When we over-anthropomorphize this mathematical tool, we obfuscate how it actually works, and that makes it harder to build and refine models. That is, AI models are not 'learning' or 'understanding'; they are large-parameter models that are being fit to data. The only learning that's happening is what we do with these models.\n", + "\n", + "The end of this notebook contains a glossary of AI-related terms." ] }, { @@ -103,7 +115,7 @@ "id": "V3xHhKu6c5-e" }, "source": [ - "### 1.1. Import Packages\n", + "### 1.2. Import packages\n", "\n", "[`numpy`](https://numpy.org/) is a widely used Python library for computations and mathematical operations on multi-dimensional arrays.\n", "\n", @@ -141,7 +153,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 1.2 Define Functions" + "### 1.3. Define functions\n", + "\n", + "The following functions are defined and used throughout this notebook.\n", + "\n", + "It is not necessary to understand exactly what every funtion does in order to proceed with this tutorial.\n", + "\n", + "Execute all cells and move on to Section 2." ] }, { @@ -539,7 +557,7 @@ " n_objects_a = shape_a[0]\n", "\n", " if ndim == 2:\n", - " if n_classes == None:\n", + " if n_classes is None:\n", " n_classes = shape_a[1]\n", " if n_colors is None:\n", " n_colors = n_classes\n", @@ -871,14 +889,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 1.3 Define Paths for Data and Plots" + "### 1.4. Define paths for data and plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Neural network training (i.e., model fitting) typically requires many numerical experiments to achieve an ideal model. To facilitate the comparison of these experiments/models, it is helpful to organize data carefully. We set paths for the model weight parameters and diagnostic figures. We also set the variable `run_label` for each training run. We also save these paths in a dictionary to facilitate passing information to plotting functions." + "Neural network training (i.e., model fitting) typically requires many numerical experiments to achieve an ideal model. To facilitate the comparison of these experiments/models, it is helpful to organize data carefully. \n", + "\n", + "Set the variable `run_label` for each training run. \n", + "Set paths for the model weight parameters and diagnostic figures, and\n", + "save these paths in the dictionary `path_dict` to facilitate passing information to plotting functions.\n", + "Check whether the paths exist, and if not, create them." ] }, { @@ -889,14 +912,16 @@ "source": [ "run_label = \"Run000\"\n", "\n", + "temppath = os.getenv(\"HOME\") + '/dp02_16a_temp/'\n", "path_dict = {'run_label': run_label,\n", - " 'dir_data_model': \"Data/Models/\",\n", - " 'dir_data_figures': \"Data/Figures/\",\n", + " 'dir_data_model': temppath + \"Data/Models/\",\n", + " 'dir_data_figures': temppath + \"Data/Figures/\",\n", " 'file_model_prefix': \"Model\",\n", " 'file_figure_prefix': \"Figure\",\n", " 'file_figure_suffix': \".png\",\n", " 'file_model_suffix': \".keras\"\n", " }\n", + "del temppath\n", "\n", "if not os.path.exists(path_dict['dir_data_model']):\n", " os.makedirs(path_dict['dir_data_model'])\n", @@ -911,7 +936,7 @@ "id": "jce50kKEfHC1" }, "source": [ - "## 2. Load and Prepare data: MNIST Handwritten Digits" + "## 2. Load and prepare data: MNIST Handwritten Digits" ] }, { @@ -920,7 +945,7 @@ "id": "-dHXIDGEfLmO" }, "source": [ - "### 2.1. Download Dataset" + "### 2.1. Download the dataset" ] }, { @@ -929,9 +954,11 @@ "id": "-dHXIDGEfLmO" }, "source": [ - "The [`MNIST handwritten digits dataset`](https://ieeexplore.ieee.org/document/6296535) comprises 10 classes --- one for each digit. This is a useful dataset for learning the basics of neural networks and other AI algorithms. MNIST is one of a few canonical AI benchmark data sets for image classification. `tensorflow` has a simple function easily downloading the MNIST data to your local server for free. It automatically downloads the data into.\n", + "The [MNIST handwritten digits dataset](https://ieeexplore.ieee.org/document/6296535) comprises 10 classes --- one for each digit. This is a useful dataset for learning the basics of neural networks and other AI algorithms. MNIST is one of a few canonical AI benchmark data sets for image classification. `Tensorflow` has a simple function easily downloading the MNIST data to your local server for free.\n", "\n", - "The **input** data are held in `x_`, while the **output** (aka, label) data are held in `y_`." + "Automatically download the data using `tf.keras.datasets.mnist`.\n", + "The `tf` class automatically downloads data into training and test data sets;\n", + "load them into variables `train_temp` and `test_temp`, respectively." ] }, { @@ -956,7 +983,7 @@ "id": "O4FPxkLKiJKe" }, "source": [ - "### 2.2. Split Data into Train/Validation/Test" + "### 2.2. Split data into training, validation, and testing" ] }, { @@ -973,7 +1000,7 @@ "* **Validation** (`_val`) data is used indirectly to update the hyperparameters of the AI model -- e.g., the batchsize, the learning rate, or the layers in the architecture of a neural network. Each time the neural network has completed training with the training data, the human looks at those diagnostics when run on the training and the validation data.\n", "* **Test(ing)** (`_tes`) data is only used when the model is trained and validated and will no longer be update or further trained. \n", "\n", - "The `TF` class automatically downloads data into training and test data sets. Therefore, we use the `sklearn` `train_test_split()` function to further split the training set into training and validation data sets. We then \n" + "Set the fraction of the data set to use for validation as 25%." ] }, { @@ -987,24 +1014,37 @@ "fraction_validation = 0.25" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The **input** data are held in `x_`, while the **output** (aka, label) data are held in `y_`.\n", + "\n", + "Set the test data sets from `test_temp`, which was loaded above." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# set the test data sets from the temp data at read-in\n", "x_tes, y_tes = test_temp[0], test_temp[1]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the `sklearn` `train_test_split()` function to further split the training set into training and validation data sets." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# set the training and validata data sets from the temp data at read-in\n", - "# use the sklearn train_test_split function\n", "x_tra, x_val, y_tra, y_val = train_test_split(train_temp[0], train_temp[1],\n", " test_size=fraction_validation,\n", " random_state=1)" @@ -1018,9 +1058,10 @@ "source": [ "### 2.3. Normalize data\n", "\n", - "First, we make sure that the input data are floats. This allows us to perform computations on the real number line for the inputs.\n", + "Make sure that the input data are type float, to enable computations on the real number line for the inputs.\n", "\n", - "Second, we normalize the data according to the maximum value in all the data sets. The inputs will all exist on a smaller range. This improves the stability of the training." + "Calculate the minimum and maximum value across all data sub-sets.\n", + "Use the `normalizeInputs` function to normalize the data according to the minimum and maximum value in all the data sets. The inputs will all exist on a smaller range. This improves the stability of the training." ] }, { @@ -1031,27 +1072,20 @@ }, "outputs": [], "source": [ - "# set to floats\n", "x_tra = x_tra.astype('float32')\n", "x_val = x_val.astype('float32')\n", "x_tes = x_tes.astype('float32')\n", "\n", - "# calculate min and max across all input images\n", "input_minimum = np.min([np.min(x_tra), np.min(x_val), np.min(x_tes)])\n", "input_maximum = np.max([np.max(x_tra), np.max(x_val), np.max(x_tes)])\n", "\n", - "print(\"Before\")\n", - "print(\"min/max\", np.min(x_tra), np.max(x_tra))\n", + "print(\"Before normalization, min and max: \", np.min(x_tra), np.max(x_tra))\n", "\n", "x_tra = normalizeInputs(x_tra, input_minimum, input_maximum)\n", "x_val = normalizeInputs(x_val, input_minimum, input_maximum)\n", "x_tes = normalizeInputs(x_tes, input_minimum, input_maximum)\n", "\n", - "print(\"After\")\n", - "print(\"min/max\", np.min(x_tra), np.max(x_tra))\n", - "\n", - "# get shapes\n", - "image_shape = x_tra[0, :, :].shape" + "print(\"After normalization, min and max: \", np.min(x_tra), np.max(x_tra))" ] }, { @@ -1060,7 +1094,7 @@ "id": "CdwlTbFafOYc" }, "source": [ - "### 2.4. Examine Raw Data" + "### 2.4. Examine raw data" ] }, { @@ -1071,14 +1105,27 @@ "source": [ "Review data shapes. \n", "\n", - "The zeroth elements of the `x` and `y` shapes should match. The first and second elements of `x` should be equal: these are the dimensions of the images. The image size, in part determines the depth of the neural network that can be created." + "The variable `image_shape` will be used again in Section 3.2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"The shape of the input training set is:\")\n", + "image_shape = x_tra[0, :, :].shape\n", + "print(image_shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Print the data shapes to make sure you understand how many objects there are and what the number of pixels is for each image." + "The zeroth elements of the `x` and `y` shapes should match. The first and second elements of `x` should be equal: these are the dimensions of the images. The image size, in part determines the depth of the neural network that can be created.\n", + "\n", + "Print the data shapes to understand how many objects there are and what the number of pixels is for each image." ] }, { @@ -1094,7 +1141,6 @@ }, "outputs": [], "source": [ - "print('check data shapes')\n", "print('x_train:', x_tra.shape)\n", "print('y_train:', y_tra.shape)\n", "print('x_valid:', x_val.shape)\n", @@ -1107,7 +1153,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Plot examples to gain visual familiarity. Do these all look like hand-written digits?" + "Plot examples to gain visual familiarity. All images should look like hand-written digits." ] }, { @@ -1136,7 +1182,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Plot pixel distributions to further understand data. Is it normalized? Do the disributions of the pixel values make sense according to what you see in the related images above? \n" + "> Figure 1: Two rows of five images, each a handwritten number in white on a black background.\n", + "\n", + "Plot the distributions of pixel values to further understand data.\n", + "\n", + "Note that all pixel data has been normalized, and pixels have values between 0 and 1 only.\n", + "\n", + "Note that the distribution of pixel values matches the images shown above: mostly black (values near 0) pixels, with some white (values near 1), and a few grey (values in between 0 and 1)." ] }, { @@ -1162,7 +1214,9 @@ "id": "rmAXXIHOfUnD" }, "source": [ - "## 3. Train Model: Dense Neural Network" + "> Figure 2: Two rows of five plots, each showing the distribution of pixel values (number of pixels of a given value) for the handwritten digit images shown in Figure 1.\n", + "\n", + "## 3. Train the model: dense neural network" ] }, { @@ -1178,7 +1232,7 @@ "id": "bKRmx2k2wNtE" }, "source": [ - "### 3.1. Define Model Training Parameters" + "### 3.1. Define model training parameters" ] }, { @@ -1194,12 +1248,17 @@ "id": "bKRmx2k2wNtE" }, "source": [ - "Define optimizer\n", - "Define loss\n", - "Define accuracy\n", - "Define batch_size\n", - "Define epochs\n", - "Define metrics" + " * `epochs` : \n", + " * `batch_size` : \n", + " * `verbose` : When `True`, the code will write more output to screen. This can help users diagnose issues.\n", + " * `optimizer` : \n", + " * `loss` : \n", + " * `metrics` : \n", + " * `dropout_rate` : \n", + " * `learning_rate` : \n", + " * `momentum` :\n", + "\n", + "Define the variables to hold the model training parameters." ] }, { @@ -1218,15 +1277,14 @@ "metrics = ['accuracy']\n", "dropout_rate = 0.3\n", "learning_rate = 0.01\n", - "momentum = 0.9\n", - "seed = 1000" + "momentum = 0.9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Set the random seed for neural network weight initialization" + "Set the random seed for neural network weight initialization." ] }, { @@ -1235,6 +1293,7 @@ "metadata": {}, "outputs": [], "source": [ + "seed = 1000\n", "tf.keras.utils.set_random_seed(seed)" ] }, @@ -1244,7 +1303,7 @@ "id": "rmAXXIHOfUnD" }, "source": [ - "### 3.2. Define Model" + "### 3.2. Define the model" ] }, { @@ -1253,12 +1312,16 @@ "id": "rmAXXIHOfUnD" }, "source": [ - "Define Sequential Model\n", - "Define layers\n", - "Define flat layer\n", - "Define dense layers\n", - "Define activation function; define types activation functions -- sigmoid and relu\n", - "Define weights and biases" + " * layers : \n", + " * flat layer : \n", + " * dense layers : \n", + " * activation function : \n", + " * sigmoid : \n", + " * softmax : \n", + " * weights and biases : \n", + " * sequential model :\n", + "\n", + "Define `model_layers` and use it to set `model` as a sequential model." ] }, { @@ -1312,7 +1375,7 @@ "id": "bKRmx2k2wNtE" }, "source": [ - "### 3.3. Compile and Train Model" + "### 3.3. Compile and train the model" ] }, { @@ -1383,29 +1446,31 @@ "id": "suHSArn6wb27" }, "source": [ - "## 4. Diagnosing the Results of the Classification Model" + "## 4. Diagnosing the results of the classification model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### 4.1. Key Terms for Diagnostic Metrics\n", + "### 4.1. Key terms for diagnostic metrics\n", + "\n", + "Use the following diagnostics to assess the status of the network optimization and efficacy. \n", + "\n", + "A good reference is this [scikit-learn page on metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html).\n", + "\n", + "**Metrics**\n", + " * Loss:\n", + " * Accuracy: Use as a rough indicator of model training progress/convergence for balanced datasets. For model performance, use only in combination with other metrics. Avoid for imbalanced datasets. Consider using another metric.\n", + " * True Positive Rate (TPR; \"Recall\"): Use when false negatives are more expensive than false positives.\n", + " * for: Use when false positives are more expensive than false negatives.\n", + " * precision: Use when it's very important for positive predictions to be accurate.\n", "\n", - "We use the following diagnostics to assess the status of the network optimization and efficacy. \n", - "https://scikit-learn.org/stable/modules/model_evaluation.html\n", + "**Generalization Error**: The Generalization Error (GE) is the difference in loss when the model is applied to training data versus when it is applied to validation data and test data.\n", "\n", + "**Confusion Matrix**:\n", "\n", - "* **Metrics**\n", - " * Loss:\n", - " * Accuracy: Use as a rough indicator of model training progress/convergence for balanced datasets. For model performance, use only in combination with other metrics. Avoid for imbalanced datasets. Consider using another metric.\n", - " * tpr (Recall): Use when false negatives are more expensive than false positives.\n", - " * for: Use when false positives are more expensive than false negatives.\n", - " * precision: Use when it's very important for positive predictions to be accurate.\n", - " * \n", - "* **Generalization Error**: The Generalization Error (GE) is the difference in loss when the model is applied to training data versus when it is applied to validation data and test data.\n", - "* **Confusion Matrix**:\n", - "* **Receiver Operator Characteristic (ROC) Curve**:\n" + "**Receiver Operator Characteristic (ROC) Curve**:\n" ] }, { @@ -1451,6 +1516,13 @@ "y_pred_tes_topchoice = y_pred_tes.argmax(axis=1)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Verify that the shapes `y_pred_tra` match the length of 45000, as in Section 2.4." + ] + }, { "cell_type": "code", "execution_count": null, @@ -1465,7 +1537,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Histograms of prediction distributions by class" + "Use the `plotPredictionHistogram` function to plot histograms of prediction distributions by class." ] }, { @@ -1486,8 +1558,22 @@ " figsize=(12, 5),\n", " file_prefix=file_prefix,\n", " file_location=path_dict['dir_data_figures'],\n", - " file_suffix=path_dict['file_figure_suffix'])\n", - "\n", + " file_suffix=path_dict['file_figure_suffix'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Figure 3: Histograms of the number of images for which the top-choice class was each number 0 through 9, for the training, validation, and test sets. Note that these are overlapping histograms, not stacked." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "file_prefix = path_dict['file_figure_prefix'] + \"_\"\\\n", " + \"Histograms_class_probabilities\"\\\n", " + \"_\" + path_dict['run_label']\n", @@ -1507,8 +1593,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Observations about these histograms ...\n", - "1. very similar shapes across the data sets: that's good" + "> Figure 4: Histograms of the number of images (y-axis) that had a probability (x-axis) of being each class 0 through 9 (light to dark shades).\n", + "\n", + "In both of Figure 3 and 4, the histograms show very similar shapes across the classification categories.\n", + "This is a good sign because it indicates the model does not 'prefer' a class." ] }, { @@ -1526,14 +1614,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 4.3.1. Loss History: History of Loss and Accuracy during Training" + "### 4.3.1. Loss History: history of loss and accuracy during training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Plot the loss history for the validation and training sets. We reserve the test set for a 'blind' analysis." + "Plot the loss history for the validation and training sets, but reserve the test set for a 'blind' analysis." ] }, { @@ -1558,14 +1646,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 4.3.2. Confusion Matrix: Bias in Trained Model?" + "> Figure 5: In the top panel, the loss history as a function of epoch for the training and validation sets decreases with time, as it should as the model improves. In the bottom panel, the loss residual (validation - training) shows a dip at epoch 2, indicating the model caused a divergence in the training and validation set classifications, but that this was rectified in later epochs.\n", + "\n", + "### 4.3.2. Confusion matrix: look for bias in the trained model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Compute confusion matrices" + "Compute the confusion matrices." ] }, { @@ -1583,7 +1673,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "plot confusion matrices for training, validation, and test samples (left, right, middle)" + "Plot the confusion matrices for the training, validation, and test samples (left, right, middle)." ] }, { @@ -1599,14 +1689,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 4.3.4. Investigating Errant Classifications: Look at the examples\n" + "> Figure 6: Confusion matrices for the training, validation, and test sets (left to right). The number of images with a given true (y-axis) and predicted (x-axis) classification is written in each box, and boxes are colored based on the number (from purple to yellow). The diagonal represents images that were correctly classified." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.3.4. Investigating errant classifications: look at the examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Choose a digit/class (human option/choice) for examination." + "Choose to investigate explore the digit classification `class_value` = 2." ] }, { @@ -1649,7 +1746,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "plot examples of false positives" + "Display 10 images that exemplify:\n", + " * true positives (correctly classified digits; a true 2 classified as 2);\n", + " * false positives (another digit classified as 2);\n", + " * true negatives (another digit classified as another digit); and\n", + " * false negatives (a true 2 classified as another digit)." ] }, { @@ -1711,6 +1812,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "> Figure 7: Four panels of 10 images each, representing true positives (top), false positives (second), true negatives (third), and false negatives (bottom), for classification category 2.\n", + "\n", "Plot histograms of images pixels of true positives, false positives, true negatives, and false negatives." ] }, @@ -1773,6 +1876,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "> Figure 8: Histograms of the pixel flux values for the images shown in Figure 7.\n", + "\n", "## 5. Exercises for the Learner" ] }, @@ -1825,21 +1930,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## 7. AI is math, not magic.\n", - "\n", - "AI is firmly based in math, computer science, and statistics. Additionally, some of the approaches are inspired by concepts or notions in biology (e.g., the computational neuron) and in physics (e.g., the reverse Boltzmann machine). \n", - "\n", - "Much of the jargon in AI is anthropomorphic, which can make it appear that some other than math is happening. For example, consider the following list of terms that are very often used in AI -- and what these terms actually mean mathematically.\n", - "\n", - "1. learn $\\rightarrow$ fit\n", - "2. hallucinate/lie $\\rightarrow$ predict incorrectly\n", - "3. understand $\\rightarrow$ model has converged\n", - "4. cheat $\\rightarrow$ more efficiently guesses the best weight parameters of the model\n", - "5. believe $\\rightarrow$ predict/infer based on statistical priors\n", - "\n", - "When we over-anthropomorphize this mathematical tool, we obfuscate how it actually works, and that makes it harder to build and refine models. That is, AI models are not 'learning' or 'understanding'; they are large-parameter models that are being fit to data. The only learning that's happening is what we do with these models." - ] + "source": [] } ], "metadata": {