diff --git a/notebooks/212-pyannote-speaker-diarization/212-pyannote-speaker-diarization.ipynb b/notebooks/212-pyannote-speaker-diarization/212-pyannote-speaker-diarization.ipynb index 851e96ac272..e85bd30b90c 100644 --- a/notebooks/212-pyannote-speaker-diarization/212-pyannote-speaker-diarization.ipynb +++ b/notebooks/212-pyannote-speaker-diarization/212-pyannote-speaker-diarization.ipynb @@ -9,7 +9,7 @@ "\n", "Speaker diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question \"who spoke when?\"\n", "\n", - "![image.png](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/_images/asr_sd_diagram.png)\n", + "![image.png](https://developer-blogs.nvidia.com/wp-content/uploads/2022/09/speaker-diarization.png)\n", "\n", "With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarization has received much attention by the speech community. Speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels.\n", "\n", @@ -49,7 +49,7 @@ }, "outputs": [], "source": [ - "%pip install -q librosa>=0.8.1 \"ruamel.yaml>=0.17.8,<0.17.29\" --extra-index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio git+https://github.com/eaidova/pyannote-audio.git@hub0.10 openvino>=2023.1.0" + "%pip install -q \"librosa>=0.8.1\" \"matplotlib<3.8\" \"ruamel.yaml>=0.17.8,<0.17.29\" --extra-index-url https://download.pytorch.org/whl/cpu torch torchvision torchaudio git+https://github.com/eaidova/pyannote-audio.git@hub0.10 openvino>=2023.1.0" ] }, { @@ -98,7 +98,6 @@ "```python\n", "\n", "## login to huggingfacehub to get access to pre-trained model\n", - "[back to top ⬆️](#Table-of-contents:)\n", "from huggingface_hub import notebook_login, whoami\n", "\n", "try:\n", @@ -336,7 +335,7 @@ "## Convert model to OpenVINO Intermediate Representation format\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", - "For best results with OpenVINO, it is recommended to convert the model to OpenVINO IR format. OpenVINO supports PyTorch via ONNX conversion. We will use `torch.onnx.export` for exporting the ONNX model from PyTorch. We need to provide initialized model's instance and example of inputs for shape inference. We will use `mo.convert_model` functionality to convert the ONNX models. The `mo.convert_model` Python function returns an OpenVINO model ready to load on the device and start making predictions. We can save it on disk for the next usage with `openvino.runtime.serialize`." + "For best results with OpenVINO, it is recommended to convert the model to OpenVINO IR format. OpenVINO supports PyTorch via ONNX conversion. We will use `torch.onnx.export` for exporting the ONNX model from PyTorch. We need to provide initialized model's instance and example of inputs for shape inference. We will use `ov.convert_model` functionality to convert the ONNX models. The `mo.convert_model` Python function returns an OpenVINO model ready to load on the device and start making predictions. We can save it on disk for the next usage with `ov.save_model`." ] }, { @@ -567,7 +566,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -581,7 +580,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.8.10" }, "vscode": { "interpreter": { @@ -598,4 +597,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/notebooks/212-pyannote-speaker-diarization/README.md b/notebooks/212-pyannote-speaker-diarization/README.md index f6f8c508a1a..aceff46035c 100644 --- a/notebooks/212-pyannote-speaker-diarization/README.md +++ b/notebooks/212-pyannote-speaker-diarization/README.md @@ -2,7 +2,7 @@ Speaker diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?". Speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels. -![image.png](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/_images/asr_sd_diagram.png) +![image.png](https://developer-blogs.nvidia.com/wp-content/uploads/2022/09/speaker-diarization.png) In this tutorial, we consider how to build speaker diarization pipeline using `pyannote.audio` and OpenVINO. `pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on PyTorch deep learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. You can find more information about pyannote pre-trained models in [model card](https://huggingface.co/pyannote/speaker-diarization), [repo](https://github.com/pyannote/pyannote-audio) and [paper](https://arxiv.org/abs/1911.01255). diff --git a/notebooks/248-stable-diffusion-xl/248-ssd-b1.ipynb b/notebooks/248-stable-diffusion-xl/248-ssd-b1.ipynb index 7477c3ef18a..cb586dc3408 100644 --- a/notebooks/248-stable-diffusion-xl/248-ssd-b1.ipynb +++ b/notebooks/248-stable-diffusion-xl/248-ssd-b1.ipynb @@ -1,10 +1,14 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "id": "73eedd1846fbbc76", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "# Image generation with Segmind Stable Diffusion 1B (SSD-1B) model and OpenVINO\n", @@ -39,10 +43,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "a5e736551f15ee21", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "## Install prerequisites\n", @@ -58,22 +66,28 @@ "end_time": "2023-10-31T13:35:19.220747700Z", "start_time": "2023-10-31T13:35:19.220747700Z" }, - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ "%pip install -q \"git+https://github.com/huggingface/optimum-intel.git\"\n", "%pip install -q \"openvino>=2023.1.0\"\n", "%pip install -q --upgrade-strategy eager \"invisible-watermark>=0.2.0\" \"transformers>=4.33\" \"accelerate\" \"onnx\" \"onnxruntime\" safetensors \"diffusers>=0.22.0\"\n", - "%pip install -q diffusers transformers peft\n", "%pip install -q gradio" ] }, { + "attachments": {}, "cell_type": "markdown", "id": "bf52326c380ca9fc", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "## SSD-1B Base model\n", @@ -92,7 +106,10 @@ "execution_count": null, "id": "1d28ca20bc63cc9b", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -105,10 +122,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "8c1b73f7f7b2ec5d", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Select inference device SSD-1B Base model\n", @@ -122,7 +143,10 @@ "execution_count": 2, "id": "8bc866dd4513a227", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -163,7 +187,10 @@ "execution_count": 3, "id": "2e170d7fb8271f0b", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -193,10 +220,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "9fce7a3b2f805652", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Run Text2Image generation pipeline\n", @@ -211,7 +242,10 @@ "execution_count": 6, "id": "ed145470184014ff", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -249,10 +283,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "d814c52f181d2be6", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "Generating a 512x512 image requires about 27GB for the SSD-1B model and about 42GB RAM for the SDXL model in case if the converted model is loaded from disk." @@ -263,7 +301,10 @@ "execution_count": 5, "id": "be7b2783f2102b1a", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -303,10 +344,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "32dff8c3c4923c4f", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Image2Image Generation Interactive Demo\n", @@ -318,7 +363,10 @@ "execution_count": null, "id": "8be18d8cdab7f519", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -361,10 +409,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "638828db7e5f5063", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "## Latent Consistency Model (LCM)\n", @@ -378,10 +430,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "2583b0876656dc25", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Infer the original model\n", @@ -393,7 +449,10 @@ "execution_count": 23, "id": "b89f0ec11edac979", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -462,10 +521,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "7d9f6a79afedaf11", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Convert the model to OpenVINO IR\n", @@ -481,10 +544,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "3dec5f5b0ffa6ef3", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "#### Imports\n", @@ -496,7 +563,10 @@ "execution_count": 24, "id": "ac96fd938479a24", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -509,10 +579,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "6b01303e1935524e", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "Let's define the conversion function for PyTorch modules. We use `ov.convert_model` function to obtain OpenVINO Intermediate Representation object and `ov.save_model` function to save it as XML file." @@ -523,7 +597,10 @@ "execution_count": 25, "id": "7bdacbe5bbe43b57", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -542,10 +619,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "efb3f4bc4c310a83", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "#### Convert VAE\n", @@ -563,7 +644,10 @@ "execution_count": 26, "id": "cdb432352d725f9", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -586,10 +670,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "c6de1e10ffb77ccc", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "#### Convert U-NET\n", @@ -603,7 +691,10 @@ "execution_count": 27, "id": "9571138b84a07a42", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -642,10 +733,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "c98577817b7b4fd", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "#### Convert Encoders\n", @@ -659,7 +754,10 @@ "execution_count": 28, "id": "aceff98bb77574a3", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -683,7 +781,10 @@ "execution_count": 29, "id": "afde23a1d2b20c42", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -701,7 +802,10 @@ "execution_count": 30, "id": "214209a8eec921de", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -716,7 +820,10 @@ "execution_count": 31, "id": "8ef31e611078a50e", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -727,10 +834,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "f8db21ed7bdb17f5", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Compiling models\n", @@ -744,7 +855,10 @@ "execution_count": 32, "id": "de9eac5688e0ee71", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -783,7 +897,10 @@ "execution_count": 33, "id": "d0d9de3ffdcf60e6", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -794,10 +911,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "480a2608b0f3179", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Building the pipeline\n", @@ -811,7 +932,10 @@ "execution_count": 34, "id": "61a4a7518121b9c3", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -849,7 +973,10 @@ "execution_count": 35, "id": "1433accbbc474080", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -883,7 +1010,10 @@ "execution_count": 36, "id": "c4b85d4f193f8457", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -902,10 +1032,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "e82978d2880bf6e0", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "And insert wrappers instances in the pipeline:" @@ -916,7 +1050,10 @@ "execution_count": 37, "id": "a3e49ecea1d7efd0", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -927,10 +1064,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "58cda6e40cd714aa", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Inference\n", @@ -942,7 +1083,10 @@ "execution_count": 40, "id": "2caefa6d9b741f35", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [ { @@ -979,10 +1123,14 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "c8a088d0be7c286d", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "source": [ "### Image2Image Generation with LCM Interactive Demo\n", @@ -994,7 +1142,10 @@ "execution_count": null, "id": "18a2de015398e5bb", "metadata": { - "collapsed": false + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } }, "outputs": [], "source": [ @@ -1055,7 +1206,14 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.8.10" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } } }, "nbformat": 4, diff --git a/notebooks/252-fastcomposer-image-generation/252-fastcomposer-image-generation.ipynb b/notebooks/252-fastcomposer-image-generation/252-fastcomposer-image-generation.ipynb index 5b339983f5c..b051f57e6e9 100644 --- a/notebooks/252-fastcomposer-image-generation/252-fastcomposer-image-generation.ipynb +++ b/notebooks/252-fastcomposer-image-generation/252-fastcomposer-image-generation.ipynb @@ -85,14 +85,28 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cloning into 'fastcomposer'...\n", + "remote: Enumerating objects: 339, done.\u001b[K\n", + "remote: Counting objects: 100% (276/276), done.\u001b[K\n", + "remote: Compressing objects: 100% (119/119), done.\u001b[K\n", + "remote: Total 339 (delta 170), reused 231 (delta 142), pack-reused 63\u001b[K\n", + "Receiving objects: 100% (339/339), 35.12 MiB | 13.94 MiB/s, done.\n", + "Resolving deltas: 100% (186/186), done.\n" + ] + } + ], "source": [ "from pathlib import Path\n", "\n", @@ -113,7 +127,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": { "collapsed": false, "jupyter": { @@ -140,21 +154,39 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config[\"id2label\"]` will be overriden.\n", + "`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config[\"bos_token_id\"]` will be overriden.\n", + "`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config[\"eos_token_id\"]` will be overriden.\n" + ] + }, + { + "data": { + "text/plain": [ + "_IncompatibleKeys(missing_keys=['vae.encoder.mid_block.attentions.0.to_q.weight', 'vae.encoder.mid_block.attentions.0.to_q.bias', 'vae.encoder.mid_block.attentions.0.to_k.weight', 'vae.encoder.mid_block.attentions.0.to_k.bias', 'vae.encoder.mid_block.attentions.0.to_v.weight', 'vae.encoder.mid_block.attentions.0.to_v.bias', 'vae.encoder.mid_block.attentions.0.to_out.0.weight', 'vae.encoder.mid_block.attentions.0.to_out.0.bias', 'vae.decoder.mid_block.attentions.0.to_q.weight', 'vae.decoder.mid_block.attentions.0.to_q.bias', 'vae.decoder.mid_block.attentions.0.to_k.weight', 'vae.decoder.mid_block.attentions.0.to_k.bias', 'vae.decoder.mid_block.attentions.0.to_v.weight', 'vae.decoder.mid_block.attentions.0.to_v.bias', 'vae.decoder.mid_block.attentions.0.to_out.0.weight', 'vae.decoder.mid_block.attentions.0.to_out.0.bias'], unexpected_keys=['text_encoder.embeddings.position_ids', 'image_encoder.vision_model.embeddings.position_ids', 'vae.encoder.mid_block.attentions.0.query.weight', 'vae.encoder.mid_block.attentions.0.query.bias', 'vae.encoder.mid_block.attentions.0.key.weight', 'vae.encoder.mid_block.attentions.0.key.bias', 'vae.encoder.mid_block.attentions.0.value.weight', 'vae.encoder.mid_block.attentions.0.value.bias', 'vae.encoder.mid_block.attentions.0.proj_attn.weight', 'vae.encoder.mid_block.attentions.0.proj_attn.bias', 'vae.decoder.mid_block.attentions.0.query.weight', 'vae.decoder.mid_block.attentions.0.query.bias', 'vae.decoder.mid_block.attentions.0.key.weight', 'vae.decoder.mid_block.attentions.0.key.bias', 'vae.decoder.mid_block.attentions.0.value.weight', 'vae.decoder.mid_block.attentions.0.value.bias', 'vae.decoder.mid_block.attentions.0.proj_attn.weight', 'vae.decoder.mid_block.attentions.0.proj_attn.bias'])" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "from dataclasses import dataclass\n", "\n", "import torch\n", "\n", - "from model import FastComposerModel\n", - "\n", "\n", "@dataclass()\n", "class Config:\n", @@ -201,7 +233,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { @@ -239,14 +271,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ea/work/genai_env/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):\n" + ] + } + ], "source": [ "text_encoder_ir_xml_path = Path('models/text_encoder_ir.xml')\n", "example_input = torch.zeros((1, 77), dtype=torch.int64)\n", @@ -270,14 +315,32 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ea/work/openvino_notebooks/notebooks/252-fastcomposer-image-generation/fastcomposer/fastcomposer/transforms.py:35: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if h == w:\n", + "/home/ea/work/openvino_notebooks/notebooks/252-fastcomposer-image-generation/fastcomposer/fastcomposer/transforms.py:37: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " elif h > w:\n" + ] + } + ], "source": [ "from collections import OrderedDict\n", "from torchvision import transforms as T\n", @@ -323,14 +386,23 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ea/work/openvino_notebooks/notebooks/252-fastcomposer-image-generation/model.py:108: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if h != self.image_size or w != self.image_size:\n" + ] + } + ], "source": [ "image_encoder_ir_xml_path = Path('models/image_encoder_ir.xml')\n", "example_input = torch.zeros((1, 2, 3, 256, 256), dtype=torch.float32)\n", @@ -354,7 +426,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { @@ -391,14 +463,45 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ea/work/genai_env/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if dim % default_overall_up_factor != 0:\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/peft/tuners/loha/layer.py:270: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " def forward(ctx, w1a, w1b, w2a, w2b, scale=torch.tensor(1)):\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/peft/tuners/loha/layer.py:293: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " def forward(ctx, t1, w1a, w1b, t2, w2a, w2b, scale=torch.tensor(1)):\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " assert hidden_states.shape[1] == self.channels\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " assert hidden_states.shape[1] == self.channels\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " assert hidden_states.shape[1] == self.channels\n", + "/home/ea/work/genai_env/lib/python3.8/site-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if hidden_states.shape[0] >= 64:\n" + ] + }, + { + "data": { + "text/plain": [ + "16724" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "unet_ir_xml_path = Path('models/unet_ir.xml')\n", "\n", @@ -429,7 +532,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": { "collapsed": false, "jupyter": { @@ -439,18 +542,12 @@ "outputs": [], "source": [ "import numpy as np\n", - "from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker\n", "from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput\n", "from diffusers.pipelines.stable_diffusion import StableDiffusionPipeline\n", "from diffusers.loaders import TextualInversionLoaderMixin\n", - "from diffusers.models import AutoencoderKL, UNet2DConditionModel\n", "from typing import Any, Callable, Dict, List, Optional, Union\n", - "from diffusers.schedulers import KarrasDiffusionSchedulers\n", - "from transformers import CLIPImageProcessor, CLIPTokenizer\n", "from PIL import Image\n", "\n", - "from model import FastComposerTextEncoder\n", - "\n", "\n", "class StableDiffusionFastCompposerPipeline(StableDiffusionPipeline):\n", " r\"\"\"\n", @@ -459,27 +556,6 @@ " This model inherits from [`StableDiffusionPipeline`]. Check the superclass documentation for the generic methods the\n", " library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)\n", " \"\"\"\n", - " def __init__(\n", - " self,\n", - " vae: AutoencoderKL,\n", - " text_encoder: FastComposerTextEncoder,\n", - " tokenizer: CLIPTokenizer,\n", - " unet: UNet2DConditionModel,\n", - " scheduler: KarrasDiffusionSchedulers,\n", - " safety_checker: StableDiffusionSafetyChecker,\n", - " feature_extractor: CLIPImageProcessor,\n", - " requires_safety_checker: bool = True,\n", - " ):\n", - " super().__init__(\n", - " vae,\n", - " text_encoder,\n", - " tokenizer,\n", - " unet,\n", - " scheduler,\n", - " safety_checker,\n", - " feature_extractor,\n", - " requires_safety_checker,\n", - " )\n", "\n", "\n", " @torch.no_grad()\n", @@ -711,7 +787,7 @@ " The prompt or prompts not to guide the image generation. If not defined, one has to pass\n", " `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is\n", " less than `1`).\n", - " num_images_per_prompt (`int`, *optional*, defaults to 1):\n", + " num_images_per_prompt (`int`, *optional*, defaults to 1):_unwrap_model\n", " The number of images to generate per prompt.\n", " eta (`float`, *optional*, defaults to 0.0):\n", " Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to\n", @@ -937,14 +1013,38 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f5313f3d12634444bc2e20bb7f936156", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading pipeline components...: 0%| | 0/7 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "display(result[0][0])" ]