diff --git a/tutorials/README.md b/tutorials/README.md deleted file mode 100644 index b1d9668..0000000 --- a/tutorials/README.md +++ /dev/null @@ -1,91 +0,0 @@ ->This repository falls under the NIH STRIDES Initiative. STRIDES aims to harness the power of the cloud to accelerate biomedical discoveries. To learn more, visit https://cloud.nih.gov. - -# Microsoft Azure Tutorial Resources - -NIH Cloud Lab’s goal is to make Cloud easy and accessible for you, so that you can spend less time on administrative tasks and focus more on research. - -Use this repository to learn about how to use Azure by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you start with the jumpstart section on the [Cloud Lab website](https://cloud.nih.gov/resources/cloudlab/) before returning here. - ---------------------------------- -## Overview of Page Contents - -+ [Artificial Intelligence](#ai) -+ [Clinical Informatics](#ci) -+ [Medical Imaging](#mi) -+ [Genomics on Azure](#bio) -+ [GWAS](#gwas) -+ [BLAST](#blast) -+ [VCF Query](#vcf) -+ [RNAseq](#rna) -+ [scRNAseq](#sc) -+ [Long Read Sequencing Analysis](#long) -+ [Open Data](#open) - -## **Artificial Intelligence** -Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. Azure offers AI services through Azure AI Studio and Azure Machine Learning. - -See our suite of tutorials to learn more about [Gen AI on Azure](/notebooks/GenAI/) that highlight Azure products such as [Azure AI Studio](/notebooks/GenAI/Azure_AI_Studio_README.md), [Azure OpenAI](/notebooks/GenAI/Azure_Open_AI_README.md) and [Azure AI Search](/notebooks/GenAI/notebooks/Pubmed_RAG_chatbot.ipynb) and external tools like [Langchain](/notebooks/GenAI/notebooks/AzureAIStudio_langchain.ipynb). These notebooks walk you through how to deploy, train, and query models, as well as how to implement techniques like [Retrieval-Augmented Generation (RAG)](/notebooks/GenAI/notebooks/Pubmed_RAG_chatbot.ipynb). If you are interested in configuring a model to work with structured data like csv or json files, we've created tutorials that walk you through how to index your csv using the [Azure UI](/docs/create_index_from_csv.md) and query your database using a [notebook within Azure ML](/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_with_console.ipynb). We also have another [tutorial that runs all the necessary steps directly from a notebook](/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_notebook.ipynb). - - ## **Clinical Informatics with FHIR** -Azure Health Data Services is a set of services that enables you to store, process, and analyze medical data in Azure. These services are designed to help organizations quickly connect disparate health data sources and formats, such as structured, imaging, and device data, and normalize it to be persisted in the cloud. At its core, Azure Health Data Services possesses the ability to transform and ingest data into FHIR (Fast Healthcare Interoperability Resources) format. This allows you to transform health data from legacy formats, such as HL7v2 or CDA, or from high-frequency IoT data in device proprietary formats to FHIR. This makes it easier to connect data stored in Azure Health Data Services with services across the Azure ecosystem, like Azure Synapse Analytics, and Azure Machine Learning (Azure ML). - -Azure Health Data Services includes support for multiple health data standards for the exchange of structured data, and the ability to deploy multiple instances of different service types (FHIR, DICOM, and MedTech) that seamlessly work with one another. Services deployed within a workspace also share a compliance boundary and common configuration settings. The product scales automatically to meet the varying demands of your workloads, so you spend less time managing infrastructure and more time generating insights from health data. - -Copying healthcare data stored in Azure FHIR Server to Synapse Analytics allows researchers to leverage a cloud-scale data warehousing and analytics tool to extract insights from their data as well as build scalable research pipelines. -For information on how to perform this export and downstream analytics, please visit [this repository](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/healthcare-apis/fhir/copy-to-synapse.md). - -You can also see hands-on examples of using [FHIR on Azure](https://github.com/microsoft/genomicsnotebook/tree/main/fhirgenomics), but note that you will need to supply your own VCF files as these are not provided with the tutorial content. - -## **Medical Imaging Analysis** -Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing. Microsoft Azure offers cloud-based medical imaging analysis capabilities through its Azure Healthcare APIs and Azure Medical Imaging solutions. Azure's DICOM Service allows for the secure storage, management, and processing of medical images in the cloud, using industry standard DICOM (Digital Imaging and Communications in Medicine) format. The DICOM Service provides features like high availability, disaster recovery, and scalable storage options, making it an ideal solution for pipelines that need to store, manage, and analyze large amounts of medical imaging data. In addition, the server integrates with other Azure services like Azure ML, facilitating the use of advanced machine learning algorithms for image analysis tasks such as object detection, segmentation, and classification. Read about how to deploy the service [here](https://learn.microsoft.com/en-us/azure/healthcare-apis/dicom/deploy-dicom-services-in-azure). - -Microsoft has several medical imaging notebooks that showcase different medical imaging use-cases on Azure Machine Learning. These notebooks demonstrate various data science techniques such as manual model development with PyTorch, automated machine learning, and MLOPS-based examples for automating the machine learning lifecycle in medical use cases, including retraining. -These notebooks are available [here](https://github.com/Azure/medical-imaging). Make sure you select a kernel that includes Pytorch else the install of dependencies can be challenging. Note also that you need to use a GPU VM for most of the notebook cells, but you can create several compute environments and switch between them as needed. Be sure to shut them off when you are finished. - -For Cloud Lab users interested in multi-modal clinical informatics, DICOMcast provides the ability to synchronize data from a DICOM service to a FHIR service, allowing users to integrate clinical and imaging data. DICOMcast expands the use cases for health data by supporting both a streamlined view of longitudinal patient data and the ability to effectively create cohorts for medical studies, analytics, and machine learning. For more information on how to utilize DICOMcast please visit Microsoft’s [documentation](https://learn.microsoft.com/en-us/azure/healthcare-apis/dicom/dicom-cast-overview) or the open-source [GitHub repository](https://github.com/microsoft/dicom-server/blob/main/docs/quickstarts/deploy-dicom-cast.md). - -For users hoping to train deep learning models on imaging data, InnerEye-DeepLearning (IE-DL) is a toolbox that Microsoft developed for easily training deep learning models on 3D medical images. Simple to run both locally and in the cloud with Azure Machine Learning, it allows users to train and run inference on the following: -• Segmentation models -• Classification and regression models -• Any PyTorch Lightning model, via a bring-your-own-model setup -This project exists in a separate [GitHub repository](https://github.com/microsoft/InnerEye-DeepLearning). - -## **Microsoft Genomics** -Microsoft has several genomics-related offerings that will be useful to many Cloud Lab users. For a broad overview, visit the [Microsoft Genomics Community site](https://microsoft.github.io/Genomics-Community/index.html). You can also get an overview of different execution options from [this blog](https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/genomic-workflow-managers-on-microsoft-azure/ba-p/3747052), and a detailed analysis for Nextflow with AWS Batch at [this blog](https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/rna-sequencing-analysis-on-azure-using-nextflow-configuration/ba-p/3738854). We highlight a few key services here: -+ [Genomics Notebooks](https://github.com/microsoft/genomicsnotebook): These example notebooks highlight many common use cases in genomics research. The Bioconductor/Rstudio notebook will not work in Cloud Lab. To run Rstudio, look at [Posit Workbench from the Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/rstudio-5237862.rstudioserverprostandard). -+ [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure): Documentation on how to spin up the resources needed to run Cromwell on Azure. Note that this service will not work within Cloud Lab because you need high-level permissions, but we list it here for demonstration purposes. -+ [Microsoft Genomics](https://learn.microsoft.com/en-us/azure/genomics/quickstart-run-genomics-workflow-portal): Run BWA and GATK using this managed service. Note that it uses Python 2.7 and thus is not compatible with AzureML (which uses Python 3), but you can run it from any other shell environment. -+ [Nextflow on Azure](https://microsoft.github.io/Genomics-Community/mydoc_nextflow.html): Run Nextflow workflows using Azure Batch. -+ [NVIDIA Parabricks for Secondary Genomics Analysis on Azure](https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/benchmarking-the-nvidia-clara-parabricks-for-secondary-genomics/ba-p/3722434). Follow this guide to run Parabricks on a VM by pulling the Docker container directly from NVIDIA. - -## **Genome Wide Association Studies** -Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes. -- This [NIH CFDE written tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud -) walks you through running a simple GWAS on AWS, thus we converted it to Azure in [this notebook](/notebooks/GWAS). Note that the CFDE page has a few other bioinformatics related tutorials like BLAST and Illumina read simulation. -- This blog post [illustrates some of the costs associated](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-to-accelerate-genome-wide-analysis-study/ba-p/2644120) with running GWAS on Azure - -## **NCBI BLAST+** -NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. -- [This Microsoft Blog](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/running-ncbi-blast-on-azure-performance-scalability-and-best/ba-p/2410483) explains how to optimize BLAST analyses on Azure VMs. Feel free to install BLAST+ on a VM or an AzureML notebook and run queries there. - -## **Query a VCF file in Azure Synapse** -- You can use SQL to rapidly query a VCF file in Azure Synapse. The requires converting the file from VCF to Parquet format, a common format for databases. Read more about how to do this in Azure on [this Microsoft blog](https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/genomic-data-in-parquet-format-on-azure/ba-p/3150554). Although the notebooks for this tutorial are bundled with the other genomics notebooks, to get them to work you will need to use Azure Databricks or Synapse Analytics, not AzureML. - -## **RNAseq** -RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks. -- You can run this [Nextflow on Azure tutorial](https://microsoft.github.io/Genomics-Community/mydoc_nextflow.html) for RNAseq a variety of ways on Azure. Following the instructions outlined above, you could use Virtual Machines, Azure Machine Learning, or Azure Batch. -- For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification from the NIGMS Sandbox Program use this [notebook](/notebooks/rnaseq-myco-tutorial-main), which we re-wrote to work on Azure. - -## **Single Cell RNAseq** -Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems. -- This [NVIDIA blog](https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/) details how to run an accelerated scRNAseq pipeline using RAPIDS. You can find a link to the GitHub that has lots of example notebooks [here](https://github.com/clara-parabricks/rapids-single-cell-examples). For each example use case they show some nice benchmarking data with time and cost for CPU vs. GPU machine types on AWS. You will see that most runs cost less than $1.00 with GPU machines (priced on AWS). If you want a CPU version that users Scanpy you can use this [notebook](https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/hlca_lung_cpu_analysis.ipynb). Pay careful attention to the environment setup as there are a lot of dependencies for these notebooks. Create a conda environment in the terminal, then run the notebook. Consider using [mamba](https://github.com/mamba-org/mamba) to speed up environment creation. We created a [guide](/docs/create_conda_env.md) for conda environment set up as well. - -## **Long Read Sequence Analysis** -Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. -Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. Access the notebooks [here](https://labs.epi2me.io/nbindex/) and on [GitHub](https://github.com/epi2me-labs). These notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. Oxford Nanopore also offers a host of [Nextflow workflows](https://labs.epi2me.io/wfindex/) that will allow you to run a variety of long read pipelines. - -## **Open Data** -These publicly available datasets can save you time on data discovery and preparation by being curated and ready to use in your workflows. -+ The [COVID-19 Data Lake](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-covid-19-data-lake) contains COVID-19 related datasets from various sources. It covers testing and patient outcome tracking data, social distancing policy, hospital capacity and mobility. -+ In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the [COVID-19 Open Research Dataset (CORD-19)](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-covid-19-open-research?tabs=azure-storage). This dataset is a free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community. This dataset mobilizes researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. -+ [The Genomics Data Lake](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-genomics-data-lake) provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats: [Illumina Platinum Genomes](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-illumina-platinum-genomes), [Human Reference Genomes](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-human-reference-genomes), [ClinVar Annotations](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-clinvar-annotations), [SnpEff](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-snpeff), [Genome Aggregation Database (gnomAD)](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-gnomad), [1000 Genomes](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-1000-genomes), [OpenCravat](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-open-cravat), [ENCODE](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-encode), [GATK Resource Bundle](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-gatk-resource-bundle). diff --git a/tutorials/notebooks/GWAS/GWAS_coat_color.ipynb b/tutorials/notebooks/GWAS/GWAS_coat_color.ipynb deleted file mode 100644 index fd6bf6d..0000000 --- a/tutorials/notebooks/GWAS/GWAS_coat_color.ipynb +++ /dev/null @@ -1,583 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "7a244bb3", - "metadata": {}, - "source": [ - "# Runing Genome Wide Association Studies in the cloud" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "Genome Wide Association Study analyses are conducted via the command line using mostly BASH commands, and then plotting often done using Python or R. Here, we adapted an [NIH CFDE tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud/background/) and fit it to a notebook. We have greatly simplified the instructions, so if you need or want more details, look at the full tutorial to find out more.\n", - "\n", - "Most of this notebook is bash, but expects that you are using a Python kernel, until step 3, plotting, you will need to switch your kernel to R." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have provisioned a compute environment in Azure ML Studio" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "+ Learn how to run GWAS analysis and visualize results in Azure AI Studio" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "id": "8fbf6304", - "metadata": {}, - "source": [ - "### Download the data\n", - "Use %%bash to denote a bash block. You can also use '!' to denote a single bash command within a Python notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8ec900bd", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "%%bash\n", - "mkdir GWAS\n", - "curl -LO https://de.cyverse.org/dl/d/E0A502CC-F806-4857-9C3A-BAEAA0CCC694/pruned_coatColor_maf_geno.vcf.gz\n", - "curl -LO https://de.cyverse.org/dl/d/3B5C1853-C092-488C-8C2F-CE6E8526E96B/coatColor.pheno" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4d43ae73", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "%%bash\n", - "mv *.gz GWAS\n", - "mv *.pheno GWAS\n", - "ls GWAS" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "28aadbf8", - "metadata": {}, - "source": [ - "### Install packages\n", - "Here we install mamba, which is faster than conda. You could also skip this install and just use conda since that is preinstalled in the kernel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3ba3eef", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "%%bash\n", - "curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", - "bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ae20d01c", - "metadata": { - "gather": { - "logged": 1686580882939 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#add to your path\n", - "import os\n", - "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b219074a", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "! mamba install -y -c bioconda plink vcftools" - ] - }, - { - "cell_type": "markdown", - "id": "013d960d", - "metadata": {}, - "source": [ - "### Make map and ped files from the vcf file to feed into plink" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e91c7a01", - "metadata": { - "gather": { - "logged": 1686579597925 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "cd GWAS" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9b770f7f", - "metadata": { - "gather": { - "logged": 1686579600325 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "ls GWAS" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6570875d", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "! vcftools --gzvcf pruned_coatColor_maf_geno.vcf.gz --plink --out coatColor" - ] - }, - { - "cell_type": "markdown", - "id": "b9a38761", - "metadata": {}, - "source": [ - "### Create a list of minor alleles.\n", - "For more info on these terms, look at step 2 at https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud/analyze/" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6c868a67", - "metadata": { - "gather": { - "logged": 1686581972147 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#unzip vcf\n", - "! vcftools --gzvcf pruned_coatColor_maf_geno.vcf.gz --recode --out pruned_coatColor_maf_geno" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e11f991", - "metadata": { - "gather": { - "logged": 1686581979545 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#create list of minor alleles\n", - "! cat pruned_coatColor_maf_geno.recode.vcf | awk 'BEGIN{FS=\"\\t\";OFS=\"\\t\";}/#/{next;}{{if($3==\".\")$3=$1\":\"$2;}print $3,$5;}' > minor_alleles" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8cff47e3", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "! head minor_alleles" - ] - }, - { - "cell_type": "markdown", - "id": "56d901c7", - "metadata": {}, - "source": [ - "### Run quality controls" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dafa14a6", - "metadata": { - "gather": { - "logged": 1686582023237 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#calculate missingness per locus\n", - "! plink --file coatColor --make-pheno coatColor.pheno \"yellow\" --missing --out miss_stat --noweb --dog --reference-allele minor_alleles --allow-no-sex --adjust" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5cf5f51b", - "metadata": { - "gather": { - "logged": 1686582030150 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#take a look at lmiss, which is the per locus rates of missingness\n", - "! head miss_stat.lmiss" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "915bb263", - "metadata": { - "gather": { - "logged": 1686582034753 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#peek at imiss which is the individual rates of missingness\n", - "! head miss_stat.imiss" - ] - }, - { - "cell_type": "markdown", - "id": "4c11ca71", - "metadata": {}, - "source": [ - "### Convert to plink binary format" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b8f2d7f", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "! plink --file coatColor --allow-no-sex --dog --make-bed --noweb --out coatColor.binary" - ] - }, - { - "cell_type": "markdown", - "id": "e36f6cd7", - "metadata": {}, - "source": [ - "### Run a simple association step (the GWAS part!)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f926ef9b", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "! plink --bfile coatColor.binary --make-pheno coatColor.pheno \"yellow\" --assoc --reference-allele minor_alleles --allow-no-sex --adjust --dog --noweb --out coatColor" - ] - }, - { - "cell_type": "markdown", - "id": "b397d484", - "metadata": {}, - "source": [ - "### Identify statistical cutoffs\n", - "This code finds the equivalent of 0.05 and 0.01 p value in the negative-log-transformed p values file. We will use these cutoffs to draw horizontal lines in the Manhattan plot for visualization of haplotypes that cross the 0.05 and 0.01 statistical threshold (i.e. have a statistically significant association with yellow coat color)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b94e1e2a", - "metadata": { - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "%%bash\n", - "unad_cutoff_sug=$(tail -n+2 coatColor.assoc.adjusted | awk '$10>=0.05' | head -n1 | awk '{print $3}')\n", - "unad_cutoff_conf=$(tail -n+2 coatColor.assoc.adjusted | awk '$10>=0.01' | head -n1 | awk '{print $3}')" - ] - }, - { - "cell_type": "markdown", - "id": "1f52e97c", - "metadata": {}, - "source": [ - "### Plotting\n", - "In this tutorial, plotting is done in R. Azure gets a bit funny about running these R commands, so we recommend just runnning the rest of the commands in the Terminal. Run `R` before running the commands. Otherwise you can just download the inputs and run locally in R studio." - ] - }, - { - "cell_type": "markdown", - "id": "effb5acd", - "metadata": {}, - "source": [ - "### Install qqman" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "60feed89", - "metadata": { - "gather": { - "logged": 1686582094642 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "install.packages('qqman', contriburl=contrib.url('http://cran.r-project.org/'))" - ] - }, - { - "cell_type": "markdown", - "id": "d3f1fcd2", - "metadata": {}, - "source": [ - "### Run the plotting function" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7e8cd2b", - "metadata": { - "gather": { - "logged": 1686584355516 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "#make sure you are still CD in GWAS, when you change kernel it may reset to home\n", - "setwd('GWAS')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7946a3a7", - "metadata": { - "gather": { - "logged": 1686584356532 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "require(qqman)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0d28ef2c", - "metadata": { - "gather": { - "logged": 1686584364339 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "data=read.table(\"coatColor.assoc\", header=TRUE)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e5207be", - "metadata": { - "gather": { - "logged": 1686584368241 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "data=data[!is.na(data$P),]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6330b1e0", - "metadata": { - "gather": { - "logged": 1686584371278 - }, - "vscode": { - "languageId": "r" - } - }, - "outputs": [], - "source": [ - "manhattan(data, p = \"P\", col = c(\"blue4\", \"orange3\"),\n", - " suggestiveline = 12,\n", - " genomewideline = 15,\n", - " chrlabs = c(1:38, \"X\"), annotateTop=TRUE, cex = 1.2)" - ] - }, - { - "cell_type": "markdown", - "id": "26787d84", - "metadata": {}, - "source": [ - "In our graph, haplotypes in four parts of the genome (chromosome 2, 5, 28 and X) are found to be associated with an increased occurrence of the yellow coat color phenotype.\n", - "\n", - "The top associated mutation is a nonsense SNP in the gene MC1R known to control pigment production. The MC1R allele encoding yellow coat color contains a single base change (from C to T) at the 916th nucleotide." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusions\n", - "You learned here how to run and visualize GWAS results using a notebook in Azure ML Studio." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean Up\n", - "Make sure you stop your compute instance and if desired, delete the resource group associated with this tutorial." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "kernel_info": { - "name": "ir" - }, - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.2.2" - }, - "microsoft": { - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/GenAI/Azure_AI_Studio_README.md b/tutorials/notebooks/GenAI/Azure_AI_Studio_README.md deleted file mode 100644 index 8c5573b..0000000 --- a/tutorials/notebooks/GenAI/Azure_AI_Studio_README.md +++ /dev/null @@ -1,436 +0,0 @@ -# Azure AI Studio Studio -Microsoft Azure migrated the AI front end from Azure OpenAI to Azure AI Studio. - -✨ The following tutorial was modified from this excellent [Microsoft workshop](https://github.com/t-cjackson/AOAI-FED-CIV-Workshop) developed by [Cameron Jackson](https://github.com/t-cjackson). ✨ - -Welcome to this repository, a comprehensive collection of examples that will help you chat with your data using the Azure OpenAI Studio Playground, create highly efficient large language model prompts, and build Azure OpenAI embeddings. - -The purpose of this workshop is to equip participants with the necessary skills to make the most out of the Azure OpenAI Playground, Prompt Engineering, and Azure OpenAI Embeddings in Python. You can view in-depth info on these topics in the [workshop slides](/notebooks/GenAI/search_documents/aoai_workshop_content.pdf). - -You can also learn a lot about the details of using Azure AI at this [site](https://azure.microsoft.com/en-us/products/ai-studio). - -We recommend you 1) go through the steps in this README, 2) complete the general notebook called `notebooks/AzureOpenAI_embeddings.ipynb`, then 3) explore the other notebooks at [this directory](/notebooks/GenAI/notebooks) - -## Overview of Page Contents -+ [Azure AI Playground Prerequisites](#Azure-OpenAI-Playground-Prerequisites) -+ [Chat Playground Navigation](#Chat-Playground-Navigation) -+ [Upload your own data and query over it](#Upload-your-own-data-and-query-over-it) -+ [Prompt Engineering Best Practices](#Prompt-Engineering-Best-Practices) -+ [Azure OpenAI Embeddings](#Azure-OpenAI-Embeddings) -+ [Additional Resources](#Additional-Resources) - -## Azure OpenAI Playground Prerequisites - -Navigate to Azure AI Studio. The easiest way is to search at the top of the page. - - ![search for azure openai](/docs/images/1_azure_ai_studio.png) - -Click new Azure AI. - - ![click to open azure open ai](/docs/images/2_click_new_azureai.png) - -Fill out the necessary information. Create a new Resource Group if needed. Click **Review and Create**. - - ![fill in the info](/docs/images/3_fill_form_azureai.png) - -Once the resource deploys, click **go to resource**. - - ![go to resource](/docs/images/4_go_to_resource.png) - -Now click **Go to Azure AI Studio**. You can also view your access keys at the bottom of the page. - - ![connect to OpenAI UI](/docs/images/5_launch_ai_studio.png) - -Before diving into the UI, stop and watch [this 14 minute overview video](https://www.youtube.com/watch?v=Qes7p5w8Tz8) to learn how to take full advantage of the Studio. We won't cover every option in this tutorial, but feel free to explore! - -When ready, go to **Build** and then click **+ New Project**. - - ![select new project](/docs/images/6_select_new_project.png) - -Fill in the info with the resource name and relevant information. Make sure you put your resource in the same resource group and region as your other Azure AI resources/environments. Then click **Create a Project**. - - ![create new project and resource](/docs/images/7_create_new_project.png) - -When ready, select your project. Now go to **Build** then **Playground**. - -Next, you need to deploy model to power your chat bot. - -## Deploy a model - -On the left navigation panel, click **Deployments**, then click **Create** on the next screen. - - ![Click Models](/docs/images/8_create_model.png) - -Select your model of choice. Here we select gpt-4. - - ![select your model](/docs/images/9_select_model.png) - -Name your deployment and then click **Deploy**. - - ![deploy your model](/docs/images/10_name_and_deploy.png) - -Now under Deployments you should see your model. Feel free to deploy other models here, but be aware that you will pay for those deployed models. - - ![model is deployed](/docs/images/11_model_is_deployed.png) - -Run a quick test to ensure our deployment is acting as expected. Navigate to `Playground`, add an optional system message (we will cover this more later), and then type `Hello World` in the chat box. If you get a response, things are working well! Double check that on the far right it shows the correct deployment. - - ![test model](/docs/images/12_test_hello_world.png) - -Now we will look at [adding and querying over your own data](#Upload-your-own-data-and-query-over-it) and then review [prompt engineering best practices](#prompt-engineering-best-practices) using a general GPT model. - -## Chat Playground Navigation - -If you have not already (A) Navigate to the Chat Playground. Here we will walk through the various options available to you. First, you can specify a `System Message` which tells the model what context with which to respond to inquiries. To modify this, (B) select `System message`, then (B) input a [System Message](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message) in the prompt box, then (D) click **Apply Changes**. - -On the next tab over, you can (A) add your own data, which we dive into in the [next section](#Upload-your-own-data-and-query-over-it). In the middle of the page is where you actually interact with the model (B) through the chat prompts. Always (C) clear the chat after each session. - -On the far right under *Configuration*, you can modify which model you are deploying, which allows you to switch between different model deployments depending on the context. You can also modify the model's parameters on the same tab. - - ![modify deployment](/docs/images/19_deployment.png) - -Finally, you can select the `parameters` tab to modify the model parameters. Review [this presentation](/notebooks/GenAI/search_documents/aoai_workshop_content.pdf) to learn more about the parameters. - - ![modify parameters](/docs/images/20_parameters.png) - -Finally, click on **Prompt Samples** along the top and explore a few of these example prompts. - - ![modify parameters](/docs/images/13_prompt_examples.png) - - -## Upload your own data and query over it - -For an in-depth overview of adding your own data, check out this [Microsoft documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-studio). We give a quick start version here. - -Now, if you want to add your own data and query it, keep going here. If you want to jump ahead to prompt engineering with LLMs, jump down to [Prompt Engineering Best Practices](#prompt-engineering-best-practices). - -Within this repo there is a directory called `search_documents`. This directory contains a few PDFs that we will upload and query over related to [Immune Response to Mpox in a Woman Living with HIV](https://www.niaid.nih.gov/news-events/immune-response-mpox) and the [DCEG Diesel Exhaust in Minors Study](https://dceg.cancer.gov/news-events/news/2023/dems-ii). - -We are going to upload these PDFs to an Azure Storage Account and then add them to our Azure OpenAI workspace. Note that there are [upload limits](https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits#quotas-and-limits-reference) on the number and size of documents you can query within Azure OpenAI, but sure to read these before getting started. For example, you can only query over a max of 30 documents and/or 1 GB of data. You can only upload the datatypes [listed below] [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/use-your-data#data-formats-and-file-types), and will have the best results with markdown files. - -+ `.txt` -+ `.md` -+ `.html` -+ Microsoft Word files -+ Microsoft PowerPoint files -+ PDF - -Follow [this guide](/docs/create_storage_account.md) to create and upload to a storage account. Use a separate browswer window so that you can easily get back to Azure OpenAI. - -Once you have uploaded your PDFs (or other datatypes if you are trying that), navigate back to the `Playground`, select `Add your data` then click **Add a data source**. - - ![Add data source image](/docs/images/14_add_your_data.png) - -Select `Azure Blob Storage`, and then the correct `Storage Account` and `Container`. If this is your first time indexing documents, for `Select Azure AI Search resource` click **Create a new AI Search resource** which will open a new window. You can add vector search to your AI Search resource, but you will need to first deploy the embedding model for it to be available. - - ![select data source](/docs/images/15_point_to_data.png) - -If needed, create the new Azure AI Search resource. Make sure you delete this when you are finished with Azure AI because it will accrue charges over time. - - ![create cog search](/docs/images/7_cog_search_resource.png) - -Now select your newly made Azure AI Search (formally know as Cognitive Search) resource, and click **Next**. You can select to search with either [Vector](https://learn.microsoft.com/en-us/azure/search/vector-search-overview) or [Hybrid](https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview) search. - - ![choose keyword](/docs/images/16_hybrid_search.png) - -On the last page, click **Save and close**. It will now take a few minutes to index your updated data. Read more [here](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) about how Azure AI Search is working behind the scenes. - - ![Save and close](/docs/images/17_save_and_close.png) - -Once it is complete, you should see your data source listed. - -Also check that the index is complete by viewing your AI Search resource, and going to `AI Search` on the left. - - ![Cog Search](/docs/images/18_check_ai_search.png) - -Now select your resource, select **indexes**, and then ensure that the number of documents listed is greater then 0. - - ![Cog Search](/docs/images/15_check_index.png) - -Now let's got back to the Playground and run some example queries of our custom data set. Feel free to modify and experiment. After reading the prompt engineering section below, return to this section and see how you can improve these examples. If you get errors after adding your data, try to refresh the page, and if all else fails, send us an email at CloudLab@nih.gov. - -``` -Summarize each of the documents I uploaded in a single paragraph, listing the title, the authors, followed by a five sentence summary for each. Give a new line after each summary. -``` -``` -What were some of the phenotypic presentations of MPOX on patients with HIV? -``` -``` -Are the phenotypic effects of MPOX the same for a patient with HIV and other patients? -``` -``` -Describe the primary findings of the Diesel Exhaust in Miners Study? -``` -``` -Does exposure to Diesel exhaust increase your risk for lung cancer? What about other cancers? Keep your response to one sentence for each of these queries. -``` - - ![search custom data files](/docs/images/16_search_custom_data.png) - -### Bonus, try uploading the grant data in the search_documents and run a few queries -Follow the instructions above for the two files in search_documents called `grant_data_sub1.txt` and `grant_data_sub2.txt`. These data were produced by searching [NIH Reporter](https://reporter.nih.gov/) for NCI-funded projects from fiscal year 2022-2024. The data were downloaded as a csv, converted to txt using Excel, then split in half using a very simple `head -2500 data.txt > grant_data_sub1.txt` and `tail -2499 data.txt > grant_data_sub2.txt`. The reason we split the data is that Azure has an upload limit of 16 MB and the downloaded file was over 30MB. If you are downloading your own data be mindful of these limits and split your files as necessary. - -Once the data is uploaded, try adding a system message like the following: -``` -Pretend to be a Program Officer at the National Institutes of Health in the National Cancer Institute. Your job is to review and summarize funded opportunities. Respond in a professional manner. -``` -Now try some prompts like these: - -``` -What funding years are included in the data I provided? -``` -``` -Based on the Project Abstract, Project Title, and public health relevance please list the Project number of all projects related to women's health research and provide an summary of the women's health relevance for each. -``` -``` -Based on the Project Abstracts, what were the most commonly funded research areas in Fiscal Year 2022? -``` - -## Prompt Engineering Best Practices -First, review [this summary of prompt engineering](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/prompt-engineering) from Microsoft. - -### Write Clear Instructions - -1. Alter the system message to reply with a document that includes a playful comment or joke in each paragraph when responding to inquiries concerning writing assistance. This format should only be used for writing-related questions - -Add the following in the System Message box (SYSTEM:) -``` -You are a comedian English professor at the University of Giggles. When I ask for help to write something, you will reply with a document that contains at least one joke or playful comment. -``` -Add this query to the chat prompt box (QUERY:). -``` -Write a thank you note to my steel bolt vendor for getting a delivery in on time with short notice. This made it possible for my company to deliver an important order. -``` -Add the following to the system message, directing the LLM to only answer questions that involve writing assistance and then rerun the original query. -``` -If the user query does not have "write" in it, respond I do not know truthfully. -``` -2. Modify the system message by adding the prefix "Summary:" which should summarize the paragraph given, delimited with XML tags. Following the summary, the system should translate the paragraph from English to Spanish and add the prefix "Translation:". -To accomplish these tasks, the following steps should be taken: - 1. Identify the paragraph to be summarized, which should be delimited by XML tags. - 2. Generate a summary of the paragraph. - 3. Add the prefix "Summary:" to the beginning of the summary. - 4. Translate the paragraph from English to Spanish. - 5. Add the prefix "Translation:" to the beginning of the translated paragraph. - -SYSTEM: -``` -You will be given a paragraph delimited by XML tags. Use the following step-by-step sequence to respond to user inputs. - - Step 1) The user will provide you with a paragraph delimited by XML tags. Summarize the paragraph in one sentence with a prefix “Summary:” - Step 2) Translate the summary from Step 1 into Spanish, with a prefix “Translation:” -``` -QUERY: -``` - Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are designed to perform tasks that normally require human intelligence, such as learning, problem-solving, and decision-making. AI technology uses algorithms and statistical models to analyze data and make predictions and can be applied to a wide range of fields, including healthcare, finance, and transportation. AI is a rapidly growing field that has the potential to revolutionize many industries by increasing efficiency and productivity. However, as with any technology, there are also concerns about the ethical implications of AI, such as job displacement and privacy concerns. -``` - -Note: When implementing the above example, you might encounter a problem in Step 2 of the prompt where the model translates the entire paragraph instead of the single sentence summary. This issue is likely to arise when using the gpt-35-turbo model, primarily due to its limitations in reasoning capabilities, which impact its translation proficiency. A solution to this minor glitch is the gpt-4 model, which is designed to reason more effectively than the gpt-35-turbo model. - -1. Revise the model to classify the text it is given as either positive, neutral or negative. Once classified, have the LLM recognize the adjective it used to classify the text. Provide an example to the assitant for the LLM to comprehend tasks. - -SYSTEM -``` -Classify the text as either positive, neutral, or negative. Then find the adjective that allows you to classify the text. Follow the example to respond. - - USER: The movie was awesome! - - ASSISTANT: Positive. The adjective here is: awesome. - - USER: The movie was terrible. - - ASSISTANT: Negative. The adjective here is: terrible. - - USER: The movie was ok. - - ASSISTANT: Neutral. The adjective here is: ok. - - QUERY: I can’t wait to go to the beach. -``` - -### Providing Reference Text - -4. Revise the system message to create four bullet points outlining the key principles of the provided text delimited by triple quotes. -To accomplish this, the following steps should be taken: - 1. Identify the text to be analyzed, which should be delimited by triple quotes. - 2. Analyze the text to determine the key principles. - 3. Generate four bullet points that succinctly summarize each principle. - 4. Display the bullet points in the system message. - -SYSTEM: -``` -You will be given text delimited by triple quotes. Create 4 bullet points on the key principles of the text. Answer in the following format: - - Key principle 1 - - Key principle 2 - - Key principle 3 - - Key principle 4 -``` -QUERY: -``` - “”” - Learning a new language is an excellent way to broaden your horizons and improve your cognitive abilities. Firstly, being multilingual can open new opportunities both personally and professionally, such as traveling to new countries, connecting with people from different cultures, and expanding your job prospects. Secondly, it has been shown that learning a new language can improve cognitive function, such as memory, problem-solving, and decision-making skills. Additionally, it can increase empathy and cultural understanding, as well as enhance creativity and communication skills. Finally, it can boost confidence and self-esteem, as mastering a new language is a significant achievement and can provide a sense of accomplishment. Overall, the benefits of learning a new language are numerous and can have a positive impact on many aspects of your life. - “”” -``` -### Split complex tasks into simpler subtasks - -5. Give the system message primary and secondary categories for classifying customer service inquiries. The system should: - - take in customer service queries - - classify the query into primary and secondary categories - - output the response in JSON format with the following keys: primary and secondary - -SYSTEM: -``` -You will be provided with customer services queries. Classify each query into a primary category and a secondary category. Provide your output in JSON format with the keys: primary and secondary - Primary categories: Billing, Technical Support, Account Management, or General Inquiry - Billing secondary categories: - - Unsubscribe or upgrade - - Add a payment method - - Explanation for charge - - Dispute a charge - Technical Support secondary categories: - - Troubleshooting - - Device compatibility - - Software updates - Account Management secondary categories: - - Password reset - - Update personal information - - Close account - - Account security - General Inquiry secondary categories: - - Product information - - Pricing - - Feedback - - Speak to a human -``` -QUERY: -``` -I need to get my internet working again. -``` -(5) Continued: Based on the classification of the customer query from above, provide a set of more specific set of instructions to the system message for troubleshooting in a technical support context. - -SYSTEM: - -``` -You will be provided with customer service inquiries that require troubleshooting in a technical support context. Help the user by: - - Check all router cables are connected properly. If not, reconnect them as needed. - - Ask the customer which router model they are using. - - For MTD-327J model, instruct the customer to hold the red button for 5 seconds and wait 5 minutes before testing the connection. - - For MTD-327S model, advise the customer to unplug and replug the device and wait for 5 minutes before testing the connection. - - If the issue persists, output {"IT support requested"} to connect them to IT support. - - If the customer's questions are unrelated to troubleshooting, ask if they would like to end the chat and classify their request accordingly. - - -``` -QUERY: -``` -I need to get my internet working again. -``` - -### Give GPTs time to "think" - -6. Modify the system message to determine if the student’s solution is correct or not. The model should work out its own solution before comparing it to the student’s solution. Following an in-depth comparison, a decision on whether the answer is correct or not should be made. Add the following example so the model knows how to solve the problem correctly: - -SYSTEM: - - First work out your own solution to the problem. Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself. - - USER: - - Problem Statement: I'm building a solar power installation and I need help working out the financials. - - Land costs $100 / square foot - - I can buy solar panels for $250 / square foot - - I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot - What is the total cost for the first year of operations as a function of the number of square feet. - - ASSISTANT: - - Let x be the size of the installation in square feet. - 1. Land cost: 100x - 2. Solar panel cost: 250x - 3. Maintenance cost: 100,000 + 10x - Total cost: 100x + 250x + 100,000 + 10x = 360x + 100,000 - -QUERY: - - Problem Statement: I'm building a solar power installation and I need help working out the financials. - - Land costs $100 / square foot - - I can buy solar panels for $250 / square foot - - I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot - What is the total cost for the first year of operations as a function of the number of square feet. - Student's Solution: Let x be the size of the installation in square feet. - 1. Land cost: 100x - 2. Solar panel cost: 250x - 3. Maintenance cost: 100,000 + 100x - Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000 - -### Test changes systematically - -7. Modify the system to detect whether the text it was given contains the following pieces of information it has directly. The text should be delimited by triple quotes. Here are the pieces of information to look for: - - Neil Armstrong was the first person to walk on the moon. - - The date Neil Armstrong walked on the moon was July 21, 1969. - -SYSTEM: - - You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer: - - - Neil Armstrong was the first person to walk on the moon. - - The date Neil Armstrong first walked on the moon was July 21, 1969. - - For each of these points perform the following steps but do not display the step number: - - Step 1 - Restate the point. - Step 2 - Provide a citation from the answer which is closest to this point. - Step 3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind. - Step 4 - Write "yes" if the answer to 3 was yes, otherwise write "no". - Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": }. - -QUERY: - - """Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission.""" - -## Azure OpenAI API and Embeddings - -### Background -Creating embeddings of search documents allows you to use vector search, which is much more powerful than basic keyword search. First, review this page on [how to create embeddings](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-generate-embeddings), and then review [how vector search works](https://learn.microsoft.com/en-us/azure/search/vector-search-overview). - -### Environment Setup -Navigate to your [Azure Machine Learning Studio environment](https://github.com/STRIDES/NIHCloudLabAzure#launch-a-machine-learning-workspace-jupyter-environment-). If you have not created your environment, [create one now](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-cloud-workstation?view=azureml-api-2). - -Navigate to `Notebooks`, then clone this Git repo into your environment and navigate to the notebook called [AzureOpenAI_embeddings.ipynb](/notebooks/GenAI/notebooks/AzureOpenAI_embeddings.ipynb). - -You will need a variety of parameters to authenticate with the API. You can find these within the Playground by clicking **View Code**. Input these parameters into the notebook cell when asked. - - ![Code View Image](/docs/images/find_endpointv2.png) - -Follow along with the notebook, and when finished, feel free to explore the other notebooks which use more advanced tools like Azure AI Search and LangChain. - -Finally, navigate back here to view the Additional Resources. Make sure to **Stop your Compute** when finished in Azure ML Studio. - -## Additional Resources - -### Azure OpenAI PLayground - -*These resources are for the older Azure OpenAI, but not all the docs have been updated to Azure AI Studio. While the front end has changed, the underlying services largely have not, so these docs should still serve you well.* - -[Azure OpenAI Service models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) - -[Adding data to Azure OpenAI Playground](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-studio) - -[Azure OpenAI Chat API](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#chat-completions) - - -### Basics of Prompt Egineering - -[Prompting Techniques](https://www.promptingguide.ai/techniques) - -[Prompting Best Practices](https://platform.openai.com/docs/guides/gpt-best-practices) - -### Azure OpenAI Embeddings - -[Getting Started with Embeddings](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings?tabs=command-line) - -[OpenAI Cookbook GitHub Repository](https://github.com/openai/openai-cookbook) - -## License - -This repository is licensed under the MIT License. See the [LICENSE](https://github.com/t-cjackson/Azure-OpenAI-Workshop/blob/main/LICENSE) file for more information. diff --git a/tutorials/notebooks/GenAI/Azure_Open_AI_README.md b/tutorials/notebooks/GenAI/Azure_Open_AI_README.md deleted file mode 100644 index 6b41e99..0000000 --- a/tutorials/notebooks/GenAI/Azure_Open_AI_README.md +++ /dev/null @@ -1,395 +0,0 @@ -# Azure OpenAI Tutorial -✨ The following tutorial was modified from this excellent [Microsoft workshop](https://github.com/t-cjackson/AOAI-FED-CIV-Workshop) developed by [Cameron Jackson](https://github.com/t-cjackson). ✨ - -Welcome to this repository, a comprehensive collection of examples that will help you chat with your data using the Azure OpenAI Playground, create highly efficient large language model prompts, and build Azure OpenAI embeddings. This repository offers a wide range of examples that can be catered to your use cases, including: - -- Documents for LLM interactions in the Azure OpenAI Playground. -- 7 best practices for implementing prompt egineering in LLM applications. -- 4 Python scripts that demonstrate how to use Azure OpenAI Embeddings to create embedding applications. -- 42 in-depth content slides on the information covered in this workshop. Please find ```aoai_workshop_content.pdf``` in [search_documents](https://github.com/t-cjackson/Azure-OpenAI-Workshop/tree/main/search_documents) folder in this repository. - -The purpose of this workshop is to equip participants with the necessary skills to make the most out of the Azure OpenAI Playground, Prompt Engineering, and Azure OpenAI Embeddings in Python. You can view in-depth info on these topics in the [workshop slides](/notebooks/GenAI/search_documents/aoai_workshop_content.pdf). - -You can also learn a lot about the details of using Azure OpenAI at this [site](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-studio). - -We recommend you 1) go through the steps in this README, 2) complete the general notebook called `notebooks/AzureOpenAI_embeddings.ipynb`, then 3) explore the other notebooks at [this directory](/notebooks/GenAI/notebooks) - -## Overview of Page Contents -+ [Azure OpenAI Playground Prerequisites](#Azure-OpenAI-Playground-Prerequisites) -+ [Chat Playground Navigation](#Chat-Playground-Navigation) -+ [Upload your own data and query over it](#Upload-your-own-data-and-query-over-it) -+ [Prompt Engineering Best Practices](#Prompt-Engineering-Best-Practices) -+ [Azure OpenAI Embeddings](#Azure-OpenAI-Embeddings) -+ [Additional Resources](#Additional-Resources) - -## Azure OpenAI Playground Prerequisites - -Navigate to Azure OpenAI. The easiest way is to search at the top of the page. - - ![search for azure openai](/docs/images/1_navigate_openai.png) - -At the time of writing, Azure OpenAI is in Beta and only available to customers via an application form, if you click **Create** that is the message you will see. If you click **Create** and do not get this message, then feel free to create a new OpenAI Service. Otherwise, please email us at CloudLab@nih.gov and ask us to set this part up for you. Once you have an OpenAI Service provisioned, click to open it. - - ![click to open azure open ai](/docs/images/2_select_openai_project.png) - -Now click **Go to Azure OpenAI Studio** or **Explore** to be connected to the Azure OpenAI studio user interface. - - ![connect to OpenAI UI](/docs/images/3_connet_open_ai.png) - -Click **Chat** - - ![click chat image](/docs/images/4_click_chat.png) - -Next, you need to deploy an OpenAI model. - -## Deploy an OpenAI model - -On the left navigation panel, click **Models** - - ![Click Models](/docs/images/10_click_models.png) - -Select the (A) `gpt-35-turbo model`, click (B) **Deploy**. You can learn more about the available models by clicking (C) **Learn more about the different types of base models**, or [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models). - - ![Deploy the model](/docs/images/11_deploy_model.png) - -Name your deployment and then click **Create**. - - ![Name your Deployment](/docs/images/12_name_your_deployment.png) - -Now if you select `Deployments` on the left panel, you should see your deployed model listed. - - ![Check Deployments](/docs/images/13_check_deployments.png) - -Run a quick test to ensure our deployment is acting as expected. Navigate to `Chat`, add an optional system message (we will cover this more later), and then type `Hello World` in the chat box. If you get a response, things are working well! - - ![test model](/docs/images/14_test_your_model.png) - -Now we will look at [adding and querying over your own data](#Upload-your-own-data-and-query-over-it) and then review [prompt engineering best practices](#prompt-engineering-best-practices) using a general GPT model. - -## Chat Playground Navigation - -If you have not already (A) Navigate to the Chat Playground. Here we will walk through the various options available to you. First, you can specify a `System Message` which tells the model what context with which to respond to inquiries. To modify this, (B) select `System message`, then (B) input a [System Message](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message#define-the-models-profile-capabilities-and-limitations-for-your-scenario) in the prompt box, then (D) click **Save**. - -On the next tab over, you can (A) add your own data, which we dive into in the [next section](#Upload-your-own-data-and-query-over-it). In the middle of the page is where you actually interact with the model (B) through the chat prompts. Always (C) clear the chat after each session. - - ![add your own data](/docs/images/18_add_custom_data.png) - -On the far right, you can modify which model you are deploying, which allows you to switch between different model deployments depending on the context. - - ![modify deployment](/docs/images/19_deployment.png) - -Finally, you can select the `parameters` tab to modify the model parameters. Review [this presentation](/notebooks/GenAI/search_documents/aoai_workshop_content.pdf) to learn more about the parameters. - - ![modify parameters](/docs/images/20_parameters.png) - -## Upload your own data and query over it - -For an in-depth overview of adding your own data, check out this [Microsoft documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-studio). We give a quick start version here. - -Now, if you want to add your own data and query it, keep going here. If you want to jump ahead to prompt engineering with the general GPT model, jump down to [Prompt Engineering Best Practices](#prompt-engineering-best-practices). - -Within this repo there is a directory called `search_documents`. This directory contains a few PDFs that we will upload and query over related to [Immune Response to Mpox in Woman Living with HIV](https://www.niaid.nih.gov/news-events/immune-response-mpox) and the [DCEG Diesel Exhaust in Minors Study](https://dceg.cancer.gov/news-events/news/2023/dems-ii). - -We are going to upload these PDFs to an Azure Storage Account and then add them to our Azure OpenAI workspace. Note that there are [upload limits](https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits#quotas-and-limits-reference) on the number and size of documents you can query within Azure OpenAI, but sure to read these before getting started. For example, you can only query over a max of 30 documents and/or 1 GB of data. You can only upload the datatypes [listed below] [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/use-your-data#data-formats-and-file-types), and will have the best results with markdown files. - -+ `.txt` -+ `.md` -+ `.html` -+ Microsoft Word files -+ Microsoft PowerPoint files -+ PDF - -Follow [this guide](/docs/create_storage_account.md) to create and upload to a storage account. Use a separate browswer window so that you can easily get back to Azure OpenAI. - -Once you have uploaded your PDFs (or other datatypes if you are trying that), navigate back to the `Chat` section of Azure OpenAI and click **Add a data source**. - - ![Add data source image](/docs/images/5_add_data_source.png) - -Select `Azure Blob Storage`, and then the correct `Storage Account` and `Container`. If this is your first time indexing documents, for `Select Azure Cognitive Search resource` click **Create a new Azure Cognitive Search resource** which will open a new window. - - ![select data source](/docs/images/6_point_to_data.png) - -If needed, create the new Azure Cognitive Search resource. Make sure you delete this when you are finished with Azure OpenAI because it will accrue charges over time. - - ![create cog search](/docs/images/7_cog_search_resource.png) - -Now select your newly made Azure Cognitive Search resource, and click **Next**. You can select to search with either [Keyword or Semantic search](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/use-your-data#search-options). Keyword is simple keyword-driven search, semantic search takes the context of the words into account and is normally better. If Semantic search is not allowed in your account, just use **Keyword**. Select **Next**. - - ![choose keyword](/docs/images/choose_keyword.png) - -On the last page, click **Save and close**. It will now take a few minutes to index your updated data. Read more [here](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) about how Azure Cognitive Search is working behind the scenes. - - ![Save and close](/docs/images/9_review_and_close.png) - -Once it is complete, you should see your data source listed. Note that you can select the box that says `Limit responses to your data content` depending on if you want to limit to your data or query your data plus the general model. - -Also check that the index is complete by viewing your Cognitive Search resource, and going to `Indexes`. Ensure that the number of documents listed > 0. - - ![Cog Search](/docs/images/15_check_index.png) - -Now let's run some example queries of our custom data set. Feel free to modify and experiment. After reading the prompt engineering section below, return to this section and see how you can improve these examples. If you get errors after adding your data, try to refresh the page, and if all else fails, send us an email at CloudLab@nih.gov. - -``` -Summarize each of the documents I uploaded in a single paragraph, listing the title, the authors, followed by a five sentence summary for each. Give a new line after each summary. -``` -``` -What were some of the phenotypic presentations of MPOX on patients with HIV? -``` -``` -Are the phenotypic effects of MPOX the same for a patient with HIV and other patients? -``` -``` -Describe the primary findings of the Diesel Exhaust in Miners Study? -``` -``` -Does exposure to Diesel exhaust increase your risk for lung cancer? What about other cancers? Keep your response to one sentence for each of these queries. -``` - - ![search custom data files](/docs/images/16_search_custom_data.png) - -## Prompt Engineering Best Practices -First, review [this summary of prompt engineering](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/prompt-engineering) from Microsoft. - -### Write Clear Instructions - -1. Alter the system message to reply with a document that includes a playful comment or joke in each paragraph when responding to inquiries concerning writing assistance. This format should only be used for writing-related questions - -Add the following in the System Message box (SYSTEM:) -``` -You are a comedian English professor at the University of Giggles. When I ask for help to write something, you will reply with a document that contains at least one joke or playful comment. -``` -Add this query to the chat prompt box (QUERY:). -``` -Write a thank you note to my steel bolt vendor for getting a delivery in on time with short notice. This made it possible for my company to deliver an important order. -``` -Add the following to the system message, directing the LLM to only answer questions that involve writing assistance and then rerun the original query. -``` -If the user query does not have "write" in it, respond I do not know truthfully. -``` -2. Modify the system message by adding the prefix "Summary:" which should summarize the paragraph given, delimited with XML tags. Following the summary, the system should translate the paragraph from English to Spanish and add the prefix "Translation:". -To accomplish these tasks, the following steps should be taken: - 1. Identify the paragraph to be summarized, which should be delimited by XML tags. - 2. Generate a summary of the paragraph. - 3. Add the prefix "Summary:" to the beginning of the summary. - 4. Translate the paragraph from English to Spanish. - 5. Add the prefix "Translation:" to the beginning of the translated paragraph. - -SYSTEM: -``` -You will be given a paragraph delimited by XML tags. Use the following step-by-step sequence to respond to user inputs. - - Step 1) The user will provide you with a paragraph delimited by XML tags. Summarize the paragraph in one sentence with a prefix “Summary:” - Step 2) Translate the summary from Step 1 into Spanish, with a prefix “Translation:” -``` -QUERY: -``` - Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are designed to perform tasks that normally require human intelligence, such as learning, problem-solving, and decision-making. AI technology uses algorithms and statistical models to analyze data and make predictions and can be applied to a wide range of fields, including healthcare, finance, and transportation. AI is a rapidly growing field that has the potential to revolutionize many industries by increasing efficiency and productivity. However, as with any technology, there are also concerns about the ethical implications of AI, such as job displacement and privacy concerns. -``` - -Note: When implementing the above example, you might encounter a problem in Step 2 of the prompt where the model translates the entire paragraph instead of the single sentence summary. This issue is likely to arise when using the gpt-35-turbo model, primarily due to its limitations in reasoning capabilities, which impact its translation proficiency. A solution to this minor glitch is the gpt-4 model, which is designed to reason more effectively than the gpt-35-turbo model. - -1. Revise the model to classify the text it is given as either positive, neutral or negative. Once classified, have the LLM recognize the adjective it used to classify the text. Provide an example to the assitant for the LLM to comprehend tasks. - -SYSTEM -``` -Classify the text as either positive, neutral, or negative. Then find the adjective that allows you to classify the text. Follow the example to respond. - - USER: The movie was awesome! - - ASSISTANT: Positive. The adjective here is: awesome. - - USER: The movie was terrible. - - ASSISTANT: Negative. The adjective here is: terrible. - - USER: The movie was ok. - - ASSISTANT: Neutral. The adjective here is: ok. - - QUERY: I can’t wait to go to the beach. -``` - -### Providing Reference Text - -4. Revise the system message to create four bullet points outlining the key principles of the provided text delimited by triple quotes. -To accomplish this, the following steps should be taken: - 1. Identify the text to be analyzed, which should be delimited by triple quotes. - 2. Analyze the text to determine the key principles. - 3. Generate four bullet points that succinctly summarize each principle. - 4. Display the bullet points in the system message. - -SYSTEM: -``` -You will be given text delimited by triple quotes. Create 4 bullet points on the key principles of the text. Answer in the following format: - - Key principle 1 - - Key principle 2 - - Key principle 3 - - Key principle 4 -``` -QUERY: -``` - “”” - Learning a new language is an excellent way to broaden your horizons and improve your cognitive abilities. Firstly, being multilingual can open new opportunities both personally and professionally, such as traveling to new countries, connecting with people from different cultures, and expanding your job prospects. Secondly, it has been shown that learning a new language can improve cognitive function, such as memory, problem-solving, and decision-making skills. Additionally, it can increase empathy and cultural understanding, as well as enhance creativity and communication skills. Finally, it can boost confidence and self-esteem, as mastering a new language is a significant achievement and can provide a sense of accomplishment. Overall, the benefits of learning a new language are numerous and can have a positive impact on many aspects of your life. - “”” -``` -### Split complex tasks into simpler subtasks - -5. Give the system message primary and secondary categories for classifying customer service inquiries. The system should: - - take in customer service queries - - classify the query into primary and secondary categories - - output the response in JSON format with the following keys: primary and secondary - -SYSTEM: -``` -You will be provided with customer services queries. Classify each query into a primary category and a secondary category. Provide your output in JSON format with the keys: primary and secondary - Primary categories: Billing, Technical Support, Account Management, or General Inquiry - Billing secondary categories: - - Unsubscribe or upgrade - - Add a payment method - - Explanation for charge - - Dispute a charge - Technical Support secondary categories: - - Troubleshooting - - Device compatibility - - Software updates - Account Management secondary categories: - - Password reset - - Update personal information - - Close account - - Account security - General Inquiry secondary categories: - - Product information - - Pricing - - Feedback - - Speak to a human -``` -QUERY: -``` -I need to get my internet working again. -``` -(5) Continued: Based on the classification of the customer query from above, provide a set of more specific set of instructions to the system message for troubleshooting in a technical support context. - -SYSTEM: - -``` -You will be provided with customer service inquiries that require troubleshooting in a technical support context. Help the user by: - - Check all router cables are connected properly. If not, reconnect them as needed. - - Ask the customer which router model they are using. - - For MTD-327J model, instruct the customer to hold the red button for 5 seconds and wait 5 minutes before testing the connection. - - For MTD-327S model, advise the customer to unplug and replug the device and wait for 5 minutes before testing the connection. - - If the issue persists, output {"IT support requested"} to connect them to IT support. - - If the customer's questions are unrelated to troubleshooting, ask if they would like to end the chat and classify their request accordingly. - - -``` -QUERY: -``` -I need to get my internet working again. -``` - -### Give GPTs time to "think" - -6. Modify the system message to determine if the student’s solution is correct or not. The model should work out its own solution before comparing it to the student’s solution. Following an in-depth comparison, a decision on whether the answer is correct or not should be made. Add the following example so the model knows how to solve the problem correctly: - -SYSTEM: - - First work out your own solution to the problem. Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself. - - USER: - - Problem Statement: I'm building a solar power installation and I need help working out the financials. - - Land costs $100 / square foot - - I can buy solar panels for $250 / square foot - - I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot - What is the total cost for the first year of operations as a function of the number of square feet. - - ASSISTANT: - - Let x be the size of the installation in square feet. - 1. Land cost: 100x - 2. Solar panel cost: 250x - 3. Maintenance cost: 100,000 + 10x - Total cost: 100x + 250x + 100,000 + 10x = 360x + 100,000 - -QUERY: - - Problem Statement: I'm building a solar power installation and I need help working out the financials. - - Land costs $100 / square foot - - I can buy solar panels for $250 / square foot - - I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot - What is the total cost for the first year of operations as a function of the number of square feet. - Student's Solution: Let x be the size of the installation in square feet. - 1. Land cost: 100x - 2. Solar panel cost: 250x - 3. Maintenance cost: 100,000 + 100x - Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000 - -### Test changes systematically - -7. Modify the system to detect whether the text it was given contains the following pieces of information it has directly. The text should be delimited by triple quotes. Here are the pieces of information to look for: - - Neil Armstrong was the first person to walk on the moon. - - The date Neil Armstrong walked on the moon was July 21, 1969. - -SYSTEM: - - You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer: - - - Neil Armstrong was the first person to walk on the moon. - - The date Neil Armstrong first walked on the moon was July 21, 1969. - - For each of these points perform the following steps but do not display the step number: - - Step 1 - Restate the point. - Step 2 - Provide a citation from the answer which is closest to this point. - Step 3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind. - Step 4 - Write "yes" if the answer to 3 was yes, otherwise write "no". - Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": }. - -QUERY: - - """Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission.""" - -## Azure OpenAI API and Embeddings - -### Background -Creating embeddings of search documents allows you to use vector search, which is much more powerful than the keyword search we used above. First, review this page on [how to create embeddings](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-generate-embeddings), and then review [how vector search works](https://learn.microsoft.com/en-us/azure/search/vector-search-overview). - -### Environment Setup -Navigate to your [Azure Machine Learning Studio environment](https://github.com/STRIDES/NIHCloudLabAzure#launch-a-machine-learning-workspace-jupyter-environment-). If you have not created your environment, [create one now](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-cloud-workstation?view=azureml-api-2). - -Navigate to `Notebooks`, then clone this Git repo into your environment and navigate to the notebook called [AzureOpenAI_embeddings.ipynb](/notebooks/GenAI/notebooks/AzureOpenAI_embeddings.ipynb). - -You will need a variety of parameters to authenticate with the API. You can find these within the Chat Playground by clicking **View Code**. Input these parameters into the notebook cell when asked. - - ![Code View Image](/docs/images/find_endpointv2.png) - -Follow along with the notebook, and when finished, feel free to explore the other notebooks which use more advanced tools like Azure AI Search and LangChain. - -Finally, navigate back here to view the Additional Resources. Make sure to **Stop your Compute** when finished in Azure ML Studio. - -## Additional Resources - -### Azure OpenAI PLayground - -[Azure OpenAI Service models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) - -[Adding data to Azure OpenAI Playground](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-studio) - -[Azure OpenAI Chat API](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#chat-completions) - - -### Basics of Prompt Egineering - -[Prompting Techniques](https://www.promptingguide.ai/techniques) - -[Prompting Best Practices](https://platform.openai.com/docs/guides/gpt-best-practices) - -### Azure OpenAI Embeddings - -[Getting Started with Embeddings](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings?tabs=command-line) - -[OpenAI Cookbook GitHub Repository](https://github.com/openai/openai-cookbook) - -## License - -This repository is licensed under the MIT License. See the [LICENSE](https://github.com/t-cjackson/Azure-OpenAI-Workshop/blob/main/LICENSE) file for more information. diff --git a/tutorials/notebooks/GenAI/LICENSE b/tutorials/notebooks/GenAI/LICENSE deleted file mode 100644 index 48bc6bb..0000000 --- a/tutorials/notebooks/GenAI/LICENSE +++ /dev/null @@ -1,21 +0,0 @@ -MIT License - -Copyright (c) Microsoft Corporation - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. diff --git a/tutorials/notebooks/GenAI/embedding_demos/acs_embeddings.py b/tutorials/notebooks/GenAI/embedding_demos/acs_embeddings.py deleted file mode 100644 index 8a4a68a..0000000 --- a/tutorials/notebooks/GenAI/embedding_demos/acs_embeddings.py +++ /dev/null @@ -1,79 +0,0 @@ -from langchain.retrievers import AzureCognitiveSearchRetriever -from langchain.embeddings import OpenAIEmbeddings -from langchain.vectorstores import FAISS -from langchain.chains import RetrievalQA -from langchain.chat_models import AzureChatOpenAI -from PIL import Image -import os -import streamlit as st -from dotenv import load_dotenv - -# load in .env variables -load_dotenv() - -def config_keys(): - # set api keys for AOAI and Azure Search - os.environ['OPENAI_API_VERSION'] = os.getenv('AZURE_OPENAI_VERSION') - os.environ['OPENAI_API_KEY'] = os.getenv('AZURE_OPENAI_KEY') - os.environ['OPENAI_API_BASE'] = os.getenv('AZURE_OPENAI_ENDPOINT') - os.environ['OPENAI_EMBEDDING_DEPLOYMENT_NAME'] = os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME') - os.environ['AZURE_COGNITIVE_SEARCH_SERVICE_NAME'] = os.getenv('AZURE_COGNITIVE_SEARCH_SERVICE_NAME') - os.environ['AZURE_COGNITIVE_SEARCH_API_KEY'] = os.getenv('AZURE_COGNITIVE_SEARCH_API_KEY') - os.environ['AZURE_COGNITIVE_SEARCH_INDEX_NAME'] = os.getenv('AZURE_COGNITIVE_SEARCH_INDEX_NAME') - - -def main(): - # Streamlit config - st.title("Demo - Azure OpenAI & Cognitive Search Embeddings") - image = Image.open('image_logo2.png') - st.image(image, caption = '') - st.write('This program is designed to chat over your files in Azure Cognitive Search. \ - Be specific and clear with the questions you ask. \ - Welcome to CHATGPT over your own data !!') - if 'generated' not in st.session_state: - st.session_state.generated = [] - if 'past' not in st.session_state: - st.session_state.past = [] - - # create your LLM and embeddings. Will be conifuring 'azure' in the openai_api_type parameter. - llm = AzureChatOpenAI( - deployment_name = "gpt-35-turbo", - openai_api_type = "azure", - model = "gpt-35-turbo", - temperature=0.7, - max_tokens=200 - ) - - embeddings = OpenAIEmbeddings(chunk_size=1, openai_api_type="azure") - - # ask for the user query - query = st.text_input("Enter a search query: ", key='search_term', placeholder="") - - if query: - st.session_state.past.append(query) - - # set up Azure Cognitive Search to retrieve documents - # top_k = 1: we only want first related doc - retriever = AzureCognitiveSearchRetriever(content_key="content", top_k=1) - - # get the relevant document from Azure Cognitive Search that are only relevant to the query being asked - docs = retriever.get_relevant_documents(query) - - # create embedding from the document retrieved and place in a FAISS vector database - db = FAISS.from_documents(documents=docs, embedding=embeddings) - - # set up the chain that will feed the retrieved document to the LLM - chain = RetrievalQA.from_chain_type(llm=llm, retriever = db.as_retriever(), chain_type="stuff") - - # run the chain on the query asked - response = chain.run(query) - st.session_state.generated.append(response) - - with st.expander('Vector Search'): - for i in range(len(st.session_state.generated)-1, -1, -1): - st.info(st.session_state.past[i]) - st.success(st.session_state.generated[i]) - -if __name__ == '__main__': - config_keys() - main() diff --git a/tutorials/notebooks/GenAI/embedding_demos/aoai_embeddings.py b/tutorials/notebooks/GenAI/embedding_demos/aoai_embeddings.py deleted file mode 100644 index eb694c7..0000000 --- a/tutorials/notebooks/GenAI/embedding_demos/aoai_embeddings.py +++ /dev/null @@ -1,102 +0,0 @@ -import openai -from openai.embeddings_utils import get_embedding, cosine_similarity # must pip install openai[embeddings] -import pandas as pd -import numpy as np -import os -import streamlit as st -import time -from PIL import Image -from dotenv import load_dotenv - -# load in .env variables -load_dotenv() - -# configure azure openai keys -openai.api_type = 'azure' -openai.api_version = os.environ['AZURE_OPENAI_VERSION'] -openai.api_base = os.environ['AZURE_OPENAI_ENDPOINT'] -openai.api_key = os.environ['AZURE_OPENAI_KEY'] - -def embedding_create(): - # acquire the filename to be embed - st.subheader("Vector Creation") - st.write('This program is designed to embed your pre-chunked .csv file. \ - By accomplishing this task, you will be able to chat over all cotent in your .csv via vector searching. \ - Just enter the file and the program will take care of the rest (specify file path if not in this directory). \ - Welcome to CHATGPT over your own data !!') - filename = st.text_input("Enter a file: ", key='filename', value="") - - # start the embeddings process if filename provided - if filename: - - # read the data file to be embed - df = pd.read_csv('C:\\src\\AzureOpenAI_Gov_Workshop\\' + filename) - st.write(df) - - # calculate word embeddings - df['embedding'] = df['text'].apply(lambda x:get_embedding(x, engine='text-embedding-ada-002')) - df.to_csv('C:\\src\\AzureOpenAI_Gov_Workshop\\microsoft-earnings_embeddings.csv') - time.sleep(3) - st.subheader("Post Embedding") - st.success('Embeddings Created Sucessfully!!') - st.write(df) - - -def embeddings_search(): - - # Streamlit configuration - st.subheader("Vector Search") - st.write('This program is designed to chat over your vector stored (embedding) .csv file. \ - This Chat Bot works alongside the "Embeddings Bot" Chat Bot. \ - Be specific with the information you want to obtain over your data. \ - Welcome to CHATGPT over your own data !!') - if 'answer' not in st.session_state: - st.session_state.answer = [] - if 'score' not in st.session_state: - st.session_state.score = [] - if 'past' not in st.session_state: - st.session_state.past = [] - - # read in the embeddings .csv - # convert elements in 'embedding' column back to numpy array - df = pd.read_csv('C:\\src\\AzureOpenAI_Gov_Workshop\\microsoft-earnings_embeddings.csv') - df['embedding'] = df['embedding'].apply(eval).apply(np.array) - - # caluculate user query embedding - search_term = st.text_input("Enter a search query: ", key='search_term', placeholder="") - if search_term: - st.session_state.past.append(search_term) - search_term_vector = get_embedding(search_term, engine='text-embedding-ada-002') - - # find similiarity between query and vectors - df['similarities'] = df['embedding'].apply(lambda x:cosine_similarity(x, search_term_vector)) - df1 = df.sort_values("similarities", ascending=False).head(5) - - # output the response - answer = df1['text'].loc[df1.index[0]] - score = df1['similarities'].loc[df1.index[0]] - st.session_state.answer.append(answer) - st.session_state.score.append(score) - with st.expander('Vector Search'): - for i in range(len(st.session_state.answer)-1, -1, -1): - st.info(st.session_state.past[i]) - st.write(st.session_state.answer[i]) - st.write('Score: ', st.session_state.score[i]) - - -def main(): - # Streamlit config - st.title("Demo-Azure OpenAI Embeddings") - image = Image.open('image_logo2.png') - st.image(image, caption = '') - st.sidebar.title('Chat Bot Type Selection') - chat_style = st.sidebar.selectbox( - 'Choose between Embeddings Bot or Search Bot', ['Embeddings Bot','Search Bot'] - ) - if chat_style == 'Embeddings Bot': - embedding_create() - elif chat_style == 'Search Bot': - embeddings_search() - -if __name__ == '__main__': - main() diff --git a/tutorials/notebooks/GenAI/example_scripts/example_azureaisearch_openaichat_zeroshot.py b/tutorials/notebooks/GenAI/example_scripts/example_azureaisearch_openaichat_zeroshot.py deleted file mode 100644 index 8440bc8..0000000 --- a/tutorials/notebooks/GenAI/example_scripts/example_azureaisearch_openaichat_zeroshot.py +++ /dev/null @@ -1,99 +0,0 @@ -from langchain.chains import ConversationalRetrievalChain -from langchain.prompts import PromptTemplate -from langchain_community.retrievers import AzureCognitiveSearchRetriever -from langchain_openai import AzureChatOpenAI -import sys -import json -import os - - -class bcolors: - HEADER = '\033[95m' - OKBLUE = '\033[94m' - OKCYAN = '\033[96m' - OKGREEN = '\033[92m' - WARNING = '\033[93m' - FAIL = '\033[91m' - ENDC = '\033[0m' - BOLD = '\033[1m' - UNDERLINE = '\033[4m' - -MAX_HISTORY_LENGTH = 1 - -def build_chain(): - - os.getenv("AZURE_OPENAI_API_KEY") - os.getenv("AZURE_OPENAI_ENDPOINT") - os.getenv("AZURE_COGNITIVE_SEARCH_SERVICE_NAME") - os.getenv("AZURE_COGNITIVE_SEARCH_INDEX_NAME") - os.getenv("AZURE_COGNITIVE_SEARCH_API_KEY") - AZURE_OPENAI_DEPLOYMENT_NAME = os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"] - - llm = AzureChatOpenAI( - openai_api_version="2023-05-15", - azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME, - #max_tokens = 3000 -) - - retriever = AzureCognitiveSearchRetriever(content_key="content", top_k=2) - - - prompt_template = """ - Instructions: - I will provide you question and scientific documents you will answer my question with information from documents in English, and you will create a cumulative summary that should be concise and should accurately. - You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. - Your summary should be written in your own words and ensure that your summary is clear, and concise. - - {question} Answer "don't know" if not present in the documents. - {context} - Solution:""" - - - PROMPT = PromptTemplate( - template=prompt_template, input_variables=["context", "question"], - ) - - condense_qa_template = """ - Chat History: - {chat_history} - Here is a new question for you: {question} - Standalone question:""" - standalone_question_prompt = PromptTemplate.from_template(condense_qa_template) - - qa = ConversationalRetrievalChain.from_llm( - llm=llm, - retriever=retriever, - condense_question_prompt=standalone_question_prompt, - return_source_documents=True, - combine_docs_chain_kwargs={"prompt":PROMPT} - ) - return qa - -def run_chain(chain, prompt: str, history=[]): - print(prompt) - return chain({"question": prompt, "chat_history": history}) - -if __name__ == "__main__": - chat_history = [] - qa = build_chain() - print(bcolors.OKBLUE + "Hello! How can I help you?" + bcolors.ENDC) - print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC) - print(">", end=" ", flush=True) - for query in sys.stdin: - if (query.strip().lower().startswith("new search:")): - query = query.strip().lower().replace("new search:","") - chat_history = [] - elif (len(chat_history) == MAX_HISTORY_LENGTH): - chat_history.pop(0) - result = run_chain(qa, query, chat_history) - chat_history.append((query, result["answer"])) - print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC) - if 'source_documents' in result: - print(bcolors.OKGREEN + 'Sources:') - for d in result['source_documents']: - dict_meta=json.loads(d.metadata['metadata']) - print(dict_meta['source']) - print(bcolors.ENDC) - print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC) - print(">", end=" ", flush=True) - print(bcolors.OKBLUE + "Bye" + bcolors.ENDC) diff --git a/tutorials/notebooks/GenAI/example_scripts/example_langchain_openaichat_zeroshot.py b/tutorials/notebooks/GenAI/example_scripts/example_langchain_openaichat_zeroshot.py deleted file mode 100644 index 9d04ccd..0000000 --- a/tutorials/notebooks/GenAI/example_scripts/example_langchain_openaichat_zeroshot.py +++ /dev/null @@ -1,93 +0,0 @@ -from langchain.retrievers import PubMedRetriever -from langchain.chains import ConversationalRetrievalChain -from langchain.prompts import PromptTemplate -import sys -import json -import os - - -class bcolors: - HEADER = '\033[95m' - OKBLUE = '\033[94m' - OKCYAN = '\033[96m' - OKGREEN = '\033[92m' - WARNING = '\033[93m' - FAIL = '\033[91m' - ENDC = '\033[0m' - BOLD = '\033[1m' - UNDERLINE = '\033[4m' - -MAX_HISTORY_LENGTH = 1 - -def build_chain(): - os.getenv("AZURE_OPENAI_API_KEY") - os.getenv("AZURE_OPENAI_ENDPOINT") - AZURE_OPENAI_DEPLOYMENT_NAME = os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"] - - llm = AzureChatOpenAI( - openai_api_version="2023-05-15", - azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME, - #max_tokens = 3000 -) - - retriever= PubMedRetriever() - - prompt_template = """ - Ignore everything before. - Instructions: - I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. - The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. - You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. - Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end. - {question} Answer "don't know" if not present in the document. - {context} - Solution:""" - - - PROMPT = PromptTemplate( - template=prompt_template, input_variables=["context", "question"], - ) - - condense_qa_template = """ - Chat History: - {chat_history} - Here is a new question for you: {question} - Standalone question:""" - standalone_question_prompt = PromptTemplate.from_template(condense_qa_template) - - qa = ConversationalRetrievalChain.from_llm( - llm=llm, - retriever=retriever, - condense_question_prompt=standalone_question_prompt, - return_source_documents=True, - combine_docs_chain_kwargs={"prompt":PROMPT}, - ) - return qa - -def run_chain(chain, prompt: str, history=[]): - print(prompt) - return chain({"question": prompt, "chat_history": history}) - -if __name__ == "__main__": - chat_history = [] - qa = build_chain() - print(bcolors.OKBLUE + "Hello! How can I help you?" + bcolors.ENDC) - print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC) - print(">", end=" ", flush=True) - for query in sys.stdin: - if (query.strip().lower().startswith("new search:")): - query = query.strip().lower().replace("new search:","") - chat_history = [] - elif (len(chat_history) == MAX_HISTORY_LENGTH): - chat_history.pop(0) - result = run_chain(qa, query, chat_history) - chat_history.append((query, result["answer"])) - print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC) - if 'source_documents' in result: - print(bcolors.OKGREEN + 'Sources:') - for idx, ref in enumerate(result["source_documents"]): - print("PubMed UID: "+ref.metadata["uid"]) - print(bcolors.ENDC) - print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC) - print(">", end=" ", flush=True) - print(bcolors.OKBLUE + "Bye" + bcolors.ENDC) diff --git a/tutorials/notebooks/GenAI/example_scripts/workshop_embedding.py b/tutorials/notebooks/GenAI/example_scripts/workshop_embedding.py deleted file mode 100644 index 4212701..0000000 --- a/tutorials/notebooks/GenAI/example_scripts/workshop_embedding.py +++ /dev/null @@ -1,34 +0,0 @@ -import openai -from openai.embeddings_utils import get_embedding, cosine_similarity # must pip install openai[embeddings] -import pandas as pd -import numpy as np -import os -import streamlit as st -from dotenv import load_dotenv -import time - - -# load in variables from .env -load_dotenv() - - -# set keys and configure Azure OpenAI -openai.api_type = 'azure' -openai.api_version = os.environ['AZURE_OPENAI_VERSION'] -openai.api_base = os.environ['AZURE_OPENAI_ENDPOINT'] -openai.api_key = os.environ['AZURE_OPENAI_KEY'] - - -# read the data file to be embed -df = pd.read_csv('microsoft-earnings.csv') -print(df) - - -# calculate word embeddings -df['embedding'] = df['text'].apply(lambda x:get_embedding(x, engine='text-embedding-ada-002')) -df.to_csv('microsoft-earnings_embeddings.csv') -time.sleep(3) -print(df) - - - diff --git a/tutorials/notebooks/GenAI/example_scripts/workshop_search.py b/tutorials/notebooks/GenAI/example_scripts/workshop_search.py deleted file mode 100644 index 4da089e..0000000 --- a/tutorials/notebooks/GenAI/example_scripts/workshop_search.py +++ /dev/null @@ -1,41 +0,0 @@ -import openai -from openai.embeddings_utils import get_embedding, cosine_similarity # must pip install openai[embeddings] -import pandas as pd -import numpy as np -import os -import streamlit as st -from dotenv import load_dotenv - - - -load_dotenv() - - - - -# set keys and configure Azure OpenAI -openai.api_type = 'azure' -openai.api_version = os.environ['AZURE_OPENAI_VERSION'] -openai.api_base = os.environ['AZURE_OPENAI_ENDPOINT'] -openai.api_key = os.environ['AZURE_OPENAI_KEY'] - -# read in the embeddings .csv -# convert elements in 'embedding' column back to numpy array -df = pd.read_csv('microsoft-earnings_embeddings.csv') -df['embedding'] = df['embedding'].apply(eval).apply(np.array) - -# caluculate user query embedding -search_term = input("Enter a search term: ") -if search_term: - search_term_vector = get_embedding(search_term, engine='text-embedding-ada-002') - - # find similiarity between query and vectors - df['similarities'] = df['embedding'].apply(lambda x:cosine_similarity(x, search_term_vector)) - df1 = df.sort_values("similarities", ascending=False).head(5) - - # output the response - print('\n') - print('Answer: ', df1['text'].loc[df1.index[0]]) - print('\n') - print('Similarity Score: ', df1['similarities'].loc[df1.index[0]]) - print('\n') diff --git a/tutorials/notebooks/GenAI/microsoft-earnings.csv b/tutorials/notebooks/GenAI/microsoft-earnings.csv deleted file mode 100644 index c4fcb16..0000000 --- a/tutorials/notebooks/GenAI/microsoft-earnings.csv +++ /dev/null @@ -1,63 +0,0 @@ -text -"Thank you, Brett. To start, I want to outline the principles that are guiding us through these changing economic times. First, we will invest behind categories that will drive the long-term secular trend where digital technology as a percentage of world's GDP will continue to increase.Second, we'll prioritize helping our customers get the most value out of their digital spending, so that they can do more with less. And finally, we will be disciplined in managing our cost structure." -"With that context, this quarter, the Microsoft Cloud again exceeded $25 billion in quarterly revenue, up 24% and 31% in constant currency. And based on current trends continuing, we expect our broader commercial business to grow at around 20% in constant currency this fiscal year, as we manage through the cyclical trends affecting our consumer business. With that, let me highlight our progress starting with Azure. Moving to the cloud is the best way for organizations to do more with less today." -"It helps them align their spend with demand and mitigate risk around increasing energy costs and supply chain constraints. We're also seeing more customers turn to us to build and innovate with infrastructure they already have. With Azure Arc, organizations like Wells Fargo can run Azure services, including containerized applications across on-premises, edge, and multi-cloud environments. We now have more than 8,500 Arc customers, more than double the number a year ago." -"We are the platform of choice for customers' SAP workloads in the cloud, companies like Thabani, Munich Re's, Sodexo, Volvo Cars, all run SAP on Azure. We are the only cloud provider with direct and secure access to Oracle databases running an Oracle Cloud infrastructure, making it possible for companies like FedEx, GE, and Marriott to use capabilities from both companies. And with Azure Confidential Computing, we're enabling companies in highly regulated industries, including RBC, to bring their most sensitive applications to the cloud. Just last week, UBS said it will move more than 50% of its applications to Azure." -"Now to data and AI. With our Microsoft Intelligent Data Platform, we provide a complete data fabric, helping customers cut integration tax associated with bringing together siloed solutions. Customers like Mercedes-Benz are standardizing on our data stack to process and govern massive amounts of data. Cosmos DB is the go-to database powering the world's most demanding workloads at limitless scale." -"Cosmos DB now supports postscript SQL, making Azure the first cloud provider to offer a database service that supports both relational and no SQL workloads. And in AI, we are turning the world's most advanced models into platforms for customers. Earlier this month, we brought the power of DALL-E to Azure OpenAI service, helping customers like Mattel apply the breakthrough image generation model to commercial use cases for the first time. In Azure machine learning, provides industry-leading ML apps, helping organizations like 3M deploy, manage and govern models." -"All of, Azure ML revenue has increased more than 100% for four quarters in a row. Now on to developers. We have the most complete platform for developers to build cloud-native applications. Four years since our acquisition, GitHub is now at $1 billion in annual recurring revenue." -"And GitHub's developer-first ethos has never been stronger. More than 90 million people now use the service to build software for any cloud on any platform up three times. GitHub advanced security is helping organizations improve their security posture by bringing features directly into the developer's workflow. Toyota North America chose the offering this quarter to help its developers build and secure many of its most critical applications." -"Now on to Power Platform. We are helping customers save time and money with our end-to-end suite spanning Low-Code/No-Code tools, robotic process automation, virtual agents, and business intelligence. Power BI is the market leader in business intelligence in the cloud and is growing faster than competition, as companies like Walmart standardize on the tool for reporting and analytics. Power Apps is the market leader in Low-Code/No-Code tools and has nearly 15 million monthly active users, up more than 50% compared to a year ago." -"Power Automate has more than seven million monthly active users and is being used by companies like Brown-Forman, Komatsu, Mass, T-Mobile to digitize manual business processes and save thousands of hours of employee time. And we're going further with new AI-powered capabilities and power automate that turn natural language into advanced workflows. Now on to Dynamics 365. From customer experience and service to finance and supply chain, we continue to take share across all categories we serve." -"For example, Lufthansa Cargo chose us to centralize customer information and related shipments. CBRE is optimizing its field service operations, gaining cost efficiencies. Darden is using our solutions to increase both guest frequency and spend at its restaurants. And Tillamook is scaling its growth and improving supply chain visibility." -"All up more than 400,000 organizations now use our business applications. Now on to Industry Solutions. We are seeing increased adoption of our industry and cross-industry clouds. Bank of Queensland chose our cloud for financial services to deliver new digital experiences for its customers." -"Our cloud for sustainability is off to a fast start as organizations like Telstra use the solution to track their environmental footprint. New updates provide insights on hard-to-measure Scope 3 carbon emissions, and we are seeing record growth in healthcare, driven, in part, by our Nuance DAX ambient intelligence solutions, which automatically documents patient encounters at the point of care. Physicians tell us DAX dramatically improves their productivity, and it's quickly becoming an on-ramp to our broader healthcare offerings. Now on to new systems of work, Microsoft 365, Teams, and Viva uniquely enable employees to thrive in today's digitally connected distributed world of work." -"Microsoft 365 is the cloud-first platform that supports all the ways people work and every type of worker reducing cost and complexity for IT. The new Microsoft 365 app brings together our productivity apps with third-party content, as well as personalized recommendations. Microsoft Teams is the de facto standard for collaboration and has become essential to how hundreds of millions of people meet, call, chat, collaborate and do business. As we emerge from the pandemic, we are retaining users we have gained and are seeing increased engagement, too." -"Users interact with Teams 1,500 times per month on average. In a typical day, the average commercial user spends more time in Teams chat than they do in email, and the number of users who use four or more features within Teams increased over 20% year over year. Teams is becoming a ubiquitous platform for business process. Monthly active enterprise users running third party and custom applications within Teams increased nearly 60% year over year, and over 55% of our enterprise customers who use Teams today also buy Teams Rooms or Teams Phone." -"Teams Phone provides the best-in-class calling. PSTN users have grown by double digits for five quarters in a row. We are bringing Teams Rooms to a growing hardware ecosystem, including Cisco's devices and peripherals, which will now run Teams natively. And we are creating a new category with Microsoft Places to help organizations evolve and manage the space for hybrid and in-person work." -"Just like Outlook calendar orchestrates when people can meet and collaborate, Places will do the same for where. We also announced Teams Premium, addressing enterprise demands for advanced meeting features like additional security options and intelligent meeting recaps. All this innovation is driving growth across Microsoft 365. Leaders in every industry from Fannie Mae and Land O'Lakes to Rabobank continue to turn to our premium E5 offerings for advanced security, compliance, voice, and analytics." -"We've also built a completely new suite for our employee experience platform, Microsoft Viva, which now has more than 20 million monthly active users at companies like Finastra, SES, and Unilever. And we are extending Viva to meet role-specific needs. Viva Sales is helping salespeople at companies like Adobe, Crayon, and PwC reclaim their time by bringing customer interactions across Teams and Outlook directly into their CRM system. Now on to Windows." -"Despite the drop in PC shipments during the quarter, Windows continues to see usage growth. All up, there are nearly 20% more monthly active Windows devices than pre-pandemic. And on average, Windows 10 and Windows 11 users are spending 8.5% more time on their PCs than they were two and a half years ago. And we are seeing larger commercial deployments of Windows 11." -"Accenture, for example, has deployed Windows 11 to more than 450,000 employees' PCs, up from just 25,000, seven months ago, and L'Oreal has deployed the operating system to 85,000 employees. Now to security. Security continues to be a top priority for every organization. We're the only company with integrated end-to-end tools spanning security, compliance, identity and device management, and privacy across all clouds and platforms." -"More than 860,000 organizations across every industry from BP and Fuji Film to ING Bank, iHeartMedia, and Lumen Technologies now use our security solutions, up 33% year over year. They can save up to 60% when they consolidate our security stack, and the number of customers with more than four workloads have increased 50% year over year. More organizations are choosing both our XDR and cloud-native SIM to secure their entire digital estate. The number of E5 customers who also purchased Sentinel increased 44% year over year." -"And as threats become more sophisticated, we are innovating to protect customers. New capabilities in Defender help secure the entire DevOps life cycle and manage security posture across clouds. And Entra now provides comprehensive identity governance for both on-premise and cloud-based user directories. Now on to LinkedIn." -"We once again saw record engagement among our more than 875 million members, with international growth increasing at nearly 2x the pace as in the United States. There are now more than 150 million subscriptions to newsletters on LinkedIn, up 4x year over year. New integrations between Viva and LinkedIn Learning helped companies invest in their existing employees by providing access to courses directly in the flow of work. Members added 365 million skills to their profiles over the last 12 months, up 43% year over year." -"And with our acquisition of EduBrite, they will also soon be able to earn professional certificates from trusted partners directly on the platform. We launched the next-generation sales navigator this quarter, helping sellers increase win rates and deal sizes by better understanding and evaluating customer interest. Finally, LinkedIn Marketing Solution continues to provide leading innovation and ROI in B2B digital advertising. More broadly, with Microsoft Advertising, we offer a trusted platform for any marketeer or advertiser looking to innovate." -"We've expanded our geographies we serve by nearly 4x over the past year. We are seeing record daily usage of Edge, Start, and Bing driven by Windows. Edge is the fastest-growing browser on Windows and continues to gain share as people use built-in coupon price comparison features to save money. We surface more than $2 billion in savings to date." -"And this quarter, we brought our shopping tools to 15 new markets. Users of our Start, personalized content feed are consuming 2x more content compared to a year ago. And we're also expanding our third-party ad inventory. Netflix will launch its first ad-supported subscription plan next month, exclusively powered by our technology and sales." -"And with PromoteIQ we offer an omnichannel media platform for retailers like the auto group looking to generate additional revenue while maintaining ownership of their own data and customer relationships. Now onto gaming. We are adding new gamers to our ecosystem as we execute on our ambition to reach players wherever and whenever they want on any device. We saw usage growth across all platforms driven by the strength of console." -"PC Game Pass subscriptions increased 159% year over year. And with cloud gaming, we're transforming how games are distributed, played and viewed. More than 20 million people have used the service to stream games to date. And we are adding support for new devices like handhelds from Logitech and Razor as well as Meta Quest." -"And as we look toward the holidays, we offer the best value in gaming with Game Pass and Xbox Series S, nearly half of the Series S buyers are new to our ecosystem. In closing, in a world facing increasing headwinds, digital technology is the ultimate tailwind. And we're innovating across the entire tech stack to help every organization, while also focusing intensely on our operational excellence and execution discipline. With that, I'll hand it over to Amy." -"Thank you, Satya, and good afternoon, everyone. Our first quarter revenue was $50.1 billion, up 11% and 16% in constant currency. Earnings per share was $2.35, increased 4% and 11% in constant currency when adjusted for the net tax benefit for the first quarter of fiscal year 2022. Driven by strong execution in a dynamic environment, we delivered a solid start to our fiscal year, in line with our expectations even as we saw many of the macro trends from the end of the fourth quarter continued to weaken through Q1." -"In our consumer business, PC market demand further deteriorated in September, which impacted our Windows OEM and Surface businesses. And reductions in customer advertising spend, which also weakened later in the quarter, impacted search and news advertising and LinkedIn Marketing Solutions. As you heard from Satya, in our commercial business, we saw strong overall demand for our Microsoft cloud offerings with a growth of 31% in constant currency as well as share gains across many businesses. Commercial bookings declined 3% and increased 16% in constant currency on a flat expiry base." -"Excluding the FX impact, growth was driven by strong renewal execution, and we continue to see growth in the number of large long-term Azure and Microsoft 365 contracts across all deal sizes. More than half of the $10 million plus Microsoft 365 bookings came from E5. Commercial remaining performance obligation increased 31% and 34% in constant currency to $180 billion. Roughly 45% will be recognized in revenue in the next 12 months, up 23% year over year." -"The remaining portion, which we recognized beyond the next 12 months, increased 38% year over year, and our annuity mix increased one point year over year to 96%. FX impacted company results in line with expectations. With the stronger US dollar, FX decreased total company revenue by five points, and at the segment level, FX decreased productivity and business processes and intelligent cloud revenue growth by six points and more personal computing revenue growth by three points. Additionally, FX decreased COGS and operating expense growth by three points." -"Microsoft Cloud gross margin percentage increased roughly two points year over year to 73%. Excluding the impact of the change in accounting estimate for useful lives, Microsoft cloud gross margin percentage decreased roughly one point driven by sales mix shift to Azure and lower Azure margin, primarily due to higher energy costs. Company gross margin dollars increased 9% and 16% in constant currency, and gross margin percentage decreased slightly year over year to 69%, excluding the impact of the latest change in accounting estimate, gross margin percentage decreased roughly three points, driven by sales mix shift to cloud, the lower Azure margin noted earlier and Nuance. Operating expense increased 15% and 18% in constant currency, driven by investments in cloud engineering, LinkedIn, Nuance, and commercial sales." -"At a total company level, headcount grew 22% year over year, as we continue to invest in key areas just mentioned, as well as customer deployment. Headcount growth included roughly six points from the Nuance and Xandr acquisitions, which closed last Q3 and Q4, respectively. Operating income increased 6% and 15% in constant currency, and operating margins decreased roughly two points year over year to 43%. Excluding the impact of the change in accounting estimate, operating margins declined roughly four points year over year driven by sales mix shift to cloud, unfavorable FX impact, Nuance, and the lower Azure margin noted earlier." -"Now to our segment results. Revenue from productivity and business processes was $16.5 billion and grew 9% and 15% in constant currency, ahead of expectations, with better-than-expected results in Office commercial and LinkedIn. Office commercial revenue grew 7% and 13% in constant currency. Office 365 commercial revenue increased 11% and 17% in constant currency, slightly better than expected, with the strong renewal execution noted earlier." -"Growth was driven by installed base expansion across all workloads and customer segments, as well as higher ARPU from E5. Demand for security, compliance, and voice value in Microsoft 365 drove strong E5 momentum again this quarter. Paid Office 365 commercial seats grew 14% year over year, driven by our small and medium business and frontline worker offerings, although we saw a continued impact of new deal moderation outside of E5. Office consumer revenue grew 7% and 11% in constant currency, driven by continued momentum in Microsoft 365 subscriptions, which grew 13% to $61.3 million." -"Dynamics revenue grew 15% and 22% in constant currency, driven by Dynamics 365, which grew 24% and 32% in constant currency. LinkedIn revenue increased 17% and 21% in constant currency, ahead of expectations, driven by better-than-expected growth in talent solutions, partially offset by weakness in marketing solutions from the advertising trends noted earlier. Segment gross margin dollars increased 11% and 18% in constant currency, and gross margin percentage increased roughly 1 point year over year. Excluding the impact of the latest change in accounting estimate, gross margin percentage decreased slightly, driven by sales mix shift to cloud offerings." -"Operating expense increased 13% and 16% in constant currency, and operating income increased 10% and 19% in constant currency, including four points due to the latest change in accounting estimate. Next, the Intelligent cloud segment. Revenue was $20.3 billion, increasing 20% and 26% in constant currency, in line with expectations. Overall, server products and cloud services revenue increased 22% and 28% in constant currency." -"Azure and other cloud services revenue grew 35% and 42% in constant currency, about one point lower than expected, driven by the continued moderation in Azure consumption growth, as we help customers optimize current workloads while they prioritize new workloads. In our per-user business, the enterprise mobility and security installed base grew 18% to over 232 million seats, with continued impact from the new deal moderation noted earlier. In our on-premises server business, revenue was flat, and increased 4% in constant currency, slightly ahead of expectations, driven by hybrid demand, including better-than-expected annuity purchasing ahead of the SQL Server 2022 launch. Enterprise services revenue grew 5% and 10% in constant currency, driven by enterprise support services." -"Segment gross margin dollars increased 20% and 26% in constant currency, and gross margin percentage decreased slightly. Excluding the impact of the latest change in accounting estimate, gross margin percentage declined roughly three points, driven by sales mix shift to Azure and higher energy costs impacting Azure margins. Operating expenses increased 25% and 28% in constant currency, including roughly eight points of impact from Nuance. And operating income grew 17% and 25% in constant currency with roughly nine points of favorable impact from the latest change in accounting estimate." -"Now to more personal computing. Revenue decreased slightly year over year to $13.3 billion and grew 3% in constant currency, in line with expectations overall, but with OEM and Surface weakness offset by upside in gaming consoles. Windows OEM revenue decreased 15% year over year. Excluding the impact from the Windows 11 deferral last year, revenue declined 20%, driven by PC market demand deterioration noted earlier." -"Devices revenue grew 2% and 8% in constant currency, in line with expectations, driven by the impact of a large Hollands deal, partially offset by low double-digit declines in consumer Surface sales. Windows commercial products and cloud services revenue grew 8% and 15% in constant currency, in line with expectations, driven by demand for Microsoft 365 E5 noted earlier. Search and news advertising revenue, ex TAC, increased 16% and 21% in constant currency, in line with expectations, benefiting from an increase in search volumes and roughly five points of impact from Xandr even as we saw increased ad market headwinds during September. Edge browser gained share again this quarter." -"And in gaming, revenue grew slightly and was up 4% in constant currency, ahead of expectations, driven by better-than-expected console sales. Xbox hardware revenue grew 13% and 19% in constant currency. Xbox content and services revenue declined 3% and increased 1% in constant currency, driven by declines in first-party content as well as in third-party content where we had lower engagement hours and higher monetization, partially offset by growth in Xbox Game Pass subscriptions. Segment gross margin dollars declined 9% and 4% in constant currency, and gross margin percentage decreased roughly five points year over year driven by sales mix shift to lower margin businesses." -"Operating expenses increased 2% and 5% in constant currency, driven by the Xandr acquisition. And operating income decreased 15% and 9% in constant currency. Now back to total company results. Capital expenditures, including finance leases were $6.6 billion, and cash paid for PP&E was $6.3 billion." -"Our data center investments continue to be based on strong customer demand and usage signals. Cash flow from operations was $23.2 billion, down 5% year over year, driven by strong cloud billings and collections, which were more than offset by a tax payment related to the transfer of intangible property completed in Q1 of FY '22. Free cash flow was $16.9 billion, down 10% year over year. Excluding the impact of this tax payment, cash flow from operations grew 2%, and free cash flow was relatively unchanged [indiscernible] year." -"This quarter, other income and expense was $54 million, driven by interest income, which was mostly offset by interest expense and net losses on foreign currency remeasurement. Our effective tax rate was approximately 19%. And finally, we returned $9.7 billion to shareholders through share repurchases and dividends. Now, moving to our Q2 outlook, which, unless specifically noted otherwise, is on a US dollar basis." -"My commentary, for both the full year and next quarter, does not include any impact from AC division, which we still expect to close by the end of the fiscal year. First, FX. With the stronger US dollar and based on current rates, we now expect FX to decrease total revenue growth by approximately five points and to decrease total COGS and operating expense growth by approximately three points. Within the segments, we anticipate roughly seven points of negative FX impact on revenue growth in productivity and business processes, six points in Intelligent cloud and three points in more personal computing." -"Our outlook has many of the trends we saw at the end of Q1, continue into Q2. In our consumer business, materially weaker PC demand from September will continue, and impact both Windows OEM and Surface device results even as the Windows installed base and usage grows, as you heard from Satya. Additionally, customers focusing our advertising spend will impact LinkedIn and Search and News advertising revenue. In our commercial business, demand for our differentiated hybrid and cloud offerings, together with consistent execution, should drive healthy growth across the Microsoft Cloud." -"In commercial bookings, continued strong execution across core annuity sales motions and commitments to our platform should drive solid growth on a moderately growing expiry base against a strong prior year comparable, which included a significant volume of large long-term Azure contracts. As a reminder, the growing mix of larger long-term Azure contracts which are more unpredictable in their timing, always drives increased quarterly volatility in our bookings growth rate. Microsoft Cloud gross margin percentage should be up roughly one point year over year, driven by the latest accounting estimate change noted earlier. Excluding that impact, Q2 gross margin percentage will decrease roughly two points driven by lower Azure margin, primarily due to higher energy cost, revenue mix shift to Azure and the impact from Nuance." -"In capital expenditures, we expect a sequential increase on a dollar basis with normal quarterly spend variability in the timing of our cloud infrastructure build-out. Next, to segment guidance. In productivity and business processes, we expect revenue to grow between 11% and 13% in constant currency or USD16.6 billion to USD 16.9 billion. In Office Commercial, revenue growth will again be driven by Office 365, with seat growth across customer segments and ARPU growth from E5." -"We expect Office 365 revenue growth to be similar to last quarter on a constant currency basis. In our on-premises business, we expect revenue to decline in the low to mid-30s. In Office Consumer, we expect revenue to decline low to mid-single digits as Microsoft 365 subscription growth will be more than offset by unfavorable FX impact. For LinkedIn, we expect continued strong engagement on the platform, although results will be impacted by a slowdown in advertising spend and hiring, resulting in mid to high single-digit revenue growth or low to mid-teens growth in constant currency." -"And in Dynamics, we expect revenue growth in the low double digits or the low 20s in constant currency, driven by continued share gains in Dynamics 365. For intelligent cloud, we expect revenue to grow between 22% and 24% in constant currency or US$21.25 billion to US$21.55 billion. Revenue will continue to be driven by Azure, which, as a reminder, can have quarterly variability primarily from our per-user business and from in-period recognition depending on the mix of contracts. We expect Azure revenue growth to be sequentially lower by roughly 5 points on a constant currency basis." -"Azure revenue will continue to be driven by strong growth in consumption, with some impact from the Q1 trends noted earlier. And our per user business should continue to benefit from Microsoft 365 suite momentum, though we expect moderation in growth rate given the size of the installed base. In our on-premise server business, we expect revenue to decline low single digits, as demand for our hybrid solutions, including strong annuity purchasing from the SQL Server 2022 launch, will be more than offset by unfavorable FX impact. And in enterprise services, we expect revenue growth to be in the low single digits, driven by enterprise support." -"In More Personal Computing, we expect revenue of US$14.5 billion to US$14.9 billion. In Windows OEM, we expect revenue to decline in the high 30s. Excluding the impact from the Windows 11 revenue deferral last year, revenue would decline mid-30s, reflecting both PC market demand and a strong prior year comparable, particularly in the commercial segment. In devices, revenue should decline approximately 30%, again, roughly in line with the PC market." -"In Windows commercial products and cloud services, customer demand for Microsoft 365 and our advanced security solutions should drive growth in the mid-single digits or low double digits in constant currency. Search and news advertising, ex TAC, should grow in the low to mid-teens, roughly six points faster than overall search and news advertising revenue, driven by growing first-party revenue and the inclusion of Xandr. And in gaming, we expect revenue to decline in the low to mid-teens against a strong prior year comparable. That included several first-party title launches, partially offset by growth in Xbox Game Pass subscribers." -"We expect Xbox content and services revenue to decline in the low to mid-teens. Now back to company guidance. We expect COGS to grow between 6% and 7% in constant currency or to be between US$17.4 billion and US$17.6 billion, and operating expense to grow between 17% and 18% in constant currency, or to be between US$14.3 billion and US$14.4 billion. As we continue to focus our investment in key growth areas, total headcount growth sequentially should be minimal." -"Other income and expense should be roughly $100 million, as interest income is expected to more than offset interest expense. Further FX and equity movements through Q2 are not reflected in this number. And as a reminder, we are required to recognize mark-to-market gains or losses on our equity portfolio, which can increase quarterly volatility. And we expect our Q2 effective tax rate to be between 19% and 20%." -"And finally, as a reminder, for Q2 cash flow, we expect to make a $2.4 billion cash tax payment related to the capitalization of R&D provision enacted in 2017 TCJA and effective as of July 1, 2022. Now some thoughts on the full fiscal year. First, FX. Based on current rates, we now expect a roughly five-point headwind to full-year revenue growth." -"And FX should decrease COGS and operating expense growth by approximately three points. At the total company level, we continue to expect double-digit revenue and operating income growth on a constant currency basis. Revenue will be driven by around 20% constant currency growth in our commercial business, driven by strong demand for our Microsoft cloud offerings. That growth will be partially offset by the increased declines we now see in the PC market." -"With the high margins in our Windows OEM business and the cyclical nature of the PC market, we take a long-term approach to investing in our core strategic growth areas and maintain these investment levels regardless of PC market conditions. Therefore, with our first quarter results and lower expected OEM revenue for the remainder of the year as well as over $800 million of greater-than-expected energy cost, we now expect operating margins in US dollars to be down roughly a point year over year. On a constant currency basis, excluding the incremental impact of the lower Windows OEM revenue and the favorable impact of the latest accounting change, we continue to expect FY 2023 operating margins to be roughly flat year over year. In closing, in this environment, it is more critical than ever to continue to invest in our strategic growth markets such as cloud, security, Teams, Dynamics 365, and LinkedIn where we have opportunities to continue to gain share as we provide problem-solving innovations to our customers." -"And while we continue to help our customers do more with less, we will do the same internally. And you should expect to see our operating expense growth moderate materially through the year. While we focus on growing productivity of the significant headcount investments we've made over the last year. With that, let's go to Q&A." diff --git a/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_notebook.ipynb b/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_notebook.ipynb deleted file mode 100644 index 6f56588..0000000 --- a/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_notebook.ipynb +++ /dev/null @@ -1,857 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "403555d2-4703-4bec-94ec-9d17ec656d62", - "metadata": {}, - "source": [ - "# Indexing Delimited Files on Azure AI Search" - ] - }, - { - "cell_type": "markdown", - "id": "e421b4fd", - "metadata": {}, - "source": [ - "## Overview\n", - "LLMs work best when querying vector databases (DBs). In a few of our tutorials in this repo, we have created vector DBs from unstructured data like PDF documents. Here, we create a vector DB from structured data, which is technically complex and requires additional steps. Here we will vectorize (embed) a csv file, index our DB using Azure AI Search, and then query our vector DB using a GPT model deployed within Azure AI Studio." - ] - }, - { - "cell_type": "markdown", - "id": "f3fe0439", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have access to Azure AI Studio and Azure AI Search Service and have already deployed an LLM." - ] - }, - { - "cell_type": "markdown", - "id": "6c5376ef", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "\n", - "This tutorial will cover the following topics:\n", - "+ Introduce embeddings from structured data\n", - "+ Create Azure AI Search index from command line\n", - "+ Query Azure AI Search index from command line using LLMs\n" - ] - }, - { - "cell_type": "markdown", - "id": "6fd8094c", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "id": "7efcb4f8-f3a1-4ea8-b826-012ceb733f4a", - "metadata": {}, - "source": [ - "### Install packages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "adb68924-26c9-4e6e-9880-2ea371c8d188", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "pip install -U \"langchain\" \"openai\" \"langchain-openai\" \"langchain-community\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5292c962-637e-4169-ad77-2e62da44596f", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "pip install azure-search-documents --pre --upgrade" - ] - }, - { - "cell_type": "markdown", - "id": "687d3dea-ea4c-4eea-8e5a-cff78c0340de", - "metadata": {}, - "source": [ - "### Import CSV data" - ] - }, - { - "cell_type": "markdown", - "id": "d4687557-c31a-4b6c-a17a-8628e46ad7a3", - "metadata": {}, - "source": [ - "For this tutorial we are using a Kaggle dataset about data scientist salaries from 2023. This dataset can be downloaded from [here](https://www.kaggle.com/datasets/henryshan/2023-data-scientists-salary)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ed34bedc-6339-4991-96b3-e9a2ac62b1a4", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np \n", - "# reading the csv file using read_csv\n", - "# storing the data frame in variable called df\n", - "df = pd.read_csv('ds_salaries.csv')\n", - " \n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "95c03a9b-9719-45e3-9bca-31fd310eb916", - "metadata": {}, - "source": [ - "Add an ID to each row of your data this will be the key in our Index. If you choose to use your own data make sure to clean up any trailing whitespaces or punctuation. Your headers should not have any spaces between the words." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a41c5439-4b14-41a3-aaf4-c6f704c9f232", - "metadata": {}, - "outputs": [], - "source": [ - "df['ID'] = np.arange(df.shape[0]).astype(str)\n", - "\n", - "#making the entire dataset into strings\n", - "df= df.astype(str)\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "db997a43-4195-487f-b023-795d361db993", - "metadata": {}, - "source": [ - "#### Optional: Adding embeddings to our data" - ] - }, - { - "cell_type": "markdown", - "id": "84a03ce7-87a4-4e94-acd1-cd01fa3cfa81", - "metadata": {}, - "source": [ - "If you want to add embeddings to your data you can run the code below! Embeddings will help our vector store (Azure AI Search) to retrieve relevant information based on the query or question you have supplied the model. Here we use the embedding **text-embedding-ada-002** to convert our data into numerical values which represents how similar each word is to another in your data. Embedding are usually used for dense data so if you have any columns in your dataset that contains sentences of text its recommended to add embeddings. Although the dataset we are using doesn't have that for this example we will be adding embeddings for the `job_title` column and add them to a new column called and `job_title_vector`.\n", - "\n", - "**If you don't want to add embeddings you can skip this code cell and run the [next one](#csv2json).**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "87181df6-c8d5-4498-ac6b-faa86631125e", - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"\"\n", - "os.environ[\"AZURE_OPENAI_KEY\"] = \"\"\n", - "\n", - "#create embeddings functions to apply to a given column\n", - "from openai import AzureOpenAI\n", - " \n", - "client = AzureOpenAI(\n", - " api_key=os.getenv(\"AZURE_OPENAI_KEY\"), \n", - " api_version=\"2023-05-15\",\n", - " azure_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", - " )\n", - "\n", - "def generate_embeddings(text, model=\"text-embedding-ada-002\"):\n", - " return client.embeddings.create(input = [text], model=model).data[0].embedding\n", - "\n", - "#adding embeddings for job title to get more accurate search results\n", - "df['job_title_vector'] = df['job_title'].apply(lambda x : generate_embeddings (x)) # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model" - ] - }, - { - "cell_type": "markdown", - "id": "c0b1da1e-4eda-47c3-a8d4-d04df80b0476", - "metadata": {}, - "source": [ - " Now we will convert our dataframe into JSON format. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fd4c4453-d958-49ad-9b72-34a829ac0437", - "metadata": {}, - "outputs": [], - "source": [ - "df_json = df.to_json(orient=\"records\")" - ] - }, - { - "cell_type": "markdown", - "id": "a532eacf-ff14-44fc-b703-844f83c846bd", - "metadata": {}, - "source": [ - "### Connect to our Azure Open AI Models" - ] - }, - { - "cell_type": "markdown", - "id": "6bfd4382-8809-4bb8-8940-68d81f8fba44", - "metadata": {}, - "source": [ - "Here we are setting the keys and endpoint to our OpenAI models as environmental variables which will help us connect to our LLM model which in this case is **gpt-4**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0377877c-73f4-4508-9a4f-9ddcd14bea96", - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"\"\n", - "os.environ[\"AZURE_OPENAI_KEY\"] = \"\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1c91f3a-5a72-463c-b869-a970ae1139df", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from openai import AzureOpenAI\n", - " \n", - "client = AzureOpenAI(\n", - " api_key=os.getenv(\"AZURE_OPENAI_KEY\"), \n", - " api_version=\"2023-05-15\",\n", - " azure_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "daa799c1-f587-4c9b-b1bc-7bc37415ca12", - "metadata": {}, - "source": [ - "### Create Azure AI Search Service" - ] - }, - { - "cell_type": "markdown", - "id": "da8b68e1-2488-41d1-be51-436cb47e5c86", - "metadata": {}, - "source": [ - "Enter in the name you would like for your AI Search service and index along with the name of your resource group and the location you would like your index to be held in." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "711cbd23-73ab-4ce0-b499-0f254549a7c3", - "metadata": {}, - "outputs": [], - "source": [ - "service_name=''\n", - "index_name = ''\n", - "location = 'eastus2'\n", - "resource_group = ''" - ] - }, - { - "cell_type": "markdown", - "id": "53d9a98f-4772-48ea-acef-0f1e947685e4", - "metadata": {}, - "source": [ - "Authenticate to use Azure cli, follow the outputs instructions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "63ab007c-5c88-4206-aa51-318fc3e82292", - "metadata": {}, - "outputs": [], - "source": [ - "! az login" - ] - }, - { - "cell_type": "markdown", - "id": "19a076aa-d2c1-4709-aec2-9532e6457e48", - "metadata": {}, - "source": [ - "Create your Azure AI Search service." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bf4f5c0e-3fb6-4f79-aaf4-a8f9920af590", - "metadata": {}, - "outputs": [], - "source": [ - "! az search service create --name {service_name} --sku free --location {location} --resource-group {resource_group} --partition-count 1 --replica-count 1" - ] - }, - { - "cell_type": "markdown", - "id": "94fd109a-2729-47b3-9327-811e2ce598b9", - "metadata": {}, - "source": [ - "Save the key to a JSON file and then we will save the value to our **search_key** variable." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cd6b4dee-0ea2-434e-b98f-55d22e871f36", - "metadata": {}, - "outputs": [], - "source": [ - "! az search admin-key show --resource-group {resource_group} --service-name {service_name} > keys.json" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b0c98d31-ebad-4e66-adef-abcbe222f317", - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "with open('keys.json', mode='r') as f:\n", - " data = json.load(f)\n", - "search_key = data[\"primaryKey\"]" - ] - }, - { - "cell_type": "markdown", - "id": "c4cc7e03-d080-4486-9e0d-30167fec715a", - "metadata": {}, - "source": [ - "### Create Azure AI Index" - ] - }, - { - "cell_type": "markdown", - "id": "dc9f88db-a05a-499a-891b-be32ec44cd97", - "metadata": {}, - "source": [ - "Import the necessary tools to create our index and the fields this will be our **vector store**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "094c737d-d941-4681-ac78-e499f9b96805", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "from azure.search.documents import SearchClient\n", - "from azure.core.credentials import AzureKeyCredential\n", - "from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient\n", - "from azure.search.documents.indexes.models import (\n", - " SimpleField,\n", - " SearchField,\n", - " SearchableField,\n", - " SearchFieldDataType,\n", - " SearchIndexerDataContainer,\n", - " SearchIndexerDataSourceConnection,\n", - " SearchIndex,\n", - " SearchIndexer,\n", - " TextWeights,\n", - " VectorSearch,\n", - " VectorSearchProfile,\n", - " HnswAlgorithmConfiguration,\n", - " ComplexField\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "5e836161-1910-411a-8dfa-7820c0fc9bd0", - "metadata": {}, - "source": [ - "Create your index client to pass on information about our index too." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "80e780b6-a045-4779-876a-2b36e4dc5225", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = \"https://{}.search.windows.net/\".format(service_name)\n", - "index_client = SearchIndexClient(endpoint, AzureKeyCredential(index_key))" - ] - }, - { - "cell_type": "markdown", - "id": "defec19a-0a82-413d-9a6b-13b9db5873b7", - "metadata": {}, - "source": [ - "Next you will add in the field names to the index which are based on the names of your columns. Notice that the **Key** is our 'ID' column and it is a string also that even columns that hold integers will also be strings because we want to be able to search and retrieve data from our index which can only be done so if our data is in string format.\n", - "\n", - "If you **added embeddings** to your data skip to the next section [Adding Embeddings to Vector Store](#Embeddings-to-Vector-Store)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a949a7ec-6a28-43ed-b16a-0245836a187f", - "metadata": {}, - "outputs": [], - "source": [ - "fields = [\n", - " SimpleField(\n", - " name=\"ID\",\n", - " type=SearchFieldDataType.String,\n", - " key=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"work_year\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"experience_level\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ), \n", - " SearchableField(\n", - " name=\"employment_type\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"job_title\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"salary\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"salary_currency\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"salary_in_usd\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"employee_residence\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"remote_ratio\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"company_location\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"company_size\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " )\n", - "]\n", - " \n", - "#set our index values\n", - "index = SearchIndex(name=index_name, fields=fields)\n", - "#create our index\n", - "index_client.create_index(index)\n" - ] - }, - { - "cell_type": "markdown", - "id": "a7c113a3-3b95-4359-b307-73dda33d5394", - "metadata": {}, - "source": [ - "

Optional: Adding Embeddings to Vector Store

" - ] - }, - { - "cell_type": "markdown", - "id": "361e905c-a945-427f-ac03-741ff6bb54be", - "metadata": {}, - "source": [ - "If you are working with embeddings you need to add a **SearchField** that holds a collection which is your array of numerical values. The name of the column is the same as our dataset **job_title_vector**. We also need to set a **vector profile** which dictates what algorithm we will have our vector store use to find text that are similar to each other (find the nearest neighbors) for this profile we will be using the **Hierarchical Navigable Small World (HNSW) algorithm**, we have named our profile **vector_search**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8107360d-a4f2-4b55-9fe0-8cf8ea3c618a", - "metadata": {}, - "outputs": [], - "source": [ - "fields = [\n", - " fields = [\n", - " SimpleField(\n", - " name=\"ID\",\n", - " type=SearchFieldDataType.String,\n", - " key=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"work_year\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"experience_level\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ), \n", - " SearchableField(\n", - " name=\"employment_type\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"job_title\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchField(\n", - " name=\"job_title_vector\",\n", - " type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n", - " searchable=True,\n", - " vector_search_dimensions=len(generate_embeddings(\"Text\")),\n", - " vector_search_profile_name=\"my-vector-config\"\n", - " ),\n", - " SearchableField(\n", - " name=\"salary\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"salary_currency\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"salary_in_usd\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"employee_residence\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"remote_ratio\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"company_location\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"company_size\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " )\n", - "]\n", - "\n", - "vector_search = VectorSearch(\n", - " profiles=[VectorSearchProfile(name=\"my-vector-config\", algorithm_configuration_name=\"my-algorithms-config\")],\n", - " algorithms=[HnswAlgorithmConfiguration(name=\"my-algorithms-config\")],\n", - ")\n", - " \n", - "#set our index values\n", - "index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)\n", - "#create our index\n", - "index_client.create_index(index)\n" - ] - }, - { - "cell_type": "markdown", - "id": "07ed728d-2af3-42d5-843c-e94266ce66d1", - "metadata": {}, - "source": [ - "### Upload Data to our Index" - ] - }, - { - "cell_type": "markdown", - "id": "b65aece2-ad33-4600-b862-a1eedd2cf50a", - "metadata": {}, - "source": [ - "Here we are creating a **search client** that will allow us to upload our data to our index and query our index." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4a3273fd-748e-41ee-ad42-6af31db93855", - "metadata": {}, - "outputs": [], - "source": [ - "from azure.search.documents import SearchClient\n", - "search_client = SearchClient(endpoint, index_name, AzureKeyCredential(index_key))" - ] - }, - { - "cell_type": "markdown", - "id": "e2dd96f4-439e-4c74-877e-6da7b10331ff", - "metadata": {}, - "source": [ - "Next we will convert our dataset into a JSON object because even though it is in JSON format its still labeled as a Python object. After that we will upload each row of our data, or in this case, since we are now dealing with JSON, each group as a separate document. This process is essentially **chunking** our data to help our index easily query our data and only retrieve the groups that hold similar text to the our query. This also minimizes hallucinations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d151e32a-ad57-4cf4-9928-8f8933e98959", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "import json\n", - " \n", - "# Convert JSON data to a Python object\n", - "data = json.loads(df_json)\n", - "\n", - "# Iterate through the JSON array\n", - "for item in data:\n", - " result = search_client.upload_documents(documents=[item])\n", - "\n", - "print(\"Upload of new document succeeded: {}\".format(result[0].succeeded))" - ] - }, - { - "cell_type": "markdown", - "id": "fc10ce7e-e9da-443e-9cd1-7cad4faa740b", - "metadata": {}, - "source": [ - "### Interacting with our Model" - ] - }, - { - "cell_type": "markdown", - "id": "b9630cd3-98ab-456f-8e3e-095ce6982143", - "metadata": {}, - "source": [ - "First, we will write our query. You can run any of the ones below or make your own. That query will be passed to our index which will then give us results of documents that held similar text to our query." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f13c6e1-c719-4fae-a097-f6d7f5081caa", - "metadata": {}, - "outputs": [], - "source": [ - "query = \"Please count how many ML Engineers are there.\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4dcf66f5-a0e1-4b9b-9b55-532b9ead433a", - "metadata": {}, - "outputs": [], - "source": [ - "query = \"Please list the unique job titles.\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "34af7401-4c24-4e5f-888a-990e72e82afc", - "metadata": {}, - "outputs": [], - "source": [ - "query = \"Please count how many employees worked in 2020.\"" - ] - }, - { - "cell_type": "markdown", - "id": "011eedb0-663e-4e4b-9304-f2a810ad5992", - "metadata": {}, - "source": [ - "Here is where we will input our query and then fix the formatting of the results in a way that our model can understand. This will mean first gathering our results in a list, removing any unncessary keys to lessen the token count, converting that list into JSON format so that it is also a string, and then adding quotes around spaces for the model to better decipher our query results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e34ae18-89f3-40f6-96c6-25b9b1ec5c09", - "metadata": {}, - "outputs": [], - "source": [ - "#gathering our query results\n", - "search_results = list(search_client.search(query))\n", - "\n", - "#removing any removing any unncessary keys to lessen the token count (some of these are provided by the vector store)\n", - "#job_title_vector is for users that included embeddings to their data\n", - "remove_keys = ['job_title_vector', '@search.reranker_score', '@search.highlights', '@search.captions', '@search.score']\n", - "for l in search_results:\n", - " for i in remove_keys:\n", - " l.pop(i, None)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e35dfd1c-06ed-47cf-b6d7-bb6368ff203b", - "metadata": {}, - "outputs": [], - "source": [ - "#converting that list into JSON format\n", - "search_results = json.dumps(search_results)\n", - "#adding quotes around spaces\n", - "context=' '.join('\"{}\"'.format(word) for word in search_results.split(' '))" - ] - }, - { - "cell_type": "markdown", - "id": "e4a6d875-4ce3-4a0d-a803-d65eebb17821", - "metadata": {}, - "source": [ - "We will then pass our context and query to our model via a message." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "69ff5e34-c60f-4e99-a3f3-29950c6e617b", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "response = client.chat.completions.create(\n", - " model=\"gpt-4\",\n", - " messages=[\n", - " {\"role\": \"system\", \"content\": \"You are a helpful assistant who answers only from the given Context and answers the question from the given Query. If you are asked to count then you must count all of the occurances mentioned.\"},\n", - " {\"role\": \"user\", \"content\": \"Context: \"+ context + \"\\n\\n Query: \" + query}\n", - " ],\n", - " #max_tokens=100,\n", - " temperature=1,\n", - " top_p=1,\n", - " n=1\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "2cf1549d-fea7-4a37-b94c-269724c7c519", - "metadata": {}, - "source": [ - "Now we can see our results!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1541aa6e-e09c-4151-9938-c42c063e6dcf", - "metadata": {}, - "outputs": [], - "source": [ - "response.choices[0].message.content" - ] - }, - { - "cell_type": "markdown", - "id": "31c915c9", - "metadata": {}, - "source": [ - "## Conclusion\n", - "Here we created embeddings from structured data and fed these embeddings to our LLM. Key skills you learned were to : \n", - "+ Create embeddings and a vector store using Azure AI Search\n", - "+ Send prompts to the LLM grounded on your structured data" - ] - }, - { - "cell_type": "markdown", - "id": "0459e0ae-5183-4b6a-9eca-41c97b0b8a8c", - "metadata": {}, - "source": [ - "## Clean up" - ] - }, - { - "cell_type": "markdown", - "id": "edb500e2-c8cb-428c-85f7-d4886d89899d", - "metadata": {}, - "source": [ - "**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs, including shutting down your Azure ML compute, delete your AI search resource, and optionally delete your deployed models in AI Studio" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4d0e5690-1e15-4c97-9422-410c54a92f6f", - "metadata": {}, - "outputs": [], - "source": [ - "#delete search service this will also delete any indexes\n", - "! az search service delete --name {service_name} --resource-group {resource_group} -y" - ] - } - ], - "metadata": { - "kernel_info": { - "name": "python38-azureml" - }, - "kernelspec": { - "display_name": "Python 3.8 - AzureML", - "language": "python", - "name": "python38-azureml" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - }, - "microsoft": { - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_with_console.ipynb b/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_with_console.ipynb deleted file mode 100644 index 5ad4ee9..0000000 --- a/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_index_structured_with_console.ipynb +++ /dev/null @@ -1,413 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "# Indexing Delimited Files on Azure AI Search using Console and Notebook" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "LLMs work best when querying vector databases (DBs). In a few of our tutorials in this repo, we have created vector DBs from unstructured data like PDF documents. Here, we create a vector DB from structured data, which is technically complex and requires additional steps. Here we will vectorize (embed) a csv file, index our DB using Azure AI Search, and then query our vector DB using a GPT model deployed within Azure AI Studio.\n", - "\n", - "This notebook differs slightly from the tutorial titled `AzureAIStudio_index_structured_notebook.ipynb` in that here we create the index within Azure AI Search directly, rather than in the notebook. We also use NIH grant data here rather than a Kaggle dataset. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have access to both Azure AI Studio and Azure AI Search Service, and have already deployed an LLM." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "\n", - "This tutorial will cover the following topics:\n", - "+ Introduce embeddings from structured data\n", - "+ Create Azure AI Search index from the console\n", - "+ Query Azure AI Search index from command line using LLMs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Install packages" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Use Python3 (ipykernel) kernel" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1707424158923 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "pip install langchain openai" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Import libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1707412445314 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "import os\n", - "import pandas as pd\n", - "from openai import AzureOpenAI\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Connect to an index\n", - "This is the index you created via [these instructions](https://github.com/STRIDES/NIHCloudLabAzure/blob/main/docs/create_index_from_csv.md).\n", - "Look [here](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal#name-the-service) for your endpoint name, and [here](https://learn.microsoft.com/en-us/azure/search/search-security-api-keys?tabs=portal-use%2Cportal-find%2Cportal-query#find-existing-keys) for your index key." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1707412411658 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "endpoint=\"\"\n", - "index_name=\"\"\n", - "index_key=''" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "#connect to vector store \n", - "from azure.search.documents import SearchClient\n", - "from azure.core.credentials import AzureKeyCredential\n", - "\n", - "search_client = SearchClient(endpoint, index_name, AzureKeyCredential(index_key))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Connect to your model\n", - "First, make sure you have a [model deployed](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-openai), and if not, deploy a model.\n", - "To get your endpoint, key, and version number, just go to the Chat Playground and click **View Code** at the top." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1707412412208 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "#connect to model\n", - "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"\"\n", - "os.environ[\"AZURE_OPENAI_API_KEY\"] = \"\",\n", - " azure_endpoint = \"\",\n", - " openai_api_key=\"\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Run the prompt against the LLM" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1699909895704 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "message = HumanMessage(\n", - " content=\"Translate this sentence from English to French: Why all the hype about Generative AI?\"\n", - ")\n", - "llm([message])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Use the LLM to summarize a scientific document" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Now let's load in a scientific document to run a query against. Read more about document loaders from langchain [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf). Note that we are both loading, and splitting our document. You can read more about the default document chunking/splitting procedures [here](https://python.langchain.com/docs/modules/data_connection/document_transformers/#get-started-with-text-splitters)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1699909908495 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "loader = WebBaseLoader(\"https://pubmed.ncbi.nlm.nih.gov/37883540/\")\n", - "pages = loader.load_and_split()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Define that we want to use [stuffing](https://python.langchain.com/docs/modules/chains/document/stuff) to summarize the document." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1699909918382 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "chain = load_summarize_chain(model, chain_type=\"stuff\")\n", - "\n", - "chain.run(pages)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "In this notebook you learned how to feed a PDF document directly to an LLM that you deployed in the Azure console and summarize the document." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean up\n", - "Make sure to shut down your Azure ML compute and if desired you can delete your deployed model on Azure AI Studio." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "kernel_info": { - "name": "python310-sdkv2" - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - }, - "microsoft": { - "host": { - "AzureML": { - "notebookHasBeenCompleted": true - } - }, - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_sql_chatbot.ipynb b/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_sql_chatbot.ipynb deleted file mode 100644 index fb76a70..0000000 --- a/tutorials/notebooks/GenAI/notebooks/AzureAIStudio_sql_chatbot.ipynb +++ /dev/null @@ -1,678 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "66bdb4fe-7ae4-4b3f-8b61-0004d49baa91", - "metadata": {}, - "source": [ - "# Creating a chatbot for structured data using SQL" - ] - }, - { - "cell_type": "markdown", - "id": "4d7509ad", - "metadata": {}, - "source": [ - "## Overview\n", - "**Generative AI (GenAI)** is a groundbreaking technology that generates human-like texts, images, code, and other forms of content. Although this is all true the focus of many GenAI techniques or implementations have been on unstructured data such as PDF's, text docs, image files, websites, etc. where it is required to set a parameter called *top K*. Top K utilizes an algorithm to only retrieve the top scored pieces of content or docs that is relevant to the users ask. This limits the amount of data the model is presented putting a disadvantage for users that may want to gather information from structured data like CSV and JSON files because they typically want all the occurrences relevant data appears. \n", - "\n", - "An example would be if you had a table that lists different types of apples, where they originate, and their colors and you want a list of red apples that originate from the US the model would only give you partial amount of the data you need because it is limited to looking for the top relevant data which may be limited to only finding the top 4 or 20 names of apples (depending on how you have configured your model) instead of listing them all. \n", - "\n", - "The technique that is laid our in this tutorial utilizes **SQL databases** and asks the model to create a query based on the ask of the user. It will then submit that query to the database and present the user with the results. This will not only give us all the information we need but will also decrease the chances of hitting our token limit." - ] - }, - { - "cell_type": "markdown", - "id": "6c69574a-dc53-414c-9606-97c1f871f603", - "metadata": {}, - "source": [ - "## Prerequisites" - ] - }, - { - "cell_type": "markdown", - "id": "7497c624-f592-4061-8dbf-8a9e2baf7fb2", - "metadata": {}, - "source": [ - "We assume you have access to Azure AI Studio, Azure SQL Databases, and have already deployed an LLM. For this tutorial we used **gpt 3.5** and used the **Python 3.10** kernel within our Azure Jupyter notebook." - ] - }, - { - "cell_type": "markdown", - "id": "431e4421-0b41-4a12-9811-0d7a030cf0f9", - "metadata": {}, - "source": [ - "## Learning objectives" - ] - }, - { - "cell_type": "markdown", - "id": "8aee4c83-bb83-442b-a158-61962f43c80a", - "metadata": {}, - "source": [ - "In this tutorial you will learn:\n", - "- Setting up a Azure SQL database\n", - "- Creating a SQl table and query from it\n", - "- Creating a chatbot and utilizing langchains SQL agent to connect the bot to a database" - ] - }, - { - "cell_type": "markdown", - "id": "3d2aa60a-cf87-4083-80fa-e9dc9179dcc8", - "metadata": {}, - "source": [ - "## Table of Contents" - ] - }, - { - "cell_type": "markdown", - "id": "3bad1638-6fcd-4299-b714-48c7cfd865ff", - "metadata": {}, - "source": [ - "- [Summary](#summary)\n", - "- [Install Packages](#packages)\n", - "- [Create Azure SQL Database](#azure_db)\n", - "- [Create Azure SQL Table](#azure_table)\n", - "- [Submitting a Query](#query)\n", - "- [Setting up a Chatbot](#chatbot)\n", - "- [Conclusion](#conclusion)\n", - "- [Cleaning up Resources](#cleanup)" - ] - }, - { - "cell_type": "markdown", - "id": "3d98bdb4", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "id": "79baaa6a-b851-45b5-9002-68af981fb145", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Install packages " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "db3117be-a03f-490f-84ed-b322d9df992e", - "metadata": {}, - "outputs": [], - "source": [ - "pip install 'pyodbc' 'fast_to_sql' 'sqlalchemy'\n", - "pip install --upgrade \"langchain-openai\" \"langchain\" \"langchain-community\"" - ] - }, - { - "cell_type": "markdown", - "id": "094bf011-77ca-41ba-a42c-b2b95f890fc7", - "metadata": {}, - "source": [ - "### Create Azure SQL Database " - ] - }, - { - "cell_type": "markdown", - "id": "64e959a6-4515-49cd-bdf8-b0da0544c10a", - "metadata": {}, - "source": [ - "Follow the instructions [here](https://learn.microsoft.com/en-us/azure/azure-sql/database/single-database-create-quickstart?view=azuresql&tabs=azure-portal) to create a single database in Azure SQL Database. Note that for this tutorials database the field name **Use existing data** was set to **None**." - ] - }, - { - "cell_type": "markdown", - "id": "35d70a02-7a15-4c32-b453-3a97752f9755", - "metadata": {}, - "source": [ - "### Create Azure SQL Table " - ] - }, - { - "cell_type": "markdown", - "id": "69ff0cb2-89cb-4c34-a7c0-19cd09b1d3fb", - "metadata": {}, - "source": [ - "Now that we have our SQL database we will connect to it using the python package `pyodbc` which will allow us to commit changes to our database and query tables." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5776725d-df8e-4a74-8b01-eb4f33a74b83", - "metadata": {}, - "outputs": [], - "source": [ - "import pyodbc\n", - "\n", - "server_name = \"\"\n", - "user = \"\"\n", - "password = \"\"\n", - "database = \"\"\n", - "driver= '{ODBC Driver 18 for SQL Server}'\n", - "\n", - "conn = pyodbc.connect('DRIVER='+driver+';PORT=1433;SERVER='+server+'.database.windows.net/;PORT=1443;DATABASE='+database+';UID='+user+';PWD='+ password)" - ] - }, - { - "cell_type": "markdown", - "id": "506c4b63-276e-438a-a1e5-f4b16ff34cfd", - "metadata": {}, - "source": [ - "Now that we are connected to our database we can upload our data as a table, in this example we are using a csv file from Kaggle that can be downloaded from [here](https://www.kaggle.com/datasets/henryshan/2023-data-scientists-salary). \n", - "\n", - "**Tip:** If you are using a json file you can used the command `pd.read_json` to load in the data frame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6db341b4-62c0-4bb4-a9b6-925b7ebbeccd", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np \n", - "# reading the csv file using read_csv and storing the data frame in variable called df\n", - "df = pd.read_csv('ds_salaries.csv')\n", - "\n", - "# view the data\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "d071bf99-2f0d-4450-935c-94732a1f27f7", - "metadata": {}, - "source": [ - "**Tip:** If you receive a **timeout error** wait a couple of minutes and then run the above code again." - ] - }, - { - "cell_type": "markdown", - "id": "48f72cf4-b307-47b1-a3e4-7b5809ec7715", - "metadata": {}, - "source": [ - "Our second python package we are using is `fast_to_sql` **(fts)** which will allow us to easily create tables from our data. Usually, you would have to create a SQL query that outlines the columns, datatype, and values of our table but **fts** does all the work for us." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "41c6cba4-f6f0-4210-a116-1a0242804fae", - "metadata": {}, - "outputs": [], - "source": [ - "from fast_to_sql import fast_to_sql as fts\n", - "table_name = \"ds_salaries\"\n", - "create_table = fts(df, table_name , conn, if_exists=\"replace\", temp=\"FALSE\")" - ] - }, - { - "cell_type": "markdown", - "id": "8beb69bb-864c-41e9-b5d2-0ed9625861b8", - "metadata": {}, - "source": [ - "Now we will commit our change to make it permanent." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "298dbb17-b78a-4b72-8ac2-db18581a8cc8", - "metadata": {}, - "outputs": [], - "source": [ - "conn.commit()" - ] - }, - { - "cell_type": "markdown", - "id": "5d68afeb-3166-4729-9dd4-7c67e84f7673", - "metadata": {}, - "source": [ - "### Submiting a query " - ] - }, - { - "cell_type": "markdown", - "id": "217c8b47-ac37-4c28-aa5b-01a7cc997e7a", - "metadata": {}, - "source": [ - "To submit a query to our database we first need to establish our connection with a **cursor** which allows you to process data row by row.\n", - "\n", - "**Tip:** At any time you can close the connection to your database using the command `conn.close()`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f0f9982-87fa-43c0-9b5a-0c349efad1e6", - "metadata": {}, - "outputs": [], - "source": [ - "cursor = conn.cursor()" - ] - }, - { - "cell_type": "markdown", - "id": "a67e7f2a-c081-47db-a699-e64987f8ed58", - "metadata": {}, - "source": [ - "Now we can finally submit a query to our database! In the query below we ask to count the number of workers that worked in 2023. Then we use the `execute` command to send our query to the database. The result will be an **iterable** which we will need to create a for loop to see our query result. the result you should receive is **1785**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "329f1fa3-a542-4033-926b-58e55085c73b", - "metadata": {}, - "outputs": [], - "source": [ - "query=\"SELECT COUNT(work_year) FROM ds_salaries WHERE work_year = '2023';\"\n", - "\n", - "cursor.execute(query)\n", - "for row in cursor:\n", - " print(f'QUERY RESULT: {str(row)}') " - ] - }, - { - "cell_type": "markdown", - "id": "5666452a-54be-414f-a29d-561f91f6de82", - "metadata": {}, - "source": [ - "Another way to output our query is to make it into a list and we can use the python function `replace` to get rid of the parentheses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "117d2be0-6e06-4f57-81e7-51ce2548e71a", - "metadata": {}, - "outputs": [], - "source": [ - "query=\"\"\"SELECT name FROM sys.columns WHERE object_id = OBJECT_ID('ds_salaries') \n", - "\"\"\"\n", - "cursor.execute(query)\n", - "\n", - "result = [str(row).replace(\"('\", \"\").replace(\"',)\", \"\") for row in cursor]\n", - "\n", - "print(result)" - ] - }, - { - "cell_type": "markdown", - "id": "9523a38d-16b7-4c34-a8bc-af64ae696853", - "metadata": {}, - "source": [ - "### Setting up a chatbot " - ] - }, - { - "cell_type": "markdown", - "id": "0b2aa0b8-fcc1-440d-b5fe-4762f0ec7f86", - "metadata": {}, - "source": [ - "For our chatbot we will be utilizing langchain to connect our model to our database." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2225a78-8919-4821-ba29-d14f8445ace0", - "metadata": {}, - "outputs": [], - "source": [ - "#load in the required tools\n", - "from langchain_openai import AzureChatOpenAI\n", - "from sqlalchemy import create_engine\n", - "from langchain.agents import AgentType, create_sql_agent\n", - "from langchain.sql_database import SQLDatabase\n", - "from langchain.agents.agent_toolkits.sql.toolkit import SQLDatabaseToolkit" - ] - }, - { - "cell_type": "markdown", - "id": "4d09ef93-36e5-41e5-89e3-f7b34355b6da", - "metadata": {}, - "source": [ - "Enter in your OpenAI model's endpoint and key. For this tutorial we used gpt 3.5." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2ad907b4-37f2-4536-b87d-1132d8dad04e", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"\"\n", - "os.environ[\"AZURE_OPENAI_API_KEY\"] = \"\"" - ] - }, - { - "cell_type": "markdown", - "id": "1e44d275-e7cc-4ad0-acc3-d61f8b97f0aa", - "metadata": {}, - "source": [ - "Set our model to the variable `llm` and enter the model name which was set when the model was deployed, this will connect langchain to our model. We are also setting the **temperature** to **0** because we don't want any randomness or creativity in the models answer only what is in the date." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0b026497-73b5-4d93-8cf4-3e0fd4a5c72c", - "metadata": {}, - "outputs": [], - "source": [ - "model_name=\"\"\n", - "\n", - "llm = AzureChatOpenAI(\n", - " openai_api_version=\"2023-05-15\",\n", - " azure_deployment=model_name,\n", - " temperature = 0\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "51f83b06-e938-459e-8f31-5223326ead34", - "metadata": {}, - "source": [ - "The first step to connecting our model to our database will be to create an engine that will help langchain connect to our SQL database using a package called `sqlalchemy`. The package will take the same info from the connection we name before but the format of driver is a little different where in this package it does not require curly brackets." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f4862c17-e38d-4220-9ccd-c1b295d6e401", - "metadata": {}, - "outputs": [], - "source": [ - "driver= \"ODBC Driver 18 for SQL Server\"" - ] - }, - { - "cell_type": "markdown", - "id": "c59d18f0-2d93-47e0-a743-f48275df47ed", - "metadata": {}, - "source": [ - "The database information will be entered as a connection string and then converted to our database engine using the command `create_engine`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c37e37ca-f6f0-4f5b-a94f-5906a25d6681", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "py_connectionString=f\"mssql+pyodbc://{user}:{password}@{server_name}.database.windows.net/{database}?driver={driver}\"\n", - "db_engine = create_engine(py_connectionString)" - ] - }, - { - "cell_type": "markdown", - "id": "0a696eeb-6a6e-4a32-994d-7460083413c2", - "metadata": {}, - "source": [ - "Now that we have established a connection to the databse we need to use the langchain package `SQLDatabase` to pass that connection to langchain. Notice that we leave the schema as **\"dbo\"** which stands for database owner and will be the default schema for all users, unless some other schema is specified. The dbo schema cannot be dropped." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91adaf0e-1b38-4dee-988d-ceb2f3992b21", - "metadata": {}, - "outputs": [], - "source": [ - "db = SQLDatabase(db_engine, view_support=True, schema=\"dbo\")" - ] - }, - { - "cell_type": "markdown", - "id": "f3dce9b8-b054-4dad-a1c5-c53f6bcc5f04", - "metadata": {}, - "source": [ - "Lets run a test query below to ensure we are connected!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4263a3d5-488e-44ee-8616-79e63758a141", - "metadata": {}, - "outputs": [], - "source": [ - "print(db.dialect)\n", - "db.run(\"SELECT COUNT(*) FROM ds_salaries WHERE work_year = 2023 AND experience_level = 'SE' \")" - ] - }, - { - "cell_type": "markdown", - "id": "46ad4ab1-84ac-43e2-b152-038abadb7184", - "metadata": {}, - "source": [ - "The last step will be to create a SQL agent. The SQL agent will provide our bot with the following instructions:\n", - "1. Taken in the users ask or question and survey the SQL table mentioned in the ask/question\n", - "2. Create a SQL query based on the columns that have relevant information to the ask/question\n", - "3. Submit the query to our database and present the results to the user\n", - "\n", - "There is no need for a prompt because the agent already supplies that.\n", - "\n", - "**Tip**: If you do not want to see the reasoning of the agent and only want to answer set `verbose` to `false` (e.g., `verbose=False`)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "10b17ab9-b1b2-4315-97be-06cb39572fcd", - "metadata": {}, - "outputs": [], - "source": [ - "toolkit = SQLDatabaseToolkit(db=db, llm=llm)\n", - "\n", - "agent_executor = create_sql_agent(llm=llm,\n", - "toolkit=toolkit,\n", - "verbose=True,\n", - "agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "faa22b65-e518-4e24-9154-e42bae28940e", - "metadata": {}, - "source": [ - "Now we can ask our bot questions about our data! Notice how in the question below we mention that the table we are looking at is **ds_salaries**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8f968c90-4151-4d4f-b5a9-516ca34a7a58", - "metadata": {}, - "outputs": [], - "source": [ - "question = \"count the number of employees that worked in 2023 and have a experience level of SE in table ds_salaries.\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2d80a800-ba4f-431d-bab2-1e4b3e50f8da", - "metadata": {}, - "outputs": [], - "source": [ - "agent_executor.invoke(question)" - ] - }, - { - "cell_type": "markdown", - "id": "032bc4e2-9a10-4e20-b775-3e34dc3683ee", - "metadata": {}, - "source": [ - "## Conclusion " - ] - }, - { - "cell_type": "markdown", - "id": "edac27e0-45dd-450b-bb78-fa341a667575", - "metadata": {}, - "source": [ - "In this notebook you learned how to set up a Azure SQL database and connect your model to the database using langchain tools, creating a chatbot that can read and retrieve data from structured data formats." - ] - }, - { - "cell_type": "markdown", - "id": "116c547b-c569-4843-a6a9-e81c6c0f8252", - "metadata": {}, - "source": [ - "## Clean up " - ] - }, - { - "cell_type": "markdown", - "id": "5a66120b-79a4-4a5a-a78b-125cbb1e8aac", - "metadata": {}, - "source": [ - "Dont forget to turn off or delete any notebooks or compute resources! Below you will find instructions to delete the SQL database. With the first step to close the connection to the database." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0dfaedac-271b-4277-b34b-eac1c4c5fc62", - "metadata": {}, - "outputs": [], - "source": [ - "conn.close()" - ] - }, - { - "cell_type": "markdown", - "id": "d21fcfc6-23e0-40ca-bb25-6f000db03aad", - "metadata": {}, - "source": [ - "We will be using Azure CLI commands which first require use to login. Run the command below and follow the steps outputted." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5fd0edc8-af4d-469f-8eb5-81ffec8d3033", - "metadata": {}, - "outputs": [], - "source": [ - "! az login" - ] - }, - { - "cell_type": "markdown", - "id": "0190471e-7238-4a89-a276-e9ab4ff61f62", - "metadata": {}, - "source": [ - " Next we will delete our database, wait for the command to output **'Finished'**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27ff38bb-3a93-4c0e-90ee-dd34b8a37d9c", - "metadata": {}, - "outputs": [], - "source": [ - "resource_group=\"\"\n", - "!az sql db delete --name {database} --resource-group --server {server_name}" - ] - }, - { - "cell_type": "markdown", - "id": "5ecc188e-c4eb-40a2-8854-03027d99e079", - "metadata": {}, - "source": [ - "For this command you will need your subscriptions ID which can be found running the following command:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "34b2beb0-6cd0-4a89-ac0b-1cafcc7f762d", - "metadata": {}, - "outputs": [], - "source": [ - "!az sql server list --resource-group {resource_group}" - ] - }, - { - "cell_type": "markdown", - "id": "e70cea5d-ad6e-49f9-bac7-255f3cf3d147", - "metadata": {}, - "source": [ - "Finally delete your SQL server, wait for the command to output **'Finished'**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2ca10666-81a4-47f2-b168-6fd0a590d4a3", - "metadata": {}, - "outputs": [], - "source": [ - "subscription_id=''\n", - "!az sql server delete --name {server_name} --resource-group {resource_group} --subscription {subscription_id} -y" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1f91700b-bfe3-452c-b5a0-0b8fed115fd8", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernel_info": { - "name": "python310-sdkv2" - }, - "kernelspec": { - "display_name": "Python 3.10 - SDK v2", - "language": "python", - "name": "python310-sdkv2" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - }, - "microsoft": { - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/GenAI/notebooks/AzureOpenAI_embeddings.ipynb b/tutorials/notebooks/GenAI/notebooks/AzureOpenAI_embeddings.ipynb deleted file mode 100644 index cfc93cd..0000000 --- a/tutorials/notebooks/GenAI/notebooks/AzureOpenAI_embeddings.ipynb +++ /dev/null @@ -1,488 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "# Access Azure OpenAI LLMs from a notebook " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "Models you deploy to Azure OpenAI can be accessed via API calls. This tutorial gives you the basics of creating local embeddings from custom data and querying over those." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have access to Azure AI Studio and have already deployed an LLM." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "+ Get familiar with Azure OpenAI APIs\n", - "+ Learn how to create embeddings from custom data\n", - "+ Learn how to query over those embedings\n", - "+ Learn how to access deployed LLMs outside of the Azure console" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Install packages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1696341373678 - } - }, - "outputs": [], - "source": [ - "pip install -r ../requirements.txt" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Run a query on a local csv file by creating local embeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Import required libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1696365118786 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "import os\n", - "import openai\n", - "import requests\n", - "import numpy as np\n", - "import pandas as pd\n", - "from openai.embeddings_utils import get_embedding, cosine_similarity" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "You also need to [deploy a new model](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource?pivots=web-portal#deploy-a-model). You need to select and deploy `text-embedding-ada-0021`. If you get an error downstream about your model not being ready, give it up to five minutes for everything to sync. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "For simplicity, we just use a microsoft example here, but you could theoretically use any csv file as long as you match the expected format of the downstream code. This example is a recent earning report given by the CEO of Microsoft. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1696367383849 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# read the data file to be embedded\n", - "df = pd.read_csv('microsoft-earnings.csv')\n", - "print(df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1696367387035 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# set keys and configure Azure OpenAI\n", - "openai.api_type = \"azure\"\n", - "openai.api_base = \"\"\n", - "openai.api_version = \"2023-07-01-preview\"\n", - "# get the key from the instructions in the README of this repo. \n", - "#You can also just click View Code in the chat playground\n", - "openai.api_key = \"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1696367395456 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# calculate word embeddings \n", - "df['embedding'] = df['text'].apply(lambda x:get_embedding(x, engine='text-embedding-ada-002'))\n", - "df.to_csv('microsoft-earnings_embeddings.csv')\n", - "print(df)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Query the embeddings. After each query you put into the little box, you need to rerun this cell to reset the query. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1696346882392 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "# read in the embeddings .csv \n", - "# convert elements in 'embedding' column back to numpy array\n", - "df = pd.read_csv('microsoft-earnings_embeddings.csv')\n", - "df['embedding'] = df['embedding'].apply(eval).apply(np.array)\n", - "\n", - "# caluculate user query embedding \n", - "search_term = input(\"Enter a search term: \")\n", - "if search_term:\n", - " search_term_vector = get_embedding(search_term, engine='text-embedding-ada-002')\n", - "\n", - " # find similiarity between query and vectors \n", - " df['similarities'] = df['embedding'].apply(lambda x:cosine_similarity(x, search_term_vector))\n", - " df1 = df.sort_values(\"similarities\", ascending=False).head(5)\n", - "\n", - " # output the response \n", - " print('\\n')\n", - " print('Answer: ', df1['text'].loc[df1.index[0]])\n", - " print('\\n')\n", - " print('Similarity Score: ', df1['similarities'].loc[df1.index[0]]) \n", - " print('\\n')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "### Query your own data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "In the README, we show how to add your own data. When you have done this, type in a query, and then similar to what we show for above, if you click **View Code** in the Chat Playground, it will show you all the metadata you need to fill in here." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "openai.api_type = \"azure\"\n", - "openai.api_version = \"2023-08-01-preview\"\n", - "# Azure OpenAI setup\n", - "openai.api_base = \"\" # Add your endpoint here\n", - "deployment_id = \"\" # Add your deployment ID here\n", - "# Azure Cognitive Search setup\n", - "search_endpoint = \"\"; # Add your Azure Cognitive Search endpoint here\n", - "# This is different than the key from above, its the key for the Cog search\n", - "search_key = \"\"; # Add your Azure Cognitive Search admin key here\n", - "search_index_name = \"\"; # Add your Azure Cognitive Search index name here\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "Now run the query, note that the query is defined in the block below, and will output in Json format" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "gather": { - "logged": 1696353881797 - }, - "jupyter": { - "outputs_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "def setup_byod(deployment_id: str) -> None:\n", - " \"\"\"Sets up the OpenAI Python SDK to use your own data for the chat endpoint.\n", - "\n", - " :param deployment_id: The deployment ID for the model to use with your own data.\n", - "\n", - " To remove this configuration, simply set openai.requestssession to None.\n", - " \"\"\"\n", - "\n", - " class BringYourOwnDataAdapter(requests.adapters.HTTPAdapter):\n", - "\n", - " def send(self, request, **kwargs):\n", - " request.url = f\"{openai.api_base}/openai/deployments/{deployment_id}/extensions/chat/completions?api-version={openai.api_version}\"\n", - " return super().send(request, **kwargs)\n", - "\n", - " session = requests.Session()\n", - "\n", - " # Mount a custom adapter which will use the extensions endpoint for any call using the given `deployment_id`\n", - " session.mount(\n", - " prefix=f\"{openai.api_base}/openai/deployments/{deployment_id}\",\n", - " adapter=BringYourOwnDataAdapter()\n", - " )\n", - "\n", - " openai.requestssession = session\n", - "\n", - "setup_byod(deployment_id)\n", - "\n", - "completion = openai.ChatCompletion.create(\n", - " messages=[{\"role\": \"user\", \"content\": \"What were some of the phenotypic presentations of MPOX on patients with HIV?\"}],\n", - " deployment_id=deployment_id,\n", - " dataSources=[ # camelCase is intentional, as this is the format the API expects\n", - " {\n", - " \"type\": \"AzureCognitiveSearch\",\n", - " \"parameters\": {\n", - " \"endpoint\": search_endpoint,\n", - " \"key\": search_key,\n", - " \"indexName\": search_index_name,\n", - " }\n", - " }\n", - " ]\n", - ")\n", - "print(completion)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "nteract": { - "transient": { - "deleting": false - } - } - }, - "source": [ - "## Conclusion\n", - "In this notebook you learned how to feed a PDF document directly to an LLM that you deployed in the Azure console and summarize the document." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean up\n", - "Make sure to shut down your Azure ML compute and if desired you can delete your deployed model on Azure OpenAI." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernel_info": { - "name": "python310-sdkv2" - }, - "kernelspec": { - "display_name": "Python 3.10 - SDK v2", - "language": "python", - "name": "python310-sdkv2" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - }, - "microsoft": { - "host": { - "AzureML": { - "notebookHasBeenCompleted": true - } - }, - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/tutorials/notebooks/GenAI/notebooks/Pubmed_RAG_chatbot b/tutorials/notebooks/GenAI/notebooks/Pubmed_RAG_chatbot deleted file mode 100644 index 77f4d28..0000000 --- a/tutorials/notebooks/GenAI/notebooks/Pubmed_RAG_chatbot +++ /dev/null @@ -1,1664 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "2edc6187-82ae-44e2-852f-2ad2712c93aa", - "metadata": {}, - "source": [ - "# Creating a PubMed Chatbot using Azure" - ] - }, - { - "cell_type": "markdown", - "id": "8acf2f72", - "metadata": {}, - "source": [ - "## Overview\n", - "[PubMed](https://pubmed.ncbi.nlm.nih.gov/about/) supports the search and retrieval of biomedical and life sciences literature with the aim of improving health both globally and personally. Here we create a chatbot that is grounded on PubMed data. Most Azure command line tools are already installed and it is recommended to use the **AzureML** kernel in your Jupyter notebook." - ] - }, - { - "cell_type": "markdown", - "id": "58cb56d0", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have access to both Azure AI Studio and Azure AI Search, and have already deployed an LLM." - ] - }, - { - "cell_type": "markdown", - "id": "3ecea2ad-7c65-4367-87e1-b021167c3a1d", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "\n", - "This tutorial will cover the following topics:\n", - "+ Introduce Langchain\n", - "+ Explain the differences between zero-shot, one-shot, and few-shot prompting\n", - "+ Practice using different document retrievers" - ] - }, - { - "cell_type": "markdown", - "id": "f645b8cf", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "id": "4d01e74b-b5b4-4be9-b16e-ec55419318ef", - "metadata": {}, - "source": [ - "### Optional: Deploy a model" - ] - }, - { - "cell_type": "markdown", - "id": "9dbd13e7-afc9-416b-94dc-418a93e14587", - "metadata": {}, - "source": [ - "In this tutorial we will be using Azure OpenAI which (if you havent already) you can learn how to deploy [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource?pivots=cli). This tutorial utilizes the model **gpt-35-turbo** version 0301 and the embeddings model **text-embedding-ada-002** version 2." - ] - }, - { - "cell_type": "markdown", - "id": "4f3e3ab1-5f7e-4028-a66f-9619926a2afd", - "metadata": {}, - "source": [ - "### PubMed API vs Azure AI Search" - ] - }, - { - "cell_type": "markdown", - "id": "5a820eea-1538-4f40-86c4-eb14fe09e127", - "metadata": {}, - "source": [ - "Our chatbot will rely on documents to answer our questions to do so we are supplying it a **vector index**. A vector index or index is a data structure that enables fast and accurate search and retrieval of vector embeddings from a large dataset of objects. We will be working with two options for our index: PubMed API vs Azure AI Search." - ] - }, - { - "cell_type": "markdown", - "id": "7314b115-9433-460d-b275-78aa50f0a858", - "metadata": {}, - "source": [ - "**What is the difference?**\n", - "\n", - "The **PubMed API** is provided free by LangChain to connect your model to more than **35 million citations** for biomedical literature from MEDLINE, life science journals, and online books. \n", - "\n", - "**Azure AI Search** (formally known as Azure Cognitive Search) is a vector store from Azure that allows the user more **security and control** on which documents you wish to supply to your model. AI Search is a vector store or database that stores the **embeddings** of your documents and the metadata. It can also act as a retriever by using the LangChain tool **AzureCognitiveSearchRetriever** which will be implementing **Retrieval-augmented generation** (RAG). RAG is a method or technique that **indexes documents** by first loading them in, splitting them into chucks (making it easier for our model to search for relevant splits), embedding the splits, then storing them in a vector store. The next steps in RAG are based on the question you ask your chatbot. If we were to ask it \"What is a cell?\" the vector store will be searched by a retriever to find relevant splits that have to do with our question, thus **retrieving relevant documents**. And finally our chatbot will **generate an answer** that makes sense of what a cell is, and point out which source documents it used to create the answer.\n", - "\n", - "We will be exploring both methods!" - ] - }, - { - "cell_type": "markdown", - "id": "bcf1690d-e93d-4cd3-89c6-8d06b5a071a8", - "metadata": {}, - "source": [ - "### Setting up Azure AI Search" - ] - }, - { - "cell_type": "markdown", - "id": "c6330ddf-7972-4451-9fcb-98cf83f5d118", - "metadata": {}, - "source": [ - "If you choose to use Azure AI Search to supply documents to your model follow the instructions below:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c1b23ad-8809-4954-a4df-2ff3b8d9ee58", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install 'langchain' 'langchain-openai' 'langchain-community' 'unstructured' 'tiktoken'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "abddde62-f269-454e-bea6-538bd4267277", - "metadata": { - "collapsed": false, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "#Authenticate to use azure cli\n", - "! az login" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d8c1803b-829a-4256-a88c-1f4b57372ba2", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "#uncomment if update is needed\n", - "#! pip install -U \"azure-storage-blob\" \"azure-search-documents\"" - ] - }, - { - "cell_type": "markdown", - "id": "9428d5bc-76e3-4ee1-891b-0bc190c0ae2f", - "metadata": {}, - "source": [ - "### Setting up our storage container" - ] - }, - { - "cell_type": "markdown", - "id": "05b93a90-ff0b-430d-a5f4-4640bfb77b38", - "metadata": {}, - "source": [ - "The first step will be to create a container that we will later use as our data source for our index. Set your storage account name, location, and container name variables." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43cdc419-25e5-4ba8-b836-a13b2ad77a26", - "metadata": { - "collapsed": false, - "gather": { - "logged": 1701806019922 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - }, - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "location = 'eastus2'\n", - "container_name = 'pubmed-chatbot-resources'\n", - "\n", - "#this should be the same as the one you used to set up your workspace\n", - "resource_group = ''\n", - "\n", - "# storage_account_name can be found by going to Azure Machine Learning Workspace > Storage \n", - "#or you can uncomment and run the command below to list the storage accounts names within your resource group\n", - "storage_account_name = ''\n", - "\n", - "#! az storage account list --resource-group {resource_group} --query \"[].{name:name}\" --output tsv" - ] - }, - { - "cell_type": "markdown", - "id": "a568fcc3-24a7-4f5d-9798-9016468a30ee", - "metadata": {}, - "source": [ - "Create your container within your storage account running the command below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cdf6c373-a0ee-40c6-a5c6-6841c58cc3db", - "metadata": {}, - "outputs": [], - "source": [ - "! az storage container create -n {container_name} --account-name {storage_account_name}" - ] - }, - { - "cell_type": "markdown", - "id": "7adafdba-4c4d-4b96-b9bb-33143a72eafc", - "metadata": {}, - "source": [ - "Run the command below to list the key values of your storage account. The key values will be saved to a json file for protection. We will need one of these keys to create a SAS token that gives us temporary access and permissions to add objects to our container." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6414395e-80b1-4736-b116-70d82675b73b", - "metadata": {}, - "outputs": [], - "source": [ - "!az storage account keys list -g {resource_group} -n {storage_account_name} > keys.json" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eb2a94da-40de-4b69-8de9-e001b4ea98c7", - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "with open('keys.json', mode='r') as f:\n", - " data = json.load(f)\n", - "f.close()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92cebae7-ddbb-4763-8d50-dd2c9f512696", - "metadata": {}, - "outputs": [], - "source": [ - "key=data[0]['value']" - ] - }, - { - "cell_type": "markdown", - "id": "b3ec2e81-f5c1-43f1-8db0-769069acf9f7", - "metadata": {}, - "source": [ - "Now we can create our SAS token that will last for 2 hours. Here we are giving our token the ability to read, write, list, add, and create objects (blobs) within our container." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4b311d9b-1713-4699-ab62-b228d1decc2d", - "metadata": {}, - "outputs": [], - "source": [ - "# create your SAS token\n", - "from datetime import datetime, timedelta\n", - "from azure.storage.blob import BlobServiceClient, generate_account_sas, ResourceTypes, AccountSasPermissions\n", - "start_time = datetime.utcnow()\n", - "expiry_time = start_time + timedelta(hours=2)\n", - "sas_token = generate_account_sas(\n", - " account_name=storage_account_name,\n", - " container_name=container_name,\n", - " account_key=key,\n", - " resource_types=ResourceTypes(object=True),\n", - " permission=AccountSasPermissions(read=True, write=True, delete=True, list=True, add=True, create=True),\n", - " expiry=expiry_time,\n", - " start=start_time\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "900efd36-371b-4400-9a9f-fffd1bc14cce", - "metadata": {}, - "source": [ - "### Gathering documents for the vector store" - ] - }, - { - "cell_type": "markdown", - "id": "1d1c9de7-4a06-4f85-b9ff-c8c9e51f8c70", - "metadata": {}, - "source": [ - "AWS marketplace has PubMed database named **PubMed Central® (PMC)** that contains free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). We will be subsetting this database to add documents to our AI Search Index. Ensure that you have the correct permissions to allow your environment to connect to containers and AI Search." - ] - }, - { - "cell_type": "markdown", - "id": "b6ad30ba-cee8-47f9-bc1e-ece8961ac66a", - "metadata": {}, - "source": [ - "Here we are downloading the metadata file from the PMC index directory, this will list all of the articles within the PMC bucket and their paths. We will use this to subset the database into our own blob storage. Here we are using curl to connect to the public AWS s3 bucket where the metadata and documents are originally stored." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7b395e34-062d-4f77-afee-3601d471954a", - "metadata": { - "gather": { - "logged": 1701794361537 - } - }, - "outputs": [], - "source": [ - "#download the metadata file\n", - "!curl -O http://pmc-oa-opendata.s3.amazonaws.com/oa_comm/txt/metadata/csv/oa_comm.filelist.csv" - ] - }, - { - "cell_type": "markdown", - "id": "93a8595a-767f-4cad-9273-62d8e2cf60d1", - "metadata": {}, - "source": [ - "We only want the metadata of the first 100 files." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c26b0f29-2b07-43a6-800d-4aa5e957fe52", - "metadata": { - "gather": { - "logged": 1701794425470 - }, - "tags": [] - }, - "outputs": [], - "source": [ - "#import the file as a dataframe\n", - "import pandas as pd\n", - "\n", - "df = pd.read_csv('oa_comm.filelist.csv')\n", - "#first 100 files\n", - "first_100=df[0:100]" - ] - }, - { - "cell_type": "markdown", - "id": "abd1ae93-450e-4c79-83cc-ea46a1b507c1", - "metadata": {}, - "source": [ - "Lets look at our metadata! We can see that the s3 bucket path to the files are under the **Key** column this is what we will use to loop through the PMC bucket and copy the first 100 files to our bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ff77b2aa-ed1b-4d27-8163-fdaa7a304582", - "metadata": { - "gather": { - "logged": 1701794430114 - } - }, - "outputs": [], - "source": [ - "first_100" - ] - }, - { - "cell_type": "markdown", - "id": "84e5f36a-239c-4c15-80ab-f896d45849d3", - "metadata": {}, - "source": [ - "The following commands uses `azcopy`, a tool that allows you to copy objects from AWS s3 buckets. The for loop we created will gather the location of each document with in AWS s3 bucket and save the documents to our container in the form of a text file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7d63a7e2-dbf1-49ec-bc84-b8c2c8bde62d", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from io import BytesIO\n", - "#gather path to files in bucket\n", - "for i in first_100['Key']:\n", - " doc_name=i.split(r'/')[-1]\n", - " os.system(f'azcopy copy \"https://s3.amazonaws.com/pmc-oa-opendata/{i}\" \"https://{storage_account_name}.blob.core.windows.net/{container_name}/{doc_name}?{sas_token}\"')" - ] - }, - { - "cell_type": "markdown", - "id": "928de2ca-010a-4087-82a7-e548f84f3d95", - "metadata": {}, - "source": [ - "If you run into any errors make sure you have the `Storage Blob Data Contributor` role assigned to your storage account." - ] - }, - { - "cell_type": "markdown", - "id": "e5adf90e-e88b-4631-b860-81c2ea347786", - "metadata": {}, - "source": [ - "The command below sees if our files have any metadata already associated with them. If your data does not have metadata you can add it to your blob following the section **Adding Metadata to Our Data**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3d2831bf-babd-45cf-9641-a34e1b9d7c37", - "metadata": {}, - "outputs": [], - "source": [ - "! az storage blob metadata show --container-name {container_name} --account-name {storage_account_name} --account-key {key} --name 'PMC10000000.txt'" - ] - }, - { - "cell_type": "markdown", - "id": "613cef7d-d0aa-42a8-a46e-7fd1f5c48c3b", - "metadata": {}, - "source": [ - "### Optional: Adding metadata to our dataset" - ] - }, - { - "cell_type": "markdown", - "id": "acd6b7cf-decf-4e1d-8a36-86031cc64faf", - "metadata": {}, - "source": [ - "To add metadata, our keys can't have spaces and need to be strings. Here we are making a new dataframe with wanted columns for our metadata, these columns are from the `first_100` variable we created earlier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "22b9579b-bde9-4e57-bad6-700c7ee73645", - "metadata": {}, - "outputs": [], - "source": [ - "metadata_table = first_100[['Article Citation', 'AccessionID', 'PMID']].copy()\n", - "#make sure that all keys and values are strings the blob metadata with not beable to parse through our metadata if it is a integer\n", - "metadata_table['PMID'] = metadata_table['PMID'].apply(str)\n", - "metadata_table.rename(columns={'Article Citation': 'Article_Citation'}, inplace=True)" - ] - }, - { - "cell_type": "markdown", - "id": "c02c78e0-18d3-4162-9e58-c790ad85f76f", - "metadata": {}, - "source": [ - "Transform our table into a dictionary to add to our blob metadata." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6ae31175-f664-413d-8d2f-ecaef67038dd", - "metadata": {}, - "outputs": [], - "source": [ - "metadata_dict = metadata_table.to_dict('records')" - ] - }, - { - "cell_type": "markdown", - "id": "90dd8101-c635-43a0-9645-1115b32eb037", - "metadata": {}, - "source": [ - "Let's look at our metadata!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5a9de234-c83a-4d94-87b8-3bd51fb0c531", - "metadata": {}, - "outputs": [], - "source": [ - "metadata_dict[0]" - ] - }, - { - "cell_type": "markdown", - "id": "fd877ef2-155f-4fe2-b0c1-8e45293196e2", - "metadata": {}, - "source": [ - "Now that we have our metadata variables set we can connect to our container and the blobs within it by using a **BlobServiceClient**. This client service uses our storage account endpoint and our SAS token. Then we will construct a for loop that loops through the 'first_100' dataframe to gather our document name (which is also the blob name).\n", - "\n", - "Next it will do the following:\n", - "- Gather the metadata (if any exists) of the blob\n", - "- Update the metadata as the new metadata record we created 'metadata_dict'\n", - "- Set the metadata on the blob. Although we have updated the metadata it will not save on your blob unless you set it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46767760-8794-4860-9468-6b2d6b72022b", - "metadata": {}, - "outputs": [], - "source": [ - "from azure.storage.blob import BlobServiceClient\n", - "blob_service = BlobServiceClient(account_url=f'https://{storage_account_name}.blob.core.windows.net', credential=sas_token)\n", - "\n", - "for i in range(len(first_100['Key'])):\n", - " document_name = first_100['Key'][i].split(\"/\")[-1] \n", - " blob_client = blob_service.get_blob_client(container=container_name, blob=document_name)\n", - " # gather metadata properties for that blob\n", - " blob_metadata = blob_client.get_blob_properties().metadata\n", - " # Update blob metadata\n", - " more_blob_metadata = metadata_dict[i]\n", - " blob_metadata.update(more_blob_metadata)\n", - "\n", - " # Set metadata on the blob\n", - " blob_client.set_blob_metadata(metadata=blob_metadata)" - ] - }, - { - "cell_type": "markdown", - "id": "2eaf1733-a4f6-4d71-80d2-83089d6dd3f6", - "metadata": {}, - "source": [ - "Lets check the metadata of one of our blobs!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2dc7caf8-f021-4406-93bb-e4834a17cc56", - "metadata": {}, - "outputs": [], - "source": [ - "! az storage blob metadata show --container-name {container_name} --account-name {storage_account_name} --account-key {key} --name 'PMC10000000.txt'" - ] - }, - { - "cell_type": "markdown", - "id": "c1b396c8-baa9-44d6-948c-2326dc514839", - "metadata": {}, - "source": [ - "### Creating an Azure AI Search service" - ] - }, - { - "cell_type": "markdown", - "id": "bb6fa941-bf59-4cae-9aa8-2f2741f3a1b1", - "metadata": {}, - "source": [ - "To create our AI Search index, we will first need to create a search service, and request to create the free SKU to hold all our documents in our vector store. The **free** tier allows you to hold 50MB of data and 3 indexes, and indexers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "63226024-d03e-4fa0-9557-2f18fec07bd5", - "metadata": {}, - "outputs": [], - "source": [ - "service_name = 'pubmed-search'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ea9458fa-3c0c-4249-a8bd-fd86f9bee8c7", - "metadata": {}, - "outputs": [], - "source": [ - "! az search service create --name {service_name} --sku free --location {location} --resource-group {resource_group} --partition-count 1 --replica-count 1" - ] - }, - { - "cell_type": "markdown", - "id": "4eea51b2-6511-4ae4-ba9d-963a861376cd", - "metadata": {}, - "source": [ - "Below will list the admin keys, select one of them to use for adding objects to our index." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "95382baa-abb9-4ad1-b1db-11cb9a606b7c", - "metadata": {}, - "outputs": [], - "source": [ - "! az search admin-key show --resource-group {resource_group} --service-name {service_name} > keys.json" - ] - }, - { - "cell_type": "markdown", - "id": "ee7c2455-9e61-4568-b2f1-03546c1f9878", - "metadata": {}, - "source": [ - "Save one of the keys." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1dde166b-d342-400a-ba1d-23436e1938ce", - "metadata": {}, - "outputs": [], - "source": [ - "with open('keys.json', mode='r') as f:\n", - " data = json.load(f)\n", - "search_key = data[\"primaryKey\"]" - ] - }, - { - "cell_type": "markdown", - "id": "51fbde69-5a23-45a8-a000-9952824d973a", - "metadata": {}, - "source": [ - "Now we can create our index using a SearchClient which will allow us to also define our fields within our index. Depending on the size of your documents you may need to split your document in chucks so that it fits within the token size of our model." - ] - }, - { - "cell_type": "markdown", - "id": "59c5304e-8e14-485c-b452-e5af2da95e01", - "metadata": {}, - "source": [ - "### Creating an index and loading small documents" - ] - }, - { - "cell_type": "markdown", - "id": "c80ae3c4-ddec-473d-98e9-034e58542968", - "metadata": {}, - "source": [ - "Here we can create an index that connects our blobs in our container using an **Indexer** and a **Data Container**.\n", - "\n", - "**Warning:** This dataset contains large documents, while the below steps are only meant to show you how the process would go with smaller documents" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c0540581-7b16-49af-8c67-a3bbd8a247d2", - "metadata": {}, - "outputs": [], - "source": [ - "from azure.search.documents import SearchClient\n", - "from azure.core.credentials import AzureKeyCredential\n", - "from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient\n", - "from azure.search.documents.indexes.models import (\n", - " SearchIndexerDataContainer,\n", - " SearchIndexerDataSourceConnection,\n", - " SearchIndex,\n", - " SearchIndexer,\n", - " SearchableField,\n", - " SearchFieldDataType,\n", - " SimpleField,\n", - ")\n", - "\n", - "endpoint = \"https://{}.search.windows.net/\".format(service_name)\n", - "index_client = SearchIndexClient(endpoint, AzureKeyCredential(search_key))\n", - "indexers_client = SearchIndexerClient(endpoint, AzureKeyCredential(search_key))\n", - "connection_string = f\"DefaultEndpointsProtocol=https;AccountName={storage_account_name};AccountKey={key}\"" - ] - }, - { - "cell_type": "markdown", - "id": "0584fad3-a07f-4263-97c3-475d774e87a1", - "metadata": {}, - "source": [ - "Here we are stating our schema or fields before we create our index, these fields are from when we ran the `az storage blob metadata show` command after loading our blobs to our container.\n", - "\n", - "- **SimpleField:** A field that you can retrieve values but not search them this is ideal for keys which is a unique ID for each blob. Here we are setting the Md5 value as our key.\n", - "- **SearchableField:** A field that allows you to retrieve and search values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2bddb00f-fa9e-4677-9d02-484b2eb5b02d", - "metadata": {}, - "outputs": [], - "source": [ - "s_index_name = \"pubmed-index-smalldocs\"\n", - "\n", - "fields = [\n", - " SimpleField(\n", - " name=\"Md5\",\n", - " type=SearchFieldDataType.String,\n", - " key=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"content\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"metadata_storage_path\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"metadata_storage_name\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"Citation\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"Accession_id\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " ),\n", - " SearchableField(\n", - " name=\"Pmid\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True,\n", - " )\n", - "]\n", - "#set our index values\n", - "index = SearchIndex(name=s_index_name, fields=fields)\n", - "#create our index\n", - "index_client.create_index(index)" - ] - }, - { - "cell_type": "markdown", - "id": "2db17d58-1cc3-4a9c-8872-aeca6da86638", - "metadata": {}, - "source": [ - "Now that our index is created we can create our **Data Container** which is the storage container that holds our documents. Once this is created we then create a **Indexer** that will link our data container and our index together, it also has the option to update our index if you were to add new blobs to your storage container." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6ae3eddb-af30-40cc-95b9-bbba29109af2", - "metadata": {}, - "outputs": [], - "source": [ - "# create a datasource\n", - "container = SearchIndexerDataContainer(name=container_name)\n", - "data_source_connection = SearchIndexerDataSourceConnection(\n", - " name=\"pubmed-datasource\", type=\"azureblob\", connection_string=connection_string, container=container\n", - ")\n", - "data_source = indexers_client.create_data_source_connection(data_source_connection)\n", - "\n", - "# create an indexer\n", - "indexer = SearchIndexer(\n", - " name=\"pubmed-indexer\", data_source_name=\"pubmed-datasource\", target_index_name=s_index_name\n", - ")\n", - "result = indexers_client.create_indexer(indexer)" - ] - }, - { - "cell_type": "markdown", - "id": "39913a9f-5026-4b50-91f9-2acd49d2999f", - "metadata": {}, - "source": [ - "Wait about 5 mins for the index and indexer to sync." - ] - }, - { - "cell_type": "markdown", - "id": "ab4cbeb1-5ba2-4f56-bf8e-7b5875c29538", - "metadata": {}, - "source": [ - "### Creating an index and loading large documents" - ] - }, - { - "cell_type": "markdown", - "id": "67ffaf25-4b54-498a-9b60-4fca4607e9e9", - "metadata": {}, - "source": [ - "For our model to retrieve information from larger documents we need to split the text in our documents into smaller chucks. This will make it easier for our model to sift through our docs to retrieve information without going over the model's token limit. " - ] - }, - { - "cell_type": "markdown", - "id": "0e7acee9-fa89-4d45-b577-5f374103792f", - "metadata": {}, - "source": [ - "If you remember before, we mentioned **RAG**, the process below follows this technique using LangChain. First, we will add metadata to our docs, split our docs into chunks, and embed them. Then much like for smaller documents we will create an index, the fields in our index will be different compared to the small document index. " - ] - }, - { - "cell_type": "markdown", - "id": "f1727147-3cf9-4f9f-a21e-6b33a2f5640d", - "metadata": {}, - "source": [ - "#### Adding metadata to loaded documents" - ] - }, - { - "cell_type": "markdown", - "id": "2fa34e7b-99c7-4a2e-b73b-146636a98285", - "metadata": {}, - "source": [ - "After we have our documents stored in our container we can start to load our files back. This step is necessary though redundant because we will need to embed our docs for our vector store and we need to attach metadata for each document. Although our blobs already have metadata attached to them, LangChain document loader tools only retrieves the path of our files so we need to add them back. In this case we will be using **AzureBlobStorageContainerLoader** to load in the container that holds all of our documents.\n", - "\n", - "If your data is in a directory within your container add the `prefix` variable to the loader definition.\n", - "\n", - "When we load in our documents they will be set as a tuple that is named **Documents**. This tuple will contain two items:\n", - "- **page content:** The text or content within our document\n", - "- **metadata:** The associated metadata which for now will only hold the source (path) to our documents" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4a11c98f-463b-48fd-84f7-f2b99f87d992", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from langchain_community.document_loaders import AzureBlobStorageContainerLoader\n", - "connection_string = f\"DefaultEndpointsProtocol=https;AccountName={storage_account_name};AccountKey={key}\"\n", - "print(f\"Processing documents from {container_name}\")\n", - "\n", - "loader = AzureBlobStorageContainerLoader(\n", - " conn_str=connection_string, container=container_name\n", - ")\n", - "\n", - "documents = loader.load()" - ] - }, - { - "cell_type": "markdown", - "id": "8b6ab068-2919-4d93-8711-15dd7eb19ada", - "metadata": {}, - "source": [ - "Next we use our blob service client to retrieve our metadata from our blobs to add our metadata back to our loaded docs via a for loop. The metadata will consist of the source, title, and the original metadata fields from our blob." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d0a805af-98b1-4367-9aa9-de519e38bdea", - "metadata": {}, - "outputs": [], - "source": [ - "from azure.storage.blob import BlobServiceClient\n", - "\n", - "blob_service = BlobServiceClient(account_url=f'https://{storage_account_name}.blob.core.windows.net', credential=sas_token)\n", - "\n", - "for i in range(len(documents)):\n", - " #set metadata to variable\n", - " doc_md = documents[i].metadata\n", - " #gather document name from metadata to correct source formatting\n", - " document_name = doc_md[\"source\"].split(\"/\")[-1]\n", - " source = f'{container_name}/{document_name}'\n", - " #set the first two fields of our metadata\n", - " documents[i].metadata = {\"source\": source, \"title\": document_name}\n", - " #connect to our blob to gather the metadata\n", - " blob_client = blob_service.get_blob_client(container=container_name, blob=document_name)\n", - " other_metadata = blob_client.get_blob_properties().metadata\n", - " #add the blob metadata to our loaded documents\n", - " documents[i].metadata.update(other_metadata)\n", - "print(f\"# of documents loaded (pre-chunking) = {len(documents)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "4abb10cd-4eb9-4678-aa18-b4f168f1d927", - "metadata": {}, - "source": [ - "Lets look at our metadata!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "db06d582-b84d-4b9e-9ff1-1695e37bb50e", - "metadata": {}, - "outputs": [], - "source": [ - "print(documents[0].metadata)" - ] - }, - { - "cell_type": "markdown", - "id": "57e21813-35fa-485a-ac2e-41d38676d87e", - "metadata": {}, - "source": [ - "#### Splitting our documents" - ] - }, - { - "cell_type": "markdown", - "id": "9dfb34dc-4b8d-4c92-9e64-f94926bd8793", - "metadata": {}, - "source": [ - "Splitting our data into chucks will help our vector store parse through our data faster and efficiently.\n", - "\n", - "For this step we will be using langchains **RecursiveCharacterTextSplitter**. This text splitter allows us to set the size of each chunk, if the chunks should have any text overlap (this is to help the model bridge some the chunks to make sense of them), and where best to separate texts. Each chunk will have the same metadata as the original document they came from." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3ae5e1eb-b2df-465c-a37d-3ddbad526602", - "metadata": {}, - "outputs": [], - "source": [ - "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", - "\n", - "text_splitter = RecursiveCharacterTextSplitter(\n", - " # Set a small chunk size.\n", - " chunk_size = 2000,\n", - " chunk_overlap = 20,\n", - " length_function = len,\n", - " separators=[\"\\n\\n\", \"\\n\", \".\", \"!\", \"?\", \",\", \" \", \"\"]\n", - ")\n", - "chunk = text_splitter.split_documents(documents)\n", - "\n", - "print(f\"# of documents loaded (pre-chunking) = {len(chunk)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "b70848ce-02ee-4c2c-9824-d231a4d9037a", - "metadata": {}, - "source": [ - "lets look at one of our chunks!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2ee4576-6d4b-4f70-8cf4-1f52abcb8208", - "metadata": {}, - "outputs": [], - "source": [ - "chunk[0]" - ] - }, - { - "cell_type": "markdown", - "id": "0c8ae84a-3e21-41b0-85d0-7093c563bb90", - "metadata": {}, - "source": [ - "#### Create an Index with a Vector Field" - ] - }, - { - "cell_type": "markdown", - "id": "10628e98-5486-4222-ad36-52ae4ad3a5c0", - "metadata": {}, - "source": [ - "For our index we will be adding in a **content_vector** field which represents each chuck embedded. **Embedding** means that we are converting our text into a **numerical vectors** that will help our model find similar objects like documents that hold similar texts or find similar photos based on the numbers assigned to the object, basically capturing texts meaning and relationship through numbers. Depending on the model you choose you have to find an embedder that is compatible to our model. Since we are using a OpenAI model the compatible embedding model will be **text-embedding-ada-002**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "11fbd018-4197-4641-bbc3-9feff8c4b4e9", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from langchain_openai import AzureOpenAIEmbeddings\n", - "from azure.search.documents.indexes.models import SearchIndex\n", - "from azure.search.documents.indexes import SearchIndexClient\n", - "from azure.core.credentials import AzureKeyCredential\n", - "from azure.search.documents.indexes.models import (\n", - " SearchableField,\n", - " SearchField,\n", - " SearchFieldDataType,\n", - " SimpleField,\n", - " TextWeights,\n", - " VectorSearch,\n", - " VectorSearchProfile,\n", - " HnswAlgorithmConfiguration,\n", - " ComplexField\n", - ")\n", - "\n", - "endpoint = \"https://{}.search.windows.net/\".format(service_name)\n", - "index_client = SearchIndexClient(endpoint, AzureKeyCredential(search_key))\n", - "\n", - "#Setup embeddings model\n", - "os.environ[\"AZURE_OPENAI_API_KEY\"] = \"\"\n", - "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"\"\n", - "\n", - "embeddings = AzureOpenAIEmbeddings(\n", - " azure_deployment=\"text-embedding-ada-002\",\n", - " chunk_size=10, #processing our chunks in batches of 10\n", - ")\n", - "embedding_function = embeddings.embed_query" - ] - }, - { - "cell_type": "markdown", - "id": "f8643cfd-861c-4d6b-92cf-f21f4e15ccfb", - "metadata": {}, - "source": [ - "Now we can create our fields. You will notice that they are different from the small documents fields. Because we are using LangChain to add our chunks to our index all our metadata will be held in a field called metadata, the page_content will be held in content, and langchan will create ids for each chunk.\n", - "\n", - "Another field you might have noticed is the **content_vector** field this field will hold the content that has been embedded. To create this field we have to set a vector profile which dictates what algorithm we will have our vector store use to find text that are similar to each other (find the nearest neighbors) for this profile we will be using the **Hierarchical Navigable Small World (HNSW) algorithm**.\n", - "\n", - "- **SimpleField:** A field that you can retrieve values but not search them this is ideal for keys which is a unique ID for each blob. Here we are setting the id value as our key.\n", - "- **SearchableField:** A field that allows you to retrieve and search values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b542ff8-221a-4e7a-8fca-d5ce09e5976d", - "metadata": {}, - "outputs": [], - "source": [ - "fields = [\n", - " SimpleField(\n", - " name=\"id\",\n", - " type=SearchFieldDataType.String,\n", - " key=True\n", - " ),\n", - " SearchableField(\n", - " name=\"content\",\n", - " type=SearchFieldDataType.String,\n", - " searchable=True\n", - " ),\n", - " SearchField(\n", - " name=\"content_vector\",\n", - " type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n", - " searchable=True,\n", - " vector_search_dimensions=len(embedding_function(\"Text\")),\n", - " vector_search_profile_name=\"my-vector-config\"\n", - " ),\n", - " SearchableField(name=\"metadata\", type=SearchFieldDataType.String, searchable=True),\n", - "]\n", - "\n", - "vector_search = VectorSearch(\n", - " profiles=[VectorSearchProfile(name=\"my-vector-config\", algorithm_configuration_name=\"my-algorithms-config\")],\n", - " algorithms=[HnswAlgorithmConfiguration(name=\"my-algorithms-config\")],\n", - ")\n", - " \n", - "l_index_name = \"pubmed-index-largedocs\"\n", - "index = SearchIndex(name=l_index_name, fields=fields, vector_search=vector_search)\n", - "index_client.create_index(index)" - ] - }, - { - "cell_type": "markdown", - "id": "9e91998d-376f-4050-8080-50ee3c473ea6", - "metadata": {}, - "source": [ - "Define your vector store for langchain." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a1547a74-1727-4c14-8856-842b161fe201", - "metadata": {}, - "outputs": [], - "source": [ - "from langchain.retrievers import AzureCognitiveSearchRetriever\n", - "\n", - "vector_store = AzureSearch(\n", - " azure_search_endpoint=endpoint,\n", - " azure_search_key=search_key,\n", - " index_name=l_index_name,\n", - " embedding_function=embedding_function\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "fe658444-b194-4495-a53c-c39f98498178", - "metadata": {}, - "source": [ - "#### Embedding and Adding Data to Vector Store" - ] - }, - { - "cell_type": "markdown", - "id": "4e3bfb5b-a3a6-4156-bca3-394774a94565", - "metadata": {}, - "source": [ - "For our chunks to be read by our embedding model we need split the tuple within each chunk, remember that the chunks consists of tuple called **Document** that contains **page content** and **metadata**. The code below loops through the chunks and splits the page_content and metadata saving them as separate variable lists." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1ba20bef-5d38-4a99-9374-7642563d8716", - "metadata": {}, - "outputs": [], - "source": [ - "texts = [doc.page_content for doc in chunk]\n", - "metadatas = [doc.metadata for doc in chunk]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "78fcd8d9-1e07-4413-bfbf-53347adf2bcc", - "metadata": {}, - "outputs": [], - "source": [ - "chunk[0]" - ] - }, - { - "cell_type": "markdown", - "id": "62095449-f2bd-4038-ac6a-e1569887680e", - "metadata": {}, - "source": [ - "Finally we can upload our split content and metadata to our vector store! This may take 10 to 20 mins depending on how large your dataset is." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27eff5e6-ff27-4ea8-9e6a-c0c5c05a245a", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "vector_store.add_texts(texts=texts, metadatas=metadatas)" - ] - }, - { - "cell_type": "markdown", - "id": "07b3bc6b-8c43-476f-a662-abda830dc2da", - "metadata": { - "tags": [] - }, - "source": [ - "### Creating an inference script " - ] - }, - { - "cell_type": "markdown", - "id": "3ba2291e-109e-4120-ad10-5dbfd341a07b", - "metadata": {}, - "source": [ - "In order for us to fluidly send input and receive outputs from our chatbot we need to create an **inference script** that will format inputs in a way that the chatbot can understand and format outputs in a way we can understand. We will also be supplying instructions to the chatbot through the script.\n", - "\n", - "Our script will utilize **LangChain** tools and packages to enable our model to:\n", - "- **Connect to sources of context** (e.g. providing our model with tasks and examples)\n", - "- **Rely on reason** (e.g. instruct our model how to answer based on provided context)\n", - "\n", - "The following tools must be installed via your terminal `pip install \"langchain\" \"langchain-openai\" \"langchain-community\" \"xmltodict\" \"openai\"` and the general inference script must be run on the terminal via the command `python YOUR_SCRIPT.py`." - ] - }, - { - "cell_type": "markdown", - "id": "ad374085-c4b1-4083-85a5-90cba35846d6", - "metadata": {}, - "source": [ - "The first section below will list all the tools that are required. \n", - "- **PubMedRetriever:** Utilizes the langchain retriever tool to specifically retrieve PubMed documents from the PubMed API.\n", - "- **AzureCognitiveSearchRetriever:** Connects to Azure AI Search to be used as a langchain retriever tool by specifically retrieving embedded documents stored in your vector store.\n", - "- **AzureChatOpenAI:** Connects to your deployed OpenAI model. \n", - "- **ConversationalRetrievalChain:** Allows the user to construct a conversation with the model and retrieves the outputs while sending inputs to the model.\n", - "- **PromptTemplate:** Allows the user to prompt the model to provide instructions, best method for zero and few shot prompting" - ] - }, - { - "cell_type": "markdown", - "id": "6f0ad48d-c6c8-421a-a48b-88e979d15b57", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "from langchain_community.retrievers import PubMedRetriever\n", - "from langchain_community.retrievers import AzureCognitiveSearchRetriever\n", - "from langchain_openai import AzureChatOpenAI\n", - "from langchain.chains import ConversationalRetrievalChain\n", - "from langchain.prompts import PromptTemplate\n", - "import sys\n", - "import json\n", - "import os\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "900f4c31-71cd-4f39-8bfc-de098bdbaafc", - "metadata": {}, - "source": [ - "Second will build a class that will hold the functions we need to send inputs and retrieve outputs from our model. For the beginning of our class we will establish some colors to our text conversation with our chatbot which we will utilize later." - ] - }, - { - "cell_type": "markdown", - "id": "decbb901-f811-4b8e-a956-4c8c7f914ae2", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "class bcolors:\n", - " HEADER = '\\033[95m'\n", - " OKBLUE = '\\033[94m'\n", - " OKCYAN = '\\033[96m'\n", - " OKGREEN = '\\033[92m'\n", - " WARNING = '\\033[93m'\n", - " FAIL = '\\033[91m'\n", - " ENDC = '\\033[0m'\n", - " BOLD = '\\033[1m'\n", - " UNDERLINE = '\\033[4m'\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "ba36d057-5189-4075-a243-18996c6fc932", - "metadata": {}, - "source": [ - "We need to extract environmental variables to connect to our Open AI model. They will be :\n", - "- OpenAI Key\n", - "- OpenAI Endpoint (url)\n", - "- Open AI Deployment Name\n", - "\n", - "If you are using Azure AI Search instead of the PubMed API we need to create a function that will gather the necessary information to connect to our vector store, which will be the:\n", - "- Azure AI Search Service Name\n", - "- Azure AI Search Index Name\n", - "- Azure AI Search API Key" - ] - }, - { - "cell_type": "markdown", - "id": "3f7a244a-7e71-40d3-ae78-8e166dd3c7ee", - "metadata": {}, - "source": [ - "```python\n", - "def build_chain():\n", - " os.getenv(\"AZURE_OPENAI_API_KEY\")\n", - " os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", - " os.getenv(\"AZURE_COGNITIVE_SEARCH_SERVICE_NAME\")\n", - " os.getenv(\"AZURE_COGNITIVE_SEARCH_INDEX_NAME\")\n", - " os.getenv(\"AZURE_COGNITIVE_SEARCH_API_KEY\")\n", - " AZURE_OPENAI_DEPLOYMENT_NAME = os.environ[\"AZURE_OPENAI_DEPLOYMENT_NAME\"]\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "dab1012f-ed20-47b9-9162-924e03e836d5", - "metadata": {}, - "source": [ - "Now we can define our OpenAI model that has been predeployed. If you want to modify parameters, you can control them via:\n", - "- Temperature: Controls randomness, higher values increase diversity meaning a more unique response make the model to think harder. Must be a number from 0 to 1, 0 being less unique.\n", - "- Max Output Tokens: Limit of tokens outputted by the model.(optional: can assign if you like)" - ] - }, - { - "cell_type": "markdown", - "id": "8cadb1af-2c46-4ab1-92f9-6e0861f83324", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "llm = AzureChatOpenAI(\n", - " openai_api_version=\"2023-05-15\",\n", - " azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME,\n", - " temperature = 0.5\n", - " #max_tokens = 3000\n", - ")\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "c44b4f91-0c64-459b-a6e9-8a955c0797c7", - "metadata": {}, - "source": [ - "Make sure you use either the PubMed retreiver from LangChain or the Azure AI Search Index, but not both.\n", - "\n", - "If using Azure AI Search we need to specify what we are retrieving for our model to review, in this case it is the **content** part of our scheme we set within our index. We also set **'top_k'** to 2 meaning that our retriever will retrieve 2 documents that are the most similar to our query." - ] - }, - { - "cell_type": "markdown", - "id": "21c61724-23d3-4b49-8c72-cbd208bdb5df", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "retriever= PubMedRetriever()\n", - "\n", - "#only if using Azure AI Search as a retriever\n", - "\n", - "retriever = AzureCognitiveSearchRetriever(content_key=\"content\", top_k=2)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "ec8e464a-0931-444a-aa58-09ee0c4c9884", - "metadata": {}, - "source": [ - "Here we are constructing our **prompt_template**, this is where we can try zero-shot or few-shot prompting. Only add one method per script." - ] - }, - { - "cell_type": "markdown", - "id": "4431051e-0e84-408e-9821-f50a9b88c9c1", - "metadata": {}, - "source": [ - "#### Zero-shot prompting\n", - "\n", - "Zero-shot prompting does not require any additional training, but rather it asks a pre-trained language model to respond directly to a prompt, similar to if you were to ask Chat GPT a quick question without context. The model relies on its general language understanding and the patterns it has learned during its training to produce relevant output. In our script we grounded our model via a **retriever** to make sure it gathers information from our input data (PubMed API or Azure AI Search). \n", - "\n", - "See below that the task is more like instructions notifying our model they will be asked questions which it will answer based on the info of the scientific documents provided from the index provided (this can be the PubMed API or Vector Search index). All of this information is established as a **prompt template** for our model to receive." - ] - }, - { - "cell_type": "markdown", - "id": "c0316dc5-6274-4a5e-92e4-3d266ed6a4df", - "metadata": { - "tags": [] - }, - "source": [ - "```python\n", - "prompt_template = \"\"\"\n", - " Ignore everything before.\n", - " \n", - " Instructions:\n", - " I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. \n", - " The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. \n", - " You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. \n", - " Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers.\n", - " \n", - " {question} Answer \"don't know\" if not present in the document. \n", - " {context}\n", - " Solution:\"\"\"\n", - " PROMPT = PromptTemplate(\n", - " template=prompt_template, input_variables=[\"context\", \"question\"],\n", - " )\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "edbe7032-8507-4d07-baab-1b3bf0e92074", - "metadata": {}, - "source": [ - "#### One-shot and Few-shot Prompting" - ] - }, - { - "cell_type": "markdown", - "id": "5614ea04-e1f8-4941-ae16-4359f718f98f", - "metadata": {}, - "source": [ - "One and few-shot prompting are similar to one-shot prompting, in addition to giving our model a task just like before we have also supplied an example of how we want the model to respond. See below for an example. " - ] - }, - { - "cell_type": "markdown", - "id": "5ffb9669-5b77-4d9b-9f4e-a0d3a18b0fae", - "metadata": {}, - "source": [ - "```python\n", - "prompt_template = \"\"\"\n", - " Instructions:\n", - " I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. \n", - " The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. \n", - " You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. \n", - " Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers.\n", - " Examples:\n", - " Question: What is a cell?\n", - " Answer: '''\n", - " Cell, in biology, the basic membrane-bound unit that contains the fundamental molecules of life and of which all living things are composed. \n", - " Sources: \n", - " Chow, Christopher , Laskey, Ronald A. , Cooper, John A. , Alberts, Bruce M. , Staehelin, L. Andrew , \n", - " Stein, Wilfred D. , Bernfield, Merton R. , Lodish, Harvey F. , Cuffe, Michael and Slack, Jonathan M.W.. \n", - " \"cell\". Encyclopedia Britannica, 26 Sep. 2023, https://www.britannica.com/science/cell-biology. Accessed 9 November 2023.\n", - " '''\n", - " \n", - " {question} Answer \"don't know\" if not present in the document. \n", - " {context}\n", - " \n", - "\n", - " \n", - " Solution:\"\"\"\n", - " PROMPT = PromptTemplate(\n", - " template=prompt_template, input_variables=[\"context\", \"question\"],\n", - " )\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "82c66d53-97b2-46dc-a466-70a3d3bee4a7", - "metadata": {}, - "source": [ - "The following set of commands control the chat history essentially telling the model to expect another question after it finishes answering the previous one. Follow up questions can contain references to past chat history so the **ConversationalRetrievalChain** combines the chat history and the followup question into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.\n", - "\n", - "All of these pieces such as our conversational chain, prompt, and chat history are passed through a function called **run_chain** so that our model can return a response. We have also set the length of our chat history to one, meaning that our model can only refer to the pervious conversation as a reference." - ] - }, - { - "cell_type": "markdown", - "id": "fda4d33b-60f2-4462-a8e6-bbce7f8a7b07", - "metadata": {}, - "source": [ - "```python\n", - "condense_qa_template = \"\"\"\n", - " Chat History:\n", - " {chat_history}\n", - " Here is a new question for you: {question}\n", - " Standalone question:\"\"\"\n", - " standalone_question_prompt = PromptTemplate.from_template(condense_qa_template)\n", - " \n", - " qa = ConversationalRetrievalChain.from_llm(\n", - " llm=llm, \n", - " retriever=retriever, \n", - " condense_question_prompt=standalone_question_prompt, \n", - " return_source_documents=True, \n", - " combine_docs_chain_kwargs={\"prompt\":PROMPT},\n", - " )\n", - " return qa\n", - "\n", - "def run_chain(chain, prompt: str, history=[]):\n", - " print(prompt)\n", - " return chain({\"question\": prompt, \"chat_history\": history})\n", - "\n", - "MAX_HISTORY_LENGTH = 1 #increase to refer to more pervious chats\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "b8f1ef8d-66fe-4f84-933b-af2d730bd114", - "metadata": {}, - "source": [ - "The final part of our script utilizes our class and incorporates colors to add a bit of flare to our conversation with our model. The model when first initialized should greet the user asking **\"Hello! How can I help you?\"** then instructs the user to ask a question or exit the session **\"Ask a question, start a New search: or CTRL-D to exit.\"**. With every question submitted to the model it is labeled as a **new search** we then run the run_chain function to get the models response or answer and add the response to the **chat history**. " - ] - }, - { - "cell_type": "markdown", - "id": "1aa6ef65-ced4-445e-875c-7fee3483b81d", - "metadata": {}, - "source": [ - "```python\n", - "if __name__ == \"__main__\":\n", - " chat_history = []\n", - " qa = build_chain()\n", - " print(bcolors.OKBLUE + \"Hello! How can I help you?\" + bcolors.ENDC)\n", - " print(bcolors.OKCYAN + \"Ask a question, start a New search: or CTRL-D to exit.\" + bcolors.ENDC)\n", - " print(\">\", end=\" \", flush=True)\n", - " for query in sys.stdin:\n", - " if (query.strip().lower().startswith(\"new search:\")):\n", - " query = query.strip().lower().replace(\"new search:\",\"\")\n", - " chat_history = []\n", - " elif (len(chat_history) == MAX_HISTORY_LENGTH):\n", - " chat_history.pop(0)\n", - " result = run_chain(qa, query, chat_history)\n", - " chat_history.append((query, result[\"answer\"]))\n", - " print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC) \n", - " if 'source_documents' in result:\n", - " print(bcolors.OKGREEN + 'Sources:')\n", - " for d in result['source_documents']:\n", - " ###Use this for Azure Search AI\n", - " dict_meta=json.loads(d.metadata['metadata'])\n", - " print(dict_meta['source'])\n", - " ###\n", - " #Use this for PubMed retriever:\n", - " #print(\"PubMed UID: \"+d.metadata[\"uid\"])\n", - " print(bcolors.ENDC)\n", - " print(bcolors.OKCYAN + \"Ask a question, start a New search: or CTRL-D to exit.\" + bcolors.ENDC)\n", - " print(\">\", end=\" \", flush=True)\n", - " print(bcolors.OKBLUE + \"Bye\" + bcolors.ENDC)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "1abcbd48-bb84-4310-b8eb-ad87850a8649", - "metadata": {}, - "source": [ - "Running our script in the terminal will require us to export the following global variables before using the command `python NAME_OF_SCRIPT.py`. Example scripts are also ready to use within our 'example_scripts' folder." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ba97df23-6893-438d-8a67-cb7dbf83e407", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "#retreive info to allow langchain to connect to Azure Search AI\n", - "print(service_name)\n", - "print(l_index_name)\n", - "print(s_index_name)\n", - "print(s_index_name)\n", - "print(search_key)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7eab00a3-54ff-4873-8d25-eaf8bd18a2e6", - "metadata": {}, - "outputs": [], - "source": [ - "#enter the global variables in your terminal\n", - "export AZURE_OPENAI_API_KEY='' \\\n", - "export AZURE_OPENAI_ENDPOINT='' \\\n", - "export AZURE_OPENAI_DEPLOYMENT_NAME='' \\\n", - "export AZURE_COGNITIVE_SEARCH_SERVICE_NAME='' \\\n", - "export AZURE_COGNITIVE_SEARCH_INDEX_NAME='' \\\n", - "export AZURE_COGNITIVE_SEARCH_API_KEY='' " - ] - }, - { - "cell_type": "markdown", - "id": "bbe127e6-c0b1-4e07-ad56-38c30a9bf858", - "metadata": { - "tags": [] - }, - "source": [ - "You should see similar results on the terminal. In this example we ask the chatbot to summarize one of our documents!" - ] - }, - { - "cell_type": "markdown", - "id": "80c8fb4b-e74f-4e8d-892b-0f913eff747d", - "metadata": {}, - "source": [ - "![PubMed Chatbot Results](../../../docs/images/azure_chatbot.png)" - ] - }, - { - "cell_type": "markdown", - "id": "67776cc7", - "metadata": {}, - "source": [ - "## Conclusion\n", - "Here we built a chat bot using LangChain and Azure Open AI. Key skills you learned were to : \n", - "+ Create embeddings and a vector store using Azure AI Search\n", - "+ Use the PubMed API via LangChain\n", - "+ Send prompts to the LLM and capture chat history\n", - "+ Experiment with zero-shot and one/few-shot prompting" - ] - }, - { - "cell_type": "markdown", - "id": "a178c1c6-368a-48c5-8beb-278443b685a2", - "metadata": {}, - "source": [ - "## Clean up" - ] - }, - { - "cell_type": "markdown", - "id": "7ec06a34-dc47-453f-b519-424804fa2748", - "metadata": {}, - "source": [ - "**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c307bb17-757a-4579-a0d8-698eb1bb3f2e", - "metadata": {}, - "outputs": [], - "source": [ - "#delete search service this will also delete any indexes, datastore, and indexers\n", - "! az search service delete --name {service_name} --resource-group {resource_group} -y" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "280cea0a-a8fc-494e-8ce4-afb65847a222", - "metadata": {}, - "outputs": [], - "source": [ - "#delete container\n", - "! az storage container delete -n {container_name} --account-name {storage_account_name}" - ] - }, - { - "cell_type": "markdown", - "id": "6928d95d-d7ec-43f6-9135-79fcfc9520d9", - "metadata": {}, - "source": [ - "Dont forget to also delete or undeploy your LLM and embeddings model within Azure AI Studio." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d7350f02-aaf2-444d-b32a-c414d7d857ee", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "environment": { - "kernel": "python3", - "name": "common-cpu.m113", - "type": "gcloud", - "uri": "gcr.io/deeplearning-platform-release/base-cpu:m113" - }, - "kernel_info": { - "name": "python3" - }, - "kernelspec": { - "display_name": "Python 3.8 - AzureML", - "language": "python", - "name": "python38-azureml" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - }, - "microsoft": { - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - }, - "toc-autonumbering": false - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/GenAI/requirements.txt b/tutorials/notebooks/GenAI/requirements.txt deleted file mode 100644 index ebf0d31..0000000 --- a/tutorials/notebooks/GenAI/requirements.txt +++ /dev/null @@ -1,12 +0,0 @@ -python-dotenv -openai -openai[embeddings] -pandas -numpy -streamlit -langchain -langchain-openai -langchain-community -azure-search-documents==11.4.0b6 -tiktoken -faiss-cpu diff --git a/tutorials/notebooks/GenAI/search_documents/Hurricane_Irene_(2005).pdf b/tutorials/notebooks/GenAI/search_documents/Hurricane_Irene_(2005).pdf deleted file mode 100644 index 01e5395..0000000 Binary files a/tutorials/notebooks/GenAI/search_documents/Hurricane_Irene_(2005).pdf and /dev/null differ diff --git a/tutorials/notebooks/GenAI/search_documents/Koutros_et_al_2023.pdf b/tutorials/notebooks/GenAI/search_documents/Koutros_et_al_2023.pdf deleted file mode 100644 index e15b849..0000000 Binary files a/tutorials/notebooks/GenAI/search_documents/Koutros_et_al_2023.pdf and /dev/null differ diff --git a/tutorials/notebooks/GenAI/search_documents/New_York_State_Route_373.pdf b/tutorials/notebooks/GenAI/search_documents/New_York_State_Route_373.pdf deleted file mode 100644 index 69f5c0d..0000000 Binary files a/tutorials/notebooks/GenAI/search_documents/New_York_State_Route_373.pdf and /dev/null differ diff --git a/tutorials/notebooks/GenAI/search_documents/Rai_et_al_2023.pdf b/tutorials/notebooks/GenAI/search_documents/Rai_et_al_2023.pdf deleted file mode 100644 index d3f5a12..0000000 Binary files a/tutorials/notebooks/GenAI/search_documents/Rai_et_al_2023.pdf and /dev/null differ diff --git a/tutorials/notebooks/GenAI/search_documents/Silverman_et_al_2023.pdf b/tutorials/notebooks/GenAI/search_documents/Silverman_et_al_2023.pdf deleted file mode 100644 index 0574732..0000000 Binary files a/tutorials/notebooks/GenAI/search_documents/Silverman_et_al_2023.pdf and /dev/null differ diff --git a/tutorials/notebooks/GenAI/search_documents/aoai_workshop_content.pdf b/tutorials/notebooks/GenAI/search_documents/aoai_workshop_content.pdf deleted file mode 100644 index df600c7..0000000 Binary files a/tutorials/notebooks/GenAI/search_documents/aoai_workshop_content.pdf and /dev/null differ diff --git a/tutorials/notebooks/GenAI/search_documents/grant_data_sub1.txt b/tutorials/notebooks/GenAI/search_documents/grant_data_sub1.txt deleted file mode 100644 index d3f5a12..0000000 --- a/tutorials/notebooks/GenAI/search_documents/grant_data_sub1.txt +++ /dev/null @@ -1 +0,0 @@ - diff --git a/tutorials/notebooks/GenAI/search_documents/grant_data_sub2.txt b/tutorials/notebooks/GenAI/search_documents/grant_data_sub2.txt deleted file mode 100644 index d3f5a12..0000000 --- a/tutorials/notebooks/GenAI/search_documents/grant_data_sub2.txt +++ /dev/null @@ -1 +0,0 @@ - diff --git a/tutorials/notebooks/SRADownload/SRA-Download.ipynb b/tutorials/notebooks/SRADownload/SRA-Download.ipynb deleted file mode 100644 index 963e317..0000000 --- a/tutorials/notebooks/SRADownload/SRA-Download.ipynb +++ /dev/null @@ -1,548 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "88625911", - "metadata": {}, - "source": [ - "# Download sequence data from the NCBI Sequence Read Archive (SRA)" - ] - }, - { - "cell_type": "markdown", - "id": "41cea78e", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via a collection of command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Alternatively, you can search the SRA metadata using Amazon Athena and generate a list of accession numbers. Here we are going to generate a list of accessions using Athena, use tools from the SRA Toolkit to download a few fastq files, then copy those fastq files to a cloud bucket. We really only scratch the surface of how to search Athena using SQL. If you want more examples, you can also try the notebooks from [this SRA GitHub repo](https://github.com/ncbi/ASHG-Workshop-2021). " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "+ Learn how to set up an Athena Database\n", - "+ Learn how to use AWS Glue to scrape the SRA metadata\n", - "+ Query Athena to find target Accession numbers\n", - "+ Use SRA tools to download genomic sequence data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "id": "39f62f42", - "metadata": {}, - "source": [ - "### Set up your Athena Database\n", - "You need to set up your Athena database in the Athena console before you start this notebook. Follow our [guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/create_athena_database.md) to walk you through it." - ] - }, - { - "cell_type": "markdown", - "id": "7aed7098", - "metadata": {}, - "source": [ - "### Install packages\n" - ] - }, - { - "cell_type": "markdown", - "id": "7e9e2c86", - "metadata": {}, - "source": [ - "Install dependencies, including mamba (you could also use conda). At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c69dca1f", - "metadata": {}, - "outputs": [], - "source": [ - "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", - "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "29930bfc", - "metadata": {}, - "outputs": [], - "source": [ - "#add to your path\n", - "import os\n", - "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bd8f7f67", - "metadata": {}, - "outputs": [], - "source": [ - "! mamba install -c bioconda -c conda-forge sra-tools==2.11.0 sql-magic pyathena -y" - ] - }, - { - "cell_type": "markdown", - "id": "0032d702", - "metadata": {}, - "source": [ - "Test that your install works and that fasterq-dump is available in your path" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a5e68c3f", - "metadata": {}, - "outputs": [], - "source": [ - "!fasterq-dump -h" - ] - }, - { - "cell_type": "markdown", - "id": "ddc46609", - "metadata": {}, - "source": [ - "### Setup Directory Structure and Create a Staging Bucket" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3ec72dd0", - "metadata": {}, - "outputs": [], - "source": [ - "! mkdir -p data data/fasterqdump/raw_fastq data/prefetch_fasterqdump/raw_fastq" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "827f2447", - "metadata": {}, - "outputs": [], - "source": [ - "cd data/" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0f583e32", - "metadata": {}, - "outputs": [], - "source": [ - "#make sure you change this name, it needs to be globally unique\n", - "%env BUCKET=sra-data-athena" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ddf58849", - "metadata": {}, - "outputs": [], - "source": [ - "# will only create the bucket if it doesn't yet exist\n", - "# if the bucket exists you won't see any output\n", - "! aws s3 ls s3://$BUCKET >& /dev/null || aws s3 mb s3://$BUCKET" - ] - }, - { - "cell_type": "markdown", - "id": "086a50c1", - "metadata": {}, - "source": [ - "### Create Accession List using Athena" - ] - }, - { - "cell_type": "markdown", - "id": "4033ef70", - "metadata": {}, - "source": [ - "Here we use Athena to generate a list of accessions. You can also generate a manual list by searching the [SRA Database](https://www.ncbi.nlm.nih.gov/sra) and saving to a file or list." - ] - }, - { - "cell_type": "markdown", - "id": "afa1369f", - "metadata": {}, - "source": [ - "If you get a module not found error for either of these, rerun the mamba commands above, make sure mamba is still in your path, or just use `pip install pyathena`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3d4c368e", - "metadata": {}, - "outputs": [], - "source": [ - "#import packages\n", - "from pyathena import connect\n", - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "id": "fbdfda5e", - "metadata": {}, - "source": [ - "Establish connection. List your staging bucket and the region of your bucket. Make sure your bucket is in us-east-1 to avoid egress charges when downloading from sra." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "180fb47b", - "metadata": {}, - "outputs": [], - "source": [ - "conn = connect(s3_staging_dir='s3://sra-data-athena/',\n", - " region_name='us-east-1')" - ] - }, - { - "cell_type": "markdown", - "id": "882f2dd9", - "metadata": {}, - "source": [ - "**When you run the query in the next cell you may get this error**:\n", - "`An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:sts::055102001469:assumed-role/sagemaker-notebook-instance-role/SageMaker is not authorized to perform: athena:StartQueryExecution on resource: arn:aws:athena:us-east-1:055102001469:workgroup/primary because no identity-based policy allows the athena:StartQueryExecution action`\n", - "\n", - "If you get this error, read our [IAM guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/update_sagemaker_role.md) to set up the correct policy for your Sagemaker role. \n" - ] - }, - { - "cell_type": "markdown", - "id": "5830d8e3", - "metadata": {}, - "source": [ - "Now that the permissions are all set up, let's download bacterial samples. You could change the SQL query as you like, feel free to take a look at the generated df, and then play with different parameters. For more inspiration of what is possible with SQL queries, look at this [SRA tutorial](https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5eff316", - "metadata": {}, - "outputs": [], - "source": [ - "query = \"\"\"\n", - "SELECT *\n", - "FROM AwsDataCatalog.srametadata.metadata\n", - "WHERE organism = 'Mycobacteroides chelonae' \n", - "limit 3;\n", - "\"\"\"\n", - "df = pd.read_sql(\n", - " query, conn\n", - ")\n", - "df" - ] - }, - { - "cell_type": "markdown", - "id": "d3511937", - "metadata": {}, - "source": [ - "As you can see, most of what you need to know is shown in this data frame. If you wanted to just show the accession, you could replace the * for acc in the SELECT command. One other thing to think about is how large are these files, and do you have space on your VM to download them? You can figure this out by looking at the 'jattr' column, and then converting the number of bites to GB, then add that for a few samples to get a ballpark figure. If you need more space, stop the VM, go Compute Engine and either [resize your disk](https://aws.amazon.com/blogs/machine-learning/customize-your-notebook-volume-size-up-to-16-tb-with-amazon-sagemaker/). Make sure you stop your notebook instance to Edit and resize it. You can see the amount of space on your disk from the command line using `!df -h .`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2615b98d", - "metadata": {}, - "outputs": [], - "source": [ - "df['jattr'][0]" - ] - }, - { - "cell_type": "markdown", - "id": "2ea5dfb0", - "metadata": {}, - "source": [ - "You can also get the same info using `vdb-dump --info `. You can also get the path for the sra compressed file in a bucket using `srapath `." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21acb1be", - "metadata": {}, - "outputs": [], - "source": [ - "!vdb-dump --info SRR13349124 " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "88e3aa85", - "metadata": {}, - "outputs": [], - "source": [ - "!srapath SRR13349124" - ] - }, - { - "cell_type": "markdown", - "id": "e39e7a97", - "metadata": {}, - "source": [ - "Save our accession list to a text file" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0aca98b1", - "metadata": {}, - "outputs": [], - "source": [ - "with open('list_of_accessionIDS.txt', 'w') as f:\n", - " accs = df['acc'].to_string(header=False, index=False)\n", - " f.write(accs)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3ac5c48b", - "metadata": {}, - "outputs": [], - "source": [ - "cat list_of_accessionIDS.txt" - ] - }, - { - "cell_type": "markdown", - "id": "01437b57", - "metadata": {}, - "source": [ - "### Download FASTQ files with fasterq dump" - ] - }, - { - "cell_type": "markdown", - "id": "e3ff0d5e", - "metadata": {}, - "source": [ - "Fasterq-dump is the replacement for the legacy fastq-dump tool. You can read [this guide](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to see the full details on this tool. You can also run `fasterq-dump -h` to see most of the options" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4764f355", - "metadata": {}, - "outputs": [], - "source": [ - "cd fasterqdump/" - ] - }, - { - "cell_type": "markdown", - "id": "37097eb4", - "metadata": {}, - "source": [ - "Fasterq dump doesn't run in batch mode, so one way to run a command on multiple samples is by using a for loop. There are many options you can explore, but here we are running -O for outdir, -e for the number of threads, -m for memory (4GB). The default number of threads = 6, so adjust -e based on your machine size. For large files, you may also benefit from a machine type with more memory and/or threads. You may need to stop this VM, resize it, then restart and come back. There are also a bunch of ways to split your fastq files (defined [here](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)) but the default of `split 3` will split into forward, reverse, and unpaired reads. Depending on your machine size, expect about 5 min for these three files." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "80c2e3b4", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "!for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G $x ; done" - ] - }, - { - "cell_type": "markdown", - "id": "84c5acc6", - "metadata": {}, - "source": [ - "On our VM that command took 6.5 min, although with a larger machine size it will run faster." - ] - }, - { - "cell_type": "markdown", - "id": "55bd52cd", - "metadata": {}, - "source": [ - "### Download FASTQ files with prefetch + fasterq dump" - ] - }, - { - "cell_type": "markdown", - "id": "b15200f2", - "metadata": {}, - "source": [ - "Using the example bacterial data, fasterq dump took about 6.5 min to download the files (ml.t3.2xlarge with 8 CPUs and 32 GB RAM). Under the hood, fasterq dump is pulling the compressed sra files from the database (in this case it should be coming from AWS) and converting them on the fly, which is slow (ish) because it has to do a lot over the network. A better method is to disaggregate these functions using prefetch to pull the compressed files, then use fasterq-dump to convert them locally, rather than over the network. For this to work, you need to either give the path to the prefetch directories in your text file, or make sure you cd into the raw_fastq dir so that it can find those directories with the .sra files." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ddefec2d", - "metadata": {}, - "outputs": [], - "source": [ - "cd ../prefetch_fasterqdump" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "935f6ca2", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "!prefetch --option-file ../list_of_accessionIDS.txt -O raw_fastq -f yes" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7eece75e", - "metadata": {}, - "outputs": [], - "source": [ - "ls raw_fastq/" - ] - }, - { - "cell_type": "markdown", - "id": "14eb650a", - "metadata": {}, - "source": [ - "Now convert the prefetch records" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1852a71a", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "!for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G raw_fastq/$x; done" - ] - }, - { - "cell_type": "markdown", - "id": "49507511", - "metadata": {}, - "source": [ - "Comparing the two methods, we can see that fasterq-dump on its own took 6.5 min, whereas prefetch + fasterq-dump takes less than 1.5 min." - ] - }, - { - "cell_type": "markdown", - "id": "ea152fd7", - "metadata": {}, - "source": [ - "### Copy Files to a Bucket" - ] - }, - { - "cell_type": "markdown", - "id": "7a4eef67", - "metadata": {}, - "source": [ - "--recursive will copy a whole directory like `-r` in bash. S3 multithreads by default so you don't have to specify threads." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ad73308f", - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 cp raw_fastq/*.fastq s3://sra-data-athena/raw_fastq --recursive" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "072ebc9a", - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 ls s3://sra-data-athena/raw_fastq/" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusions\n", - "You learned here how to bring the SRA metadata into Athena and query Athena DB to find target accession numbers, then use SRA tools to download sequence data locally." - ] - }, - { - "cell_type": "markdown", - "id": "a4026566", - "metadata": {}, - "source": [ - "## Clean up\n", - "Make sure you shut down this VM, or delete it if you don't plan to use if further. You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "environment": { - "kernel": "python3", - "name": "common-cpu.m93", - "type": "gcloud", - "uri": "gcr.io/deeplearning-platform-release/base-cpu:m93" - }, - "kernelspec": { - "display_name": "conda_python3", - "language": "python", - "name": "conda_python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.13" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/SpleenLiverSegmentation/README.md b/tutorials/notebooks/SpleenLiverSegmentation/README.md deleted file mode 100644 index 976552f..0000000 --- a/tutorials/notebooks/SpleenLiverSegmentation/README.md +++ /dev/null @@ -1,49 +0,0 @@ -# Spleen Segmentation with Liver Example using NVIDIA Models and MONAI -_We have put together a training example that segments the Spleen in 3D CT Images. At the end is an example of combining both the Spleen model and the Liver model._ - -*Nvidia has changed some of the models used in this tutorial and it may crash, if you have issues, try commenting out the liver model, we are working on a patch* - -## Introduction -Two pre-trained models from NVIDIA are used in this training, a Spleen model and Liver. -The Spleen model is additionally retrained on the medical decathlon spleen dataset: [http://medicaldecathlon.com/](http://medicaldecathlon.com/) -Data is not necessary to be downloaded to run the notebook. The notebook downloads the data during it's run. -The notebook uses the Python package [MONAI](https://monai.io/), the Medical Open Network for Artificial Intelligence. - -- Spleen Model - [clara_pt_spleen_ct_segmentation_V2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/monaitoolkit/models/monai_spleen_ct_segmentation) -- Liver Model - [clara_pt_liver_and_tumor_ct_segmentation_V1]() - -## Outcomes -After following along with this notebook the user will be familiar with: -- Downloading public datasets using MONAI -- Using MONAI transformations for training -- Downloading a pretrained NVIDIA Clara model using MONAI -- Retrain model using MONAI -- Visualizing medical images in python/matplotlib - -## Installing MONAI -Please follow the [instructions](https://monai.io/started.html#installation) on MONAI's website for up to date install. -Installing MONAI in a notebook environment can be completed with the commands: -- !python -c "import monai" || pip install -q 'monai[all]' -- !python -c "import matplotlib" || pip install -q matplotlib - -## Dependencies -_It is recommended to use an NVIDIA GPU for training. If the user does not have access to a NVIDIA GPU then it is recommended to skip the training cells._ - -The following packages and versions were installed during the testing of this notebook: -- MONAI version: 0.8.1 -- Numpy version: 1.21.1 -- Pytorch version: 1.9.0 -- Pytorch Ignite version: 0.4.8 -- Nibabel version: 3.2.1 -- scikit-image version: 0.18.2 -- Pillow version: 8.3.1 -- Tensorboard version: 2.5.0 -- gdown version: 3.13.0 -- TorchVision version: 0.10.0+cu111 -- tqdm version: 4.61.2 -- lmdb version: 1.2.1 -- psutil version: 5.8.0 -- pandas version: 1.3.0 -- einops version: 0.3.0 -- transformers version: 4.18.0 -- mlflow version: 1.25.1 diff --git a/tutorials/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb b/tutorials/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb deleted file mode 100644 index cf8b3fe..0000000 --- a/tutorials/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb +++ /dev/null @@ -1,1017 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "1452463e", - "metadata": {}, - "source": [ - "# Spleen Model With NVIDIA Pretrain" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Overview\n", - "This notebook conducts image segmentation of spleen images using an NVIDIA pretrained model. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have provisioned a compute environment in Azure ML Studio **with a GPU**! A T4 GPU will work fine." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "+ Learn how to use NVIDIA pre-trained models for image segmentation within Azure ML Studio" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Install packages" - ] - }, - { - "cell_type": "markdown", - "id": "f59ba435", - "metadata": {}, - "source": [ - "Uncomment below to install all dependencies." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "82db674f", - "metadata": {}, - "outputs": [], - "source": [ - "#!pip install 'monai[all]'\n", - "#!pip install matplotlib " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bb1228b3", - "metadata": {}, - "outputs": [], - "source": [ - "%matplotlib inline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "540e5d47", - "metadata": {}, - "outputs": [], - "source": [ - "# MONAI version: 0.6.0+38.gf6ad4ba5\n", - "# Numpy version: 1.21.1\n", - "# Pytorch version: 1.9.0\n", - "# Pytorch Ignite version: 0.4.5\n", - "# Nibabel version: 3.2.1\n", - "# scikit-image version: 0.18.2\n", - "# Pillow version: 8.3.1\n", - "# Tensorboard version: 2.5.0\n", - "# gdown version: 3.13.0\n", - "# TorchVision version: 0.10.0+cu111\n", - "# tqdm version: 4.61.2\n", - "# lmdb version: 1.2.1\n", - "# psutil version: 5.8.0\n", - "# pandas version: 1.3.0\n", - "# einops version: 0.3.0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "07510582", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import tempfile\n", - "import glob\n", - "\n", - "import matplotlib.pyplot as plt\n", - "#import plotly.graph_objects as go\n", - "import torch\n", - "import numpy as np\n", - "\n", - "from monai.apps import download_and_extract\n", - "from monai.networks.nets import UNet\n", - "from monai.networks.layers import Norm\n", - "from monai.losses import DiceFocalLoss\n", - "from monai.metrics import DiceMetric\n", - "from monai.inferers import sliding_window_inference\n", - "from monai.data import (\n", - " LMDBDataset,\n", - " DataLoader,\n", - " decollate_batch,\n", - " ImageDataset,\n", - " Dataset\n", - ")\n", - "from monai.apps import load_from_mmar\n", - "from monai.transforms import (\n", - " AsDiscrete,\n", - " EnsureChannelFirstd,\n", - " Compose,\n", - " LoadImaged,\n", - " ScaleIntensityRanged,\n", - " Spacingd,\n", - " Orientationd,\n", - " CropForegroundd,\n", - " RandCropByPosNegLabeld,\n", - " RandAffined,\n", - " RandRotated,\n", - " EnsureType,\n", - " EnsureTyped,\n", - ")\n", - "from monai.utils import first, set_determinism\n", - "from monai.apps.mmars import RemoteMMARKeys\n", - "from monai.config import print_config\n", - "\n", - "print_config()" - ] - }, - { - "cell_type": "markdown", - "id": "6f523cbf", - "metadata": {}, - "source": [ - "### Running a pretrained model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0be7401d", - "metadata": {}, - "outputs": [], - "source": [ - "PRETRAINED = True" - ] - }, - { - "cell_type": "markdown", - "id": "e9f3e5f3", - "metadata": {}, - "source": [ - "Create the directory for storing data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "311c3282", - "metadata": {}, - "outputs": [], - "source": [ - "directory = \"monai_data/\"\n", - "root_dir = tempfile.mkdtemp() if directory is None else directory\n", - "print(root_dir)" - ] - }, - { - "cell_type": "markdown", - "id": "38463a18", - "metadata": {}, - "source": [ - "### Download the public dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "da7cfede", - "metadata": {}, - "outputs": [], - "source": [ - "resource = \"https://msd-for-monai.s3-us-west-2.amazonaws.com/Task09_Spleen.tar\"\n", - "md5 = \"410d4a301da4e5b2f6f86ec3ddba524e\"\n", - "\n", - "compressed_file = os.path.join(root_dir, \"Task09_Spleen.tar\")\n", - "download_and_extract(resource, compressed_file, root_dir, md5)\n", - "data_dir = os.path.join(root_dir, \"Task09_Spleen\")" - ] - }, - { - "cell_type": "markdown", - "id": "fae7c51b", - "metadata": {}, - "source": [ - "### Create Date Dictionaries and separate files from training and validation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2515b177", - "metadata": {}, - "outputs": [], - "source": [ - "train_images = sorted(\n", - " glob.glob(os.path.join(data_dir, \"imagesTr\", \"*.nii.gz\")))\n", - "train_labels = sorted(\n", - " glob.glob(os.path.join(data_dir, \"labelsTr\", \"*.nii.gz\")))\n", - "data_dicts = [\n", - " {\"image\": image_name, \"label\": label_name}\n", - " for image_name, label_name in zip(train_images, train_labels)\n", - "]\n", - "train_files, val_files = data_dicts[:-9], data_dicts[-9:]" - ] - }, - { - "cell_type": "markdown", - "id": "974fc5aa", - "metadata": {}, - "source": [ - "### Define your transformations for training and validation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2357d35d", - "metadata": {}, - "outputs": [], - "source": [ - "train_transforms = Compose( #Transformations for training dataset\n", - " [\n", - " LoadImaged(keys=[\"image\", \"label\"]), #Load dictionary based images and labels\n", - " EnsureChannelFirstd(keys=[\"image\", \"label\"]), #Ensures the first channel of each image is the channel dimension\n", - " Spacingd(keys=[\"image\", \"label\"], pixdim=( #Change spacing of voxels to be same across images\n", - " 1.5, 1.5, 2.0), mode=(\"bilinear\", \"nearest\")),\n", - " Orientationd(keys=[\"image\", \"label\"], axcodes=\"RAS\"), #Correct the orientation of images (Right, Anterior, Superior)\n", - " ScaleIntensityRanged( #Scale intensity of all images (For images only and not labels)\n", - " keys=[\"image\"], a_min=-57, a_max=164,\n", - " b_min=0.0, b_max=1.0, clip=True,\n", - " ),\n", - " CropForegroundd(keys=[\"image\", \"label\"], source_key=\"image\"), #Crop foreground of image\n", - " RandCropByPosNegLabeld( #Randomly crop fixed sized region\n", - " keys=[\"image\", \"label\"],\n", - " label_key=\"label\",\n", - " spatial_size=(96, 96, 96),\n", - " pos=1,\n", - " neg=1,\n", - " num_samples=4,\n", - " image_key=\"image\",\n", - " image_threshold=0,\n", - " ),\n", - " RandAffined( #Do a random affine transformation with some probability\n", - " keys=['image', 'label'],\n", - " mode=('bilinear', 'nearest'),\n", - " prob=0.5,\n", - " spatial_size=(96, 96, 96),\n", - " rotate_range=(np.pi/18, np.pi/18, np.pi/5),\n", - " scale_range=(0.05, 0.05, 0.05)\n", - " ),\n", - " EnsureTyped(keys=[\"image\", \"label\"]),\n", - " ]\n", - ")\n", - "val_transforms = Compose( #Transformations for testing dataset\n", - " [\n", - " LoadImaged(keys=[\"image\", \"label\"]),\n", - " EnsureChannelFirstd(keys=[\"image\", \"label\"]),\n", - " Spacingd(keys=[\"image\", \"label\"], pixdim=(\n", - " 1.5, 1.5, 2.0), mode=(\"bilinear\", \"nearest\")),\n", - " Orientationd(keys=[\"image\", \"label\"], axcodes=\"RAS\"),\n", - " ScaleIntensityRanged(\n", - " keys=[\"image\"], a_min=-57, a_max=164,\n", - " b_min=0.0, b_max=1.0, clip=True,\n", - " ),\n", - " RandRotated(\n", - " keys=['image', 'label'],\n", - " mode=('bilinear', 'nearest'),\n", - " range_x=np.pi/18,\n", - " range_y=np.pi/18,\n", - " range_z=np.pi/5,\n", - " prob=1.0,\n", - " padding_mode=('reflection', 'reflection'),\n", - " ),\n", - " CropForegroundd(keys=[\"image\", \"label\"], source_key=\"image\"),\n", - " EnsureTyped(keys=[\"image\", \"label\"]),\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ada5757a", - "metadata": {}, - "outputs": [], - "source": [ - "val_files" - ] - }, - { - "cell_type": "markdown", - "id": "ba3c7695", - "metadata": {}, - "source": [ - "### Visualize Image and Label (example)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "689eea4e", - "metadata": {}, - "outputs": [], - "source": [ - "check_ds = Dataset(data=val_files, transform=val_transforms)\n", - "check_loader = DataLoader(check_ds, batch_size=1)\n", - "check_data = first(check_loader)\n", - "image, label = (check_data[\"image\"][0][0], check_data[\"label\"][0][0])\n", - "print(f\"image shape: {image.shape}, label shape: {label.shape}\")\n", - "# plot the slice [:, :, 80]\n", - "plt.figure(\"check\", (12, 6))\n", - "plt.subplot(1, 2, 1)\n", - "plt.title(\"image\")\n", - "plt.imshow(image[:, :, 80], cmap=\"gray\")\n", - "plt.subplot(1, 2, 2)\n", - "plt.title(\"label\")\n", - "plt.imshow(label[:, :, 80])\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "f45ba707", - "metadata": {}, - "source": [ - "### Use a dataloader to load files\n", - "Ability to use LMDB (Lightning Memory-Mapped Database). Here is where transforms take place and they happen on both images and labels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fe3285d0", - "metadata": {}, - "outputs": [], - "source": [ - "train_ds = LMDBDataset(data=train_files, transform=train_transforms, cache_dir=root_dir)\n", - "# initialize cache and print meta information\n", - "print(train_ds.info())\n", - "\n", - "# use batch_size=2 to load images and use RandCropByPosNegLabeld\n", - "# to generate 2 x 4 images for network training\n", - "train_loader = DataLoader(train_ds, batch_size=2, shuffle=True, num_workers=2)\n", - "\n", - "# the validation data loader will be created on the fly to ensure \n", - "# a deterministic validation set for demo purpose.\n", - "val_ds = LMDBDataset(data=val_files, transform=val_transforms, cache_dir=root_dir)\n", - "# initialize cache and print meta information\n", - "print(val_ds.info())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "455cbcdc", - "metadata": {}, - "outputs": [], - "source": [ - "print(train_ds.info())" - ] - }, - { - "cell_type": "markdown", - "id": "a77e7856", - "metadata": {}, - "source": [ - "### Download the pretrained model from NVIDIA" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8539fb7d", - "metadata": {}, - "outputs": [], - "source": [ - "mmar = {\n", - " RemoteMMARKeys.ID: \"clara_pt_spleen_ct_segmentation_1\",\n", - " RemoteMMARKeys.NAME: \"clara_pt_spleen_ct_segmentation\",\n", - " RemoteMMARKeys.FILE_TYPE: \"zip\",\n", - " RemoteMMARKeys.HASH_TYPE: \"md5\",\n", - " RemoteMMARKeys.HASH_VAL: None,\n", - " RemoteMMARKeys.MODEL_FILE: os.path.join(\"models\", \"model.pt\"),\n", - " RemoteMMARKeys.CONFIG_FILE: os.path.join(\"config\", \"config_train.json\"),\n", - " RemoteMMARKeys.VERSION: 2,\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "de7fb262", - "metadata": {}, - "outputs": [], - "source": [ - "mmar['name']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bf96f9f9", - "metadata": {}, - "outputs": [], - "source": [ - "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\") #torch.device(\"cpu\")\n", - "if PRETRAINED:\n", - " print(\"using a pretrained model.\")\n", - " try: #MONAI=0.8\n", - " unet_model = load_from_mmar(\n", - " item = mmar['name'], \n", - " mmar_dir=root_dir,\n", - " map_location=device,\n", - " version=mmar['version'],\n", - " pretrained=True)\n", - " except: #MONAI<0.8\n", - " unet_model = load_from_mmar(\n", - " mmar, \n", - " mmar_dir=root_dir,\n", - " map_location=device,\n", - " pretrained=True)\n", - " model = unet_model\n", - "else: \n", - " print(\"using a randomly init. model.\")\n", - " model = UNet(\n", - " dimensions=3,\n", - " in_channels=1,\n", - " out_channels=2,\n", - " channels=(16, 32, 64, 128, 256),\n", - " strides=(2, 2, 2, 2),\n", - " num_res_units=2,\n", - " norm=Norm.BATCH,\n", - " )\n", - "\n", - "model = model.to(device)" - ] - }, - { - "cell_type": "markdown", - "id": "39910557", - "metadata": {}, - "source": [ - "This will be our test file we will view for reference. Here we see how our initial model appears to perform." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4be7eb8f", - "metadata": {}, - "outputs": [], - "source": [ - "test_file = data_dicts[20:21]\n", - "test_ds = LMDBDataset(data=test_file, transform=None, cache_dir=root_dir)" - ] - }, - { - "cell_type": "markdown", - "id": "2544a774", - "metadata": {}, - "source": [ - "We use a sliding window technique to search the image." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "16fd4e94", - "metadata": {}, - "outputs": [], - "source": [ - "num_classes=2\n", - "post_pred = Compose([EnsureType(), AsDiscrete(argmax=True, to_onehot=num_classes)])\n", - "post_label = Compose([EnsureType(), AsDiscrete(to_onehot=num_classes)])\n", - "model.eval()\n", - "with torch.no_grad():\n", - " for data in DataLoader(test_ds, batch_size=1, num_workers=2):\n", - " test_inputs, test_labels = (\n", - " data[\"image\"].to(device),\n", - " data[\"label\"].to(device),\n", - " )\n", - " roi_size = (160, 160, 160)\n", - " sw_batch_size = 4\n", - " test_outputs = sliding_window_inference(\n", - " test_inputs, roi_size, sw_batch_size, model, overlap=0.5)\n", - " test_outputspre = [post_pred(i) for i in decollate_batch(test_outputs)] # Decollate our results\n", - " test_labelspre = [post_label(i) for i in decollate_batch(test_labels)]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9782ec96", - "metadata": {}, - "outputs": [], - "source": [ - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Actual Spleen')\n", - "plt.imshow(test_labelspre[0].cpu().numpy()[1][:,:,200], cmap='Greys_r') #Actual spleen" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76cd38e6", - "metadata": {}, - "outputs": [], - "source": [ - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Pretrained CalculatedSpleen')\n", - "plt.imshow(test_outputspre[0].cpu().numpy()[1][:,:,200], cmap='Greys_r') #Pretrained model spleen" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "65c68242", - "metadata": {}, - "outputs": [], - "source": [ - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Differences Between Actual and Model')\n", - "pretraineddif = test_labelspre[0].cpu().numpy()[1][:,:,200] - test_outputspre[0].cpu().numpy()[1][:,:,200]\n", - "plt.imshow(pretraineddif, cmap='Greys_r') #Differences" - ] - }, - { - "cell_type": "markdown", - "id": "2f60e5b5", - "metadata": {}, - "source": [ - "Using just the pretrained model, it appears we are performing pretty well! We can now continue to train with our data using the NVIDIA models initial weights" - ] - }, - { - "cell_type": "markdown", - "id": "c3e40010", - "metadata": {}, - "source": [ - "## Training\n", - " Without a GPU, training can take a while, we recommend skipping next three cells and load in model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8ad6aee", - "metadata": {}, - "outputs": [], - "source": [ - "loss_function = DiceFocalLoss(to_onehot_y=True, softmax=True)\n", - "optimizer = torch.optim.Adam(model.parameters(), 5e-4)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d91d340c", - "metadata": {}, - "outputs": [], - "source": [ - "max_epochs = 25\n", - "val_interval = 2\n", - "num_classes = 2\n", - "best_metric = -1\n", - "best_metric_epoch = -1\n", - "epoch_loss_values = []\n", - "metric_values = []\n", - "post_pred = Compose([EnsureType(), AsDiscrete(argmax=True, to_onehot=num_classes)])\n", - "post_label = Compose([EnsureType(), AsDiscrete(to_onehot=num_classes)])\n", - "dice_metric = DiceMetric(include_background=False, reduction=\"mean\", get_not_nans=False)\n", - "\n", - "for epoch in range(max_epochs):\n", - " print(\"-\" * 10)\n", - " print(f\"epoch {epoch + 1}/{max_epochs}\")\n", - " model.train()\n", - " epoch_loss = 0\n", - " step = 0\n", - " set_determinism(seed=42)\n", - " for batch_data in train_loader:\n", - " step += 1\n", - " inputs, labels = (\n", - " batch_data[\"image\"].to(device),\n", - " batch_data[\"label\"].to(device),\n", - " )\n", - " optimizer.zero_grad()\n", - " outputs = model(inputs)\n", - " loss = loss_function(outputs, labels)\n", - " loss.backward()\n", - " optimizer.step()\n", - " epoch_loss += loss.item()\n", - " print(\n", - " f\"{step}/{len(train_ds) // train_loader.batch_size}, \"\n", - " f\"train_loss: {loss.item():.4f}\")\n", - " epoch_loss /= step\n", - " epoch_loss_values.append(epoch_loss)\n", - " print(f\"epoch {epoch + 1} average loss: {epoch_loss:.4f}\")\n", - "\n", - " if (epoch + 1) % val_interval == 0:\n", - " model.eval()\n", - " with torch.no_grad():\n", - " set_determinism(seed=42)\n", - " for val_data in DataLoader(val_ds, batch_size=1, num_workers=2):\n", - " val_inputs, val_labels = (\n", - " val_data[\"image\"].to(device),\n", - " val_data[\"label\"].to(device),\n", - " )\n", - " roi_size = (160, 160, 160)\n", - " sw_batch_size = 4\n", - " val_outputs = sliding_window_inference(\n", - " val_inputs, roi_size, sw_batch_size, model, overlap=0.5)\n", - " val_outputs = [post_pred(i) for i in decollate_batch(val_outputs)]\n", - " val_labels = [post_label(i) for i in decollate_batch(val_labels)]\n", - " dice_metric(y_pred=val_outputs, y=val_labels)\n", - " metric = dice_metric.aggregate().item()\n", - " dice_metric.reset()\n", - " metric_values.append(metric)\n", - " if metric > best_metric:\n", - " best_metric = metric\n", - " best_metric_epoch = epoch + 1\n", - " torch.save(model.state_dict(), os.path.join(\n", - " root_dir, \"Spleen_best_metric_model_pretrained.pth\"))\n", - " print(\"saved new best metric model\")\n", - " print(\n", - " f\"current epoch: {epoch + 1} current mean dice: {metric:.4f}\"\n", - " f\"\\nbest mean dice: {best_metric:.4f} \"\n", - " f\"at epoch: {best_metric_epoch}\"\n", - " )\n", - "print(\n", - " f\"train completed, best_metric: {best_metric:.4f} \"\n", - " f\"at epoch: {best_metric_epoch}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5cf1fd04", - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(\"train\", (12, 6))\n", - "plt.subplot(1, 2, 1)\n", - "plt.title(\"Epoch Average Loss\")\n", - "x = [i + 1 for i in range(len(epoch_loss_values))]\n", - "y = epoch_loss_values\n", - "plt.xlabel(\"epoch\")\n", - "plt.ylim([0.1, 0.7])\n", - "plt.plot(x, y)\n", - "plt.subplot(1, 2, 2)\n", - "plt.title(\"Val Mean Dice\")\n", - "x = [val_interval * (i + 1) for i in range(len(metric_values))]\n", - "y = metric_values\n", - "plt.xlabel(\"epoch\")\n", - "plt.ylim([0, 1.0])\n", - "plt.plot(x, y)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "4ff0035d", - "metadata": {}, - "source": [ - "The model shows that it has improved fairly quickly over just 25 epochs." - ] - }, - { - "cell_type": "markdown", - "id": "0499fa93", - "metadata": {}, - "source": [ - "## Inference\n", - "Without GPU skip to here to load previously trained best model (without a gpu the training will take a while)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "29441405", - "metadata": {}, - "outputs": [], - "source": [ - "model.load_state_dict(torch.load('monai_data/best_metric_model_pretrained.pth'))" - ] - }, - { - "cell_type": "markdown", - "id": "fab5b4b9", - "metadata": {}, - "source": [ - "With the model loaded let's see if much has changed for our example image." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "94615f38", - "metadata": {}, - "outputs": [], - "source": [ - "num_classes = 2\n", - "post_pred = Compose([EnsureType(), AsDiscrete(argmax=True, to_onehot=num_classes)])\n", - "post_label = Compose([EnsureType(), AsDiscrete(to_onehot=num_classes)])\n", - "model.eval()\n", - "with torch.no_grad():\n", - " for data in DataLoader(test_ds, batch_size=1, num_workers=2):\n", - " test_inputs, test_labels = (\n", - " data[\"image\"].to(device),\n", - " data[\"label\"].to(device),\n", - " )\n", - " roi_size = (160, 160, 160)\n", - " sw_batch_size = 4\n", - " test_outputs = sliding_window_inference(\n", - " test_inputs, roi_size, sw_batch_size, model, overlap=0.5)\n", - " test_outputsSpl = [post_pred(i) for i in decollate_batch(test_outputs)]\n", - " test_labelsSpl = [post_label(i) for i in decollate_batch(test_labels)]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a3f78dd4", - "metadata": {}, - "outputs": [], - "source": [ - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Trained Calculated Spleen')\n", - "plt.imshow(test_outputsSpl[0].cpu().numpy()[1][:,:,200], cmap='Greys_r') #Pretrained model spleen" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a67f89f2", - "metadata": {}, - "outputs": [], - "source": [ - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Differences Between Actual and Model')\n", - "traineddif = test_labelsSpl[0].cpu().numpy()[1][:,:,200] - test_outputsSpl[0].cpu().numpy()[1][:,:,200]\n", - "plt.imshow(traineddif, cmap='Greys_r') #Differences" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "382c7285", - "metadata": {}, - "outputs": [], - "source": [ - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Differences Between The Models')\n", - "modelsdif = test_outputspre[0].cpu().numpy()[1][:,:,200] - test_outputsSpl[0].cpu().numpy()[1][:,:,200]\n", - "plt.imshow(traineddif, cmap='Greys_r') #Differences" - ] - }, - { - "cell_type": "markdown", - "id": "6606bce2", - "metadata": {}, - "source": [ - "We see not much has changed, which is a good sign for how well the NVIDIA model performs out of the box." - ] - }, - { - "cell_type": "markdown", - "id": "5cfd20c6", - "metadata": {}, - "source": [ - "Here is the final image of our Spleen!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91e83d40", - "metadata": {}, - "outputs": [], - "source": [ - "maskedspleen = np.ma.masked_where(test_outputsSpl[0].cpu().numpy()[1][:,:,200] == 0, test_outputsSpl[0].cpu().numpy()[1][:,:,200])\n", - "fig = plt.figure(frameon=False, figsize=(10,10))\n", - "plt.imshow(np.rot90(test_ds[0]['image'][0][:,:,200]), cmap='Greys_r')\n", - "plt.imshow(np.rot90(maskedspleen), cmap='viridis', alpha=1.0)" - ] - }, - { - "cell_type": "markdown", - "id": "6030d210", - "metadata": {}, - "source": [ - "Feel free to play around in this notebook or download it and use it where a GPU is accessible." - ] - }, - { - "cell_type": "markdown", - "id": "896388a1", - "metadata": {}, - "source": [ - "## Additional Exercise: Use liver segmentation in addition to spleen\n", - "Her we are loading in liver segmentation from NVIDIA. While we can't train this model, since we don't have training data, we can use it as a rough estimate." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "657e44a0", - "metadata": {}, - "outputs": [], - "source": [ - "mmarliver = {\n", - " RemoteMMARKeys.ID: \"clara_pt_liver_and_tumor_ct_segmentation_1\",\n", - " RemoteMMARKeys.NAME: \"clara_pt_liver_and_tumor_ct_segmentation\",\n", - " RemoteMMARKeys.FILE_TYPE: \"zip\",\n", - " RemoteMMARKeys.HASH_TYPE: \"md5\",\n", - " RemoteMMARKeys.HASH_VAL: None,\n", - " RemoteMMARKeys.MODEL_FILE: os.path.join(\"models\", \"model.pt\"),\n", - " RemoteMMARKeys.CONFIG_FILE: os.path.join(\"config\", \"config_train.json\"),\n", - " RemoteMMARKeys.VERSION: 1,\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a6fb0da7", - "metadata": {}, - "outputs": [], - "source": [ - " try: #MONAI=0.8\n", - " unet_model = load_from_mmar(\n", - " item = mmarliver['name'], \n", - " mmar_dir=root_dir,\n", - " map_location=device,\n", - " version=mmarliver['version'],\n", - " pretrained=True)\n", - " except: #MONAI<0.8\n", - " unet_model = load_from_mmar(\n", - " mmarliver, \n", - " mmar_dir=root_dir,\n", - " map_location=device,\n", - " pretrained=True)\n", - " model = unet_model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "55034354", - "metadata": {}, - "outputs": [], - "source": [ - "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", - "\n", - "print(\"using a pretrained model.\")\n", - "try: #MONAI=0.8\n", - " unet_model = load_from_mmar(\n", - " item = mmarliver['name'], \n", - " mmar_dir=root_dir,\n", - " map_location=device,\n", - " version=mmarliver['version'],\n", - " pretrained=True)\n", - "except: #MONAI<0.8\n", - " unet_model = load_from_mmar(\n", - " mmarliver, \n", - " mmar_dir=root_dir,\n", - " map_location=device,\n", - " pretrained=True)\n", - "model = unet_model.to(device)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a79c1731", - "metadata": {}, - "outputs": [], - "source": [ - "num_classesP=3\n", - "num_classesL=2\n", - "post_pred = Compose([EnsureType(), AsDiscrete(argmax=True, to_onehot=num_classesP)])\n", - "post_label = Compose([EnsureType(), AsDiscrete(to_onehot=num_classesL)])\n", - "model.eval()\n", - "with torch.no_grad():\n", - " for data in DataLoader(test_ds, batch_size=1, num_workers=2):\n", - " test_inputs, test_labels = (\n", - " data[\"image\"].to(device),\n", - " data[\"label\"].to(device),\n", - " )\n", - " roi_size = (160, 160, 160)\n", - " sw_batch_size = 4\n", - " test_outputs = sliding_window_inference(\n", - " test_inputs, roi_size, sw_batch_size, model, overlap=0.5)\n", - " test_outputsliv = [post_pred(i) for i in decollate_batch(test_outputs)] # Decollate our results\n", - " test_labelsliv = [post_label(i) for i in decollate_batch(test_labels)]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c0956706", - "metadata": {}, - "outputs": [], - "source": [ - "sliceval = 215\n", - "maskedliv = np.ma.masked_where(test_outputsliv[0].cpu().numpy()[1][:,:,sliceval] == 0, test_outputsliv[0].cpu().numpy()[1][:,:,sliceval])\n", - "maskedspleen = np.ma.masked_where(test_outputsSpl[0].cpu().numpy()[1][:,:,sliceval] == 0, test_outputsSpl[0].cpu().numpy()[1][:,:,sliceval])\n", - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Pretrained Calculated Liver and spleen')\n", - "plt.imshow(np.rot90(test_ds[0]['image'][0][:,:,sliceval]), cmap='Greys_r')\n", - "plt.imshow(np.rot90(maskedliv), cmap='cividis', alpha=0.75)\n", - "plt.imshow(np.rot90(maskedspleen), cmap='viridis', alpha=0.75)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5bdfdbe9", - "metadata": {}, - "outputs": [], - "source": [ - "sliceval = 110\n", - "maskedliv = np.ma.masked_where(test_outputsliv[0].cpu().numpy()[1][:,sliceval,:] == 0, test_outputsliv[0].cpu().numpy()[1][:,sliceval,:])\n", - "maskedspleen = np.ma.masked_where(test_outputsSpl[0].cpu().numpy()[1][:,sliceval,:] == 0, test_outputsSpl[0].cpu().numpy()[1][:,sliceval,:])\n", - "fig = plt.figure(frameon=False, figsize=(7,7))\n", - "plt.title('Pretrained Calculated Liver and Spleen')\n", - "plt.imshow(np.rot90(test_ds[0]['image'][0][:,sliceval,:]), cmap='Greys_r')\n", - "plt.imshow(np.rot90(maskedliv), cmap='cividis', alpha=0.75)\n", - "plt.imshow(np.rot90(maskedspleen), cmap='viridis', alpha=0.75)" - ] - }, - { - "cell_type": "markdown", - "id": "af1169b6", - "metadata": {}, - "source": [ - "Continue including more models found at the NGC Catalog: https://catalog.ngc.nvidia.com/models. We recommend filtering by 'CT'." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusions\n", - "Here you learned how to use NVIDIA pre-trained models for image segmentation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean up\n", - "Shut down your compute environment and delete any resource groups associated with this notebook." - ] - } - ], - "metadata": { - "environment": { - "name": "pytorch-gpu.1-9.m75", - "type": "gcloud", - "uri": "gcr.io/deeplearning-platform-release/pytorch-gpu.1-9:m75" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth b/tutorials/notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth deleted file mode 100644 index 61ed04c..0000000 Binary files a/tutorials/notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth and /dev/null differ diff --git a/tutorials/notebooks/pangolin/pangolin_pipeline.ipynb b/tutorials/notebooks/pangolin/pangolin_pipeline.ipynb deleted file mode 100644 index 453aa95..0000000 --- a/tutorials/notebooks/pangolin/pangolin_pipeline.ipynb +++ /dev/null @@ -1,361 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "31e8c3cd", - "metadata": {}, - "source": [ - "# Pangolin SARS-CoV-2 Pipeline Notebook" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview \n", - "SARS-CoV-2 sequence is usually analyzed using a bioinformatic pipeline called Pangolin. Here we will download some genomic data and run Pangolin following [standard instructions](https://cov-lineages.org/resources/pangolin/usage.html). " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have access to Azure AI Studio and have already deployed an LLM " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "+ Download genomic data from NCBI from the commnd line\n", - "+ Run pangolin to identify viral lineages\n", - "+ Generate a phylogeny to visualize lineage identity" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "id": "03541941", - "metadata": {}, - "source": [ - "### Install packages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f994b990", - "metadata": {}, - "outputs": [], - "source": [ - "#change this depending on how many threads are available in your notebook\n", - "CPU=4" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a19b662e", - "metadata": {}, - "outputs": [], - "source": [ - "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", - "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a40f7ebc", - "metadata": {}, - "outputs": [], - "source": [ - "#add to your path\n", - "import os\n", - "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f421805e", - "metadata": {}, - "outputs": [], - "source": [ - "#install biopython to import packages below\n", - "! pip install biopython" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fd936fd6", - "metadata": {}, - "outputs": [], - "source": [ - "! mamba install ipyrad iqtree -c conda-forge -c bioconda" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5a99cf0d", - "metadata": {}, - "outputs": [], - "source": [ - "#import libraries\n", - "import os\n", - "from Bio import SeqIO\n", - "from Bio import Entrez\n", - "import ipyrad.analysis as ipa\n", - "import toytree" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Set up directory structure" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8f831fca", - "metadata": {}, - "outputs": [], - "source": [ - "if not os.path.exists('pangolin_analysis'):\n", - " os.mkdir('pangolin_analysis')\n", - "os.chdir('pangolin_analysis')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6423ca5d", - "metadata": {}, - "outputs": [], - "source": [ - "if os.path.exists('sarscov2_sequences.fasta'):\n", - " os.remove('sarscov2_sequences.fasta')\n", - "!rm sarscov2_*\n", - "!rm lineage_report.csv" - ] - }, - { - "cell_type": "markdown", - "id": "9d7015e6", - "metadata": {}, - "source": [ - "### Fetch viral sequences using a list of accession IDs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "16824bcf", - "metadata": {}, - "outputs": [], - "source": [ - "#give a list of accession number for covid sequences\n", - "acc_nums=['NC_045512','LR757995','LR757996','OL698718','OL677199','OL672836','MZ914912','MZ916499','MZ908464','MW580573','MW580574','MW580576','MW991906','MW931310','MW932027','MW424864','MW453109','MW453110']\n", - "print('the number of sequences we will analyze = ',len(acc_nums))" - ] - }, - { - "cell_type": "markdown", - "id": "9e382d33", - "metadata": {}, - "source": [ - "Let this block run without going to the next until it finishes, otherwise you may get an error about too many requests. If that happens, reset your kernel and just rerun everything (except installing software)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a28a7122", - "metadata": {}, - "outputs": [], - "source": [ - "#use the bio.entrez toolkit within biopython to download the accession numbers\n", - "#save those sequences to a single fasta file\n", - "Entrez.email = \"email@example.com\" # Always tell NCBI who you are\n", - "filename = \"sarscov2_seqs.fasta\"\n", - "if not os.path.isfile(filename):\n", - " # Downloading...\n", - " for acc in acc_nums:\n", - " net_handle = Entrez.efetch(\n", - " db=\"nucleotide\", id=acc, rettype=\"fasta\", retmode=\"text\"\n", - " )\n", - " out_handle = open(filename, \"a\")\n", - " out_handle.write(net_handle.read())\n", - " out_handle.close()\n", - " net_handle.close()\n", - " print(\"Saved\",acc)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "56acb7cc", - "metadata": {}, - "outputs": [], - "source": [ - "#make sure our fasta file has the same number of seqs as the acc_nums list\n", - "print('the number of seqs in our fasta file: ')\n", - "! grep '>' sarscov2_seqs.fasta | wc -l" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8606c352", - "metadata": {}, - "outputs": [], - "source": [ - "#let's peek at our new fasta file\n", - "! head sarscov2_seqs.fasta" - ] - }, - { - "cell_type": "markdown", - "id": "2db37b4e", - "metadata": { - "tags": [] - }, - "source": [ - "### Run pangolin to identify lineages and output alignment\n", - "Here we call pangolin, give it our input sequences and the number of threads. We also tell it to output the alignment. The full list of pangolin parameters can be found in the [docs](https://cov-lineages.org/resources/pangolin/usage.html)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f1a17a74", - "metadata": {}, - "outputs": [], - "source": [ - "! pangolin sarscov2_seqs.fasta --alignment --threads $CPU" - ] - }, - { - "cell_type": "markdown", - "id": "b0e56a4b", - "metadata": {}, - "source": [ - "You can view the output file from pangolin called lineage_report.csv (within pangolin_analysis folder) by double clicking on the file, or by right clicking and downloading. What lineages are present in the dataset? Is Omicron in there?" - ] - }, - { - "cell_type": "markdown", - "id": "37e6efbe", - "metadata": {}, - "source": [ - "### Run iqtree to estimate maximum likelihood tree for our sequences\n", - "iqtree can find the best nucleotide model for the data, but here we are going to assign a model to save time (HKY) and just estimate the phylogeny without any bootstrap support values. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2782855", - "metadata": {}, - "outputs": [], - "source": [ - "#run iqtree with threads = $CPU variable, if you exclude the -m it will do a phylogenetic model search before tree search\n", - "! iqtree -s sequences.aln.fasta -nt $CPU -m HKY --prefix sarscov2_tree --redo-tree" - ] - }, - { - "cell_type": "markdown", - "id": "c7197dd4", - "metadata": {}, - "source": [ - "### Visualize the tree with toytree" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cef2ba18", - "metadata": {}, - "outputs": [], - "source": [ - "#Define the tree file\n", - "tre = toytree.tree('sarscov2_tree.treefile')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "842af165", - "metadata": {}, - "outputs": [], - "source": [ - "#draw the tree\n", - "rtre = tre.root(wildcard=\"OL\")\n", - "rtre.draw(tip_labels_align=True);" - ] - }, - { - "cell_type": "markdown", - "id": "52d9389f", - "metadata": {}, - "source": [ - "You can also visualize the tree by downloading it and opening in figtree." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusions\n", - "Here you learned how to use Azure ML Studio to conduct a basic phylogenetic analysis" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean Up\n", - "Make sure you stop your compute instance and if desired, delete the resource group associated with this tutorial." - ] - } - ], - "metadata": { - "environment": { - "kernel": "python3", - "name": "r-cpu.4-1.m87", - "type": "gcloud", - "uri": "gcr.io/deeplearning-platform-release/r-cpu.4-1:m87" - }, - "kernelspec": { - "display_name": "conda_amazonei_mxnet_p36", - "language": "python", - "name": "conda_amazonei_mxnet_p36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.13" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/tutorials/notebooks/rnaseq-myco-tutorial-main/LICENSE b/tutorials/notebooks/rnaseq-myco-tutorial-main/LICENSE deleted file mode 100644 index d420629..0000000 --- a/tutorials/notebooks/rnaseq-myco-tutorial-main/LICENSE +++ /dev/null @@ -1,21 +0,0 @@ -MIT License - -Copyright (c) 2021 MaineINBRE - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. diff --git a/tutorials/notebooks/rnaseq-myco-tutorial-main/README.md b/tutorials/notebooks/rnaseq-myco-tutorial-main/README.md deleted file mode 100644 index 42bf099..0000000 --- a/tutorials/notebooks/rnaseq-myco-tutorial-main/README.md +++ /dev/null @@ -1,2 +0,0 @@ -# rnaseq-myco-tutorial -Tutorial on RNA-Seq data analysis from a study of gene expression in a prokaryote. Open the notebook in AzureML and try and run all the way through. Learn about downloading data, conda environments, and bash commands. diff --git a/tutorials/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb b/tutorials/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb deleted file mode 100644 index fe594c4..0000000 --- a/tutorials/notebooks/rnaseq-myco-tutorial-main/RNAseq_pipeline.ipynb +++ /dev/null @@ -1,493 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# RNA-Seq Analysis Training Demo on Azure" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This short tutorial demonstrates how to run an RNA-Seq workflow using a prokaryotic data set. Steps in the workflow include read trimming, read QC, read mapping, and counting mapped reads per gene to quantitative gene expression." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "We assume you have provisioned a compute environment in Azure ML Studio" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Learning objectives\n", - "+ Learn how to copy data to and from Blob storage\n", - "+ Learn how to run and visualize basic RNAseq analysis" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Install packages" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that within Jupyter you can run a bash command either by using the magic '!' in front of your command, or by adding %%bash to the top of your cell." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For example\n", - "```\n", - "%%bash\n", - "example command\n", - "```\n", - "Or\n", - "```\n", - "!example command\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The first step is to install mambaforge, which is the newer and faster version of the conda package manager." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n", - "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1682515170386 - } - }, - "outputs": [], - "source": [ - "#add to your path\n", - "import os\n", - "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "! mamba info --envs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we will install the necessary packages into the current environment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "! mamba install -c conda-forge -c bioconda -c defaults -y sra-tools pigz pbzip2 fastp fastqc multiqc salmon" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a set of directories to store the reads, reference sequence files, and output files.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%bash\n", - "mkdir -p data\n", - "mkdir -p data/raw_fastq\n", - "mkdir -p data/trimmed\n", - "mkdir -p data/fastqc\n", - "mkdir -p data/aligned\n", - "mkdir -p data/reference\n", - "mkdir -p data/quants" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Copy FASTQ Files\n", - "In order for this tutorial to run quickly, we will only analyze 50,000 reads from a sample from both sample groups instead of analyzing all the reads from all six samples. These files have been posted on a Azure Blob storage containers that we made publicly accessible." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349122_1.fastq --output data/raw_fastq/SRR13349122_1.fastq\n", - "!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349122_2.fastq --output data/raw_fastq/SRR13349122_2.fastq\n", - "!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349128_1.fastq --output data/raw_fastq/SRR13349128_1.fastq\n", - "!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/raw_fastq/SRR13349128_2.fastq --output data/raw_fastq/SRR13349128_2.fastq" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Copy reference transcriptome files that will be used by Salmon\n", - "Salmon is a tool that aligns RNA-Seq reads to a set of transcripts rather than the entire genome." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/reference/M_chelonae_transcripts.fasta --output data/reference/M_chelonae_transcripts.fasta\n", - "!curl https://storeshare.blob.core.windows.net/publicdata/testsample/RNAseq/reference/decoys.txt --output data/reference/decoys.txt" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1682517580413 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "ls data/raw_fastq" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Trim our data with Fastp" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "! fastp -i data/raw_fastq/SRR13349122_1.fastq -I data/raw_fastq/SRR13349122_2.fastq -o data/trimmed/SRR13349122_1_trimmed.fastq -O data/trimmed/SRR13349122_2_trimmed.fastq\n", - "! fastp -i data/raw_fastq/SRR13349128_1.fastq -I data/raw_fastq/SRR13349128_2.fastq -o data/trimmed/SRR13349128_1_trimmed.fastq -O data/trimmed/SRR13349128_2_trimmed.fastq" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run FastQC\n", - "FastQC is an invaluable tool that allows you to evaluate whether there are problems with a set of reads. For example, it will provide a report of whether there is any bias in the sequence composition of the reads." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once FastQC is done running, look at the outputs in data/fastqc. What can you say about the quality of the two samples we are looking at here? " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%bash\n", - "fastqc -o data/fastqc data/trimmed/SRR13349122_1_trimmed.fastq\n", - "fastqc -o data/fastqc data/trimmed/SRR13349128_1_trimmed.fastq" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run MultiQC\n", - "MultiQC reads in the FastQQ reports and generate a compiled report for all the analyzed FASTQ files.\n", - "Just as with fastqc, we can look at the mulitqc results after it finishes at data/multiqc_data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1682517201690 - } - }, - "outputs": [], - "source": [ - "! multiqc -f data/fastqc -f\n", - "#! mv multiqc_data/ data/" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Index the Transcriptome so that Trimmed Reads Can Be Mapped Using Salmon" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "! salmon index -t data/reference/M_chelonae_transcripts.fasta -p 8 -i data/reference/transcriptome_index --decoys data/reference/decoys.txt -k 31 --keepDuplicates" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run Salmon to Map Reads to Transcripts and Quantify Expression Levels\n", - "Salmon aligns the trimmed reads to the reference transcriptome and generates the read counts per transcript. In this analysis, each gene has a single transcript." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "%%bash\n", - "salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349122_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349122_quant\n", - "salmon quant -i data/reference/transcriptome_index -l SR -r data/trimmed/SRR13349128_1_trimmed.fastq -p 8 --validateMappings -o data/quants/SRR13349128_quant" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "gather": { - "logged": 1682518630201 - }, - "jupyter": { - "outputs_hidden": false, - "source_hidden": false - }, - "nteract": { - "transient": { - "deleting": false - } - } - }, - "outputs": [], - "source": [ - "ls data/quants/" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Report the top 10 most highly expressed genes in the samples" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Top 10 most highly expressed genes in the wild-type sample.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "! sort -nrk 4,4 data/quants/SRR13349122_quant/quant.sf | head -10" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Top 10 most highly expressed genes in the double lysogen sample.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!sort -nrk 4,4 data/quants/SRR13349128_quant/quant.sf | head -10" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Report the expression of a putative acyl-ACP desaturase (BB28_RS16545) that was downregulated in the double lysogen relative to wild-type\n", - "A acyl-transferase was reported to be downregulated in the double lysogen as shown in the table of the top 20 upregulated and downregulated genes from the paper describing the study." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use `grep` to report the expression in the wild-type sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields: \n", - "`Name Length EffectiveLength TPM NumReads`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!grep 'BB28_RS16545' data/quants/SRR13349122_quant/quant.sf" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use `grep` to report the expression in the double lysogen sample. The fields in the Salmon `quant.sf` file are as follows. The level of expression is reported in the Transcripts Per Million (`TPM`) and number of reads (`NumReads`) fields: \n", - "`Name Length EffectiveLength TPM NumReads`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!grep 'BB28_RS16545' data/quants/SRR13349128_quant/quant.sf" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "Here you learned how to import data to and from a Blob storage container and then use fastq files to run basic RNAseq analysis! " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean Up\n", - "Make sure you stop your compute instance and if desired, delete the resource group associated with this tutorial." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "kernel_info": { - "name": "python3" - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.13" - }, - "microsoft": { - "ms_spell_check": { - "ms_spell_check_language": "en" - } - }, - "nteract": { - "version": "nteract-front-end@1.0.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/tutorials/notebooks/rnaseq-myco-tutorial-main/images/count-workflow.png b/tutorials/notebooks/rnaseq-myco-tutorial-main/images/count-workflow.png deleted file mode 100644 index 1a873dd..0000000 Binary files a/tutorials/notebooks/rnaseq-myco-tutorial-main/images/count-workflow.png and /dev/null differ diff --git a/tutorials/notebooks/rnaseq-myco-tutorial-main/images/rnaseq-workflow.png b/tutorials/notebooks/rnaseq-myco-tutorial-main/images/rnaseq-workflow.png deleted file mode 100644 index 2620809..0000000 Binary files a/tutorials/notebooks/rnaseq-myco-tutorial-main/images/rnaseq-workflow.png and /dev/null differ diff --git a/tutorials/notebooks/rnaseq-myco-tutorial-main/images/table-cushman.png b/tutorials/notebooks/rnaseq-myco-tutorial-main/images/table-cushman.png deleted file mode 100644 index ced39b9..0000000 Binary files a/tutorials/notebooks/rnaseq-myco-tutorial-main/images/table-cushman.png and /dev/null differ