Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Vector Based Text2SQL Code and Approach #14

Merged
merged 29 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
66da626
Add deployment for text2sql index
BenConstable9 Sep 11, 2024
cf69d7e
Add entities for new version
BenConstable9 Sep 11, 2024
0b43c72
Work in progress
BenConstable9 Sep 11, 2024
6bb3c3d
Add deployment for text2sql index
BenConstable9 Sep 11, 2024
3953ab8
Add entities for new version
BenConstable9 Sep 11, 2024
c9ac623
Work in progress
BenConstable9 Sep 11, 2024
6706df8
Merge branch 'feautre/text2sql-with-vector' of https://github.com/mic…
BenConstable9 Sep 11, 2024
e21bcba
refactor the location
BenConstable9 Sep 11, 2024
ef04b1e
Update envs
BenConstable9 Sep 11, 2024
fbfdfd1
Update the envs
BenConstable9 Sep 11, 2024
1dbe906
Update envs
BenConstable9 Sep 11, 2024
93fef49
Finish indexer building
BenConstable9 Sep 11, 2024
7fcae02
Update the location of the dictionary
BenConstable9 Sep 11, 2024
78d1f7e
Update data dictionary
BenConstable9 Sep 11, 2024
7826d7d
Update the plugin to load jsonl
BenConstable9 Sep 11, 2024
abf099a
Update content
BenConstable9 Sep 11, 2024
482e546
Move to use a skillset for indexing
BenConstable9 Sep 11, 2024
7424246
Update the scripts
BenConstable9 Sep 11, 2024
a154fcb
Improve the readme
BenConstable9 Sep 11, 2024
27c1f33
Update the naming
BenConstable9 Sep 11, 2024
7d9082f
Fix bad replacement
BenConstable9 Sep 11, 2024
30f2802
Update the readmes
BenConstable9 Sep 11, 2024
353a684
Update readme
BenConstable9 Sep 11, 2024
79e454a
Update the readme
BenConstable9 Sep 11, 2024
4a270e8
Run the vector example
BenConstable9 Sep 11, 2024
15c6d82
Update the env
BenConstable9 Sep 11, 2024
27dda9c
Update the code
BenConstable9 Sep 11, 2024
6225006
Update readme
BenConstable9 Sep 11, 2024
2af5d98
Update main readme
BenConstable9 Sep 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@ It is intended that the plugins and skills provided in this repository, are adap

## Components

- `./text2sql` contains an Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base.
- `./ai_search_with_adi_function_app` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these.
- `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
- `./text_2_sql` contains an two Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base. A prompt based and vector based approach are shown, both of which exhibit great performance in answering sql queries. With these plugins, your RAG application can now access and pull data from any SQL table exposed to it to answer questions.
- `./adi_function_app` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these. With this custom skill, the RAG application can draw insights from complex charts and images during the vector search.
- `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search and for Text2SQL.

The above components have been successfully used on production RAG projects to increase the quality of responses. The code provided in this repo is a sample of the implementation and should be adjusted before being used in production.
The above components have been successfully used on production RAG projects to increase the quality of responses.

_The code provided in this repo is a sample of the implementation and should be adjusted before being used in production._

## High Level Implementation

Expand Down
12 changes: 12 additions & 0 deletions adi_function_app/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FunctionApp__ClientId=<clientId of the function app if using user assigned managed identity>
IdentityType=<identityType> # system_assigned or user_assigned or key
OpenAI__ApiKey=<openAIKey if using non managed identity>
OpenAI__Endpoint=<openAIEndpoint>
OpenAI__MultiModalDeployment=<openAIEmbeddingDeploymentId>
OpenAI__ApiVersion=<openAIApiVersion>
AIService__DocumentIntelligence__Endpoint=<documentIntelligenceEndpoint>
AIService__DocumentIntelligence__Key=<documentIntelligenceKey if not using identity>
AIService__Language__Endpoint=<languageEndpoint>
AIService__Language__Key=<languageKey if not using identity>
StorageAccount__Endpoint=<Endpoint if using identity based connections>
StorageAccount__ConnectionString=<connectionString if using non managed identity>
4 changes: 2 additions & 2 deletions adi_function_app/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@ The properties returned from the ADI Custom Skill are then used to perform the f

## Deploying AI Search Setup

To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./ai_search/README.md`.
To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.

## ADI Custom Skill

Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_deploy_ai_search` HTTP endpoint.
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.

To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.

Expand Down
10 changes: 6 additions & 4 deletions adi_function_app/adi_2_ai_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,11 +188,11 @@ async def understand_image_with_gptv(image_base64, caption, tries_left=3):
"role": "user",
"content": [
{
"type": "text",
"Type": "text",
"text": user_input,
},
{
"type": "image_url",
"Type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_base64}"
},
Expand Down Expand Up @@ -371,10 +371,12 @@ async def analyse_document(file_path: str) -> AnalyzeResult:
managed_identity_client_id=os.environ["FunctionApp__ClientId"]
)
else:
credential = AzureKeyCredential(os.environ["AIService__Services__Key"])
credential = AzureKeyCredential(
os.environ["AIService__DocumentIntelligence__Key"]
)

async with DocumentIntelligenceClient(
endpoint=os.environ["AIService__Services__Endpoint"],
endpoint=os.environ["AIService__DocumentIntelligence__Endpoint"],
credential=credential,
) as document_intelligence_client:
poller = await document_intelligence_client.begin_analyze_document(
Expand Down
4 changes: 2 additions & 2 deletions adi_function_app/key_phrase_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ async def extract_key_phrases_from_text(
managed_identity_client_id=os.environ.get("FunctionApp__ClientId")
)
else:
credential = AzureKeyCredential(os.environ.get("AIService__Services__Key"))
credential = AzureKeyCredential(os.environ.get("AIService__Language__Key"))
text_analytics_client = TextAnalyticsClient(
endpoint=os.environ.get("AIService__Services__Endpoint"),
endpoint=os.environ.get("AIService__Language__Endpoint"),
credential=credential,
)

Expand Down
22 changes: 22 additions & 0 deletions deploy_ai_search/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FunctionApp__Endpoint=<functionAppEndpoint>
FunctionApp__Key=<functionAppKey>
FunctionApp__PreEmbeddingCleaner__FunctionName=pre_embedding_cleaner
FunctionApp__ADI__FunctionName=adi_2_ai_search
FunctionApp__KeyPhraseExtractor__FunctionName=key_phrase_extractor
FunctionApp__AppRegistrationResourceId=<App registration in form api://appRegistrationclientId if using identity based connections>
IdentityType=<identityType> # system_assigned or user_assigned or key
AIService__AzureSearchOptions__Endpoint=<searchServiceEndpoint>
AIService__AzureSearchOptions__Identity__ClientId=<clientId if using user assigned identity>
AIService__AzureSearchOptions__Key=<searchServiceKey if not using identity>
AIService__AzureSearchOptions__UsePrivateEndpoint=<true/false>
AIService__AzureSearchOptions__Identity__FQName=<fully qualified name of the identity if using user assigned identity>
StorageAccount__FQEndpoint=<Fully qualified endpoint in form ResourceId=resourceId if using identity based connections>
StorageAccount__ConnectionString=<connectionString if using non managed identity>
StorageAccount__RagDocuments__Container=<containerName>
StorageAccount__Text2Sql__Container=<containerName>
OpenAI__ApiKey=<openAIKey if using non managed identity>
OpenAI__Endpoint=<openAIEndpoint>
OpenAI__EmbeddingModel=<openAIEmbeddingModelName>
OpenAI__EmbeddingDeployment=<openAIEmbeddingDeploymentId>
OpenAI__EmbeddingDimensions=1536
Text2Sql__DatabaseName=<databaseName>
16 changes: 13 additions & 3 deletions deploy_ai_search/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,28 @@
# AI Search Indexing with Azure Document Intelligence - Pre-built Index Setup
# AI Search Indexing Pre-built Index Setup

The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.

## Steps
## Steps for Rag Documents Index Deployment

1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
3. Run `deploy.py` with the following args:

- `indexer_type rag`. This selects the `rag_documents` sub class.
- `indexer_type rag`. This selects the `RagDocumentsAISearch` sub class.
- `enable_page_chunking True`. This determines whether page wise chunking is applied in ADI, or whether the inbuilt skill is used for TextSplit. **Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
- `rebuild`. Whether to delete and rebuild the index.
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.

## Steps for Text2SQL Index Deployment

1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
2. Adjust `text_2_sql.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
3. Run `deploy.py` with the following args:

- `indexer_type text_2_sql`. This selects the `Text2SQLAISearch` sub class.
- `rebuild`. Whether to delete and rebuild the index.
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.

## ai_search.py & environment.py

This includes a variety of helper files and scripts to deploy the index setup. This is useful for CI/CD to avoid having to write JSON files manually or use the UI to deploy the pipeline.
29 changes: 26 additions & 3 deletions deploy_ai_search/ai_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
SynonymMap,
SplitSkill,
SearchIndexerIndexProjections,
BlobIndexerParsingMode,
)
from azure.core.exceptions import HttpResponseError
from azure.search.documents.indexes import SearchIndexerClient, SearchIndexClient
Expand Down Expand Up @@ -66,12 +67,16 @@ def __init__(
self.environment = AISearchEnvironment(indexer_type=self.indexer_type)

self._search_indexer_client = SearchIndexerClient(
self.environment.ai_search_endpoint, self.environment.ai_search_credential
endpoint=self.environment.ai_search_endpoint,
credential=self.environment.ai_search_credential,
)
self._search_index_client = SearchIndexClient(
self.environment.ai_search_endpoint, self.environment.ai_search_credential
endpoint=self.environment.ai_search_endpoint,
credential=self.environment.ai_search_credential,
)

self.parsing_mode = BlobIndexerParsingMode.DEFAULT

@property
def indexer_name(self):
"""Get the indexer name for the indexer."""
Expand Down Expand Up @@ -156,7 +161,16 @@ def get_data_source(self) -> SearchIndexerDataSourceConnection:
if self.get_indexer() is None:
return None

data_deletion_detection_policy = NativeBlobSoftDeleteDeletionDetectionPolicy()
if self.parsing_mode in [
BlobIndexerParsingMode.DEFAULT,
BlobIndexerParsingMode.TEXT,
BlobIndexerParsingMode.JSON,
]:
data_deletion_detection_policy = (
NativeBlobSoftDeleteDeletionDetectionPolicy()
)
else:
data_deletion_detection_policy = None

data_change_detection_policy = HighWaterMarkChangeDetectionPolicy(
high_water_mark_column_name="metadata_storage_last_modified"
Expand Down Expand Up @@ -268,6 +282,10 @@ def get_text_split_skill(self, context, source) -> SplitSkill:
def get_adi_skill(self, chunk_by_page=False) -> WebApiSkill:
"""Get the custom skill for adi.

Args:
-----
chunk_by_page (bool, optional): Whether to chunk by page. Defaults to False.

Returns:
--------
WebApiSkill: The custom skill for adi"""
Expand Down Expand Up @@ -528,6 +546,11 @@ def run_indexer(self):

def reset_indexer(self):
"""This function runs the indexer."""

if self.get_indexer() is None:
logging.warning("Indexer not defined. Skipping reset operation.")

return
self._search_indexer_client.reset_indexer(self.indexer_name)

logging.info("%s reset.", self.indexer_name)
Expand Down
5 changes: 5 additions & 0 deletions deploy_ai_search/deploy.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# Licensed under the MIT License.
import argparse
from rag_documents import RagDocumentsAISearch
from text_2_sql import Text2SqlAISearch


def deploy_config(arguments: argparse.Namespace):
Expand All @@ -15,6 +16,10 @@ def deploy_config(arguments: argparse.Namespace):
rebuild=arguments.rebuild,
enable_page_by_chunking=arguments.enable_page_chunking,
)
elif arguments.indexer_type == "text_2_sql":
index_config = Text2SqlAISearch(
suffix=arguments.suffix, rebuild=arguments.rebuild
)
else:
raise ValueError("Invalid Indexer Type")

Expand Down
1 change: 1 addition & 0 deletions deploy_ai_search/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class IndexerType(Enum):
"""The type of the indexer"""

RAG_DOCUMENTS = "rag-documents"
TEXT_2_SQL = "text-2-sql"


class IdentityType(Enum):
Expand Down
1 change: 1 addition & 0 deletions deploy_ai_search/rag_documents.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@ def get_indexer(self) -> SearchIndexer:
fail_on_unsupported_content_type=False,
index_storage_metadata_only_for_oversized_documents=True,
indexed_file_name_extensions=".pdf,.pptx,.docx,.xlsx,.txt,.png,.jpg,.jpeg",
parsing_mode=self.parsing_mode,
),
max_failed_items=5,
)
Expand Down
Loading
Loading