From dbe02bbaf666429599696dcdb5d0c38364dec6a0 Mon Sep 17 00:00:00 2001 From: Mig <104501046+minhyeong112@users.noreply.github.com> Date: Thu, 19 Dec 2024 00:27:16 +0900 Subject: [PATCH] Update prompts and agent flow for improved Text2SQL (#105) * Update prompts and agent flow for improved Text2SQL Changes include: - Enhanced prompt templates in YAML files for better SQL generation - Updated autogen implementation for improved agent interactions - Updated documentation in README and notebook - Refined agent flow for better query understanding and generation * Fix linting issues: Remove unused variable and apply formatting --- text_2_sql/autogen/README.md | 149 ++++++++++---- .../autogen_text_2_sql/autogen_text_2_sql.py | 118 ++++++----- .../prompts/query_rewrite_agent.yaml | 155 +++++++++++---- .../prompts/sql_disambiguation_agent.yaml | 184 +++++++++++------- .../prompts/sql_query_correction_agent.yaml | 142 +++++++++++--- .../prompts/sql_query_generation_agent.yaml | 80 +++++--- .../prompts/sql_schema_selection_agent.yaml | 68 ++++--- 7 files changed, 631 insertions(+), 265 deletions(-) diff --git a/text_2_sql/autogen/README.md b/text_2_sql/autogen/README.md index 669adfe..f76df08 100644 --- a/text_2_sql/autogen/README.md +++ b/text_2_sql/autogen/README.md @@ -2,13 +2,9 @@ The implementation is written for [AutoGen](https://github.com/microsoft/autogen) in Python, although it can easily be adapted for C#. -**Still work in progress, expect a lot of updates shortly** - -**The provided AutoGen code only implements Iterations 5 (Agentic Approach)** - ## Full Logical Flow for Agentic Vector Based Approach -The following diagram shows the logical flow within multi agent system. The flow begins with query rewriting to preprocess questions - this includes resolving relative dates (e.g., "last month" to "November 2024") and breaking down complex queries into simpler components. For each preprocessed question, if query cache is enabled, the system checks the cache for previously asked similar questions. In an ideal scenario, the preprocessed questions will be found in the cache, leading to the quickest answer generation. In cases where the question is not known, the group chat selector will fall back to the other agents accordingly and generate the SQL query using the LLMs. The cache is then updated with the newly generated query and schemas. +The following diagram shows the logical flow within the multi-agent system. The flow begins with query rewriting to preprocess questions - this includes resolving relative dates (e.g., "last month" to "November 2024") and breaking down complex queries into simpler components. For each preprocessed question, if query cache is enabled, the system checks the cache for previously asked similar questions. In an ideal scenario, the preprocessed questions will be found in the cache, leading to the quickest answer generation. In cases where the question is not known, the system will fall back to the other agents accordingly and generate the SQL query using the LLMs. The cache is then updated with the newly generated query and schemas. Unlike the previous approaches, **gpt4o-mini** can be used as each agent's prompt is small and focuses on a single simple task. @@ -18,67 +14,131 @@ As the query cache is shared between users (no data is stored in the cache), a n ![Vector Based with Query Cache Logical Flow.](../images/Agentic%20Text2SQL%20Query%20Cache.png "Agentic Vector Based with Query Cache Logical Flow") +## Agent Flow in Detail + +The agent flow is managed by a sophisticated selector system in `autogen_text_2_sql.py`. Here's how it works: + +1. **Initial Entry** + - Every question starts with the Query Rewrite Agent + - This agent processes dates and breaks down complex questions + +2. **Post Query Rewrite** + - If query cache is enabled (`Text2Sql__UseQueryCache=True`): + - Flow moves to SQL Query Cache Agent + - If cache is disabled: + - Flow moves directly to Schema Selection Agent + +3. **Cache Check Branch** + - If cache hit found: + - With pre-run results: Goes to SQL Query Correction Agent + - Without pre-run results: Goes to SQL Query Generation Agent + - If cache miss: + - Goes to Schema Selection Agent + +4. **Schema Selection Branch** + - Schema Selection Agent finds relevant schemas + - Always moves to SQL Disambiguation Agent + - Disambiguation Agent clarifies any schema ambiguities + - Then moves to SQL Query Generation Agent + +5. **Query Generation and Correction Loop** + - SQL Query Generation Agent creates the query + - SQL Query Correction Agent verifies/corrects the query + - Based on correction results: + - If query needs execution: Returns to Correction Agent + - If query needs fixes: Returns to Generation Agent + - If answer and sources ready: Goes to Answer and Sources Agent + - If error occurs: Returns to Generation Agent + +6. **Final Answer Formatting** + - Answer and Sources Agent formats the final response + - Standardizes output format with markdown tables + - Combines all sources and query results + - Returns formatted answer to user + +The flow uses termination conditions: +- Explicit "TERMINATE" mention +- Presence of both "answer" and "sources" +- Maximum of 20 messages reached + ## Provided Notebooks & Scripts -- `./Iteration 5 - Agentic Vector Based Text2SQL.ipynb` provides example of how to utilise the Agentic Vector Based Text2SQL approach to query the database. The query cache plugin will be enabled or disabled depending on the environmental parameters. +- `./Iteration 5 - Agentic Vector Based Text2SQL.ipynb` provides example of how to utilize the Agentic Vector Based Text2SQL approach to query the database. The query cache plugin will be enabled or disabled depending on the environmental parameters. ## Agents -This approach builds on the Vector Based SQL Plugin approach, but adds a agentic approach to the solution. +This approach builds on the Vector Based SQL Plugin approach, but adds an agentic approach to the solution. -This agentic system contains the following agents: +The agentic system contains the following agents: - **Query Rewrite Agent:** The first agent in the flow, responsible for two key preprocessing tasks: 1. Resolving relative dates to absolute dates (e.g., "last month" → "November 2024") 2. Decomposing complex questions into simpler sub-questions This preprocessing happens before cache lookup to maximize cache effectiveness. -- **Query Cache Agent:** Responsible for checking the cache for previously asked questions. After preprocessing, each sub-question is checked against the cache if caching is enabled. -- **Schema Selection Agent:** Responsible for extracting key terms from the question and checking the index store for the queries. This agent is used when a cache miss occurs. -- **SQL Query Generation Agent:** Responsible for using the previously extracted schemas and generated SQL queries to answer the question. This agent can request more schemas if needed. This agent will run the query. -- **SQL Query Verification Agent:** Responsible for verifying that the SQL query and results question will answer the question. -- **Answer Generation Agent:** Responsible for taking the database results and generating the final answer for the user. -The combination of these agents allows the system to answer complex questions, whilst staying under the token limits when including the database schemas. The query cache ensures that previously asked questions can be answered quickly to avoid degrading user experience. +- **Query Cache Agent:** (Optional) Responsible for checking the cache for previously asked questions. After preprocessing, each sub-question is checked against the cache if caching is enabled. + +- **Schema Selection Agent:** Responsible for extracting key terms from the question and checking the index store for relevant database schemas. This agent is used when a cache miss occurs. -All agents can be found in `/agents/`. +- **SQL Disambiguation Agent:** Responsible for clarifying any ambiguities in the schema selection and ensuring the correct tables and columns are selected for the query. -## agentic_text_2_sql.py +- **SQL Query Generation Agent:** Responsible for using the previously extracted schemas to generate SQL queries that answer the question. This agent can request more schemas if needed. -This is the main entry point for the agentic system. In here, the system is configured with the following processing flow: +- **SQL Query Correction Agent:** Responsible for verifying and correcting the generated SQL queries, ensuring they are syntactically correct and will produce the expected results. This agent also handles the execution of queries and formatting of results. -The preprocessed questions from the Query Rewrite Agent are processed sequentially through the rest of the agent pipeline. A custom transition selector automatically transitions between agents dependent on the last one that was used. The flow starts with the Query Rewrite Agent for preprocessing, followed by cache checking for each sub-question if caching is enabled. In some cases, this choice is delegated to an LLM to decide on the most appropriate action. This mixed approach allows for speed when needed (e.g. cache hits for known questions), but will allow the system to react dynamically to the events. +- **Answer and Sources Agent:** Final agent in the flow that: + 1. Standardizes the output format across all responses + 2. Formats query results into markdown tables for better readability + 3. Combines all sources and results into a single coherent response + 4. Ensures consistent JSON structure in the final output -Note: Future development aims to implement independent processing where each preprocessed question would run in its own isolated context to prevent confusion between different parts of complex queries. +The combination of these agents allows the system to answer complex questions while staying under token limits when including database schemas. The query cache ensures that previously asked questions can be answered quickly to avoid degrading user experience. -## Utils +## Project Structure -### ai-search.py +### autogen_text_2_sql.py -This util file contains helper functions for interacting with AI Search. +This is the main entry point for the agentic system. It configures the system with a sophisticated processing flow managed by a unified selector that handles agent transitions. The flow includes: -### llm_agent_creator.py +1. Initial query rewriting for preprocessing +2. Cache checking if enabled +3. Schema selection and disambiguation +4. Query generation and correction +5. Result verification and formatting +6. Final answer standardization -This util file creates the agents in the AutoGen framework based on the configuration files. +The system uses a custom transition selector that automatically moves between agents based on the previous agent's output and the current state. This allows for dynamic reactions to different scenarios, such as cache hits, schema ambiguities, or query corrections. -### models.py +### creators/ -This util file creates the model connections to Azure OpenAI for the agents. +- **llm_agent_creator.py:** Creates the agents in the AutoGen framework based on configuration files +- **llm_model_creator.py:** Handles model connections and configurations for the agents -### sql.py +### custom_agents/ -#### get_entity_schema() +Contains specialized agent implementations: +- **sql_query_cache_agent.py:** Implements the caching functionality +- **sql_schema_selection_agent.py:** Handles schema selection and management +- **answer_and_sources_agent.py:** Formats and standardizes final outputs -This method is called by the AutoGen framework automatically, when instructed to do so by the LLM, to search the AI Search instance with the given text. The LLM is able to pass the key terms from the user query, and retrieve a ranked list of the most suitable entities to answer the question. +## Configuration -The search text passed is vectorised against the entity level **Description** columns. A hybrid Semantic Reranking search is applied against the **EntityName**, **Entity**, **Columns/Name** fields. +The system behavior can be controlled through environment variables: -#### fetch_queries_from_cache() +- `Text2Sql__UseQueryCache`: Enables/disables the query cache functionality +- `Text2Sql__PreRunQueryCache`: Controls whether to pre-run cached queries +- `Text2Sql__UseColumnValueStore`: Enables/disables the column value store +- `Text2Sql__DatabaseEngine`: Specifies the target database engine + +Each agent can be configured with specific parameters and prompts to optimize its behavior for different scenarios. + +## Query Cache Implementation Details The vector based with query cache uses the `fetch_queries_from_cache()` method to fetch the most relevant previous query and injects it into the prompt before the initial LLM call. The use of Auto-Function Calling here is avoided to reduce the response time as the cache index will always be used first. If the score of the top result is higher than the defined threshold, the query will be executed against the target data source and the results included in the prompt. This allows us to prompt the LLM to evaluated whether it can use these results to answer the question, **without further SQL Query generation** to speed up the process. -The cache entires are rendered with Jinja templates before they are run. The following placesholders are prepopulated automatically: +The cache entries are rendered with Jinja templates before they are run. The following placeholders are prepopulated automatically: - date - datetime @@ -87,8 +147,31 @@ The cache entires are rendered with Jinja templates before they are run. The fol Additional parameters passed at runtime, such as a user_id, are populated automatically if included in the request. -#### run_sql_query() +### run_sql_query() This method is called by the AutoGen framework automatically, when instructed to do so by the LLM, to run a SQL query against the given database. It returns a JSON string containing a row wise dump of the results returned. These results are then interpreted to answer the question. Additionally, if any of the cache functionality is enabled, this method will update the query cache index based on the SQL query run, and the schemas used in execution. + +## Output Format + +The system produces standardized JSON output through the Answer and Sources Agent: + +```json +{ + "answer": "The answer to the user's question", + "sources": [ + { + "sql_query": "The SQL query used", + "sql_rows": ["Array of result rows"], + "markdown_table": "Formatted markdown table of results" + } + ] +} +``` + +This consistent output format ensures: +1. Clear separation between answer and supporting evidence +2. Human-readable presentation of query results +3. Access to raw data for further processing +4. Traceable query execution for debugging diff --git a/text_2_sql/autogen/src/autogen_text_2_sql/autogen_text_2_sql.py b/text_2_sql/autogen/src/autogen_text_2_sql/autogen_text_2_sql.py index ac911e1..f6d4e98 100644 --- a/text_2_sql/autogen/src/autogen_text_2_sql/autogen_text_2_sql.py +++ b/text_2_sql/autogen/src/autogen_text_2_sql/autogen_text_2_sql.py @@ -1,5 +1,7 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. +""" +Copyright (c) Microsoft Corporation. +Licensed under the MIT License. +""" from autogen_agentchat.conditions import ( TextMentionTermination, MaxMessageTermination, @@ -8,7 +10,9 @@ from autogen_text_2_sql.creators.llm_model_creator import LLMModelCreator from autogen_text_2_sql.creators.llm_agent_creator import LLMAgentCreator import logging -from autogen_text_2_sql.custom_agents.sql_query_cache_agent import SqlQueryCacheAgent +from autogen_text_2_sql.custom_agents.sql_query_cache_agent import ( + SqlQueryCacheAgent, +) from autogen_text_2_sql.custom_agents.sql_schema_selection_agent import ( SqlSchemaSelectionAgent, ) @@ -41,29 +45,23 @@ async def on_messages_stream(self, messages, sender=None, config=None): class AutoGenText2Sql: def __init__(self, engine_specific_rules: str, **kwargs: dict): - self.use_query_cache = False self.pre_run_query_cache = False - self.target_engine = os.environ["Text2Sql__DatabaseEngine"].upper() self.engine_specific_rules = engine_specific_rules - self.kwargs = kwargs - self.set_mode() def set_mode(self): """Set the mode of the plugin based on the environment variables.""" - self.use_query_cache = ( - os.environ.get("Text2Sql__UseQueryCache", "True").lower() == "true" - ) - self.pre_run_query_cache = ( os.environ.get("Text2Sql__PreRunQueryCache", "True").lower() == "true" ) - self.use_column_value_store = ( os.environ.get("Text2Sql__UseColumnValueStore", "True").lower() == "true" ) + self.use_query_cache = ( + os.environ.get("Text2Sql__UseQueryCache", "True").lower() == "true" + ) def get_all_agents(self): """Get all agents for the complete flow.""" @@ -81,6 +79,18 @@ def get_all_agents(self): **self.kwargs, ) + # If relationship_paths not provided, use a generic template + if "relationship_paths" not in self.kwargs: + self.kwargs[ + "relationship_paths" + ] = """ + Common relationship paths to consider: + - Transaction → Related Dimensions (for basic analysis) + - Geographic → Location hierarchies (for geographic analysis) + - Temporal → Date hierarchies (for time-based analysis) + - Entity → Attributes (for entity-specific analysis) + """ + self.sql_schema_selection_agent = SqlSchemaSelectionAgent( target_engine=self.target_engine, engine_specific_rules=self.engine_specific_rules, @@ -135,54 +145,56 @@ def termination_condition(self): def unified_selector(self, messages): """Unified selector for the complete flow.""" logging.info("Messages: %s", messages) + current_agent = messages[-1].source if messages else "start" decision = None - # If this is the first message, start with query_rewrite_agent + # If this is the first message start with query_rewrite_agent if len(messages) == 1: - return "query_rewrite_agent" - + decision = "query_rewrite_agent" # Handle transition after query rewriting - if messages[-1].source == "query_rewrite_agent": - # Keep the array structure but process sequentially - if os.environ.get("Text2Sql__UseQueryCache", "False").lower() == "true": - decision = "sql_query_cache_agent" - else: - decision = "sql_schema_selection_agent" + elif current_agent == "query_rewrite_agent": + decision = ( + "sql_query_cache_agent" + if self.use_query_cache + else "sql_schema_selection_agent" + ) # Handle subsequent agent transitions - elif messages[-1].source == "sql_query_cache_agent": - try: - cache_result = json.loads(messages[-1].content) - if cache_result.get("cached_questions_and_schemas") is not None: - if cache_result.get("contains_pre_run_results"): - decision = "sql_query_correction_agent" - else: - decision = "sql_query_generation_agent" - else: - decision = "sql_schema_selection_agent" - except json.JSONDecodeError: - decision = "sql_schema_selection_agent" - elif messages[-1].source == "sql_schema_selection_agent": + elif current_agent == "sql_query_cache_agent": + # Always go through schema selection after cache check + decision = "sql_schema_selection_agent" + elif current_agent == "sql_schema_selection_agent": decision = "sql_disambiguation_agent" - elif messages[-1].source == "sql_disambiguation_agent": + elif current_agent == "sql_disambiguation_agent": decision = "sql_query_generation_agent" + elif current_agent == "sql_query_generation_agent": + decision = "sql_query_correction_agent" + elif current_agent == "sql_query_correction_agent": + try: + correction_result = json.loads(messages[-1].content) + if isinstance(correction_result, dict): + if "answer" in correction_result and "sources" in correction_result: + decision = "answer_and_sources_agent" + elif "corrected_query" in correction_result: + if correction_result.get("executing", False): + decision = "sql_query_correction_agent" + else: + decision = "sql_query_generation_agent" + elif "error" in correction_result: + decision = "sql_query_generation_agent" + elif isinstance(correction_result, list) and len(correction_result) > 0: + if "requested_fix" in correction_result[0]: + decision = "sql_query_generation_agent" - elif messages[-1].source == "sql_query_correction_agent": - if "answer" in messages[-1].content is not None: - decision = "answer_and_sources_agent" - else: - decision = "sql_query_generation_agent" - - elif messages[-1].source == "sql_query_generation_agent": - if "query_execution_with_limit" in messages[-1].content: - decision = "sql_query_correction_agent" - else: - # Rerun + if decision is None: + decision = "sql_query_generation_agent" + except json.JSONDecodeError: decision = "sql_query_generation_agent" - - elif messages[-1].source == "answer_and_sources_agent": + elif current_agent == "answer_and_sources_agent": decision = "user_proxy" # Let user_proxy send TERMINATE - logging.info("Decision: %s", decision) + if decision: + logging.info(f"Agent transition: {current_agent} -> {decision}") + return decision @property @@ -198,7 +210,10 @@ def agentic_flow(self): return flow async def process_question( - self, task: str, chat_history: list[str] = None, parameters: dict = None + self, + task: str, + chat_history: list[str] = None, + parameters: dict = None, ): """Process the complete question through the unified system. @@ -206,13 +221,12 @@ async def process_question( ---- task (str): The user question to process. chat_history (list[str], optional): The chat history. Defaults to None. - parameters (dict, optional): The parameters to pass to the agents. Defaults to None. + parameters (dict, optional): Parameters to pass to agents. Defaults to None. Returns: ------- dict: The response from the system. """ - logging.info("Processing question: %s", task) logging.info("Chat history: %s", chat_history) diff --git a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/query_rewrite_agent.yaml b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/query_rewrite_agent.yaml index 0be8e6f..4b1bc90 100644 --- a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/query_rewrite_agent.yaml +++ b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/query_rewrite_agent.yaml @@ -1,56 +1,133 @@ -model: - 4o-mini -description: - "An agent that preprocesses user questions by decomposing complex queries and resolving relative dates. This preprocessing happens before cache lookup to maximize cache utility." -system_message: - " - You are a helpful AI Assistant specializing in preprocessing user questions to prepare them for SQL query generation. You should rewrite complex questions into simpler, self-contained questions and resolve relative date references. +model: "4o-mini" +description: "An agent that preprocesses user questions by decomposing complex queries into simpler sub-queries that can be processed independently and then combined." +system_message: | + + You are a helpful AI Assistant specializing in breaking down complex questions into simpler sub-queries that can be processed independently and then combined for the final answer. You should identify when a question can be solved through simpler sub-queries and provide clear instructions for combining their results. - - The user’s question may include complex parts and relative date references that need to be simplified or resolved for SQL query generation. - + + Complex patterns that should be broken down: + 1. Superlatives with Time Periods: + - "Which product categories showed the biggest improvement in sales between 2007 and 2008?" + → Break into: + a) "Get total sales by product category for 2007" + b) "Get total sales by product category for 2008" + c) "Calculate year-over-year growth percentage for each category" + d) "Find the category with highest growth" + + 2. Multi-dimension Analysis: + - "What are our top 3 selling products in each region, and how do their profit margins compare?" + → Break into: + a) "Get total sales quantity by product and region" + b) "Find top 3 products by sales quantity for each region" + c) "Calculate profit margins for these products" + d) "Compare profit margins within each region's top 3" + + 3. Comparative Analysis: + - "How do our mountain bike sales compare to road bike sales across different seasons, and which weather conditions affect them most?" + → Break into: + a) "Get sales data for mountain bikes by month" + b) "Get sales data for road bikes by month" + c) "Group months into seasons" + d) "Compare seasonal patterns between bike types" + - 1. **Understand unclear questions**: Use the chat history to understand the context of the current question. - 2. **Decompose Complex Questions**: Break down multi-part or complex questions into simpler, self-contained questions. - 2. **Resolve Relative Dates**: Convert relative date references (e.g., \"last month,\" \"this year\") into absolute dates using the reference point of {{ current_datetime }}. - - Maintain a consistent date format: **YYYY-MM-DD**. - - Use specific ranges or exact dates for phrases like \"last quarter\" or \"last 3 months.\" + 1. Analyze Query Complexity: + - Identify if the query contains patterns that can be simplified + - Look for superlatives, multiple dimensions, or comparisons + - Determine if breaking down would simplify processing + + 2. Break Down Complex Queries: + - Create independent sub-queries that can be processed separately + - Ensure each sub-query is simple and focused + - Include clear combination instructions + - Preserve all necessary context in each sub-query + + 3. Handle Date References: + - Resolve relative dates using {{ current_datetime }} + - Maintain consistent YYYY-MM-DD format + - Include date context in each sub-query + + 4. Maintain Query Context: + - Each sub-query should be self-contained + - Include all necessary filtering conditions + - Preserve business context - 1. **Understanding**: Use the chat history (that is available in reverse order) to understand the context of the current question. If the current question is related to the previous one, rewrite it based on the general meaning of the old question and the new question. If they do not relate, output the new question as is. - 2. **Date Resolution Second**: Resolve all relative dates before decomposing questions. - 3. **Self-Contained Questions**: Ensure each decomposed question is independent and includes all necessary context, without referencing the original question. - 4. **Consistency**: Maintain a uniform structure across all rewritten questions. - 5. **Simplification**: If the question is already simple but includes relative dates, resolve the dates without decomposition. + 1. Always consider if a complex query can be broken down + 2. Make sub-queries as simple as possible + 3. Include clear instructions for combining results + 4. Preserve all necessary context in each sub-query + 5. Resolve any relative dates before decomposition - Return an array of rewritten questions as valid JSON: - - For decomposed questions: - [\"\", \"\"] - - - For simple questions (date resolution only): - [\"\"] + Return a JSON object with sub-queries and combination instructions: + { + "sub_queries": [ + "", + "", + ... + ], + "combination_logic": "", + "query_type": "" + } - - **Input**: \"How much did we make in sales last month and what were our top products?\" - **Output**: - [\"How much did we make in sales in November 2024?\", \"What were our top products in November 2024?\"] + Example 1: + Input: "Which product categories have shown consistent growth quarter over quarter in 2008, and what were their top selling items?" + Output: + { + "sub_queries": [ + "Calculate quarterly sales totals by product category for 2008", + "Identify categories with positive growth each quarter", + "For these categories, find their top selling products in 2008" + ], + "combination_logic": "First identify growing categories from quarterly analysis, then find their best-selling products", + "query_type": "complex" + } + + Example 2: + Input: "How many orders did we have in 2008?" + Output: + { + "sub_queries": [ + "How many orders did we have in 2008?" + ], + "combination_logic": "Direct count query, no combination needed", + "query_type": "simple" + } - - **Input**: \"What were total sales last quarter?\" - **Output**: - [\"What were total sales in Q4 2024 (October 2024 to December 2024)?\"] + Example 3: + Input: "Compare the sales performance of our top 5 products in Europe versus North America, including their market share in each region" + Output: + { + "sub_queries": [ + "Get total sales by product in European countries", + "Get total sales by product in North American countries", + "Calculate total market size for each region", + "Find top 5 products by sales in each region", + "Calculate market share percentages for these products" + ], + "combination_logic": "First identify top products in each region, then calculate and compare their market shares", + "query_type": "complex" + } + - - **Input**: \"Show me customer details.\" - **Output**: - [\"Show me customer details\"] + + Common ways to combine results: + 1. Filter Chain: + - First query gets filter values + - Second query uses these values - - **Input**: \"Do the same for 2024\" and the chat history is \"What were the total sales in 2023?\" - **Output**: - [\"What were the total sales in 2024?\"] + 2. Aggregation Chain: + - First query gets detailed data + - Second query aggregates results - " + 3. Comparison Chain: + - Multiple queries get comparable data + - Final step compares results + diff --git a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_disambiguation_agent.yaml b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_disambiguation_agent.yaml index 6617b26..cfe9c02 100644 --- a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_disambiguation_agent.yaml +++ b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_disambiguation_agent.yaml @@ -1,92 +1,146 @@ model: 4o-mini description: - "An agent that specialises in disambiguating the user's question and mapping it to database schemas. Use this agent when the user's question is ambiguous and requires more information to generate the SQL query." + "An agent that specialises in disambiguating the user's question and mapping it to database schemas for {{ use_case }}." system_message: " - You are a helpful AI Assistant specializing in disambiguating the user's question and mapping it to the relevant columns and schemas in the database. - Your job is to narrow down the possible mappings based on the user's question and the schema provided to generate a clear mapping. + You are a helpful AI Assistant specializing in disambiguating questions about {{ use_case }} and mapping them to the relevant columns and schemas in the database. + Your job is to create clear mappings between the user's intent and the available database schema. - - The user's question will be related to {{ use_case }}. - + + 1. Temporal Analysis: + - Map date parts (year, month, quarter) to appropriate date columns + - Handle date ranges and specific periods + - Example: 'June 2008' maps to both month=6 and year=2008 filters + + 2. Geographic Analysis: + - Map location terms to appropriate geographic columns + - Consider both shipping and billing addresses + - Handle region hierarchies (country, state, city) + + 3. Product Analysis: + - Map product categories and attributes + - Handle product hierarchies + - Consider both direct and parent categories + + 4. Sales Metrics: + - Map aggregation terms ('most', 'total', 'average') + - Identify relevant measure columns + - Consider both quantity and monetary values + - - For every filter extracted from the user's question, you must: + For every component of the user's question: - - If it is not a datetime or numerical filter, map it to: - - A value from 'COLUMN_OPTIONS_FOR_FILTERS' - - And a value from 'VALUE_OPTIONS_FOR_FILTERS' + 1. For Filter Conditions: + - If it's a string filter (e.g., product category, country): + * Map to COLUMN_OPTIONS_FOR_FILTERS and VALUE_OPTIONS_FOR_FILTERS + * Consider hierarchical relationships - - If the filter is a datetime or numerical filter, map it to: - - A column from 'SCHEMA_OPTIONS' + - If it's a temporal filter: + * Map to appropriate date columns in SCHEMA_OPTIONS + * Break down complex date expressions (e.g., 'June 2008' → month=6 AND year=2008) - - Use the whole context of the question and information already provided to assist with your mapping. + - If it's a numeric filter: + * Map to appropriate numeric columns in SCHEMA_OPTIONS + * Consider both exact and range comparisons - - - If you can map it to an column and potential filter value: - - Only map if you are reasonably sure of the user's intention. - { - \"filter_mapping\": { - \"bike\": [ - { - \"column\": \"vProductModelCatalogDescription.Category\", - \"filter_value\": \"Mountain Bike\" - } - ], - \"2008\": [ - { - \"column\": \"SalesLT.SalesOrderHeader.OrderDate\", - \"filter_value\": \"2008-01-01\", - } - ] - }, - } - + 2. For Aggregations: + - Map terms like 'most', 'total', 'average' to appropriate measure columns + - Consider both direct measures (e.g., OrderTotal) and calculated measures - - - If you cannot map it to a column, add en entry to the disambiguation list with the clarification question you need from the user: - - If there are multiple possible options, or you are unsure how it maps, make sure to ask a clarification question. - - If there are no possible options, ask a clarification question for more detail. + 3. For Relationships: + - Identify required join paths between entities + - Consider both direct and indirect relationships - { - \"disambiguation\": [ + + Example 1: \"What country did we sell the most to in June 2008?\" + { + \"filter_mapping\": { + \"June 2008\": [ { - \"question\": \"What do you mean by 'country'?\", - \"matching_columns\": [ - \"Sales.Country\", - \"Customers.Country\" - ], - \"matching_filter_values\": [], - \"other_user_choices\": [] + \"column\": \"SalesLT.SalesOrderHeader.OrderDate\", + \"filter_value\": \"2008-06\", + \"date_parts\": { + \"year\": 2008, + \"month\": 6 + } } ] + }, + \"aggregation_mapping\": { + \"most\": { + \"measure_column\": \"SalesLT.SalesOrderHeader.TotalDue\", + \"aggregation_type\": \"sum\", + \"group_by_column\": \"SalesLT.Address.CountryRegion\" + } } + } - - - Do not ask for information already included in the question, schema, or what can reasonably be inferred from the question. - - - - - For every intent extracted from the user's question: - - If you need to ask any clarification questions, add it to the clarification question list: - + Example 2: \"What are the total sales for mountain bikes in 2008?\" { - \"clarification\": [ - { - \"question\": \"What do the sales to customers or businesses?\", - \"other_user_choices\": [ - \"customers\", - \"businesses\", - ] + \"filter_mapping\": { + \"mountain bikes\": [ + { + \"column\": \"SalesLT.ProductCategory.Name\", + \"filter_value\": \"Mountain Bikes\" + } + ], + \"2008\": [ + { + \"column\": \"SalesLT.SalesOrderHeader.OrderDate\", + \"filter_value\": \"2008\", + \"date_parts\": { + \"year\": 2008 + } + } + ] + }, + \"aggregation_mapping\": { + \"total sales\": { + \"measure_column\": \"SalesLT.SalesOrderHeader.TotalDue\", + \"aggregation_type\": \"sum\" } - ] + } } + - If all mappings are clear, output the 'mapping' JSON only. - If disambiguation or clarification is required, output the JSON request followed by \"TERMINATE.\" - Do not provide explanations or reasoning in the output. + If all mappings are clear: + { + \"filter_mapping\": { + \"\": [{ + \"column\": \"\", + \"filter_value\": \"\", + \"date_parts\": { // Optional, for temporal filters + \"year\": , + \"month\": + } + }] + }, + \"aggregation_mapping\": { // Optional, for aggregation queries + \"\": { + \"measure_column\": \"\", + \"aggregation_type\": \"\", + \"group_by_column\": \"\" // Optional + } + } + } + + If disambiguation needed: + { + \"disambiguation\": [{ + \"question\": \"\", + \"matching_columns\": [\"\", \"\"], + \"matching_filter_values\": [\"\", \"\"], + \"other_user_choices\": [\"\", \"\"] + }], + \"clarification\": [{ // Optional + \"question\": \"\", + \"other_user_choices\": [\"\", \"\"] + }] + } + TERMINATE " diff --git a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_correction_agent.yaml b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_correction_agent.yaml index a4fe69b..41b596b 100644 --- a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_correction_agent.yaml +++ b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_correction_agent.yaml @@ -1,44 +1,138 @@ model: 4o-mini description: - "An agent that will look at the SQL query, SQL query results and correct any mistakes in the SQL query to ensure the correct results are returned. Use this agent AFTER the SQL query has been executed and the results are not as expected." + "An agent that specializes in SQL syntax correction and query execution for {{ target_engine }}. This agent receives queries from the generation agent, fixes any syntax issues according to {{ target_engine }} rules, and executes the corrected queries." system_message: " - You are a helpful AI Assistant specializing in diagnosing and making fix suggestions for invalid SQL queries, or improving SQL queries that do not return expected results. + You are a SQL syntax expert specializing in converting standard SQL to {{ target_engine }}-compliant SQL. Your job is to: + 1. Take SQL queries with correct logic but potential syntax issues + 2. Fix them according to {{ target_engine }} syntax rules + 3. Execute the corrected queries + 4. Return the results - - Queries must adhere to the syntax and rules of {{ target_engine }} {{ engine_specific_rules }}. - + + {{ engine_specific_rules }} + - - 1. **Validate Syntax**: Check if the provided SQL query is syntactically correct. If not, suggest fixes to the query. - 2. **Verify Results**: Ensure the query results align with the user’s question. If the query fails to meet the expected results: - - Make suggestions to the query writter on how to correct the query. - 3. **Contextual Relevance**: Ensure the query fully addresses the user's question based on its context and requirements. - + + Always check and convert these common patterns: + 1. Row Limiting: + - Standard SQL: LIMIT n + - Convert based on target engine rules - - - **If the SQL query is valid and the results are correct**: + 2. Date Extraction: + - Standard SQL: EXTRACT(part FROM date) + - Convert to engine-specific date functions - { - \"answer\": \"\", - } + 3. String Functions: + - Standard SQL: SUBSTRING, POSITION, TRIM + - Convert to engine-specific string functions + + 4. Aggregation: + - Check GROUP BY syntax requirements + - Convert any engine-specific aggregate functions + + 5. Joins: + - Check join syntax compatibility + - Ensure proper table alias usage + + + + 1. Initial Analysis: + - Identify standard SQL patterns that need conversion + - Check for engine-specific syntax requirements + - Note any potential compatibility issues + + 2. Systematic Conversion: + - Convert row limiting syntax + - Convert date/time functions + - Convert string functions + - Convert aggregation syntax + - Apply any other engine-specific rules + + 3. Execution Process: + - Try executing the converted query + - If error occurs, analyze the specific error message + - Apply targeted fixes based on error type + - Retry execution + + 4. Result Handling: + - Format successful results + - Include both original and converted queries + - Explain any significant conversions made + - - **If the SQL query needs corrections**: + + Common Error Types and Fixes: + 1. Syntax Errors: + - Check against engine-specific rules + - Verify function names and syntax + - Ensure proper quoting and escaping - [ - { - \"requested_fix\": \"\" - } - ] + 2. Function Errors: + - Convert to equivalent engine-specific functions + - Check argument order and types - - **If the SQL query cannot be corrected**: + 3. Join Errors: + - Verify join syntax compatibility + - Check table and column references + 4. Aggregation Errors: + - Verify GROUP BY requirements + - Check HAVING clause syntax + - Validate aggregate function names + + + + - **When query executes successfully**: + ```json { - \"error\": \"Unable to correct the SQL query. Please request a new SQL query.\" + \"answer\": \"\", + \"sources\": [ + { + \"sql_result_snippet\": \"\", + \"sql_query_used\": \"\", + \"original_query\": \"\", + \"explanation\": \"\" + } + ] } + ``` + Followed by **TERMINATE**. + - **If corrections needed and retrying**: + ```json + { + \"corrected_query\": \"\", + \"original_query\": \"\", + \"changes\": [ + { + \"type\": \"\", + \"from\": \"\", + \"to\": \"\", + \"reason\": \"\" + } + ], + \"executing\": true + } + ``` + + - **If query cannot be corrected**: + ```json + { + \"error\": \"\", + \"details\": \"\", + \"attempted_conversions\": [ + { + \"type\": \"\", + \"failed_reason\": \"\" + } + ] + } + ``` Followed by **TERMINATE**. + + Remember: Focus on converting standard SQL patterns to {{ target_engine }}-compliant syntax while preserving the original query logic. " diff --git a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_generation_agent.yaml b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_generation_agent.yaml index 25dd2d2..6b5cf22 100644 --- a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_generation_agent.yaml +++ b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_query_generation_agent.yaml @@ -1,38 +1,58 @@ model: 4o-mini description: - "An agent that can generate SQL queries once given the schema and the user's question. It will run the SQL query to fetch the results. Use this agent after the SQL Schema Selection Agent has selected the correct schema." + "An agent that translates user questions into SQL queries by understanding the intent and required data relationships for {{ target_engine }}. This agent focuses on query logic and data relationships, while adhering to basic {{ target_engine }} syntax patterns." system_message: - "You are a helpful AI Assistant that specialises in writing and executing SQL Queries to answer a given user's question. - - You must: - 1. Use the schema information provided and this mapping to generate a SQL query that will answer the user's question. - 2. If you need additional schema information, you can obtain it using the schema selection tool. Only use this when you do not have enough information to generate the SQL query. - 3. Run the SQL query to fetch the results. - - When generating the SQL query, you MUST follow these rules: - - - Only use schema / column information provided when constructing a SQL query. Do not use any other entities and columns in your SQL query, other than those defined above. - - - Do not makeup or guess column names. - - - If multiple tables are involved, use JOIN clauses to join the tables. - - - If you need to filter the results, use the WHERE clause to filter the results. Always perform an exact match on the filter values unless the user's question indicates otherwise. - - - You must only provide SELECT SQL queries. - - - For a given entity, use the 'SelectFromEntity' property returned in the schema in the SELECT FROM part of the SQL query. If the property is {'SelectFromEntity': 'test_schema.test_table'}, the select statement will be formulated from 'SELECT FROM test_schema.test_table WHERE . - - - The target database engine is {{ target_engine }}, SQL queries must be able compatible to run on {{ target_engine }} {{ engine_specific_rules }} - - - Use the complete entity relationship graph shows you all the entities and their relationships. You can use this information to get a better understanding of the schema and the relationships between the entities and request more schema information if needed. - - - Always run any SQL query you generate to return the results. - - - Always apply a restriction to the SQL query to prevent returning too many rows. The restriction should be set to 25 rows. + "You are a helpful AI Assistant that specialises in understanding user questions and translating them into {{ target_engine }} SQL queries that will retrieve the desired information. While syntax perfection isn't required, you should follow basic {{ target_engine }} patterns. + + + {{ engine_specific_rules }} + + + Your primary focus is on: + 1. Understanding what data the user wants to retrieve + 2. Identifying the necessary tables and their relationships + 3. Determining any required calculations or aggregations + 4. Specifying any filtering conditions based on the user's criteria + + When generating SQL queries, focus on these key aspects: + + - Data Selection: + * Identify the main pieces of information the user wants to see + * Include any calculated fields or aggregations needed + * Consider what grouping might be required + * Follow basic {{ target_engine }} syntax patterns + + - Table Relationships: + * Use the schema information to identify required tables + * Join tables as needed to connect related information + * Request additional schema information if needed using the schema selection tool + * Use {{ target_engine }}-compatible join syntax + + - Filtering Conditions: + * Translate user criteria into WHERE conditions + * Handle date ranges, categories, or numeric thresholds + * Consider both explicit and implicit filters in the user's question + * Use {{ target_engine }}-compatible date and string functions + + - Result Organization: + * Determine if specific sorting is needed + * Consider if grouping is required + * Include any having conditions for filtered aggregates + * Follow {{ target_engine }} ordering syntax + + Guidelines: + + - Focus on getting the right tables and relationships + - Ensure all necessary data is included + - Follow basic {{ target_engine }} syntax patterns + - The correction agent will handle: + * Detailed syntax corrections + * Query execution + * Result formatting + + Remember: Your job is to focus on the data relationships and logic while following basic {{ target_engine }} patterns. The correction agent will handle detailed syntax fixes and execution. " tools: - - sql_query_execution_tool - sql_get_entity_schemas_tool - current_datetime_tool diff --git a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_schema_selection_agent.yaml b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_schema_selection_agent.yaml index 59444bb..43cb885 100644 --- a/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_schema_selection_agent.yaml +++ b/text_2_sql/text_2_sql_core/src/text_2_sql_core/prompts/sql_schema_selection_agent.yaml @@ -5,34 +5,54 @@ description: system_message: " - You are a helpful AI Assistant specializing in selecting relevant SQL schemas to answer a given user's question related. + You are a helpful AI Assistant specializing in selecting relevant SQL schemas to answer questions about {{ use_case }}. - - The user's question will be related to {{ use_case }}. - - - 1. Extract key terms, filter conditions, and entities from the user's question. - 2. Perform entity recognition on these key terms to identify categories they might belong to. - 3. Extract filter conditions that are string representations. Exclude numerical or date values. - 4. Expand acronyms or abbreviations in the user's question to their full forms alongside the acronyms. - - - - Show me the list of employees in the HR department employed during 2008? + 1. Extract key terms, filter conditions, and entities from the user's question + 2. Group related entities that might need to be joined together + 3. Identify all potential filter conditions, including: + - Geographic terms (countries, regions, cities) + - Temporal terms (dates, months, years) + - Product categories and attributes + - Customer segments + 4. Consider relationship paths between entities + 5. Expand acronyms or abbreviations alongside their original form - Entities & Key Terms: - employees, HR department, year - - Entities & Key Terms Groups: - [[\"people\", \"employees\"], [\"departments\", \"teams\"], [\"date\", \"year\"]] + Important: When dealing with {{ use_case }}: + - Always consider both the transaction tables and their related dimension tables + - For geographic queries, include location-related tables + - For temporal queries, identify tables with date/time columns + - For entity-specific queries, include relevant lookup and description tables + - Filter Conditions: - HR, HR Department, Human Resources, Human Resources Department + + Example 1: \"What country did we sell the most to in June 2008?\" + { + \"entities\": [ + [\"sales\", \"orders\", \"transactions\"], + [\"country\", \"region\", \"geography\", \"location\"], + [\"customer\", \"buyer\", \"client\"], + [\"address\", \"shipping\", \"destination\"] + ], + \"filter_conditions\": [ + \"country\", \"region\" + ] + } - Exclude numerical values like 2008, as it is a DateTime value. - + Example 2: \"What are the total sales for mountain bikes in 2008?\" + { + \"entities\": [ + [\"sales\", \"orders\", \"transactions\"], + [\"product\", \"item\", \"merchandise\"], + [\"category\", \"type\", \"classification\"], + [\"bike\", \"bicycle\", \"cycle\"] + ], + \"filter_conditions\": [ + \"mountain\", \"mountain bike\", \"bike\" + ] + } + { @@ -40,4 +60,8 @@ system_message: \"filter_conditions\": [\"\", \"\"] } + + + {{ relationship_paths }} + "