Update prompts and agent flow for improved Text2SQL (#105)

* Update prompts and agent flow for improved Text2SQL Changes include: - Enhanced prompt templates in YAML files for better SQL generation - Updated autogen implementation for improved agent interactions - Updated documentation in README and notebook - Refined agent flow for better query understanding and generation * Fix linting issues: Remove unused variable and apply formatting
microsoft · Dec 18, 2024 · dbe02bb · dbe02bb
1 parent 46b4cf8
commit dbe02bb
Show file tree

Hide file tree

Showing 7 changed files with 631 additions and 265 deletions.
diff --git a/text_2_sql/autogen/README.md b/text_2_sql/autogen/README.md
@@ -2,13 +2,9 @@
 
 The implementation is written for [AutoGen](https://github.com/microsoft/autogen) in Python, although it can easily be adapted for C#.
 
-**Still work in progress, expect a lot of updates shortly**
-
-**The provided AutoGen code only implements Iterations 5 (Agentic Approach)**
-
 ## Full Logical Flow for Agentic Vector Based Approach
 
-The following diagram shows the logical flow within multi agent system. The flow begins with query rewriting to preprocess questions - this includes resolving relative dates (e.g., "last month" to "November 2024") and breaking down complex queries into simpler components. For each preprocessed question, if query cache is enabled, the system checks the cache for previously asked similar questions. In an ideal scenario, the preprocessed questions will be found in the cache, leading to the quickest answer generation. In cases where the question is not known, the group chat selector will fall back to the other agents accordingly and generate the SQL query using the LLMs. The cache is then updated with the newly generated query and schemas.
+The following diagram shows the logical flow within the multi-agent system. The flow begins with query rewriting to preprocess questions - this includes resolving relative dates (e.g., "last month" to "November 2024") and breaking down complex queries into simpler components. For each preprocessed question, if query cache is enabled, the system checks the cache for previously asked similar questions. In an ideal scenario, the preprocessed questions will be found in the cache, leading to the quickest answer generation. In cases where the question is not known, the system will fall back to the other agents accordingly and generate the SQL query using the LLMs. The cache is then updated with the newly generated query and schemas.
 
 Unlike the previous approaches, **gpt4o-mini** can be used as each agent's prompt is small and focuses on a single simple task.
 
@@ -18,67 +14,131 @@ As the query cache is shared between users (no data is stored in the cache), a n
 
 ![Vector Based with Query Cache Logical Flow.](../images/Agentic%20Text2SQL%20Query%20Cache.png "Agentic Vector Based with Query Cache Logical Flow")
 
+## Agent Flow in Detail
+
+The agent flow is managed by a sophisticated selector system in `autogen_text_2_sql.py`. Here's how it works:
+
+1. **Initial Entry**
+   - Every question starts with the Query Rewrite Agent
+   - This agent processes dates and breaks down complex questions
+
+2. **Post Query Rewrite**
+   - If query cache is enabled (`Text2Sql__UseQueryCache=True`):
+     - Flow moves to SQL Query Cache Agent
+   - If cache is disabled:
+     - Flow moves directly to Schema Selection Agent
+
+3. **Cache Check Branch**
+   - If cache hit found:
+     - With pre-run results: Goes to SQL Query Correction Agent
+     - Without pre-run results: Goes to SQL Query Generation Agent
+   - If cache miss:
+     - Goes to Schema Selection Agent
+
+4. **Schema Selection Branch**
+   - Schema Selection Agent finds relevant schemas
+   - Always moves to SQL Disambiguation Agent
+   - Disambiguation Agent clarifies any schema ambiguities
+   - Then moves to SQL Query Generation Agent
+
+5. **Query Generation and Correction Loop**
+   - SQL Query Generation Agent creates the query
+   - SQL Query Correction Agent verifies/corrects the query
+   - Based on correction results:
+     - If query needs execution: Returns to Correction Agent
+     - If query needs fixes: Returns to Generation Agent
+     - If answer and sources ready: Goes to Answer and Sources Agent
+     - If error occurs: Returns to Generation Agent
+
+6. **Final Answer Formatting**
+   - Answer and Sources Agent formats the final response
+   - Standardizes output format with markdown tables
+   - Combines all sources and query results
+   - Returns formatted answer to user
+
+The flow uses termination conditions:
+- Explicit "TERMINATE" mention
+- Presence of both "answer" and "sources"
+- Maximum of 20 messages reached
+
 ## Provided Notebooks & Scripts
 
-- `./Iteration 5 - Agentic Vector Based Text2SQL.ipynb` provides example of how to utilise the Agentic Vector Based Text2SQL approach to query the database. The query cache plugin will be enabled or disabled depending on the environmental parameters.
+- `./Iteration 5 - Agentic Vector Based Text2SQL.ipynb` provides example of how to utilize the Agentic Vector Based Text2SQL approach to query the database. The query cache plugin will be enabled or disabled depending on the environmental parameters.
 
 ## Agents
 
-This approach builds on the Vector Based SQL Plugin approach, but adds a agentic approach to the solution.
+This approach builds on the Vector Based SQL Plugin approach, but adds an agentic approach to the solution.
 
-This agentic system contains the following agents:
+The agentic system contains the following agents:
 
 - **Query Rewrite Agent:** The first agent in the flow, responsible for two key preprocessing tasks:
   1. Resolving relative dates to absolute dates (e.g., "last month" → "November 2024")
   2. Decomposing complex questions into simpler sub-questions
   This preprocessing happens before cache lookup to maximize cache effectiveness.
-- **Query Cache Agent:** Responsible for checking the cache for previously asked questions. After preprocessing, each sub-question is checked against the cache if caching is enabled.
-- **Schema Selection Agent:** Responsible for extracting key terms from the question and checking the index store for the queries. This agent is used when a cache miss occurs.
-- **SQL Query Generation Agent:** Responsible for using the previously extracted schemas and generated SQL queries to answer the question. This agent can request more schemas if needed. This agent will run the query.
-- **SQL Query Verification Agent:** Responsible for verifying that the SQL query and results question will answer the question.
-- **Answer Generation Agent:** Responsible for taking the database results and generating the final answer for the user.
 
-The combination of these agents allows the system to answer complex questions, whilst staying under the token limits when including the database schemas. The query cache ensures that previously asked questions can be answered quickly to avoid degrading user experience.
+- **Query Cache Agent:** (Optional) Responsible for checking the cache for previously asked questions. After preprocessing, each sub-question is checked against the cache if caching is enabled.
+
+- **Schema Selection Agent:** Responsible for extracting key terms from the question and checking the index store for relevant database schemas. This agent is used when a cache miss occurs.
 
-All agents can be found in `/agents/`.
+- **SQL Disambiguation Agent:** Responsible for clarifying any ambiguities in the schema selection and ensuring the correct tables and columns are selected for the query.
 
-## agentic_text_2_sql.py
+- **SQL Query Generation Agent:** Responsible for using the previously extracted schemas to generate SQL queries that answer the question. This agent can request more schemas if needed.
 
-This is the main entry point for the agentic system. In here, the system is configured with the following processing flow:
+- **SQL Query Correction Agent:** Responsible for verifying and correcting the generated SQL queries, ensuring they are syntactically correct and will produce the expected results. This agent also handles the execution of queries and formatting of results.
 
-The preprocessed questions from the Query Rewrite Agent are processed sequentially through the rest of the agent pipeline. A custom transition selector automatically transitions between agents dependent on the last one that was used. The flow starts with the Query Rewrite Agent for preprocessing, followed by cache checking for each sub-question if caching is enabled. In some cases, this choice is delegated to an LLM to decide on the most appropriate action. This mixed approach allows for speed when needed (e.g. cache hits for known questions), but will allow the system to react dynamically to the events.
+- **Answer and Sources Agent:** Final agent in the flow that:
+  1. Standardizes the output format across all responses
+  2. Formats query results into markdown tables for better readability
+  3. Combines all sources and results into a single coherent response
+  4. Ensures consistent JSON structure in the final output
 
-Note: Future development aims to implement independent processing where each preprocessed question would run in its own isolated context to prevent confusion between different parts of complex queries.
+The combination of these agents allows the system to answer complex questions while staying under token limits when including database schemas. The query cache ensures that previously asked questions can be answered quickly to avoid degrading user experience.
 
-## Utils
+## Project Structure
 
-### ai-search.py
+### autogen_text_2_sql.py
 
-This util file contains helper functions for interacting with AI Search.
+This is the main entry point for the agentic system. It configures the system with a sophisticated processing flow managed by a unified selector that handles agent transitions. The flow includes:
 
-### llm_agent_creator.py
+1. Initial query rewriting for preprocessing
+2. Cache checking if enabled
+3. Schema selection and disambiguation
+4. Query generation and correction
+5. Result verification and formatting
+6. Final answer standardization
 
-This util file creates the agents in the AutoGen framework based on the configuration files.
+The system uses a custom transition selector that automatically moves between agents based on the previous agent's output and the current state. This allows for dynamic reactions to different scenarios, such as cache hits, schema ambiguities, or query corrections.
 
-### models.py
+### creators/
 
-This util file creates the model connections to Azure OpenAI for the agents.
+- **llm_agent_creator.py:** Creates the agents in the AutoGen framework based on configuration files
+- **llm_model_creator.py:** Handles model connections and configurations for the agents
 
-### sql.py
+### custom_agents/
 
-#### get_entity_schema()
+Contains specialized agent implementations:
+- **sql_query_cache_agent.py:** Implements the caching functionality
+- **sql_schema_selection_agent.py:** Handles schema selection and management
+- **answer_and_sources_agent.py:** Formats and standardizes final outputs
 
-This method is called by the AutoGen framework automatically, when instructed to do so by the LLM, to search the AI Search instance with the given text. The LLM is able to pass the key terms from the user query, and retrieve a ranked list of the most suitable entities to answer the question.
+## Configuration
 
-The search text passed is vectorised against the entity level **Description** columns. A hybrid Semantic Reranking search is applied against the **EntityName**, **Entity**, **Columns/Name** fields.
+The system behavior can be controlled through environment variables:
 
-#### fetch_queries_from_cache()
+- `Text2Sql__UseQueryCache`: Enables/disables the query cache functionality
+- `Text2Sql__PreRunQueryCache`: Controls whether to pre-run cached queries
+- `Text2Sql__UseColumnValueStore`: Enables/disables the column value store
+- `Text2Sql__DatabaseEngine`: Specifies the target database engine
+
+Each agent can be configured with specific parameters and prompts to optimize its behavior for different scenarios.
+
+## Query Cache Implementation Details
 
 The vector based with query cache uses the `fetch_queries_from_cache()` method to fetch the most relevant previous query and injects it into the prompt before the initial LLM call. The use of Auto-Function Calling here is avoided to reduce the response time as the cache index will always be used first.
 
 If the score of the top result is higher than the defined threshold, the query will be executed against the target data source and the results included in the prompt. This allows us to prompt the LLM to evaluated whether it can use these results to answer the question, **without further SQL Query generation** to speed up the process.
 
-The cache entires are rendered with Jinja templates before they are run. The following placesholders are prepopulated automatically:
+The cache entries are rendered with Jinja templates before they are run. The following placeholders are prepopulated automatically:
 
 - date
 - datetime
@@ -87,8 +147,31 @@ The cache entires are rendered with Jinja templates before they are run. The fol
 
 Additional parameters passed at runtime, such as a user_id, are populated automatically if included in the request.
 
-#### run_sql_query()
+### run_sql_query()
 
 This method is called by the AutoGen framework automatically, when instructed to do so by the LLM, to run a SQL query against the given database. It returns a JSON string containing a row wise dump of the results returned. These results are then interpreted to answer the question.
 
 Additionally, if any of the cache functionality is enabled, this method will update the query cache index based on the SQL query run, and the schemas used in execution.
+
+## Output Format
+
+The system produces standardized JSON output through the Answer and Sources Agent:
+
+```json
+{
+  "answer": "The answer to the user's question",
+  "sources": [
+    {
+      "sql_query": "The SQL query used",
+      "sql_rows": ["Array of result rows"],
+      "markdown_table": "Formatted markdown table of results"
+    }
+  ]
+}
+```
+
+This consistent output format ensures:
+1. Clear separation between answer and supporting evidence
+2. Human-readable presentation of query results
+3. Access to raw data for further processing
+4. Traceable query execution for debugging