Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Text to SQL support #1399

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

ishaandatta
Copy link
Contributor

Support for text-to-SQL for MySQL (https://github.com/users/imartinez/projects/3?pane=issue&itemId=44266590)

This uses sqlalchemy to initialise a client to the database.

Thereafter, a SQLDatabase instance is initialised, which is passed to the llama_index object SQLTableNodeMapping for creating a node per schema. (allowing query-time retrieval of schema)

Finally, the llama_index SQLTableRetrieverQueryEngine is used to-
1. Convert text to appropriate SQL query
2. Execute the SQL query
3. Use the result as context to answer prompt

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

@lopagela lopagela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state of the Text to SQL support is not properly taking care of:

  • Multi SQL dialect
  • Execution without Text to SQL (i.e. Vanilla privateGPT) is broken
  • Declaration of new dependencies is not justified

I think this feature must be tested across different "privateGPT mode" (with, and without DB support installed).

fern/docs.yml Outdated
Comment on lines 61 to 62
- page: SQL Databases
path: ./docs/pages/manual/nlsql.mdx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be under Storage section, but more into Advance Setup or something like this.

Comment on lines 21 to 30
## Other Databases
To get started with exploring the use of other databases, set the `sqldatabase.dialect` and `sqldatabase.driver` properties in the `settings.yaml` file.

```yaml
sqldatabase:
dialect: <dialect_here>
driver: <driver_here>
```

Refer to [SQLAlchemy Engine Configuration](https://docs.sqlalchemy.org/en/20/core/engines.html) for constructing the DB connection string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great if that section could be a bit more user friendly - for example, putting additional links to SQLAlchemy reference (for example, the list of dialect https://docs.sqlalchemy.org/en/20/dialects/), and at least one additional link (to postgres or sqlite for example, which are very common).

Comment on lines 8 to 12
from sqlalchemy import (
MetaData,
)
from sqlalchemy.engine import create_engine
from sqlalchemy.engine.base import Engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try without the local mode?

Are these dependencies shipped from privateGPT by default?

It looks to me that these additional dependencies are brought by default, while it should be an opt-in feature I think. This would also means that this NLSQLComponent should be conditionally injected in the app context (basically, the app should still be able to run without that SQL component).

Ideally, the app should load this component only if the configuration is specifying a valid SQL configuration (thanks to some "on/off" switch in configuration maybe)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lopagela I have moved them inside the component, after checking on if settings.context_database.enabled

Comment on lines 225 to 226
dialect: Literal["mysql"]
driver: Literal["pymysql"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going against the documentation - Pydantic will fail to validate any other values, and consequently, one will not be able to change the dialect and driver.

Comment on lines 231 to 234
db_user: str | None = Field(
"root",
description="Username to be used for accessing the SQL Database Server. If user is None, set to 'root'.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be left to None, as root is not a standard default user in SQL norm (actually, SQL does not tell anything on a default user, but implementations are usually defining some, e.g. postgres for postgreSQL).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL default admin user is called root.

Regardless, wrt supporting other SQL databases, I'm making username and password both default to None

Comment on lines 235 to 238
db_password: str = Field(
"",
description="Password to be used for accessing the SQL Database Server. If password is None, set to empty string.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is misleading. The default is a blank string, which is different from None

pyproject.toml Outdated
Comment on lines 19 to 21
gradio = "4.4.1"
pymysql = "^1.1.0"
cryptography = "^41.0.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not its place in the default dependencies. gradio is in the ui target.

Additionally, I think that pymsql should not be added in the dependencies of this project, as this is a driver that is specific to mySQL: users might not use mySQL, and might need other dependencies (psycopg2, psycopg, and many others for example - c.f. sqlalchemy dialects)

settings.yaml Outdated
Comment on lines 51 to 55
db_host: ${HOSTNAME:localhost}
db_user: ${USERNAME:root}
db_password: ${PASSWORD:}
database: ${DATABASE:}
tables: ["${TABLES:}"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var are specific to SQL/DB world, so I'd personally prefix them with something like SQL_ or DB_.

Additionally, TABLES is set to plural, while its usage shows that only a single name is supported, which is misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tables variable accepts a list of table names (you can setup multiple tables as context- SQLTableRetrieverQueryEngine will select the right table to query).

It's used in the code like-
for table_name in self.metadata_obj.tables: table_schema_objs.append(SQLTableSchema(table_name=table_name))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ishaandatta you are right - tables does accept a list of string, but the env var TABLES seems to only accept one table.

Let me ask this: how would you pass several tables names in the env variable TABLES?

Copy link
Contributor Author

@ishaandatta ishaandatta Dec 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense- I'm not actually sure how to define an array/list here in the env file.
Figured square brackets were sufficient in conveying the usage, as I tested and it works for multiple tables.

I tested it with the following value-
tables: ["EMPLOYEE", "DEPARTMENT"]

Currently if TABLES is not given, no default is set as per https://docs.privategpt.dev/manual/general-configuration/configuration#environment-variables-expansion

Is this better- db_tables: ${TABLES_LIST}

Comment on lines 6 to 15
```yaml
sqldatabase:
dialect: mysql
driver: pymysql
host: <db_hostname_here>
user: <db_username_here>
password: <db_password_here>
database: <database_name_here>
tables: <list_of_table_names_here>
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very rigid, why not move it to a parameter similar to context_files (like context_database(s))? Conceptually speaking there is not big difference between talking to your documents or talking to your database(s)

I understand that would require a big refactor of this PR but it will set a pattern for next features (like context_websites)

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

@ishaandatta ishaandatta requested a review from lopagela December 17, 2023 14:48
Copy link
Contributor

Copy link
Contributor

Copy link
Contributor

github-actions bot commented Jan 2, 2024

Stale pull request

@github-actions github-actions bot added the stale label Jan 2, 2024
@ishaandatta ishaandatta requested a review from pabloogc January 8, 2024 20:59
@ishaandatta
Copy link
Contributor Author

@lopagela addressed the changes, please check once

@ricklettow
Copy link

Thank you for your efforts!

I have hit a bump when running this on a large database schema. It appears that SQLAlchemy is running a reflect() multiple times (1. during initial startup, 2. Before the first query in ObjectIndex.from_objects() ). Subsequent queries perform better however the startup is > 30 minutes and the first query is another 30 minutes. I am limiting to a single table however, that table has relationships to other tables that are recursively parsed.

sql_database = SQLDatabase(engine, include_tables=tables)

obj_index = ObjectIndex.from_objects(
          table_schema_objs,
          table_node_mapping,
          VectorStoreIndex,
          service_context=service_context,
      )

Both call above trigger a reflection. Is this by design? Can the reflection results be persisted for quick loading when the schema has infrequent changes? It seems to be on the llama-index side but I was wondering if a different approach could be used.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants