-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Text to SQL support #1399
base: main
Are you sure you want to change the base?
Conversation
Published docs preview URL: https://privategpt-preview-b4ba5be9-1266-4c46-bec2-8bf74af32ebb.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-fb1cbf15-a3a8-435a-8e64-70a3ec9672c0.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-fbc14ba0-7d95-481c-ae64-eca391fc1b06.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-8e2ded0f-694a-40f6-a39b-c6efcc893dd8.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-94fadd41-9144-41a0-9930-b1df12abee5b.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-0af0ff27-2f23-49aa-9134-7c2dee8b0f66.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-d67b9c29-1211-4b95-8e6e-a6b8208de95c.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-ce3d69ad-7320-4edf-971d-3beaafcb1c09.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-b1d4501a-687d-427b-b7ea-115f246a62c9.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-51c06c73-27fe-478f-8a82-6aae609aae0e.docs.buildwithfern.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current state of the Text to SQL support is not properly taking care of:
- Multi SQL dialect
- Execution without Text to SQL (i.e. Vanilla privateGPT) is broken
- Declaration of new dependencies is not justified
I think this feature must be tested across different "privateGPT mode" (with, and without DB support installed).
fern/docs.yml
Outdated
- page: SQL Databases | ||
path: ./docs/pages/manual/nlsql.mdx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be under Storage
section, but more into Advance Setup
or something like this.
fern/docs/pages/manual/nlsql.mdx
Outdated
## Other Databases | ||
To get started with exploring the use of other databases, set the `sqldatabase.dialect` and `sqldatabase.driver` properties in the `settings.yaml` file. | ||
|
||
```yaml | ||
sqldatabase: | ||
dialect: <dialect_here> | ||
driver: <driver_here> | ||
``` | ||
|
||
Refer to [SQLAlchemy Engine Configuration](https://docs.sqlalchemy.org/en/20/core/engines.html) for constructing the DB connection string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great if that section could be a bit more user friendly - for example, putting additional links to SQLAlchemy reference (for example, the list of dialect
https://docs.sqlalchemy.org/en/20/dialects/), and at least one additional link (to postgres or sqlite for example, which are very common).
from sqlalchemy import ( | ||
MetaData, | ||
) | ||
from sqlalchemy.engine import create_engine | ||
from sqlalchemy.engine.base import Engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try without the local
mode?
Are these dependencies shipped from privateGPT
by default?
It looks to me that these additional dependencies are brought by default, while it should be an opt-in
feature I think. This would also means that this NLSQLComponent
should be conditionally injected in the app context (basically, the app should still be able to run without that SQL component).
Ideally, the app should load this component only if the configuration is specifying a valid SQL configuration (thanks to some "on/off" switch in configuration maybe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lopagela I have moved them inside the component, after checking on if settings.context_database.enabled
private_gpt/settings/settings.py
Outdated
dialect: Literal["mysql"] | ||
driver: Literal["pymysql"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going against the documentation - Pydantic will fail to validate any other values, and consequently, one will not be able to change the dialect and driver.
private_gpt/settings/settings.py
Outdated
db_user: str | None = Field( | ||
"root", | ||
description="Username to be used for accessing the SQL Database Server. If user is None, set to 'root'.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be left to None
, as root
is not a standard default user in SQL norm (actually, SQL does not tell anything on a default user, but implementations are usually defining some, e.g. postgres
for postgreSQL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL default admin user is called root
.
Regardless, wrt supporting other SQL databases, I'm making username and password both default to None
private_gpt/settings/settings.py
Outdated
db_password: str = Field( | ||
"", | ||
description="Password to be used for accessing the SQL Database Server. If password is None, set to empty string.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is misleading. The default is a blank string, which is different from None
pyproject.toml
Outdated
gradio = "4.4.1" | ||
pymysql = "^1.1.0" | ||
cryptography = "^41.0.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has not its place in the default dependencies. gradio
is in the ui
target.
Additionally, I think that pymsql
should not be added in the dependencies of this project, as this is a driver that is specific to mySQL: users might not use mySQL, and might need other dependencies (psycopg2
, psycopg
, and many others for example - c.f. sqlalchemy dialects)
settings.yaml
Outdated
db_host: ${HOSTNAME:localhost} | ||
db_user: ${USERNAME:root} | ||
db_password: ${PASSWORD:} | ||
database: ${DATABASE:} | ||
tables: ["${TABLES:}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The env var are specific to SQL/DB world, so I'd personally prefix them with something like SQL_
or DB_
.
Additionally, TABLES
is set to plural, while its usage shows that only a single name is supported, which is misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tables
variable accepts a list of table names (you can setup multiple tables as context- SQLTableRetrieverQueryEngine
will select the right table to query).
It's used in the code like-
for table_name in self.metadata_obj.tables: table_schema_objs.append(SQLTableSchema(table_name=table_name))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ishaandatta you are right - tables
does accept a list of string, but the env var TABLES
seems to only accept one table.
Let me ask this: how would you pass several tables names in the env variable TABLES
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense- I'm not actually sure how to define an array/list here in the env file.
Figured square brackets were sufficient in conveying the usage, as I tested and it works for multiple tables.
I tested it with the following value-
tables: ["EMPLOYEE", "DEPARTMENT"]
Currently if TABLES
is not given, no default is set as per https://docs.privategpt.dev/manual/general-configuration/configuration#environment-variables-expansion
Is this better- db_tables: ${TABLES_LIST}
fern/docs/pages/manual/nlsql.mdx
Outdated
```yaml | ||
sqldatabase: | ||
dialect: mysql | ||
driver: pymysql | ||
host: <db_hostname_here> | ||
user: <db_username_here> | ||
password: <db_password_here> | ||
database: <database_name_here> | ||
tables: <list_of_table_names_here> | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very rigid, why not move it to a parameter similar to context_files
(like context_database(s)
)? Conceptually speaking there is not big difference between talking to your documents or talking to your database(s)
I understand that would require a big refactor of this PR but it will set a pattern for next features (like context_websites
)
Published docs preview URL: https://privategpt-preview-0d2a2cd6-18d1-45f0-94ea-8520b66a31da.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-8398e968-7a97-4f8c-a971-60effb050e48.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-a9cc24a7-2737-4620-92fd-5baef4d90794.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-6f1e054d-e90c-4487-8fbc-74549e3b9484.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-84b92ff1-390b-4124-9528-957ca2e65530.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-670e18aa-6151-4f9a-830d-9c6331e46eed.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-d29e4898-d133-4130-8c9c-8c5c54e35cce.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-0b1faa26-b2a0-4288-9cc8-cde2105dac95.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-86152132-67f6-40c2-9221-e3f98ff4a456.docs.buildwithfern.com |
Published docs preview URL: https://privategpt-preview-e0a139de-7820-4004-9e36-bb7f957fca78.docs.buildwithfern.com |
Stale pull request |
@lopagela addressed the changes, please check once |
Thank you for your efforts! I have hit a bump when running this on a large database schema. It appears that SQLAlchemy is running a reflect() multiple times (1. during initial startup, 2. Before the first query in ObjectIndex.from_objects() ). Subsequent queries perform better however the startup is > 30 minutes and the first query is another 30 minutes. I am limiting to a single table however, that table has relationships to other tables that are recursively parsed.
Both call above trigger a reflection. Is this by design? Can the reflection results be persisted for quick loading when the schema has infrequent changes? It seems to be on the llama-index side but I was wondering if a different approach could be used. Thank you. |
Support for text-to-SQL for MySQL (https://github.com/users/imartinez/projects/3?pane=issue&itemId=44266590)
This uses
sqlalchemy
to initialise a client to the database.Thereafter, a
SQLDatabase
instance is initialised, which is passed to the llama_index objectSQLTableNodeMapping
for creating a node per schema. (allowing query-time retrieval of schema)Finally, the llama_index
SQLTableRetrieverQueryEngine
is used to-1. Convert text to appropriate SQL query
2. Execute the SQL query
3. Use the result as context to answer prompt