-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement Approximate Nearest Neighbor support for DDL (CREATE TABLE, CREATE VECTOR INDEX) #124
base: main
Are you sure you want to change the base?
feat: implement Approximate Nearest Neighbor support for DDL (CREATE TABLE, CREATE VECTOR INDEX) #124
Conversation
dfeab0a
to
2ef461a
Compare
This change introduces new nox directives: * blacken: `nox -s blacken` * format: `nox -s format` to apply formatting to files * lint: `nox -s lint` to flag linting issues * unit: to run unit tests locally which are the basis to enable scalable development and continuous testing as I prepare to bring in Approximate Nearest Neighors (ANN) functionality into this package. Also while here, fixed a typo in the README.rst file that didn't have the correct import path.
This change adds ANN distance strategies for GoogleSQL semantics. While here started unit tests to effectively test out components without having to have a running Cloud Spanner instance. Updates googleapis#94
f29911a
to
36c552c
Compare
36c552c
to
64358ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the comments, focusing only on the VectorClass. Please run the lints add line breaks to improve the readability.
@@ -87,6 +87,10 @@ class SecondaryIndex: | |||
index_name: str | |||
columns: list[str] | |||
storing_columns: Optional[list[str]] = None | |||
num_leaves: Optional[int] = None # Only necessary for ANN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a subclass of SecondaryIndex
named VectorSearchIndex
& add the fields num_leaves
, num_branches
, tree_depth
& index_type
in the subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -87,6 +87,10 @@ class SecondaryIndex: | |||
index_name: str | |||
columns: list[str] | |||
storing_columns: Optional[list[str]] = None | |||
num_leaves: Optional[int] = None # Only necessary for ANN | |||
num_branches: Optional[int] = None # Only necessary for ANN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the default values for the Vector Index as recommended on https://cloud.google.com/spanner/docs/find-approximate-nearest-neighbors#best-practices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make num_leaves
& num_branches
mandatory and remove Optional.
@@ -87,6 +87,10 @@ class SecondaryIndex: | |||
index_name: str | |||
columns: list[str] | |||
storing_columns: Optional[list[str]] = None | |||
num_leaves: Optional[int] = None # Only necessary for ANN | |||
num_branches: Optional[int] = None # Only necessary for ANN | |||
tree_depth: Optional[int] = None # Only necessary for ANN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a post init method to validate tree_depth.
def __post_init__(self):
# Constraint: tree_depth must be either 2 or 3
if self. tree_depth not in [2, 3]:
raise ValueError(f"tree_depth must be either 2 or 3, got {self.access_level}")
num_leaves: Optional[int] = None # Only necessary for ANN | ||
num_branches: Optional[int] = None # Only necessary for ANN | ||
tree_depth: Optional[int] = None # Only necessary for ANN | ||
index_type: Optional[DistanceStrategy] = None # Only necessary for ANN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to make it mandatory field so customer knows on what they are creating index and won't be surprised if they query on some other distance on which they don't have an index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks! I've added VectorSearchIndex separate from SecondaryIndex.
column_name: str, | ||
table_name: str, | ||
index_name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to take it from the user... as it's defined in SpannerVectorStore at the time of initialization.
table_name: str, | ||
index_name: str, | ||
embedding: List[float], | ||
embedding_column_name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to take it from the user... as it's defined in SpannerVectorStore at the time of initialization.
embedding: List[float], | ||
embedding_column_name: str, | ||
num_leaves: int, | ||
strategy: DistanceStrategy = DistanceStrategy.COSINE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to take it from the user... as it's defined in SpannerVectorStore at the time of initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method though isn't exposed to the user, which is why I preceded it with _
as _query_ANN
as I need to test it out individually.
embedding_column_name: str, | ||
num_leaves: int, | ||
strategy: DistanceStrategy = DistanceStrategy.COSINE, | ||
limit: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it to k
embedding_column_name: str, | ||
num_leaves: int, | ||
strategy: DistanceStrategy = DistanceStrategy.COSINE, | ||
limit: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it to K
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't the K be quite confusing given that that's used inside kNN while this is ANN?
7e5279a
to
44d0996
Compare
/gcbrun |
This change adds ANN distance strategies for GoogleSQL semantics.
While here started unit tests to effectively test out components
without having to have a running Cloud Spanner instance.
Implements Data Definition Language (DDL) functionality for:
Updates #94