Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement Approximate Nearest Neighbor support for DDL (CREATE TABLE, CREATE VECTOR INDEX) #124

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

odeke-em
Copy link

@odeke-em odeke-em commented Dec 26, 2024

This change adds ANN distance strategies for GoogleSQL semantics.
While here started unit tests to effectively test out components
without having to have a running Cloud Spanner instance.

Implements Data Definition Language (DDL) functionality for:

  • CREATE TABLE
  • CREATE VECTOR INDEX

Updates #94

@odeke-em odeke-em requested review from a team as code owners December 26, 2024 12:40
@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/langchain-google-spanner-python API. label Dec 26, 2024
@odeke-em odeke-em changed the title Ann support update distance strategies feat: add Approximate Nearest Neighbor support to distance strategies Dec 26, 2024
@odeke-em odeke-em force-pushed the ANN-support-update-distance_strategies branch 2 times, most recently from dfeab0a to 2ef461a Compare December 26, 2024 15:06
@odeke-em odeke-em changed the title feat: add Approximate Nearest Neighbor support to distance strategies feat: implement Approximate Nearest Neighbor support Dec 26, 2024
This change introduces new nox directives:
* blacken: `nox -s blacken`
* format: `nox -s format` to apply formatting to files
* lint: `nox -s lint` to flag linting issues
* unit: to run unit tests locally

which are the basis to enable scalable development
and continuous testing as I prepare to bring in
Approximate Nearest Neighors (ANN) functionality into
this package.

Also while here, fixed a typo in the README.rst file
that didn't have the correct import path.
This change adds ANN distance strategies for GoogleSQL semantics.
While here started unit tests to effectively test out components
without having to have a running Cloud Spanner instance.

Updates googleapis#94
@odeke-em odeke-em force-pushed the ANN-support-update-distance_strategies branch 2 times, most recently from f29911a to 36c552c Compare January 20, 2025 08:21
@odeke-em odeke-em force-pushed the ANN-support-update-distance_strategies branch from 36c552c to 64358ab Compare January 20, 2025 08:56
@odeke-em odeke-em changed the title feat: implement Approximate Nearest Neighbor support feat: implement Approximate Nearest Neighbor support for DDL (CREATE TABLE, CREATE VECTOR INDEX) Jan 20, 2025
Copy link
Contributor

@gauravpurohit06 gauravpurohit06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the comments, focusing only on the VectorClass. Please run the lints add line breaks to improve the readability.

@@ -87,6 +87,10 @@ class SecondaryIndex:
index_name: str
columns: list[str]
storing_columns: Optional[list[str]] = None
num_leaves: Optional[int] = None # Only necessary for ANN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a subclass of SecondaryIndex named VectorSearchIndex & add the fields num_leaves, num_branches, tree_depth & index_type in the subclass.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -87,6 +87,10 @@ class SecondaryIndex:
index_name: str
columns: list[str]
storing_columns: Optional[list[str]] = None
num_leaves: Optional[int] = None # Only necessary for ANN
num_branches: Optional[int] = None # Only necessary for ANN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the default values for the Vector Index as recommended on https://cloud.google.com/spanner/docs/find-approximate-nearest-neighbors#best-practices

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make num_leaves & num_branches mandatory and remove Optional.

@@ -87,6 +87,10 @@ class SecondaryIndex:
index_name: str
columns: list[str]
storing_columns: Optional[list[str]] = None
num_leaves: Optional[int] = None # Only necessary for ANN
num_branches: Optional[int] = None # Only necessary for ANN
tree_depth: Optional[int] = None # Only necessary for ANN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a post init method to validate tree_depth.

 def __post_init__(self):        
        # Constraint: tree_depth must be either 2 or 3
        if self. tree_depth not in [2, 3]:
            raise ValueError(f"tree_depth must be either 2 or 3, got {self.access_level}")

num_leaves: Optional[int] = None # Only necessary for ANN
num_branches: Optional[int] = None # Only necessary for ANN
tree_depth: Optional[int] = None # Only necessary for ANN
index_type: Optional[DistanceStrategy] = None # Only necessary for ANN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to make it mandatory field so customer knows on what they are creating index and won't be surprised if they query on some other distance on which they don't have an index.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks! I've added VectorSearchIndex separate from SecondaryIndex.

src/langchain_google_spanner/vector_store.py Outdated Show resolved Hide resolved
Comment on lines +980 to +982
column_name: str,
table_name: str,
index_name: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to take it from the user... as it's defined in SpannerVectorStore at the time of initialization.

table_name: str,
index_name: str,
embedding: List[float],
embedding_column_name: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to take it from the user... as it's defined in SpannerVectorStore at the time of initialization.

embedding: List[float],
embedding_column_name: str,
num_leaves: int,
strategy: DistanceStrategy = DistanceStrategy.COSINE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to take it from the user... as it's defined in SpannerVectorStore at the time of initialization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method though isn't exposed to the user, which is why I preceded it with _ as _query_ANN as I need to test it out individually.

embedding_column_name: str,
num_leaves: int,
strategy: DistanceStrategy = DistanceStrategy.COSINE,
limit: int = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it to k

embedding_column_name: str,
num_leaves: int,
strategy: DistanceStrategy = DistanceStrategy.COSINE,
limit: int = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it to K

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the K be quite confusing given that that's used inside kNN while this is ANN?

@odeke-em odeke-em force-pushed the ANN-support-update-distance_strategies branch from 7e5279a to 44d0996 Compare January 21, 2025 10:53
@gauravpurohit06
Copy link
Contributor

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/langchain-google-spanner-python API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants