Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve support for fixed length array in DuckDB #10719

Open
1 task done
Riezebos opened this issue Jan 24, 2025 · 1 comment
Open
1 task done

feat: Improve support for fixed length array in DuckDB #10719

Riezebos opened this issue Jan 24, 2025 · 1 comment
Labels
feature Features or general enhancements

Comments

@Riezebos
Copy link
Contributor

Is your feature request related to a problem?

I was trying to rewrite a DuckDB query to Ibis, and ran into this error:

BinderException: Binder Error: No function matches the given name and argument types 'array_cosine_distance(DOUBLE[], DOUBLE[])'. You might need to add explicit type casts.

Here is the function and the attempt at rewriting it:

import duckdb
import ibis
from ibis import _
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

ibis.options.interactive = True

static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])


def similarity_search_duckdb(
    query: str,
    k: int = 5,
    dataset_name: str = "ai-blueprint/fineweb-bbc-news-embeddings",
    embedding_column: str = "embeddings",
):
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()

    sql = f"""
        SELECT 
            *,
            array_cosine_distance(
                {embedding_column}::float[{embedding_dim}], 
                {query_vector.tolist()}::float[{embedding_dim}]
            ) as distance
        FROM 'hf://datasets/{dataset_name}/**/*.parquet'
        ORDER BY distance
        LIMIT {k}
    """
    return ibis.memtable(duckdb.sql(sql).to_arrow_table())


t1 = similarity_search_duckdb("What is the future of AI?")
print(t1)


@ibis.udf.scalar.builtin
def array_cosine_distance(x, y) -> float:
    """Compute cosine similarity between two vectors."""


def similarity_search_ibis(
    query: str = "What is the future of AI?",
    k: int = 5,
    dataset_name: str = "ai-blueprint/fineweb-bbc-news-embeddings",
    embedding_column: str = "embeddings",
):
    # Use same model as used for indexing
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()

    return (
        ibis.read_parquet(f"hf://datasets/{dataset_name}/**/*.parquet")
        .mutate(
            distance=array_cosine_distance(
                _[embedding_column].cast("array<float64>"),
                ibis.array(query_vector).cast("array<float64>"),
            )
        )
        .order_by(_.distance.desc())
        .limit(k)
        .drop(embedding_column)
    )

t2 = similarity_search_ibis("What is the future of AI?")
print(t2)

I found a related issue: #7963

This ensures DuckDB tables with fixed-length arrays can be used in Ibis, but from what I can tell Ibis treats this as a variable-length array. I haven't found a way to create a fixed-length array in a DuckDB table using Ibis. The array_cosine_distance function only supports fixed-length arrays.

What is the motivation behind your request?

I'm trying to use Ibis wherever I would previously have used SQL or Pandas, both as a learning exercise and because I'm hoping Ibis can become my default data manipulation library.

Describe the solution you'd like

I'd like some way to create fixed-length array columns in DuckDB using Ibis.

What version of ibis are you running?

9.5.0

What backend(s) are you using, if any?

duckdb 1.1.3

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Riezebos Riezebos added the feature Features or general enhancements label Jan 24, 2025
@cpcloud
Copy link
Member

cpcloud commented Jan 24, 2025

Thanks for the issue!

I think we were just waiting for someone to request this before doing the work, since adding a new type to Ibis is a decent amount of work.

I'm not sure this will make it into 10.0, but it should be in the following feature release after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

2 participants