Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speeding up search by restricting the number of docs to search #194

Open
punkish opened this issue Dec 27, 2024 · 4 comments
Open

speeding up search by restricting the number of docs to search #194

punkish opened this issue Dec 27, 2024 · 4 comments
Assignees

Comments

@punkish
Copy link

punkish commented Dec 27, 2024

I built a db of 5000 articles and ran 9 queries against it. The response time varied between 6 and 16 seconds. This is understandably too slow for a real life application. But I notice that not only does the app search over all my 5000 articles, for some of it answers it also draws from outside the domain. I asked in #190 if I could restrict the search to just my data. But now I am thinking it would be even better if, optionally, I could also restrict the search to a subset of my data.

For example, if my (scientific) articles are about butterflies, ants, and spiders, and someone asks a question that is obviously related to ants, the search could be restricted to only the ants-related documents. Of course, this would imply storing some kind of metadata that would allow such sub-setting. Could that be possible?

I can think of doing a db JOIN against my original articles which have all kinds of metadata using the rowid (or some primary key), using a WHERE clause to narrow the basket, and then running the search on just that basket.

@adhityan
Copy link
Collaborator

adhityan commented Jan 6, 2025

There are multiple parts to this question.

Let's start with how to restrict search to just your data. For the sake of clarity, it's important we separate two key concepts from one another - the search (over your dataset) and the response (from the LLM). The search always happens on your data. Your preloaded data is encoded into vectors and stored to a database. Once a query is obtained, it is also encoded to the same vector space and cosine (there are other viable techniques) search is done on the database. The results are only always going to be from the values from the database. However, the response from the LLM is a different beast altogether. As you may be aware, LLMs are trained on huge corpus of data and it is very likely that some of that data included relevant data for the LLM to use in answering your query. You can ask the LLM in the system prompt to only use the data you provided but the adherence of LLMs to such a prompt is subjective. For example, in general, openAi adheres to such a prompt very well. You may need to fine tune the prompt yourself to make sure it meets your desired objective. To do this use this method - https://llm-tools.mintlify.app/api-reference/overview#param-set-system-message. This answers #190.

The second bit of this question is more interesting. I agree that we may want to do additional metadata filtering preferably before / as part of the database vector query. However, not all supported databases support this. Only very few ones like libSql have any robust support for SQL or such advanced filters. I have thought about this in the past and here's what I am thinking to do now -

  1. The library supports some high level (but basic) metadata filters via its API
  2. For databases where the database natively supports such filters, the high level filter API will be translated to native database filters at query time
  3. For databases that don't support such filters, the library will filter out these values at the NodeJs end after reading the values from the database based on a pure vector search.

This is going to be a little large feature and will take some time for me to push through. If you want to contribute to this, let me know and maybe we can expedite.

@punkish
Copy link
Author

punkish commented Jan 7, 2025

Thanks for the detailed reply. I believe your explanation is very educational and should be included in the documentation as many others might benefit from it as well. (Answers to my other, as yet answered, questions also might be very useful to others as well if they were made a part of the documentation.)

With regards to the first part of your reply, I will try different params to for making the LLM response narrower. Fwiw, I am using Ollama with Llama 3.2 in order to stay completely free and open source. I am not sure how much can be done there, but I will explore. Any pointers would be very welcome.

With regards to the second part, maybe if embedJs provide more lower level access, it wouldn't have to do everything for everyone. It would make it less "easy, out of the box" solution, but it would also make it more powerful and tuned to whatever stack the user may be running.

In my case, all my data is in SQLite, about 1M documents. I use SQLite for regular SQL searches as well as for FTS searches that drive an API (https://zenodeo.org) to power an application (https://ocellus.info).

If I knew the minimum essential structure of the vector tables that emdedJs builds, I could just build them myself using the full power of libSQL (SQLite) transactions. I would add an additional column to that table that would be populated with the rowid of each row in the source table. This way I could create embeddings for 1M documents efficiently, and also keep this vector db updated with TRIGGERs every time my datastore gets new documents (on a daily basis), of if any existing documents get modified or deleted (possible although rare).

When searching, I could join the source table with the vector table on rowid. That would allow me to filter the vector rows using the data in the source table. Something like

SELECT embeddings 
FROM "vector table" vt JOIN "source table" st ON vt.rowid = st.rowid
WHERE 
    st.species = 'ants' 
    AND st.journalYear IS BETWEEN 1990 AND 2002

You get the idea…

This way you would not have to build some kind of a universal translator for metadata filters which would be just a very tedious and needlessly complicated responsibility for every db you have chosen to support. Those using libSQL would be able to use advanced SQL filtering capabilities, and those using other solutions may be able to use capabilities specific to their stack.

embedJs is really wonderful. Dipping my toes in it helped me understand the entire search toolchain. But, the very ease and comprehensive-ness that embedJs strives to provide shield the underlying power from users who might want to customize the toolchain.

I hope the above makes sense.

@punkish
Copy link
Author

punkish commented Jan 7, 2025

an additional note: As I mentioned in my OP, the current search takes way too long. With only 5000 articles worth of embeddings, the LLM's responses take between 6 to 16 seconds. I need this time to come down to a few hundreds of milliseconds in order for it to be useful. Being able to customize my libSQL instance means I can tune it for high performance with all the right indexes.

@punkish
Copy link
Author

punkish commented Jan 7, 2025

I see that the vectors table structure (in SQLite) is as follows

CREATE TABLE vectors (
            id              TEXT PRIMARY KEY,
            pageContent     TEXT UNIQUE,
            uniqueLoaderId  TEXT NOT NULL,
            source          TEXT NOT NULL,
            vector          F32_BLOB(768),
            metadata        TEXT
        );

I would like to be able to instruct embedJs to add rowid to above so that I could insert the sourcetable's rowid in that column

CREATE TABLE vectors (
            id              TEXT PRIMARY KEY,
            pageContent     TEXT UNIQUE,
            uniqueLoaderId  TEXT NOT NULL,
            source          TEXT NOT NULL,
            vector          F32_BLOB(768),
            metadata        TEXT,
            rowid INTEGER NOT NULL
        );

But, as I mentioned above, this is too specific to my needs, and perhaps does not belong in a more universal package such as embedJs. So, a better approach might be to expose the appLoader interface so I can customize it myself and use transactions. I timed the process again with 2000 docs and it took an average of about a min and 20s per 500 documents.

➜  zai node index.js
building app
loading 0-499 docs: 1:27.696 (m:ss.mmm)
loading 500-999 docs: 1:04.585 (m:ss.mmm)
loading 1000-1499 docs: 1:12.623 (m:ss.mmm)
loading 1500-1999 docs: 1:20.226 (m:ss.mmm)
loading 2000-2000 docs: 19.614ms

That is just way too slow. I would be waiting forever to process a million docs. Exposing the API will allow me to use the full capabilities of my stack to my advantage as transactions would make a world of a difference here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants