Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Human-Readability of Similarity Search Results in FAISS-based Search Systems #1318

Open
Rajat-2001 opened this issue Jan 7, 2025 · 0 comments

Comments

@Rajat-2001
Copy link

Rajat-2001 commented Jan 7, 2025

Describe the bug

When performing similarity search using FAISS (Facebook AI Similarity Search), the results often come back as raw, low-level data that isn't easily readable or useful to a human.

For example, the output might look something like this:
Rank: 1, Distance: 1.629706859588623, Text: M *M 4M JM pM M M N qN N N N O TO \O ]O {O O P hP ~P P P IQ lQ Q Q Q Q FR XR ~R R R =S S S T T ;T |T T T T [U \U U U +V KV UV dV uV V V W W W W $X 4X X X X

Rank: The position of the most similar item in the search results.
Distance: A numerical value indicating how similar the result is to the query (lower distances generally mean more similarity).
Text: A string of seemingly random symbols and letters. This is the vector representation used by FAISS, which is not directly human-readable.
In short, FAISS is returning the raw vector data or its internal representation, which requires additional processing or translation into a more interpretable format (e.g., string/text mapping, nearest neighbors, etc.).

Minimal reproducible example

""" Rank: 1, Distance: 1.629706859588623, Text: M *M 4M JM pM M M N qN N N N O TO \O ]O {O O P hP ~P P P IQ lQ Q Q Q Q FR XR ~R R R =S S S T T ;T |T T T T [U \U U U +V KV UV dV uV V V W W W W $X 4X X X X
Rank: 2, Distance: 1.6545774936676025, Text: F F F F F G G H H PH RH nH I -I EI HI ZI I I I J J J J K K =L DL #M oM M M M ;N N N N sO O O LP P P P *Q 7Q TQ _Q Q Q R dR R ;S kS T KT T T T T T !U #U
"""

def create_faiss_index():
    start_time = time.time()  # Start time for measuring the function's execution time

    embeddings_file = '/var/www/html/zsapiens/llama-models/models/data/embeddings.npy'

    # Check if embeddings file exists and is loaded correctly
    try:
        embeddings = np.load(embeddings_file, allow_pickle=True)
        if embeddings is None or embeddings.shape[0] == 0:
            raise ValueError("Embeddings file is empty or not loaded correctly.")
        print(f"Loaded {embeddings.shape[0]} embeddings.")
        print(f"Embeddings shape: {embeddings.shape}")  # Add this line to print the shape
    except Exception as e:
        print(f"Error loading embeddings: {e}")
        print("Regenerating embeddings...")
        # Regenerate embeddings if the file doesn't exist or is empty
        embeddings = create_embeddings()  # Replace with your actual embedding generation function
        np.save(embeddings_file, embeddings)
        print(f"Embeddings saved to {embeddings_file}.")

    # Create FAISS index on CPU
    try:
        index = faiss.IndexFlatL2(embeddings.shape[1])  # Assuming L2 distance metric
        index.add(embeddings)
        print(f"Added {embeddings.shape[0]} embeddings to the FAISS index.")
    except Exception as e:
        print(f"Error creating FAISS index: {e}")
        return

    # Serialize the FAISS index on the CPU
    try:
        index_file = '/var/www/html/zsapiens/llama-models/models/data/faiss_index.index'
        faiss.write_index(index, index_file)
        print(f"FAISS index created and saved to {index_file}.")
    except Exception as e:
        print(f"Error saving FAISS index: {e}")
        return

    end_time = time.time()  # End time
    execution_time = end_time - start_time  # Calculate execution time
    print(f"create_faiss_index executed in {execution_time:.2f} seconds.")

Output

Rank: 1, Distance: 1.629706859588623, Text: M *M 4M JM pM M M N qN N N N O TO \O ]O {O O P hP ~P P P IQ lQ Q Q Q Q FR XR ~R R R =S S S T T ;T |T T T T [U \U U U +V KV UV dV uV V V W W W W $X 4X X X X
Rank: 2, Distance: 1.6545774936676025, Text: F F F F F G G H H PH RH nH I -I EI HI ZI I I I J J J J K K =L DL #M oM M M M ;N N N N sO O O LP P P P *Q 7Q TQ _Q Q Q R dR R ;S kS T KT T T T T T !U #U
Rank: 3, Distance: 1.6708736419677734, Text: K tL L (M 6M PM M M M 6N xN N N N MO XO O O O O P P P JQ ^Q Q R R S S S S S S S S T T T U JU U U U 1V 2V V W }W W W X X X UY bY Y Y Y >Z Z ?[ [ [
Rank: 4, Distance: 1.6997497081756592, Text: ` [ d m B | h n Y k w B H t \ w ( W & > 9 u ~ 6 u / 7
Rank: 5, Distance: 1.7031402587890625, Text: < c z W H D M % 0 " - 8 C N Y d o z ; .

<paste stacktrace and other outputs here>

Runtime Environment

  • Model: [ `llama-3.1-8b]
  • Using via huggingface?: [no]
  • OS: [eg. Linux/Ubuntu]
    Screenshot from 2025-01-07 15-43-15

Model: llama-3.1-8b
OS: Ubuntu 22.04
GPU VRAM: 48 GB (NVIDIA RTX A6000)
Number of GPUs: 1
GPU Make: NVIDIA RTX A6000
FAISS Version: (e.g., faiss-gpu )

Additional context
Expected: The output should contain human-readable data (e.g., nearest neighbor texts, objects, or descriptions).
Actual: The output contains raw vector data (e.g., a string of random symbols) which isn't interpretable without further processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant