Improving Human-Readability of Similarity Search Results in FAISS-based Search Systems #1318

Rajat-2001 · 2025-01-07T10:27:40Z

Describe the bug

When performing similarity search using FAISS (Facebook AI Similarity Search), the results often come back as raw, low-level data that isn't easily readable or useful to a human.

For example, the output might look something like this:
Rank: 1, Distance: 1.629706859588623, Text: M *M 4M JM pM M M N qN N N N O TO \O ]O {O O P hP ~P P P IQ lQ Q Q Q Q FR XR ~R R R =S S S T T ;T |T T T T [U \U U U +V KV UV dV uV V V W W W W $X 4X X X X

Rank: The position of the most similar item in the search results.
Distance: A numerical value indicating how similar the result is to the query (lower distances generally mean more similarity).
Text: A string of seemingly random symbols and letters. This is the vector representation used by FAISS, which is not directly human-readable.
In short, FAISS is returning the raw vector data or its internal representation, which requires additional processing or translation into a more interpretable format (e.g., string/text mapping, nearest neighbors, etc.).

Minimal reproducible example

""" Rank: 1, Distance: 1.629706859588623, Text: M *M 4M JM pM M M N qN N N N O TO \O ]O {O O P hP ~P P P IQ lQ Q Q Q Q FR XR ~R R R =S S S T T ;T |T T T T [U \U U U +V KV UV dV uV V V W W W W $X 4X X X X
Rank: 2, Distance: 1.6545774936676025, Text: F F F F F G G H H PH RH nH I -I EI HI ZI I I I J J J J K K =L DL #M oM M M M ;N N N N sO O O LP P P P *Q 7Q TQ _Q Q Q R dR R ;S kS T KT T T T T T !U #U
"""

def create_faiss_index():
    start_time = time.time()  # Start time for measuring the function's execution time

    embeddings_file = '/var/www/html/zsapiens/llama-models/models/data/embeddings.npy'

    # Check if embeddings file exists and is loaded correctly
    try:
        embeddings = np.load(embeddings_file, allow_pickle=True)
        if embeddings is None or embeddings.shape[0] == 0:
            raise ValueError("Embeddings file is empty or not loaded correctly.")
        print(f"Loaded {embeddings.shape[0]} embeddings.")
        print(f"Embeddings shape: {embeddings.shape}")  # Add this line to print the shape
    except Exception as e:
        print(f"Error loading embeddings: {e}")
        print("Regenerating embeddings...")
        # Regenerate embeddings if the file doesn't exist or is empty
        embeddings = create_embeddings()  # Replace with your actual embedding generation function
        np.save(embeddings_file, embeddings)
        print(f"Embeddings saved to {embeddings_file}.")

    # Create FAISS index on CPU
    try:
        index = faiss.IndexFlatL2(embeddings.shape[1])  # Assuming L2 distance metric
        index.add(embeddings)
        print(f"Added {embeddings.shape[0]} embeddings to the FAISS index.")
    except Exception as e:
        print(f"Error creating FAISS index: {e}")
        return

    # Serialize the FAISS index on the CPU
    try:
        index_file = '/var/www/html/zsapiens/llama-models/models/data/faiss_index.index'
        faiss.write_index(index, index_file)
        print(f"FAISS index created and saved to {index_file}.")
    except Exception as e:
        print(f"Error saving FAISS index: {e}")
        return

    end_time = time.time()  # End time
    execution_time = end_time - start_time  # Calculate execution time
    print(f"create_faiss_index executed in {execution_time:.2f} seconds.")

Output

Rank: 1, Distance: 1.629706859588623, Text: M *M 4M JM pM M M N qN N N N O TO \O ]O {O O P hP ~P P P IQ lQ Q Q Q Q FR XR ~R R R =S S S T T ;T |T T T T [U \U U U +V KV UV dV uV V V W W W W $X 4X X X X
Rank: 2, Distance: 1.6545774936676025, Text: F F F F F G G H H PH RH nH I -I EI HI ZI I I I J J J J K K =L DL #M oM M M M ;N N N N sO O O LP P P P *Q 7Q TQ _Q Q Q R dR R ;S kS T KT T T T T T !U #U
Rank: 3, Distance: 1.6708736419677734, Text: K tL L (M 6M PM M M M 6N xN N N N MO XO O O O O P P P JQ ^Q Q R R S S S S S S S S T T T U JU U U U 1V 2V V W }W W W X X X UY bY Y Y Y >Z Z ?[ [ [
Rank: 4, Distance: 1.6997497081756592, Text: ` [ d m B | h n Y k w B H t \ w ( W & > 9 u ~ 6 u / 7
Rank: 5, Distance: 1.7031402587890625, Text: < c z W H D M % 0 " - 8 C N Y d o z ; .

<paste stacktrace and other outputs here>

Runtime Environment

Model: [ `llama-3.1-8b]
Using via huggingface?: [no]
OS: [eg. Linux/Ubuntu]

Model: llama-3.1-8b
OS: Ubuntu 22.04
GPU VRAM: 48 GB (NVIDIA RTX A6000)
Number of GPUs: 1
GPU Make: NVIDIA RTX A6000
FAISS Version: (e.g., faiss-gpu )

Additional context
Expected: The output should contain human-readable data (e.g., nearest neighbor texts, objects, or descriptions).
Actual: The output contains raw vector data (e.g., a string of random symbols) which isn't interpretable without further processing.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Human-Readability of Similarity Search Results in FAISS-based Search Systems #1318

Improving Human-Readability of Similarity Search Results in FAISS-based Search Systems #1318

Rajat-2001 commented Jan 7, 2025 •

edited

Loading

Improving Human-Readability of Similarity Search Results in FAISS-based Search Systems #1318

Improving Human-Readability of Similarity Search Results in FAISS-based Search Systems #1318

Comments

Rajat-2001 commented Jan 7, 2025 • edited Loading

Describe the bug

Minimal reproducible example

Output

Runtime Environment

Rajat-2001 commented Jan 7, 2025 •

edited

Loading