Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port search_documents_by_keywords to C-Top2Vec #366

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

CodingKoopa
Copy link

This PR enables support for searching documents by keyword for C-Top2Vec models.

The first commit adds a Top2Vec parameter named combine_ngram_vocab to add both single words and n-grams to the phrase embeddings. If this is enabled, then the original behavior described in #364 is restored. The motivation is we don't use the top topic words (and if we do, we are manually selecting representative words/phrases), but we do want to be able to search by both word or phrase.

The second commit enables search_documents_by_keywords() to work with C-Top2Vec. This relies on the same self.document_vectors member variable as original Top2Vec, which here is the multi-vector document representation. Since the splitting of the original documents is a byproduct of the sliding window average, I have introduced another member self.multi_document_labels to maintain the mapping of each multi-document vector to the original vector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant