Port search_documents_by_keywords to C-Top2Vec #366
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR enables support for searching documents by keyword for C-Top2Vec models.
The first commit adds a
Top2Vec
parameter namedcombine_ngram_vocab
to add both single words and n-grams to the phrase embeddings. If this is enabled, then the original behavior described in #364 is restored. The motivation is we don't use the top topic words (and if we do, we are manually selecting representative words/phrases), but we do want to be able to search by both word or phrase.The second commit enables
search_documents_by_keywords()
to work with C-Top2Vec. This relies on the sameself.document_vectors
member variable as original Top2Vec, which here is the multi-vector document representation. Since the splitting of the original documents is a byproduct of the sliding window average, I have introduced another memberself.multi_document_labels
to maintain the mapping of each multi-document vector to the original vector.