How to compute similarity between vectors? #25

koheiw · 2024-11-14T07:20:15Z

I noticed that the library computes the similarity between vectors in a strange manner. The scores are not really cosine similarity but cross-products. @jwijffels , do you think it is intentional?

word2vec/src/word2vec/include/word2vec.hpp

Lines 201 to 207 in 96b0e04

    
           float ret = 0.0f; 
        
           for (uint16_t i = 0; i < m_vectorSize; ++i) { 
        
               ret += _what[i] * _with[i]; 
        
           } 
        
           if (ret > 0.0f) { 
        
               return  std::sqrt(ret / m_vectorSize); 
        
           }

library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.4.2
library(word2vec)

data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

model <- word2vec::word2vec(x = x, dim = 15, iter = 20)
emb <- as.matrix(model)

pred <- predict(model, emb["bus",], type = "nearest", top_n = 10)
pred
#>        term similarity rank
#> 1      tram  0.9866303    1
#> 2      voet  0.9863607    2
#> 3        10  0.9838170    3
#> 4        15  0.9835117    4
#> 5       min  0.9818841    5
#> 6     lopen  0.9809218    6
#> 7     vanaf  0.9808209    7
#> 8  parkeren  0.9791722    8
#> 9        20  0.9773416    9
#> 10     auto  0.9745490   10

# similarity in the library
cross <- rowSums(sqrt(crossprod(t(emb), emb["bus",]) / ncol(emb)))
#> Warning in sqrt(crossprod(t(emb), emb["bus", ])/ncol(emb)): NaNs produced
head(sort(cross, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000001 0.9866303 0.9863608 0.9838170 0.9835117 0.9818842

# cosine similarity 
cosine <- Matrix::rowSums(proxyC::simil(emb, emb["bus",,drop = FALSE]))
head(sort(cosine, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000000 0.9734393 0.9729074 0.9678959 0.9672952 0.9640965

# they are very similar but not the same
cor(cross, cosine, use = "pair")
#> [1] 0.9825781
cor(cross, cosine, use = "pair", method = "spearman")    
#> [1] 1

See how cosine similarity is computed: https://koheiw.github.io/proxyC/articles/measures.html#similarity-measures

jwijffels · 2024-11-14T08:20:49Z

I think it's intentional. #5
I've added, cosine in ?word2vec_similarity and documented the behaviour a bit at

koheiw · 2024-11-14T08:27:57Z

Thank for pointing to the issue and the function. It is nice to use your word2vec_similarity() function in predict(). The code for similarity computation is making the C++ library complicated. I removed it entirely from my wordvector package.

jwijffels · 2024-11-14T08:32:54Z

Yes, it's a bit silly that that part is in C++. It should indeed be easier to call it in predict.

koheiw · 2024-11-14T08:35:23Z

I can help you with these things once the development of my package is completed. I hope the first version will come soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compute similarity between vectors? #25

How to compute similarity between vectors? #25

koheiw commented Nov 14, 2024

jwijffels commented Nov 14, 2024 •

edited

Loading

koheiw commented Nov 14, 2024

jwijffels commented Nov 14, 2024

koheiw commented Nov 14, 2024

How to compute similarity between vectors? #25

How to compute similarity between vectors? #25

Comments

koheiw commented Nov 14, 2024

jwijffels commented Nov 14, 2024 • edited Loading

koheiw commented Nov 14, 2024

jwijffels commented Nov 14, 2024

koheiw commented Nov 14, 2024

jwijffels commented Nov 14, 2024 •

edited

Loading