Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compute similarity between vectors? #25

Open
koheiw opened this issue Nov 14, 2024 · 4 comments
Open

How to compute similarity between vectors? #25

koheiw opened this issue Nov 14, 2024 · 4 comments

Comments

@koheiw
Copy link
Contributor

koheiw commented Nov 14, 2024

I noticed that the library computes the similarity between vectors in a strange manner. The scores are not really cosine similarity but cross-products. @jwijffels , do you think it is intentional?

float ret = 0.0f;
for (uint16_t i = 0; i < m_vectorSize; ++i) {
ret += _what[i] * _with[i];
}
if (ret > 0.0f) {
return std::sqrt(ret / m_vectorSize);
}

library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.4.2
library(word2vec)

data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

model <- word2vec::word2vec(x = x, dim = 15, iter = 20)
emb <- as.matrix(model)

pred <- predict(model, emb["bus",], type = "nearest", top_n = 10)
pred
#>        term similarity rank
#> 1      tram  0.9866303    1
#> 2      voet  0.9863607    2
#> 3        10  0.9838170    3
#> 4        15  0.9835117    4
#> 5       min  0.9818841    5
#> 6     lopen  0.9809218    6
#> 7     vanaf  0.9808209    7
#> 8  parkeren  0.9791722    8
#> 9        20  0.9773416    9
#> 10     auto  0.9745490   10

# similarity in the library
cross <- rowSums(sqrt(crossprod(t(emb), emb["bus",]) / ncol(emb)))
#> Warning in sqrt(crossprod(t(emb), emb["bus", ])/ncol(emb)): NaNs produced
head(sort(cross, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000001 0.9866303 0.9863608 0.9838170 0.9835117 0.9818842

# cosine similarity 
cosine <- Matrix::rowSums(proxyC::simil(emb, emb["bus",,drop = FALSE]))
head(sort(cosine, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000000 0.9734393 0.9729074 0.9678959 0.9672952 0.9640965

# they are very similar but not the same
cor(cross, cosine, use = "pair")
#> [1] 0.9825781
cor(cross, cosine, use = "pair", method = "spearman")    
#> [1] 1

See how cosine similarity is computed: https://koheiw.github.io/proxyC/articles/measures.html#similarity-measures

@jwijffels
Copy link
Contributor

jwijffels commented Nov 14, 2024

@koheiw
Copy link
Contributor Author

koheiw commented Nov 14, 2024

Thank for pointing to the issue and the function. It is nice to use your word2vec_similarity() function in predict(). The code for similarity computation is making the C++ library complicated. I removed it entirely from my wordvector package.

@jwijffels
Copy link
Contributor

Yes, it's a bit silly that that part is in C++. It should indeed be easier to call it in predict.

@koheiw
Copy link
Contributor Author

koheiw commented Nov 14, 2024

I can help you with these things once the development of my package is completed. I hope the first version will come soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants