Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broadcasting bug in repick_unused_centers() for K-means #283

Merged
merged 2 commits into from
Jan 6, 2025

Conversation

AzamatB
Copy link
Contributor

@AzamatB AzamatB commented Sep 7, 2024

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.40%. Comparing base (b4df21a) to head (5f280eb).
Report is 16 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #283      +/-   ##
==========================================
- Coverage   95.40%   95.40%   -0.01%     
==========================================
  Files          20       19       -1     
  Lines        1503     1500       -3     
==========================================
- Hits         1434     1431       -3     
  Misses         69       69              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AzamatB
Copy link
Contributor Author

AzamatB commented Sep 21, 2024

This is a fix to a pretty significant correctness bug. Can we get this merged ASAP?

@alyst
Copy link
Member

alyst commented Dec 1, 2024

@AzamatB Thank you for the fix and sorry for very long delay in reviewing it! Do you have a test case for the fix?

@alyst alyst added the bug label Dec 1, 2024
@alyst
Copy link
Member

alyst commented Jan 6, 2025

It is clearly a bug, it affects the weights of points during empty clusters reinitialization.
But from what I can understand, it should not significantly affect the correctness of the results.
In some rare cases it can lead to the same point becoming the center of two unused clusters -- but it is hard to construct this case.
We should definitely fix it, but it would be nice to have a test case that reveals the problematic behaviour.
So if you discovered this bug because kmeans() generated wrong results for your data, it would be nice to have that example.

@alyst
Copy link
Member

alyst commented Jan 6, 2025

It's actually not easy at all to find the case when this bug would result in a different clustering (and that does not mean at all that the total cost would be worse) -- just because the situation where you would have to re-pick an unused cluster is rare, and re-picking 2 unused clusters (that's where the bug can start manifesting) is very rare. FWIW, here's the one that I have found:

Z = Float32[
    0.7469281 8.85936 26.104477 17.041048 -2.9290304 3.6850576 9.111796 -3.7082646 28.021303 -1.7768137 15.129808 10.813229 -0.37364542 9.3696785 0.058578808 16.494686 26.311668 26.63891 8.56788 24.445705 26.41015 9.561551 16.240145 15.186173 26.556469 26.59538 2.0695605 -0.21763714 -0.76779276 11.00544 -2.1536446 25.703747 24.663855 8.074589 26.758509 25.25317 1.5967612 0.31046262 0.55812395 8.941144 9.466517 25.507006 26.757868 2.425164 25.087585 26.443806 26.326286 -2.9491603 28.033964 9.664289 26.4217 23.865585 13.599952 12.534004 26.082083 9.602425 13.989668 -3.7962832 3.0748796 0.32880965 2.7096572 17.04881 9.982845 1.8529501 27.728786 -1.7613807 -4.2353554 0.8132015 24.837885 26.31062 1.0547543 1.3947017 3.0518942 -2.6900654 24.62906 -1.4229952 25.314922 10.536686 1.6732222 12.771361 12.719584 12.273293 15.357906 25.722084 24.90772 26.05082 1.7522457 2.5748842 -1.2829331 -2.350809 9.122712 15.469745 11.674798 2.4657478 -0.29506856 15.895254 3.0172286 10.194327 -1.0089047 16.014812
    10.079329 18.360752 9.28761 -10.738691 -11.377904 14.796508 14.586522 -9.868497 9.238739 -9.8855295 -9.807078 17.22134 9.412983 -1.2908658 6.4643354 -10.106296 10.545128 11.217348 16.879663 -3.241572 9.065372 -3.4642792 -10.768875 -11.025509 9.280065 6.852204 10.405429 10.598914 9.294289 15.889821 9.373185 -3.4529555 -4.348801 15.373303 -4.503598 7.714737 12.660385 10.243679 10.298691 14.378907 -3.7512515 -2.8529565 -2.5031896 13.335922 -2.0082817 8.725614 9.936363 -10.26723 8.548306 14.545052 -2.9661767 -3.8050623 -2.4869242 -1.7874911 8.777304 -2.453476 -1.549685 -10.284147 13.414661 10.497435 14.179338 -10.703788 15.013265 10.063806 -5.3631563 7.3521304 -9.017388 10.804814 -2.910087 -3.1865885 9.722448 14.545546 13.292266 -8.730829 -3.5044644 10.917173 8.672507 15.791334 12.100385 -2.5765386 -4.6828575 -2.410526 -11.11536 -4.988103 -4.856656 -3.0215774 11.570467 13.921742 6.850209 9.268251 -1.5591162 -3.2451906 15.143096 9.793272 11.128278 -10.880159 13.058364 -2.2190018 8.365673 -11.913434
]

rng = StableRNG(232424)
res = kmeans(Z, 10; rng=rng)

@alyst alyst merged commit a3768b6 into JuliaStats:master Jan 6, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants