-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch to correctly working Jaro implementation #5235
Conversation
use rapidfuzz::distance::jaro; | ||
let scorer = jaro::BatchComparator::new(v.chars()); | ||
// Confidence of 0.7 so that bar -> baz is suggested | ||
let args = jaro::Args::default().score_cutoff(0.7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use rapidfuzz::distance::jaro; | |
let scorer = jaro::BatchComparator::new(v.chars()); | |
// Confidence of 0.7 so that bar -> baz is suggested | |
let args = jaro::Args::default().score_cutoff(0.7); | |
use rapidfuzz::distance::jaro_winkler; | |
let scorer = jaro_winkler::BatchComparator::new(v.chars()); | |
// Confidence of 0.7 so that bar -> baz is suggested | |
let args = jaro_winkler::Args::default().score_cutoff(0.8); |
This could be reverted to the original behavior using Jaro-Winkler as well
The linked issue only seems to mention Jaro-Winkler.
For clap's use cases, binary size and build times would be more important measures and those tend to be negatively impacted by performance. Overall, I'm hesitant on this change. On top of binary size measurements, I'd want test cases added (first commit in PR showing current behavior, second commit changing the algorithm and showing the change in behavior) to show what problem this is solving. Even then, I'm a bit hesitant as this package has no other dependents and I'm not up for auditing it myself. Clap has a lot of people depending on it. If anything, I'd be tempted to copy/paste the algorithm used by cargo which is copy/pasted out of rustc. |
Yes, but I did check what actually causes the bug, which is https://github.com/dguo/strsim-rs/blob/65eac453cbd10ba4e13273002c843e95c81ae93f/src/lib.rs#L114-L117. The transpositions can't be counted in the same loop. Instead it should create two vectors to tag common positions and then take a second pass over them in the end to count the transpositions.
They appear to use the OSA distance / restricted damerau levenshtein distance.
I did a quick check of binary size running
Not sure about compile time. Looking at If you are leaning towards binary size over performance the switch is probably not worth it for jaro except for correctness, but you could achieve this e.g. by integrating a fixed implementation as well (the fix is not particularly complex, but strsim doesn't appear to be maintained anymore). For OSA the size overhead is smaller + performance improvement significantly better, so in case that is the preferred metric it might be more interesting. Anyways feel free to close the PR if it's simply not of interest :) |
Considering I peel off individual kb's, the extra performance is not worth the size. |
I noticed clap switched from Jaro-Winkler to Jaro similarity due to "a bug" in strsim. However the Jaro similarity in strsim is broken as well, since it doesn't calculate transpositions correctly (rapidfuzz/strsim-rs#49).
This switches from strsim to rapidfuzz for string matching. Besides calculating the correct Jaro / Jaro-Winkler similarity, this should be faster. For the relatively short text lengths involved in commands the metric should be around 2-4x faster. Even though I doubt this matters a whole lot for this usage.
As a disclaimer: I am the author of rapidfuzz.