Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to correctly working Jaro implementation #5235

Closed
wants to merge 1 commit into from
Closed

switch to correctly working Jaro implementation #5235

wants to merge 1 commit into from

Conversation

maxbachmann
Copy link

I noticed clap switched from Jaro-Winkler to Jaro similarity due to "a bug" in strsim. However the Jaro similarity in strsim is broken as well, since it doesn't calculate transpositions correctly (rapidfuzz/strsim-rs#49).

This switches from strsim to rapidfuzz for string matching. Besides calculating the correct Jaro / Jaro-Winkler similarity, this should be faster. For the relatively short text lengths involved in commands the metric should be around 2-4x faster. Even though I doubt this matters a whole lot for this usage.

As a disclaimer: I am the author of rapidfuzz.

Comment on lines +16 to +19
use rapidfuzz::distance::jaro;
let scorer = jaro::BatchComparator::new(v.chars());
// Confidence of 0.7 so that bar -> baz is suggested
let args = jaro::Args::default().score_cutoff(0.7);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use rapidfuzz::distance::jaro;
let scorer = jaro::BatchComparator::new(v.chars());
// Confidence of 0.7 so that bar -> baz is suggested
let args = jaro::Args::default().score_cutoff(0.7);
use rapidfuzz::distance::jaro_winkler;
let scorer = jaro_winkler::BatchComparator::new(v.chars());
// Confidence of 0.7 so that bar -> baz is suggested
let args = jaro_winkler::Args::default().score_cutoff(0.8);

This could be reverted to the original behavior using Jaro-Winkler as well

@epage
Copy link
Member

epage commented Dec 1, 2023

I noticed clap switched from Jaro-Winkler to Jaro similarity due to "a bug" in strsim. However the Jaro similarity in strsim is broken as well, since it doesn't calculate transpositions correctly (rapidfuzz/strsim-rs#49).

The linked issue only seems to mention Jaro-Winkler.

For the relatively short text lengths involved in commands the metric should be around 2-4x faster

For clap's use cases, binary size and build times would be more important measures and those tend to be negatively impacted by performance.

Overall, I'm hesitant on this change. On top of binary size measurements, I'd want test cases added (first commit in PR showing current behavior, second commit changing the algorithm and showing the change in behavior) to show what problem this is solving. Even then, I'm a bit hesitant as this package has no other dependents and I'm not up for auditing it myself. Clap has a lot of people depending on it. If anything, I'd be tempted to copy/paste the algorithm used by cargo which is copy/pasted out of rustc.

@maxbachmann
Copy link
Author

The linked issue only seems to mention Jaro-Winkler.

Yes, but I did check what actually causes the bug, which is https://github.com/dguo/strsim-rs/blob/65eac453cbd10ba4e13273002c843e95c81ae93f/src/lib.rs#L114-L117. The transpositions can't be counted in the same loop. Instead it should create two vectors to tag common positions and then take a second pass over them in the end to count the transpositions.

If anything, I'd be tempted to copy/paste the algorithm used by cargo which is copy/pasted out of rustc.

They appear to use the OSA distance / restricted damerau levenshtein distance.

For clap's use cases, binary size and build times would be more important measures and those tend to be negatively impacted by performance.

I did a quick check of binary size running cargo bloat --crates --release inside clap.
Baseline for performance is Jaro in strsim. For a quick runtime test I simply benchmarked text lengths of 8. This is obviously not the whole story. For binary size I counted together the size of the external lib + clap_builder to get the complete binary size of the text matching.

version performance improvement binary size
jaro strsim 1.0x 15.1kb
jaro rapidfuzz 2.1x 26kb
osa strsim 0.41x 16kb
osa rapidfuzz 3.3x 20.3kb
osa cargo 0.2x 17.6kb

Not sure about compile time. Looking at cargo build --timings it tells me the base time needed for the crate which on my machine is around 0.2 sec for strsim and 0.5 sec for rapidfuzz. Integrating only a single metric would obviously perform better in this regard. All of them shouldn't impact incremental builds.

If you are leaning towards binary size over performance the switch is probably not worth it for jaro except for correctness, but you could achieve this e.g. by integrating a fixed implementation as well (the fix is not particularly complex, but strsim doesn't appear to be maintained anymore). For OSA the size overhead is smaller + performance improvement significantly better, so in case that is the preferred metric it might be more interesting. Anyways feel free to close the PR if it's simply not of interest :)

@epage
Copy link
Member

epage commented Dec 1, 2023

Considering I peel off individual kb's, the extra performance is not worth the size.

@epage epage closed this Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants