switch to correctly working Jaro implementation #5235

maxbachmann · 2023-12-01T09:38:10Z

I noticed clap switched from Jaro-Winkler to Jaro similarity due to "a bug" in strsim. However the Jaro similarity in strsim is broken as well, since it doesn't calculate transpositions correctly (rapidfuzz/strsim-rs#49).

This switches from strsim to rapidfuzz for string matching. Besides calculating the correct Jaro / Jaro-Winkler similarity, this should be faster. For the relatively short text lengths involved in commands the metric should be around 2-4x faster. Even though I doubt this matters a whole lot for this usage.

As a disclaimer: I am the author of rapidfuzz.

maxbachmann · 2023-12-01T09:41:36Z

clap_builder/src/parser/features/suggestions.rs

+    use rapidfuzz::distance::jaro;
+    let scorer = jaro::BatchComparator::new(v.chars());
+    // Confidence of 0.7 so that bar -> baz is suggested
+    let args = jaro::Args::default().score_cutoff(0.7);


Suggested change

use rapidfuzz::distance::jaro;

let scorer = jaro::BatchComparator::new(v.chars());

// Confidence of 0.7 so that bar -> baz is suggested

let args = jaro::Args::default().score_cutoff(0.7);

use rapidfuzz::distance::jaro_winkler;

let scorer = jaro_winkler::BatchComparator::new(v.chars());

// Confidence of 0.7 so that bar -> baz is suggested

let args = jaro_winkler::Args::default().score_cutoff(0.8);

This could be reverted to the original behavior using Jaro-Winkler as well

epage · 2023-12-01T15:41:33Z

I noticed clap switched from Jaro-Winkler to Jaro similarity due to "a bug" in strsim. However the Jaro similarity in strsim is broken as well, since it doesn't calculate transpositions correctly (rapidfuzz/strsim-rs#49).

The linked issue only seems to mention Jaro-Winkler.

For the relatively short text lengths involved in commands the metric should be around 2-4x faster

For clap's use cases, binary size and build times would be more important measures and those tend to be negatively impacted by performance.

Overall, I'm hesitant on this change. On top of binary size measurements, I'd want test cases added (first commit in PR showing current behavior, second commit changing the algorithm and showing the change in behavior) to show what problem this is solving. Even then, I'm a bit hesitant as this package has no other dependents and I'm not up for auditing it myself. Clap has a lot of people depending on it. If anything, I'd be tempted to copy/paste the algorithm used by cargo which is copy/pasted out of rustc.

maxbachmann · 2023-12-01T22:13:22Z

The linked issue only seems to mention Jaro-Winkler.

Yes, but I did check what actually causes the bug, which is https://github.com/dguo/strsim-rs/blob/65eac453cbd10ba4e13273002c843e95c81ae93f/src/lib.rs#L114-L117. The transpositions can't be counted in the same loop. Instead it should create two vectors to tag common positions and then take a second pass over them in the end to count the transpositions.

If anything, I'd be tempted to copy/paste the algorithm used by cargo which is copy/pasted out of rustc.

They appear to use the OSA distance / restricted damerau levenshtein distance.

For clap's use cases, binary size and build times would be more important measures and those tend to be negatively impacted by performance.

I did a quick check of binary size running cargo bloat --crates --release inside clap.
Baseline for performance is Jaro in strsim. For a quick runtime test I simply benchmarked text lengths of 8. This is obviously not the whole story. For binary size I counted together the size of the external lib + clap_builder to get the complete binary size of the text matching.

version	performance improvement	binary size
jaro strsim	1.0x	15.1kb
jaro rapidfuzz	2.1x	26kb
osa strsim	0.41x	16kb
osa rapidfuzz	3.3x	20.3kb
osa cargo	0.2x	17.6kb

Not sure about compile time. Looking at cargo build --timings it tells me the base time needed for the crate which on my machine is around 0.2 sec for strsim and 0.5 sec for rapidfuzz. Integrating only a single metric would obviously perform better in this regard. All of them shouldn't impact incremental builds.

If you are leaning towards binary size over performance the switch is probably not worth it for jaro except for correctness, but you could achieve this e.g. by integrating a fixed implementation as well (the fix is not particularly complex, but strsim doesn't appear to be maintained anymore). For OSA the size overhead is smaller + performance improvement significantly better, so in case that is the preferred metric it might be more interesting. Anyways feel free to close the PR if it's simply not of interest :)

epage · 2023-12-01T22:55:12Z

Considering I peel off individual kb's, the extra performance is not worth the size.

maxbachmann commented Dec 1, 2023

View reviewed changes

fix(suggestions): Switch to correctly working Jaro implementation

7715aae

epage closed this Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to correctly working Jaro implementation #5235

switch to correctly working Jaro implementation #5235

maxbachmann commented Dec 1, 2023

maxbachmann Dec 1, 2023

epage commented Dec 1, 2023

maxbachmann commented Dec 1, 2023

epage commented Dec 1, 2023

switch to correctly working Jaro implementation #5235

switch to correctly working Jaro implementation #5235

Conversation

maxbachmann commented Dec 1, 2023

maxbachmann Dec 1, 2023

Choose a reason for hiding this comment

epage commented Dec 1, 2023

maxbachmann commented Dec 1, 2023

epage commented Dec 1, 2023