Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with deletion in long words #8

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ausi
Copy link
Contributor

@ausi ausi commented Dec 2, 2024

If a word in the index is longer than the index length, deletions dot not work anymore (they incorrectly count as 1 deletion + 1 insertion).

As you can see in the test case that currently fails, the word Mustermann is indexed as Muster (index length 6) and the search for Mutermann is then calculated as if we searched for Muterm. So even though Mustermann and Mutermann only have a distance of 1, you only find them with a distance of 2 or higher because Muster and Muterm have a distance of 2.

This is an issue in the original paper itself as far as I can see. But I think it can be fixed.

@Toflar
Copy link
Owner

Toflar commented Dec 2, 2024

Had to re-enable CI as it's been 6 months. Rebasing should help to have it run.

@ausi ausi force-pushed the fix/deletion-cut-off-words branch from b703bb8 to 75523f3 Compare December 2, 2024 20:16
@ausi ausi marked this pull request as draft December 7, 2024 14:05
@ausi
Copy link
Contributor Author

ausi commented Dec 7, 2024

45a893e is an attempt to fix the issue. But it fails the testResultsMatchResearchPaper test and ends up finding way too many states. I think we need to keep track of the number of deletions in $statesStarC to only allow zero-cost substitutions for these states and not all states.

@Toflar
Copy link
Owner

Toflar commented Dec 9, 2024

Yeah I don't think that many states is an acceptable solution because it will return way too many false-positives 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants