Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(search): make graphql query autoCompleteForMultiple to show exact matches first #11586

Conversation

deepgarg-visa
Copy link
Contributor

@deepgarg-visa deepgarg-visa commented Oct 10, 2024

This PR is targeting the below Issue

#11411

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels Oct 10, 2024
…ry_result

# Conflicts:
#	metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/AutocompleteRequestHandler.java
if (field != null && !field.isEmpty()) {
return ImmutableList.of(field);
return ImmutableList.of(Pair.of(field, "2.0"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded to 2.0 and looks like it treats URN matches and non-URN matches the same. Previously it weighted URNs higher which probably doesn't make sense for auto-complete purposes, but now the conditional on line 220 is the same for both conditions. So we should probably simplify the conditional and likely remove the boostScore entirely?

Copy link
Contributor Author

@deepgarg-visa deepgarg-visa Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I checked, we can't remove the boostscore entirely. We need to have higher boost score for ngram fields for non-urn fields in "multi_match" clause to get the expected result. I would suggest that we can have boostscore for non-urn fields from their respective searchable spec(which I already did) and urn field can have only default boost score (1.0), as you have rightly mention that it doesn't make sense for auto-complete purposes.

Please find the expected multi_match clause

{ "multi_match": { "query": "aucdl", "fields": [ "name.delimited^1.0", "name.ngram^10.0", "name.ngram._2gram^10.0", "name.ngram._3gram^10.0", "name.ngram._4gram^10.0", "urn.delimited^1.0", "urn.ngram^1.0", "urn.ngram._2gram^1.0", "urn.ngram._3gram^1.0", "urn.ngram._4gram^1.0" ], "type": "phrase", "operator": "OR", "slop": 0, "prefix_length": 0, "max_expansions": 50, "zero_terms_query": "NONE", "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1.0 } }

In case of function at line 266, we can put condition to have default boostscore of 10.0 for non-urn field and boostscore of 1.0 for urn field ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there are two things to consider here, the relative weights between the fields being used for the query AND the relative weights of the subfields (delimited, ngram, etc). Most likely we should be applying a multiplier. The URN while important as a unique identifier for general search is not as important here so the urn field weight is likely 1.0.

Next lets consider a multiplier for ngram. I am not seeing these being applied ->

The field weights for ngrams should be impacted by that configuration and it doesn't appear to be based on. the example query above.

Generally the weight formula for ngram subfields should be something like:
<field annotation weight> * <configuration ngram factor>

Copy link
Contributor Author

@deepgarg-visa deepgarg-visa Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-leifker, As suggested done the following changes

  • add "match" clause for exact matches
  • make URN field default boostscore to 1.0
  • make ngram subfields weight as per the formula suggested
  • add a testcase in class "SampleDataFixtureTestBase.java"
  • add testdata in containerindex_v2.json.gz
  • reverted to "bool_prefix" type in "multi_match" clause
  • simplify the condition at line 222

Please find the final query

{ "multi_match": { "query": "container", "fields": [ "name.delimited^1.0", "name.ngram^10.0", "name.ngram._2gram^12.0", "name.ngram._3gram^15.0", "name.ngram._4gram^18.0", "urn.delimited^1.0", "urn.ngram^1.0", "urn.ngram._2gram^1.0", "urn.ngram._3gram^1.0", "urn.ngram._4gram^1.0" ], "type": "bool_prefix", "operator": "OR", "slop": 0, "prefix_length": 0, "max_expansions": 50, "zero_terms_query": "NONE", "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } }, { "match": { "name.keyword": { "query": "container", "operator": "OR", "prefix_length": 0, "max_expansions": 50, "fuzzy_transpositions": true, "lenient": false, "zero_terms_query": "NONE", "auto_generate_synonyms_phrase_query": true, "boost": 10 } } }

Copy link
Collaborator

@david-leifker david-leifker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one comment and wanted to point out the 3 failing tests (x2 for elasticsearch & opensearch) which likely need to be updated or taken into consideration.

metadata-io-search > elasticsearch-testcontainers > com.linkedin.metadata.search.elasticsearch.SampleDataFixtureElasticSearchTest > testGroupAutoComplete FAILED
    java.lang.AssertionError: Expected 1 results for `T` found 0 expected [true] but found [false]
        at org.testng.Assert.fail(Assert.java:111)
        at org.testng.Assert.failNotEquals(Assert.java:1578)
        at org.testng.Assert.assertTrue(Assert.java:57)
        at com.linkedin.metadata.search.fixtures.SampleDataFixtureTestBase.lambda$testGroupAutoComplete$23(SampleDataFixtureTestBase.java:915)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at com.linkedin.metadata.search.fixtures.SampleDataFixtureTestBase.testGroupAutoComplete(SampleDataFixtureTestBase.java:909)
...
metadata-io-search > elasticsearch-testcontainers > com.linkedin.metadata.search.elasticsearch.SampleDataFixtureElasticSearchTest > testGroupAutoComplete FAILED
    java.lang.AssertionError: Expected 1 results for `T` found 0 expected [true] but found [false]
        at org.testng.Assert.fail(Assert.java:111)
        at org.testng.Assert.failNotEquals(Assert.java:1578)
        at org.testng.Assert.assertTrue(Assert.java:57)
        at com.linkedin.metadata.search.fixtures.SampleDataFixtureTestBase.lambda$testGroupAutoComplete$23(SampleDataFixtureTestBase.java:915)
...
metadata-io-search > elasticsearch-testcontainers > com.linkedin.metadata.search.elasticsearch.SampleDataFixtureElasticSearchTest > testUserAutoComplete FAILED
    java.lang.AssertionError: Expected at least 1 results for `D` found 0 expected [true] but found [false]
        at org.testng.Assert.fail(Assert.java:111)
        at org.testng.Assert.failNotEquals(Assert.java:1578)
        at org.testng.Assert.assertTrue(Assert.java:57)
        at com.linkedin.metadata.search.fixtures.SampleDataFixtureTestBase.lambda$testUserAutoComplete$24(SampleDataFixtureTestBase.java:935)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at com.linkedin.metadata.search.fixtures.SampleDataFixtureTestBase.testUserAutoComplete(SampleDataFixtureTestBase.java:929)

912 tests completed, 6 failed, 71 skipped

> Task :metadata-io:test FAILED

@deepgarg-visa
Copy link
Contributor Author

deepgarg-visa commented Oct 13, 2024

Hi @david-leifker , "phrase" type in multi_match clause is not handling query with one and two characters very well. More than that its working fine. Fix the testcases accordingly. But needs your input regarding this

@david-leifker
Copy link
Collaborator

@deepgarg-visa - those test cases are important, autocomplete needs to properly handle < 3 character queries and is responsible for this function for type-ahead use cases in the UI. We need a solution which supports less than 3 characters and this may involve combining both of the queries (the previous one with the new one) so that both use-cases can work.

@deepgarg-visa
Copy link
Contributor Author

deepgarg-visa commented Oct 15, 2024

@deepgarg-visa - those test cases are important, autocomplete needs to properly handle < 3 character queries and is responsible for this function for type-ahead use cases in the UI. We need a solution which supports less than 3 characters and this may involve combining both of the queries (the previous one with the new one) so that both use-cases can work.

@david-leifker, I have reverted the changes in testcases

Copy link
Collaborator

@david-leifker david-leifker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you!

@david-leifker david-leifker merged commit b4b9142 into datahub-project:master Oct 15, 2024
45 checks passed
@deepgarg-visa
Copy link
Contributor Author

@david-leifker, could you please mark this issue resolved:
#11411

@deepgarg-visa deepgarg-visa deleted the fix_autoCompleteForMultiple_query_result branch October 17, 2024 12:40
epatotski pushed a commit to acryldata/datahub that referenced this pull request Oct 22, 2024
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants