Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RottenTomatoes title match failure #1249

Open
1 task done
benhaney opened this issue Jan 12, 2025 · 1 comment · May be fixed by #1265
Open
1 task done

RottenTomatoes title match failure #1249

benhaney opened this issue Jan 12, 2025 · 1 comment · May be fixed by #1265
Assignees
Labels
bug Something isn't working confirmed This bug has been reproduced

Comments

@benhaney
Copy link
Contributor

Description

Sorry in advance for this being super long for a relatively small problem.

I ran across another instance of a movie getting matched to the wrong RT ratings and have been thinking about how to improve the RT search logic and make it generally more robust. I'd like to have a discussion about some of these ideas and whether or not they're worthwhile, and then I'll be happy to open a PR implementing whatever is decided on.

For starters, the specific mismatch that prompted this involves the movie "The OctoGames" (/movie/1028703 in Jellyseerr) which ends up showing the RT ratings for the unrelated movie "The Stranger." The reason for this failure has two parts:

  1. TMDB lists the title for this movie as "The OctoGames" while RT has the title as just "Octogames" (note both the lack of "The" and the lowercase "g") with no alternative titles. Searching the RT algolia API for "The OctoGames" doesn't return the correct movie anywhere in the list of results. The movie is on RT, but you need to search without "The" to find it.
  2. The Jellyseerr RT search result matching logic does a desperate pass through the results looking for any result with a matching release year, and will use that result even if the titles don't match at all. In this case, the RT results are just a pile of random popular movies that have "The" in the title, and one of them happens to have the same release year as Octogames, so it gets selected.

There are, roughly, two separate issues to address here. One is normalizing search queries before sending them to the RT algolia API, which may be as simple as stripping "the" from search queries, since the API seems to be sensitive to it. The other issue is Jellyseerr's result selection logic lacking robustness when presented with low quality search results, which is my main focus here.

For context, the current result selection logic for movies performs 4 passes over the search results, with each pass trying to grab the first movie that satisfies the condition of the pass, and moves on to the next pass if no results match that condition. The conditions of the 4 passes are:

  1. Exact title and exact year match
  2. Partial title (via String.includes) and exact year match
  3. Exact year match only (!)
  4. Exact title match only

With the third pass being the most problematic one that caused the mismatch in this specific case.

Here is a rough (and by no means complete) collection of potential data mismatches that should be accounted for:

  1. Capitalization differences in titles
  2. Minor article and punctuation differences in titles, such as missing "The" or "A", or ":" vs "-"
  3. Alternate titles being treated as primary titles
  4. Off-by-one years, for smaller movies with complicated release stories that happened near the end/beginning of a year

An interesting observation here is that both titles and years can have three potential match states, which I'll refer to as "exact," "partial," and "none." This makes me think that the best approach would be to scan the list of search results once, assign a "score" to each result based on multiplying together subscores from how closely the titles and years match, and then picking the result with the highest score above a minimum threshold. An exact match would contribute a subscore of 1, a non-match would contribute a subscore of 0 (forcing the final score to 0 as well), and a partial match would contribute a subscore between 0 and 1 potentially based on how close the match is.

Here is my proposed scoring logic so far:

  • Year:
    • exact match: 1
    • off by one: 0.5 (?)
    • anything else: 0
  • Title:
    • exact match between normalized (?) titles: 1
    • partial match via any sensible string similarity algo (?): the score of that algo
    • Sufficiently low (?) similarity: 0

Under this scheme, exact title and year matches would have a score of 1 and automatically win. Results that have very different titles or years that are more than 1 off would have a score of 0 and automatically lose. Anything in between will compete for the highest title*year score.

I put (?)s in places of the scoring logic above where I would particularly like feedback.

  • The 0.5 score for off-by-one years is arbitrary. I don't think the exact value matters too much, but I could see arguments existing for lowering (0.3?) or raising (0.8?) it.
  • "Normalizing" is pretty open ended. I envision it as lowercasing, stripping non-alphanumeric characters like ":", and removing minor articles like "the," but I'm not married to any of those details.
  • A string similarity algo could be as simple as the one already being used (ie, string-contains) or as complicated as something like jaro-distance. I don't think this needs to be particularly robust, and we'd probably be taking on an additional library dependency if we decide to go the jaro-distance route, so I think it's reasonable to stick with a simpler check if desired.
  • Depending on the string similarity algorithm, a cutoff score should be determined where values below that should be forced to 0, so we don't accidentally end up with cases where very different titles and similar years end up with a very low but non-zero total score and still win because everything else is 0. If we go with jaro, that might be "scores under 0.3 become 0." If we go with string-contains, it might just be "substring match -> 0.5, no match -> 0"

Also note that the RT algolia API returns alternate titles, but we don't currently use them. I intend to check all of them in the updated logic, and not just the primary title.

If this approach sounds good then I'll get started on a PR. Please let me know about any thoughts you have or tweaks that should be made.

Version

latest develop

Steps to Reproduce

  1. Go to /movie/1028703
  2. Click RT ratings
  3. Observe being linked to the wrong RT movie page

Screenshots

Screen Shot 2025-01-11 at 10 32 01 PM
Screen Shot 2025-01-11 at 10 32 26 PM
Screen Shot 2025-01-11 at 10 32 45 PM

Logs

No response

Platform

desktop

Database

SQLite (default)

Device

Any

Operating System

Any

Browser

Any

Additional Context

No response

Code of Conduct

  • I agree to follow Jellyseerr's Code of Conduct
@benhaney benhaney added awaiting triage This issue needs to be reviewed bug Something isn't working labels Jan 12, 2025
@Fallenbagel
Copy link
Owner

@benhaney Proceed with a PR and we will review :)

@Fallenbagel Fallenbagel added confirmed This bug has been reproduced and removed awaiting triage This issue needs to be reviewed labels Jan 14, 2025
@benhaney benhaney linked a pull request Jan 15, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed This bug has been reproduced
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants